DeFormer: correcting the usage of 'de' and 'dem' in the Swedish language

In modern spoken Swedish, the pronouns "de" and "dem" are pronounced the same way ("dom"). However, in written Swedish we still distinguish between the object form (dem) and the subject form (de) of third person plural pronouns.

Since they are homophones, an increasing number of Swedes have begun to confuse their usage, similar to "their" vs "they're", or "your" vs "you're" in English.

DeFormer is a transformer language model based on the Swedish KB-BERT. It has been trained to perform token classification. Each word piece (token) is classified in to one of three categories:

ord (all background words which are not de/dem belong to this category)
DE
DEM

Training data

DeFormer has been trained on sentences from the European Parliament and Swedish Wikipedia. These were downloaded from OPUS (the Open Parallel Corpus project). The data sources were selected because they were presumed to have correct language usage.

Only sentences containing de or dem -- or both -- were kept in the creation of the training dataset. In the table below, we present summary statistics of sentences that were kept after filtering the data.

Source	Sentences	# de	# dem	de/dem ratio
Europaparl sv.txt.gz	500660	465977	54331	8.57x
JRC-Acquis raw.sv.gz	417951	408576	17028	23.99x
Wikimedia sv.txt.gz	630601	602393	38852	15.48x
Total	1549212	1476946	110211	13.40x

In the training, random substitutions were introduced where de or dem were excahnged against the opposite word. The DeFormer model was then challenged to classify which of the previously mentioned 3 categories a word belonged to.

Accuracy

DeFormer was evaluated on a held out validation set of 31200 sentences from the above data sources. Random substitutions were introduced here as well to challenge the model. The table below displays the accuracy of the model. A significant proportion of the erroneous predictions were in the form of "de/dem som-constructions. These are ambiguous cases in Swedish and are not viewed as errors, since both forms are equally accepted.

	Accuracy
de	99.9%
dem	98.6%

Instructions to train the model

Download the data. If you are on a Linux/Mac system, you can run bash download_data.sh. This will create a data/ folder and download the relevant datasets. Windows users can check the same file and download the data files manually from the links in the file.

Secondly, run create_dedem_dataset.py. This extracts the sentences containing de/dem from each respective dataset and creates a file called dedem_corpus.feather.

Third, run train.py to train the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DeFormer: correcting the usage of 'de' and 'dem' in the Swedish language

Training data

Accuracy

Instructions to train the model

Files

README.md

Latest commit

History

README.md

File metadata and controls

DeFormer: correcting the usage of 'de' and 'dem' in the Swedish language

Training data

Accuracy

Instructions to train the model