Releases: sorenlind/lemmy
Lemmatizing without POS tags
This release adds support for lemmatizing Danish text if you do not have POS tags. This makes it easier to use Lemmy if you do not have a Danish spaCy model. Note that the Swedish model still requires you to specify POS tags (either using spaCy or manually).
Using Lemmy if you do not have POS tags:
import lemmy
lemmatizer = lemmy.load("da")
lemmatizer.lemmatize("", "akvariernes")
Support for Swedish
This release adds support for Swedish 🇸🇪.
Consequently, the API has changed: When loading a model you now need to specify a language (either 'da' or 'sv') as shown below.
Loading the standalone model:
lemmatizer = lemmy.load('da') # use 'sv' for Swedish
Load the spaCy pipeline component:
pipe = lemmy.pipe.load('da') # use 'sv' for Swedish
Better support for ambiguous lemmatization
This release aligns the behavior of the spaCy pipeline component with standalone Lemmy to support ambiguous lemmatization.
Previously, the pipeline component would return None when more than one possible lemma was returned by Lemmy. The pipeline component now always returns a list of lemmas. For unambiguous words, the list will contain only one lemma but in case of ambiguity, it can contain multiple lemmas. Consequently, the spaCy extension attribute has now been renamed lemmas
(plural).
First Experimental Release
🤘 Lemmy
Lemmy is a lemmatizer for Danish 🇩🇰 . It comes already trained on Dansk Sprognævns (DSN) word list (‘fuldformliste’) and the Danish Universal Dependencies and is ready for use. Lemmy also supports training on your own dataset.
The model currently included in Lemmy was evaluated on the Danish Universal Dependencies dev dataset and scored an accuracy > 99%.
You can use Lemmy as a spaCy extension, more specifcally a spaCy pipeline component. This is highly recommended and makes the lemmas easily accessible from the spaCy tokens. Lemmy makes use of POS tags to predict the lemmas. When wired up to the spaCy pipeline, Lemmy has the benefit of using spaCy’s builtin POS tagger.
Lemmy can also by used without spaCy, as a standalone lemmatizer. In that case, you will have to provide the POS tags. Alternatively, you can train a Lemmy model which does not depend on POS tags, though most likely the accuracy will suffer.
Lemmy is heavily inspired by the CST Lemmatizer for Danish.