v2.2.0: Norwegian & Lithuanian models, better Dutch NER, smaller install, faster matching & more
⚠️ This version of spaCy requires downloading new models. You can use thespacy validate
command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.
✨ New features and improvements
- NEW: Pretrained core models for Norwegian (MIT) and Lithuanian (CC BY-SA).
- NEW: Better pre-trained Dutch NER using custom labelled UD corpus instead of WikiNER.
- NEW: Make spaCy roughly 5-10× smaller on disk (depending on your platform) by compressing and moving lookups to a separate package.
- NEW:
EntityLinker
andKnowledgeBase
API to train and access entity linking models, plus scripts to train your own Wikidata models. - NEW: 10× faster
PhraseMatcher
and improved phrase matching algorithm. - NEW:
DocBin
class to efficiently serialize collections ofDoc
objects. - NEW: Train text classification models on the command line with
spacy train
and gettextcat
results via theScorer
. - NEW:
debug-data
command to validate your training and development data, get useful stats, and find problems like invalid entity annotations, cyclic dependencies, low data labels and more. - NEW: Efficient
Lookups
class using Bloom filters that allows storing, accessing and serializing large dictionaries viavocab.lookups
. - Data augmentation in
spacy train
via the--orth-variant-level
flag, which defines the percentage of occurrences of some tokens subject to replacement during training. - Add
nlp.pipe_labels
(labels assigned by pipeline components) and include"labels"
innlp.meta
. - Support
spacy_displacy_colors
entry point to allow packages to add entity colors todisplacy
. - Allow
template
config option indisplacy
to customize entity HTML template. - Improve match pattern validation and handling of unsupported attributes.
- Add lookup lemmatization data for Croatian and Serbian.
- Update and improve language data for Chinese, Croatian, Thai, Romanian, Hindi and English.
🔴 Bug fixes
- Fix issue #3258: Reduce package size on disk by moving and compressing large dictionaries.
- Fix issue #3540: Update lemma and vector information after splitting a token.
- Fix issue #3687: Automatically skip duplicates in
Doc.retokenize
. - Fix issue #3830: Retrain German model and fix
subtok
errors. - Fix issue #3850: Allow customizing entity HTML template in displaCy.
- Fix issue #3879, #3951, #4154: Fix bug in
Matcher
retry loop that'd cause problems with?
operator. - Fix issue #3917: Raise error for negative token indices in
displacy
. - Fix issue #3922: Add
PhraseMatcher.remove
method. - Fix issue #3959, #4133: Make sure both
pos
andtag
are correctly serialized. - Fix issue #3972: Ensure
PhraseMatcher
returns multiple matches for identical rules. - Fix issue #4020: Raise error for overlapping entities in
biluo_tags_from_offsets
. - Fix issue #4051: Ensure retokenizer sets POS tags correctly on merge.
- Fix issue #4070: Improve token pattern checking without validation.
- Fix issue #4096: Add checks for cycles in
debug-data
. - Fix issue #4100: Improve docs on phrase pattern attributes.
- Fix issue #4102: Correct mistakes in English lookup lemmatizer data.
- Fix issue #4104: Make visualized NER examples in docs more clear.
- Fix issue #4107: Automatically set span root attributes on merging.
- Fix issue #4111, #4170: Improve NER/IOB converters.
- Fix issue #4120: Correctly handle
?
operator at the end of pattern. - Fix issue #4123: Provide more details in cycle error message
E069
. - Fix issue #4138: Correctly open
.html
files as UTF-8 inevaluate
command. - Fix issue #4139: Make emoticon data a raw string.
- Fix issue #4148: Add missing API docs for
force
flag onset_extension
. - Fix issue #4155: Correct language code for Serbian.
- Fix issue #4165: Add more attributes to matcher validation schema.
- Fix issue #4190: Fix caching issue that'd cause tokenizer to not be deserialized correctly.
- Fix issue #4200: Work around
tqdm
bug that'd remove text color from terminal output. - Fix issue #4229: Fix handling of pre-set entities.
- Fix issue #4238: Flush tokenizer cache when affixes, token_match, or special cases are modified.
- Fix issue #4242: Make
.pos
/.tag
distinction more clear in the docs. - Fix issue #4245: Fix bug that occurred when processing empty string in Korean.
- Fix issue #4262: Fix handling of spaces in Japanese.
- Fix issue #4269: Tokenize punctuation correctly in Kannada, Tamil, and Telugu and add unicode characters to default sentencizer config.
- Fix issue #4270: Fix
--vectors-loc
documentation. - Fix issue #4302: Remove duplicate
Parser.tok2vec
property. - Fix issue #4303: Correctly support
as_tuples
andreturn_matches
inMatcher.pipe
. - Fix issue #4307: Ensure that pre-set entities are preserved and allow overwriting unset tokens.
- Fix issue #4308: Fix bug that could cause
PhraseMatcher
with very large lists to miss matches. - Fix issue #4348: Ensure training doesn't crash with empty batches.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validate
command to find out which models need updating, and print update instructions. - The lemmatization tables have been moved to their own package,
spacy-lookups-data
, which is not installed by default. If you're using pre-trained models, nothing changes, because the tables are now included in the model packages. If you want to use the lemmatizer for other languages that don't yet have pre-trained models (e.g. Turkish or Croatian) or start off with a blank model that contains lookup data (e.g.spacy.blank("en")
), you'll need to explicitly install spaCy plus data viapip install spacy[lookups]
. The data will be registered automatically via entry points. - Lemmatization tables (rules, exceptions, index and lookups) are now part of the
Vocab
and serialized with it. This means that serialized objects (nlp
, pipeline components, vocab) will now include additional data, and models written to disk will include additional files. - The
Lemmatizer
class is now initialized with an instance ofLookups
containing the rules and tables, instead of dicts as separate arguments. This makes it easier to share data tables and modify them at runtime. This is mostly internals, but if you've been implementing a customLemmatizer
, you'll need to update your code. - If you've been training your own models, you'll need to retrain them with the new version.
- The Dutch model has been trained on a new NER corpus (custom labelled UD instead of WikiNER), so their predictions may be very different compared to the previous version. The results should be significantly better and more generalizable, though.
- The
spacy download
command does not set the--no-deps
pip argument anymore by default, meaning that model package dependencies (if available) will now be also downloaded and installed. If spaCy (which is also a model dependency) is not installed in the current environment, e.g. if a user has built from source,--no-deps
is added back automatically to prevent spaCy from being downloaded and installed again from pip. - The built-in
biluo_tags_from_offsets
converter is now stricter and will raise an error if entities are overlapping (instead of silently skipping them). If your data contains invalid entity annotations, make sure to clean it and resolve conflicts. You can now also use the newdebug-data
command to find problems in your data. - Pipeline components can now overwrite IOB tags of tokens that are not yet part of an entity. Once a token has an
ent_iob
value set, it won't be reset to an "unset" state and will always have at leastO
assigned.list(doc.ents)
now actually keeps the annotations on the token level consistent, instead of resettingO
to an empty string. - The default punctuation in the
Sentencizer
has been extended and now includes more characters common in various languages. This also means that the results it produces may change, depending on your text. If you want the previous behaviour with limited characters, setpunct_chars=[".", "!", "?"]
on initialization. - The
PhraseMatcher
algorithm was rewritten from scratch and it's now 10× faster. The rewrite also resolved a few subtle bugs with very large terminology lists. So if you were matching large lists, you may see slightly different results – however, the results should now be fully correct. See #4309 for details on this change. - The
Serbian
language class (introduced in v2.1.8) incorrectly used the language coders
instead ofsr
. This has now been fixed, soSerbian
is now available viaspacy.lang.sr
. - The
"sources"
in themeta.json
have changed from a list of strings to a list of dicts. This is mostly internals, but if your code usednlp.meta["sources"]
, you might have to update it.
📈 Benchmarks
Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
---|---|---|---|---|---|---|---|---|
en_core_web_sm |
English | 2.2.0 | 91.61 | 89.71 | 97.03 | 85.07 | 𐄂 | 11 MB |
en_core_web_md |
English | 2.2.0 | 91.65 | 89.77 | 97.14 | 86.10 | ✓ | 91 MB |
en_core_web_lg |
English | 2.2.0 | 91.98 | 90.16 | 97.21 | 86.30 | ✓ | 789 MB |
de_core_news_sm |
German | 2.2.0 | 90.75 | 88.63 | 96.29 | 83.11 | 𐄂 | 14 MB |
de_core_news_md |
German | 2.2.0 | 91.26 | 89.36 | 96.44 | 83.42 | ✓ | 214 MB |
es_core_news_sm |
Spanish | 2.2.0 | 90.20 | 87.05 | 96.79 | 89.45 | 𐄂 | 15 MB |
es_core_news_md |
Spanish | 2.2.0 | 90.89 | 87.94 | 97.03 | 89.86 | ✓ | 74 MB |
pt_core_news_sm |
Portuguese | 2.2.0 | 89.53 | 86.07 | 79.96 | 87.97 | 𐄂 | 20 MB |
fr_core_news_sm |
French | 2.2.0 | 87.27 | 84.28 | 94.38 | 82.77 | 𐄂 | 14 MB |
fr_core_news_md |
French | 2.2.0 | 88.82 | 86.07 | 95.15 | 82.82 | ✓ | 84 MB |
it_core_news_sm |
Italian | 2.2.0 | 90.79 | 86.94 | 96.06 | 86.29 | 𐄂 | 13 MB |
nl_core_news_sm |
Dutch | 2.2.0 | 76.79 | 69.53 | 90.10 | 68.79 | 𐄂 | 14 MB |
el_core_news_sm |
Greek | 2.2.0 | 84.40 | 80.98 | 94.41 | 71.88 | 𐄂 | 10 MB |
el_core_news_md |
Greek | 2.2.0 | 87.96 | 84.88 | 96.38 | 77.59 | ✓ | 126 MB |
nb_core_news_sm |
Norwegian | 2.2.0 | 89.02 | 86.49 | 95.72 | 83.99 | 𐄂 | 12 MB |
lt_core_news_sm |
Lithuanian | 2.2.0 | 59.87 | 48.00 | 74.02 | 76.58 | 𐄂 | 12 MB |
xx_ent_wiki_sm |
Multi | 2.2.0 | - | - | - | 79.88 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_
). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Add "label scheme" section to all models in the models directory that lists the labels assigned by the different components.
- Extend the
sources
listed in themeta.json
of pre-trained models with more details on the training corpora and include more information in the models directory. - Add more examples of matching regular expressions.
- Add instructions for training an entity linking model.
- Add API docs for new
debug-data
,EntityLinker
,KnowledgeBase
andLookups
. - Add new projects to the spaCy Universe.
- Add example for interactive model visualizer with Streamlit.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @ICLRandD, @phiedulxp, @ajrader, @RyanZHe, @jenojp, @yanaiela, @isaric, @mrdbourke, @avramandrei, @Pavle992, @chkoar, @wannaphongcom, @BreakBB, @b1uec0in, @mihaigliga21, @tamuhey, @euand, @Hazoom, @SeanBE, @esemeniuc, @zqianem, @ajkl, @jaydeepborkar, @EarlGreyT and @er-raoniz for the pull requests and contributions.
Special thanks to our spaCy team @svlandeg and @adrianeboyd for the bug fixes and new features, @polm for the Bloom filters implementation and data compression and @yvespeirsman, @lemontheme, @jarib, @miktoki and @rokasramas for the help and resources for the new models.