Release v2.2.0: Norwegian & Lithuanian models, better Dutch NER, smaller install, faster matching & more · explosion/spaCy

⚠️ This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.

✨ New features and improvements

NEW: Pretrained core models for Norwegian (MIT) and Lithuanian (CC BY-SA).
NEW: Better pre-trained Dutch NER using custom labelled UD corpus instead of WikiNER.
NEW: Make spaCy roughly 5-10× smaller on disk (depending on your platform) by compressing and moving lookups to a separate package.
NEW: EntityLinker and KnowledgeBase API to train and access entity linking models, plus scripts to train your own Wikidata models.
NEW: 10× faster PhraseMatcher and improved phrase matching algorithm.
NEW: DocBin class to efficiently serialize collections of Doc objects.
NEW: Train text classification models on the command line with spacy train and get textcat results via the Scorer.
NEW: debug-data command to validate your training and development data, get useful stats, and find problems like invalid entity annotations, cyclic dependencies, low data labels and more.
NEW: Efficient Lookups class using Bloom filters that allows storing, accessing and serializing large dictionaries via vocab.lookups.
Data augmentation in spacy train via the --orth-variant-level flag, which defines the percentage of occurrences of some tokens subject to replacement during training.
Add nlp.pipe_labels (labels assigned by pipeline components) and include "labels" in nlp.meta.
Support spacy_displacy_colors entry point to allow packages to add entity colors to displacy.
Allow template config option in displacy to customize entity HTML template.
Improve match pattern validation and handling of unsupported attributes.
Add lookup lemmatization data for Croatian and Serbian.
Update and improve language data for Chinese, Croatian, Thai, Romanian, Hindi and English.

🔴 Bug fixes

Fix issue #3258: Reduce package size on disk by moving and compressing large dictionaries.
Fix issue #3540: Update lemma and vector information after splitting a token.
Fix issue #3687: Automatically skip duplicates in Doc.retokenize.
Fix issue #3830: Retrain German model and fix subtok errors.
Fix issue #3850: Allow customizing entity HTML template in displaCy.
Fix issue #3879, #3951, #4154: Fix bug in Matcher retry loop that'd cause problems with ? operator.
Fix issue #3917: Raise error for negative token indices in displacy.
Fix issue #3922: Add PhraseMatcher.remove method.
Fix issue #3959, #4133: Make sure both pos and tag are correctly serialized.
Fix issue #3972: Ensure PhraseMatcher returns multiple matches for identical rules.
Fix issue #4020: Raise error for overlapping entities in biluo_tags_from_offsets.
Fix issue #4051: Ensure retokenizer sets POS tags correctly on merge.
Fix issue #4070: Improve token pattern checking without validation.
Fix issue #4096: Add checks for cycles in debug-data.
Fix issue #4100: Improve docs on phrase pattern attributes.
Fix issue #4102: Correct mistakes in English lookup lemmatizer data.
Fix issue #4104: Make visualized NER examples in docs more clear.
Fix issue #4107: Automatically set span root attributes on merging.
Fix issue #4111, #4170: Improve NER/IOB converters.
Fix issue #4120: Correctly handle ? operator at the end of pattern.
Fix issue #4123: Provide more details in cycle error message E069.
Fix issue #4138: Correctly open .html files as UTF-8 in evaluate command.
Fix issue #4139: Make emoticon data a raw string.
Fix issue #4148: Add missing API docs for force flag on set_extension.
Fix issue #4155: Correct language code for Serbian.
Fix issue #4165: Add more attributes to matcher validation schema.
Fix issue #4190: Fix caching issue that'd cause tokenizer to not be deserialized correctly.
Fix issue #4200: Work around tqdm bug that'd remove text color from terminal output.
Fix issue #4229: Fix handling of pre-set entities.
Fix issue #4238: Flush tokenizer cache when affixes, token_match, or special cases are modified.
Fix issue #4242: Make .pos/.tag distinction more clear in the docs.
Fix issue #4245: Fix bug that occurred when processing empty string in Korean.
Fix issue #4262: Fix handling of spaces in Japanese.
Fix issue #4269: Tokenize punctuation correctly in Kannada, Tamil, and Telugu and add unicode characters to default sentencizer config.
Fix issue #4270: Fix --vectors-loc documentation.
Fix issue #4302: Remove duplicate Parser.tok2vec property.
Fix issue #4303: Correctly support as_tuples and return_matches in Matcher.pipe.
Fix issue #4307: Ensure that pre-set entities are preserved and allow overwriting unset tokens.
Fix issue #4308: Fix bug that could cause PhraseMatcher with very large lists to miss matches.
Fix issue #4348: Ensure training doesn't crash with empty batches.

⚠️ Backwards incompatibilities

This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
The lemmatization tables have been moved to their own package, spacy-lookups-data, which is not installed by default. If you're using pre-trained models, nothing changes, because the tables are now included in the model packages. If you want to use the lemmatizer for other languages that don't yet have pre-trained models (e.g. Turkish or Croatian) or start off with a blank model that contains lookup data (e.g. spacy.blank("en")), you'll need to explicitly install spaCy plus data via pip install spacy[lookups]. The data will be registered automatically via entry points.
Lemmatization tables (rules, exceptions, index and lookups) are now part of the Vocab and serialized with it. This means that serialized objects (nlp, pipeline components, vocab) will now include additional data, and models written to disk will include additional files.
The Lemmatizer class is now initialized with an instance of Lookups containing the rules and tables, instead of dicts as separate arguments. This makes it easier to share data tables and modify them at runtime. This is mostly internals, but if you've been implementing a custom Lemmatizer, you'll need to update your code.
If you've been training your own models, you'll need to retrain them with the new version.
The Dutch model has been trained on a new NER corpus (custom labelled UD instead of WikiNER), so their predictions may be very different compared to the previous version. The results should be significantly better and more generalizable, though.
The spacy download command does not set the --no-deps pip argument anymore by default, meaning that model package dependencies (if available) will now be also downloaded and installed. If spaCy (which is also a model dependency) is not installed in the current environment, e.g. if a user has built from source, --no-deps is added back automatically to prevent spaCy from being downloaded and installed again from pip.
The built-in biluo_tags_from_offsets converter is now stricter and will raise an error if entities are overlapping (instead of silently skipping them). If your data contains invalid entity annotations, make sure to clean it and resolve conflicts. You can now also use the new debug-data command to find problems in your data.
Pipeline components can now overwrite IOB tags of tokens that are not yet part of an entity. Once a token has an ent_iob value set, it won't be reset to an "unset" state and will always have at least O assigned. list(doc.ents) now actually keeps the annotations on the token level consistent, instead of resetting O to an empty string.
The default punctuation in the Sentencizer has been extended and now includes more characters common in various languages. This also means that the results it produces may change, depending on your text. If you want the previous behaviour with limited characters, set punct_chars=[".", "!", "?"] on initialization.
The PhraseMatcher algorithm was rewritten from scratch and it's now 10× faster. The rewrite also resolved a few subtle bugs with very large terminology lists. So if you were matching large lists, you may see slightly different results – however, the results should now be fully correct. See #4309 for details on this change.
The Serbian language class (introduced in v2.1.8) incorrectly used the language code rs instead of sr. This has now been fixed, so Serbian is now available via spacy.lang.sr.
The "sources" in the meta.json have changed from a list of strings to a list of dicts. This is mostly internals, but if your code used nlp.meta["sources"], you might have to update it.

📈 Benchmarks

Model	Language	Version	UAS	LAS	POS	NER F	Vec	Size
`en_core_web_sm`	English	2.2.0	91.61	89.71	97.03	85.07	𐄂	11 MB
`en_core_web_md`	English	2.2.0	91.65	89.77	97.14	86.10	✓	91 MB
`en_core_web_lg`	English	2.2.0	91.98	90.16	97.21	86.30	✓	789 MB
`de_core_news_sm`	German	2.2.0	90.75	88.63	96.29	83.11	𐄂	14 MB
`de_core_news_md`	German	2.2.0	91.26	89.36	96.44	83.42	✓	214 MB
`es_core_news_sm`	Spanish	2.2.0	90.20	87.05	96.79	89.45	𐄂	15 MB
`es_core_news_md`	Spanish	2.2.0	90.89	87.94	97.03	89.86	✓	74 MB
`pt_core_news_sm`	Portuguese	2.2.0	89.53	86.07	79.96	87.97	𐄂	20 MB
`fr_core_news_sm`	French	2.2.0	87.27	84.28	94.38	82.77	𐄂	14 MB
`fr_core_news_md`	French	2.2.0	88.82	86.07	95.15	82.82	✓	84 MB
`it_core_news_sm`	Italian	2.2.0	90.79	86.94	96.06	86.29	𐄂	13 MB
`nl_core_news_sm`	Dutch	2.2.0	76.79	69.53	90.10	68.79	𐄂	14 MB
`el_core_news_sm`	Greek	2.2.0	84.40	80.98	94.41	71.88	𐄂	10 MB
`el_core_news_md`	Greek	2.2.0	87.96	84.88	96.38	77.59	✓	126 MB
`nb_core_news_sm`	Norwegian	2.2.0	89.02	86.49	95.72	83.99	𐄂	12 MB
`lt_core_news_sm`	Lithuanian	2.2.0	59.87	48.00	74.02	76.58	𐄂	12 MB
`xx_ent_wiki_sm`	Multi	2.2.0	-	-	-	79.88	𐄂	3 MB

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

Add "label scheme" section to all models in the models directory that lists the labels assigned by the different components.
Extend the sources listed in the meta.json of pre-trained models with more details on the training corpora and include more information in the models directory.
Add more examples of matching regular expressions.
Add instructions for training an entity linking model.
Add API docs for new debug-data, EntityLinker, KnowledgeBase and Lookups.
Add new projects to the spaCy Universe.
Add example for interactive model visualizer with Streamlit.
Fix various typos and inconsistencies.

👥 Contributors

Thanks to @ICLRandD, @phiedulxp, @ajrader, @RyanZHe, @jenojp, @yanaiela, @isaric, @mrdbourke, @avramandrei, @Pavle992, @chkoar, @wannaphongcom, @BreakBB, @b1uec0in, @mihaigliga21, @tamuhey, @euand, @Hazoom, @SeanBE, @esemeniuc, @zqianem, @ajkl, @jaydeepborkar, @EarlGreyT and @er-raoniz for the pull requests and contributions.

Special thanks to our spaCy team @svlandeg and @adrianeboyd for the bug fixes and new features, @polm for the Bloom filters implementation and data compression and @yvespeirsman, @lemontheme, @jarib, @miktoki and @rokasramas for the help and resources for the new models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.2.0: Norwegian & Lithuanian models, better Dutch NER, smaller install, faster matching & more