Release v2.2.4: Alpha support for Yoruba and Basque, language data improvements and lots of bug fixes · explosion/spaCy

✨ New features and improvements

NEW: Add Span.char_span method.
NEW: Base language support for Yoruba and Basque.
NEW: Add --tag-map-path argument to debug-data and train commands.
NEW Add add_lemma option to displacy dependency visualizer.
Add IDX as an attribute available via Doc.to_array.
Improve speed of adding large number of patterns to EntityRuler.
Replace python-mecab3 with fugashi for Japanese.
Improve language data for Norwegian, Luxembourgish, Finnish, Slovak, Romanian, Greek and German.

🔴 Bug fixes

Fix issue #3979, #4819, #4871: Add tok2vec parameters to train command.
Fix issue #4009: Fix use of pretrained vectors in text classifier.
Fix issue #4342: Improve CLI training with base model.
Fix issue #4432: Add destructors for states in TransitionSystem.
Fix issue #4440: Require HEAD for is_parsed in Doc.from_array.
Fix issue #4615: Update SHAPE docs and examples.
Fix issue #4665: Allow HEAD field in CoNLL-U format to be an underscore.
Fix issue #4673: Ensure correct array module is used when returning a vector via Vocab.
Fix issue #4674: Make set_entities in the KnowledgeBase more robust.
Fix issue #4677: Add missing tags to tag maps for el, es and pt.
Fix issue #4688: Iterate over lr_edges until Doc.sents are correct.
Fix issue #4703, #4823: Facilitate large training files.
Fix issue #4707: Auto-exclude disabled when calling from_disk during load.
Fix issue #4717: Fix int value handling in Matcher.
Fix issue #4719: Add message when cli train script throws exception.
Fix issue #4723: Update EntityLinker example.
Fix issue #4725: Take care of global vectors in multiprocessing.
Fix issue #4770: Include Doc.cats in serialization of Doc and DocBin.
Fix issue #4772: Fix bug in EntityLinker.predict.
Fix issue #4777: Fix link to user hooks in documentation.
Fix issue #4829: Update build dependencies in pyproject.toml.
Fix issue #4830: Warn for punctuation in entities when training with noise.
Fix issue #4833: Make example scripts work with transformer starter models.
Fix issue #4849: Fix serialization of ENT_ID.
Fix issue #4862: Fix and improve URL pattern.
Fix issue #4868: Include .pyx and .pxd files in the distribution.
Fix issue #4876: Add friendlier error to entity linking example script.
Fix issue #4903: Fix handling of custom underscore attributes during multiprocessing.
Fix issue #4924: Fix handling of empty docs or golds in Language.evaluate.
Fix issue #4934: Prevent updating component config if the Model was already defined.
Fix issue #4935: Fix Sentencizer.pipe for empty Doc.
Fix issue #4961: Remove old docs section links.
Fix issue #4965: Sync Span.__eq__ and Span.__hash__.
Fix issue #4975: Adjust srsly pin.
Fix issue #5048: Fix behavior of get_doc test utility.
Fix issue #5073: Normalize IS_SENT_START to SENT_START for Matcher.
Fix issue #5075: Make it impossible to create invalid heads with Doc.from_array.
Fix issue #5082: Correctly set vector of merged span in merge_entities.
Fix issue #5115: Ensure paths in Tokenizer.to_disk and Tokenizer.from_disk.
Fix issue #5117: Clarify behavior of Doc.is_ flags for empty Docs.

📖 Documentation and examples

Fix various typos and inconsistencies.
Add new projects to the spaCy Universe.

👥 Contributors

Thanks to @polm, @mmaybeno, @jarib, @questoph, @aajanki, @mr-bjerre, @Tclack88, @thiagola92, @tamuhey, @Olamyy, @AlJohri, @iechevarria, @iurshina, @lineality, @pbadeer, @BramVanroy, @kabirkhan, @ceteri, @omri374, @maknotavailable, @onlyanegg, @drndos, @ju-sh, @nlptechbook, @chkoar, @Jan-711, @MisterKeefe, @bryant1410, @mirfan899, @dhpollack and @mabraham for the pull requests and contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.2.4: Alpha support for Yoruba and Basque, language data improvements and lots of bug fixes

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors