Release v2.1.5: Base support for Marathi and Korean, better pretraining, scores per entity and bug fixes · explosion/spaCy

✨ New features and improvements

NEW: Base language data for Marathi and Korean (via mecab-ko, mecab-ko-dic and natto-py).
Improve language data for Lithuanian, Spanish, Kannada, French, Norwegian and Hindi.
Add evaluation metrics per entity type.
Add resume logic to spacy pretrain.
Add optional id property to EntityRuler patterns.
Better introspection and IDE automcomplete for custom extension attributes.
Make Doc.is_sentenced always return True for single-token docs.

🔴 Bug fixes

Fix issue #3490: Add evaluation metrics per entity type to Scorer.
Fix issue #3526: Serialize EntityRuler settings correctly.
Fix issue #3558: Improve E024 error message for incorrect GoldParse.
Fix issue #3611: Fix bug when setting ngram parameter in text classifier.
Fix issue #3625: Improve default punctuation rules for Hindi.
Fix issue #3707: Improve introspection of custom attributes.
Fix issue #3737: Check if component is callable in Language.replace_pipe.
Fix issue #3743: Fix documentation of lex_id.
Fix issue #3749: Change vector training script to work with latest Gensim.
Fix issue #3762, #3934: Make Doc.is_sentenced default to True for single-token Docs.
Fix issue #3802: Fix typo in docs example.
Fix issue #3811: Fix type of --seed option in spacy pretrain.
Fix issue #3822: Allow passing PhraseMatcher arguments to EntityRuler.
Fix issue #3839: Ensure the Matcher returns correct match IDs when used with operators.
Fix issue #3840: Improve error messages in spacy pretrain.
Fix issue #3853: Rename vectors if multiple models are loaded to prevent clashes.
Fix issue #3859: Update pretrain to prevent unintended overwriting of weight files.
Fix issue #3862: Fix matcher callback example.
Fix issue #3868: Add "v.s." to English tokenizer exceptions.
Fix issue #3869: Make Doc.count_by work as expected.
Fix issue #3880: Fix unflatten padding in Thinc when last element is empty.
Fix issue #3882: Exclude user_data when copying doc in displaCy.
Fix issue #3892: Update Tokenizer initialization docs.
Fix issue #3912: Make text classifier raise more friendly errors.

📖 Documentation and examples

Add documentation for Scorer, Language.evaluate and gold.docs_to_json.
Fix various typos and inconsistencies.

👥 Contributors

Thanks to @BreakBB, @ujwal-narayan, @estr4ng7d, @maknotavailable, @ramananbalakrishnan, @nipunsadvilkar, @NirantK, @munozbravo, @intrafindBreno, @Azagh3l, @jarib, @tokestermw, @polm, @skrcode, @kabirkhan, @demongolem, @elbaulp, @clarus, @BramVanroy, @rokasramas, @askhogan, @khellan, @kognate, @cedar101 and @yash1994 for the pull requests and contributions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.1.5: Base support for Marathi and Korean, better pretraining, scores per entity and bug fixes

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors