v2.1.5: Base support for Marathi and Korean, better pretraining, scores per entity and bug fixes
✨ New features and improvements
- NEW: Base language data for Marathi and Korean (via
mecab-ko
,mecab-ko-dic
andnatto-py
). - Improve language data for Lithuanian, Spanish, Kannada, French, Norwegian and Hindi.
- Add evaluation metrics per entity type.
- Add resume logic to
spacy pretrain
. - Add optional
id
property to EntityRuler patterns. - Better introspection and IDE automcomplete for custom extension attributes.
- Make
Doc.is_sentenced
always returnTrue
for single-token docs.
🔴 Bug fixes
- Fix issue #3490: Add evaluation metrics per entity type to
Scorer
. - Fix issue #3526: Serialize
EntityRuler
settings correctly. - Fix issue #3558: Improve
E024
error message for incorrectGoldParse
. - Fix issue #3611: Fix bug when setting
ngram
parameter in text classifier. - Fix issue #3625: Improve default punctuation rules for Hindi.
- Fix issue #3707: Improve introspection of custom attributes.
- Fix issue #3737: Check if component is callable in
Language.replace_pipe
. - Fix issue #3743: Fix documentation of
lex_id
. - Fix issue #3749: Change vector training script to work with latest Gensim.
- Fix issue #3762, #3934: Make
Doc.is_sentenced
default toTrue
for single-tokenDoc
s. - Fix issue #3802: Fix typo in docs example.
- Fix issue #3811: Fix type of
--seed
option inspacy pretrain
. - Fix issue #3822: Allow passing
PhraseMatcher
arguments toEntityRuler
. - Fix issue #3839: Ensure the
Matcher
returns correct match IDs when used with operators. - Fix issue #3840: Improve error messages in
spacy pretrain
. - Fix issue #3853: Rename vectors if multiple models are loaded to prevent clashes.
- Fix issue #3859: Update
pretrain
to prevent unintended overwriting of weight files. - Fix issue #3862: Fix matcher callback example.
- Fix issue #3868: Add
"v.s."
to English tokenizer exceptions. - Fix issue #3869: Make
Doc.count_by
work as expected. - Fix issue #3880: Fix unflatten padding in Thinc when last element is empty.
- Fix issue #3882: Exclude
user_data
when copying doc in displaCy. - Fix issue #3892: Update
Tokenizer
initialization docs. - Fix issue #3912: Make text classifier raise more friendly errors.
📖 Documentation and examples
- Add documentation for
Scorer
,Language.evaluate
andgold.docs_to_json
. - Fix various typos and inconsistencies.
👥 Contributors
Thanks to @BreakBB, @ujwal-narayan, @estr4ng7d, @maknotavailable, @ramananbalakrishnan, @nipunsadvilkar, @NirantK, @munozbravo, @intrafindBreno, @Azagh3l, @jarib, @tokestermw, @polm, @skrcode, @kabirkhan, @demongolem, @elbaulp, @clarus, @BramVanroy, @rokasramas, @askhogan, @khellan, @kognate, @cedar101 and @yash1994 for the pull requests and contributions.