Skip to content

v2.1.5: Base support for Marathi and Korean, better pretraining, scores per entity and bug fixes

Compare
Choose a tag to compare
@ines ines released this 12 Jul 12:31
· 5912 commits to master since this release

✨ New features and improvements

  • NEW: Base language data for Marathi and Korean (via mecab-ko, mecab-ko-dic and natto-py).
  • Improve language data for Lithuanian, Spanish, Kannada, French, Norwegian and Hindi.
  • Add evaluation metrics per entity type.
  • Add resume logic to spacy pretrain.
  • Add optional id property to EntityRuler patterns.
  • Better introspection and IDE automcomplete for custom extension attributes.
  • Make Doc.is_sentenced always return True for single-token docs.

🔴 Bug fixes

  • Fix issue #3490: Add evaluation metrics per entity type to Scorer.
  • Fix issue #3526: Serialize EntityRuler settings correctly.
  • Fix issue #3558: Improve E024 error message for incorrect GoldParse.
  • Fix issue #3611: Fix bug when setting ngram parameter in text classifier.
  • Fix issue #3625: Improve default punctuation rules for Hindi.
  • Fix issue #3707: Improve introspection of custom attributes.
  • Fix issue #3737: Check if component is callable in Language.replace_pipe.
  • Fix issue #3743: Fix documentation of lex_id.
  • Fix issue #3749: Change vector training script to work with latest Gensim.
  • Fix issue #3762, #3934: Make Doc.is_sentenced default to True for single-token Docs.
  • Fix issue #3802: Fix typo in docs example.
  • Fix issue #3811: Fix type of --seed option in spacy pretrain.
  • Fix issue #3822: Allow passing PhraseMatcher arguments to EntityRuler.
  • Fix issue #3839: Ensure the Matcher returns correct match IDs when used with operators.
  • Fix issue #3840: Improve error messages in spacy pretrain.
  • Fix issue #3853: Rename vectors if multiple models are loaded to prevent clashes.
  • Fix issue #3859: Update pretrain to prevent unintended overwriting of weight files.
  • Fix issue #3862: Fix matcher callback example.
  • Fix issue #3868: Add "v.s." to English tokenizer exceptions.
  • Fix issue #3869: Make Doc.count_by work as expected.
  • Fix issue #3880: Fix unflatten padding in Thinc when last element is empty.
  • Fix issue #3882: Exclude user_data when copying doc in displaCy.
  • Fix issue #3892: Update Tokenizer initialization docs.
  • Fix issue #3912: Make text classifier raise more friendly errors.

📖 Documentation and examples

👥 Contributors

Thanks to @BreakBB, @ujwal-narayan, @estr4ng7d, @maknotavailable, @ramananbalakrishnan, @nipunsadvilkar, @NirantK, @munozbravo, @intrafindBreno, @Azagh3l, @jarib, @tokestermw, @polm, @skrcode, @kabirkhan, @demongolem, @elbaulp, @clarus, @BramVanroy, @rokasramas, @askhogan, @khellan, @kognate, @cedar101 and @yash1994 for the pull requests and contributions.