Skip to content

Releases: explosion/spaCy

v3.1.2: Improved spancat component and various bugfixes

20 Aug 13:13
e1f88de
Compare
Choose a tag to compare

✨ New features and improvements

  • NEW: Provide scores for the SpanCategorizer predictions.
  • NEW: Broader compatibility with type checkers thanks to .pyi stub files.
  • NEW: Auto-detect package dependencies in spacy package.
  • New INTERSECTS operator for the Matcher.
  • More debugging info for spacy project push and pull commands.
  • Allow passing in a precomputed array for speeding up multiple Span.as_doc calls.
  • The default da transformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo).

🔴 Bug fixes

  • Fix issue #8767: Fix offsets of empty and out-of-bounds spans.
  • Fix issue #8774: Ensure debug data runs correctly with a custom tokenizer.
  • Fix issue #8784: Fix incorrect ISSUBSET and ISSUPERSET in schema and docs.
  • Fix issue #8796: Respect the no_skip value for spacy project run.
  • Fix issue #8810: Make ConsoleLogger flush after each logging line.
  • Fix issue #8819: Pass exclude when serializing the vocab.
  • Fix issue #8830: Avoid adding sourced vectors hashes if not necessary.
  • Fix issue #8970: Fix allow_overlap default for span categorizer scoring.
  • Fix issue #8982: Add glossary entry for _SP.
  • Fix issue #9007: Fix span categorizer training on nested entities.

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker

v3.0.7: Bug fixes and base support for Azerbaijani

23 Jul 08:37
034ac0a
Compare
Choose a tag to compare

✨ New features and improvements

  • Alpha tokenization support for Azerbaijani.
  • Updates for French stop words.

🔴 Bug fixes

  • Fix issue #7629: Fix scoring normalization.
  • Fix issue #7886: Fix unknown tokens percentage in debug data.
  • Fix issue #7907: Update load_lookups return type and docstring.
  • Fix issue #7930: Make EntityLinker robust for nO=None.
  • Fix issue #7925: Skip vector ngram backoff if minn is not set.
  • Fix issue #7973: Fix debug model for transformers.
  • Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
  • Fix issue #7992: Fix span offsets for Matcher(as_spans) on spans.
  • Fix issue #8004: Handle errors while multiprocessing.
  • Fix issue #8009: Fix Doc.from_docs() for all empty docs.
  • Fix issue #8012: Fix ensemble textcat with listener.
  • Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
  • Fix issue #8055: Handle partial entities in Span.as_doc.
  • Fix issue #8062: Make all Span attrs writable.
  • Fix issue #8066: Update debug data for textcat.
  • Fix issue #8069: Custom warning if DocBin is too large.
  • Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
  • Fix issue #8116: Fix offsets in Span.get_lca_matrix.
  • Fix issue #8132: Remove unsupported attrs from attrs.IDS.
  • Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
  • Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
  • Fix issue #8208: Address missing config overrides post load of models.
  • Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
  • Fix issue #8216: Don't add duplicate patterns in EntityRuler.
  • Fix issue #8244: Use context manager when reading model file.
  • Fix issue #8245: Fix other open calls without context managers.
  • Fix issue #8265: Address mypy errors.
  • Fix issue #8299: Restrict pymorphy2 requirement to pymorphy2 mode in Russian and Ukrainian lemmatizers.
  • Fix issue #8335: Raise error if deps not provided with heads in Doc.
  • Fix issue #8368: Preserve whitespace in Span.lemma_.
  • Fix issue #8396: Make JsonlReader path optional.
  • Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
  • Fix issue #8423: Update validate CLI to fix compat and ignore warnings.
  • Fix issue #8426: Fix setting empty entities in Example.from_dict.
  • Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
  • Fix issue #8584: Raise an error for textcat with <2 labels.
  • Fix issue #8551: Fix duplicate spacy package CLI opts.

👥 Contributors

@adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD

v3.1.1: Support for Ancient Greek and various bug fixes

20 Jul 08:40
ffaead8
Compare
Choose a tag to compare

✨ New features and improvements

  • Alpha tokenization support for Ancient Greek.
  • Implementation of a noun_chunk iterator for Dutch.
  • Support for black & flake8 as pre-commit hooks.
  • New spacy.ngram_range_suggester.v1 for suggesting a range of n-gram sizes for the spancat component.

🔴 Bug fixes

  • Fix issue #8638: Fix Azerbaijani initialization.
  • Fix issue #8639: Use 0-vector for OOV lexemes.
  • Fix issue #8640: Update lexeme ranks for loaded vectors.
  • Fix issue #8651: Fix ru and uk multiprocessing (with spawn).
  • Fix issue #8663: Preserve existing meta information with spacy package.
  • Fix issue #8718: Ensure that replace_pipe takes disabled components into account.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe

v3.1.0: New pipelines for Catalan & Danish, SpanCategorizer for arbitrary overlapping spans, use predicted annotations during training, bug fixes & more

07 Jul 14:34
530b5d7
Compare
Choose a tag to compare

✨ New features and improvements

For more details, see the New in v3.1 usage guide.

📦 New trained pipelines

Package Language UPOS Parser LAS  NER F
ca_core_news_sm Catalan 98.2 87.4 79.8
ca_core_news_md Catalan 98.3 88.2 84.0
ca_core_news_lg Catalan 98.5 88.4 84.2
ca_core_news_trf Catalan 98.9 93.0 91.2
da_core_news_trf Danish 98.0 85.0 82.9

⚠️ Upgrading from v3.0

  • Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the spacy_version in your model package meta to ">=3.0.0,<3.2.0". If you run into degraded performance, retrain your pipeline with v3.1.
  • Use spacy init fill-config to update a v3.0 config for v3.1.
  • When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in [initialize.vectors].
  • Logger warnings have been converted to Python warnings. Use warnings.filterwarnings or the new helper method spacy.errors.filter_warning(action, error_msg='') to manage warnings.

For more information, see Notes on upgrading from v3.0.

🔴 Bug fixes

  • Fix issue #7036: Use a context manager when reading model.
  • Fix issue #7629: Fix scoring normalization.
  • Fix issue #7799: Ensure spacy ray command works.
  • Fix issue #7807: Show warning if entity ruler runs without patterns.
  • Fix issue #7886: Fix unknown tokens percentage in debug data.
  • Fix issue #7930: Make EntityLinker robust for nO=None.
  • Fix issue #7925: Skip vector ngram backoff if minn is not set.
  • Fix issue #7973: Fix debug model for transformers.
  • Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
  • Fix issue #8004: Handle errors while multiprocessing.
  • Fix issue #8009: Fix Doc.from_docs() for all empty docs.
  • Fix issue #8012: Fix ensemble textcat with listener.
  • Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
  • Fix issue #8055: Handle partial entities in Span.as_doc.
  • Fix issue #8062: Make all Span attrs writable.
  • Fix issue #8066: Update debug data for textcat.
  • Fix issue #8069: Custom warning if DocBin is too large.
  • Fix issue #8099: Update Vietnamese tokenizer.
  • Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
  • Fix issue #8116: Fix offsets in Span.get_lca_matrix.
  • Fix issue #8132: Remove unsupported attrs from attrs.IDS.
  • Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
  • Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
  • Fix issue #8208: Address missing config overrides post load of models.
  • Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
  • Fix issue #8216: Don't add duplicate patterns in EntityRuler.
  • Fix issue #8265: Address mypy errors.
  • Fix issue #8335: Raise error if deps not provided with heads in Doc.
  • Fix issue #8368: Preserve whitespace in Span.lemma_.
  • Fix issue #8388: Don't clobber vectors when loading components from source models.
  • Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
  • Fix issue #8426: Fix setting empty entities in Example.from_dict.
  • Fix issue #8441: Add correct types for Language.pipe return values.
  • Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
  • Fix issue #8559: Fix vectors check for sourced components.
  • Fix issue #8584: Raise an error for textcat with <2 labels.

👥 Contributors

@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD

v2.3.7: Bug fix for download CLI

04 Jun 18:56
cae72e4
Compare
Choose a tag to compare

🔴 Bug fixes

  • Fix issue #8286: Fix spacy download.

v2.3.6: Bug fixes and base support for Amharic

18 May 06:23
2c1de4b
Compare
Choose a tag to compare

✨ New features and improvements

  • Add base support for Amharic.
  • Add noun chunk iterator for Danish.
  • Updates to French, Portuguese and Romanian stop words.

🔴 Bug fixes

  • Fix issue #6705: Fix deserialization of null token_match and url_match for the tokenizer.
  • Fix issue #6712: Prevent overlapping noun chunks for Spanish.
  • Fix issue #6745: Fix minibatch iterator when size iterator is finished.
  • Fix issue #6759: Skip 0-length matches in the Matcher.
  • Fix issue #6771: Support IS_SENT_START in the PhraseMatcher.
  • Fix issue #6772: Fix Span.text for empty spans.
  • Fix issue #6820: Improve Doc.char_span alignment_mode handling.
  • Fix issue #6857: Remove --no-cache-dir when downloading models.
  • Fix issue #8115: Fix offsets in Span.get_lca_matrix.

👥 Contributors

Thanks to @alexcombessie, @AMArostegui, @bryant1410, @Cristianasp, @garethsparks, @jenojp, @jganseman, @jumasheff, @lorenanda, @ophelielacroix, @thomasbird, @timgates42, @tupui and @yosiasz for the pull requests and contributions.

v3.0.6: assemble CLI, Matcher alignments, training from streamed corpora and many bug fixes

23 Apr 12:15
df34444
Compare
Choose a tag to compare

✨ New features and improvements

  • New assemble CLI command for assembling a pipeline from a config without training.
  • Add support for match alignments in the Matcher to align matched tokens with matcher patterns.
  • Add support for training from streamed corpora.
  • Add support for W&B data and model checkpoint logging and versioning in spacy.WandbLogger.v2.
  • Extend Scorer.score_spans to support overlapping and unlabeled spans.
  • Update debug data for new v3 components.
  • Improve language data for Italian.
  • Various improvements to error handling and UX.

🔴 Bug fixes

  • Fix issue #7408: Add vocab kwarg to spacy.load.
  • Fix issue #7419: Exclude user hooks in displacy conversion.
  • Fix issue #7421: Update --code usage in CLI commands.
  • Fix issue #7424: Preserve sent starts on retokenization without parse.
  • Fix issue #7440: Fix pymorphy2 lookup lemmatizer.
  • Fix issue #7471: Improve warnings related to listening components.
  • Fix issue #7488: Fix upstream check in pretraining.
  • Fix issue #7489: Support callbacks entry points.
  • Fix issue #7497: Merge doc.spans in Doc.from_docs().
  • Fix issue #7528: Preserve user data for DependencyMatcher on spans.
  • Fix issue #7557: Fix __add__ method for PRFScore.
  • Fix issue #7574: Fix conversion of custom extension data in Span.as_doc and Doc.from_docs.
  • Fix issue #7620: Fix replace_listeners in configs.
  • Fix issue #7626: Fix vectors data on GPU.
  • Fix issue #7630: Update NEL for entities crossing sentence boundaries.
  • Fix issue #7631: Fix parser sourcing in NER converter.
  • Fix issue #7642: Fix handling of hyphen string value in config files.
  • Fix issue #7655: Fix sent starts when converting from v2 JSON training format.
  • Fix issue #7674: Fix handling of unknown tokens in StaticVectors.
  • Fix issue #7690: Fix pickling of Lemmatizer.
  • Fix issue #7749: Update Tokenizer.explain for special cases in v3.
  • Fix issue #7755: Fix config parsing of ints/strings.
  • Fix issue #7836: Fix tokenizer cache flushing.
  • Fix issue #7847: Fix handling of boolean values in Example.from_dict for sent starts.

📖 Documentation and examples

  • Add documentation for legacy functions and architectures.
  • Add documentation for pretrained pipeline design.
  • Add more details about pipe and multiprocessing.
  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @alvaroabascar, @armsp, @AyushExel, @BramVanroy, @broaddeep, @bryant1410, @bsweileh, @dpalmasan, @Findus23, @graue70, @jaidevd, @koaning, @langdonholmes, @m0canu1, @meghanabhange, @paoloq, @plison, @richardpaulhudson, @SamEdwardes, @Stannislav for the pull requests and contributions!

v3.0.5: Bug fix for thinc requirement

10 Mar 11:32
53a3b96
Compare
Choose a tag to compare

🔴 Bug fixes

  • Fix related to issue #7075: Update thinc requirement for Jupyter notebook GPU warning

v3.0.4: Fix tok2vec pretraining, source disabled components, better UX and bug fixes

10 Mar 01:22
3b911ee
Compare
Choose a tag to compare

✨ New features and improvements

  • Allow sourcing disabled components in config.
  • Support Doc.spans in Example.from_dict.
  • Improve transformer recommendations in quickstart widget and init config.
  • Improve language data for Bulgarian.
  • Various improvements to error handling and UX.

🔴 Bug fixes

  • Fix issue #6952, #7285, #7289: Make tok2vec pretraining and pretrain command work as expected again.
  • Fix issue #7062: Only evaluate named entities for NEL if there is a corresponding gold span.
  • Fix issue #7065: Correctly handle sentence boundaries in Span.sent.
  • Fix issue #7071: Fix conll converter option.
  • Fix issue #7100: Re-add n_sents to entity linker and fix config handling and I/O.
  • Fix issue #7122: Fix displaCy output in evaluate CLI.
  • Fix issue #7127: Fix initialization of UkrainianLemmatizer.
  • Fix issue #7176: Re-refactor Sentencizer to use Pipe API.
  • Fix issue #7182: Allow SpanGroup import from spacy.tokens.
  • Fix issue #7204: Adjust Cython compilation for setups with custom include paths.
  • Fix issue #7222: Correct YAML formatting in quickstart recommendations for bg and bn.
  • Fix issue #7225: Fix spans weakref in Doc.copy.
  • Fix issue #7237: Fix is_cython_func for additional imported code.
  • Fix issue #7250: Fix patience for identical scores.
  • Fix issue #7329: Make spacy.orth_variants.v1 and spacy.lower_case.v1 augmenters work as expected.
  • Fix issue #7352: Sort EntityRuler.labels alphabetically.

📖 Documentation and examples

  • Add documentation for textcat_multilabel component.
  • Extend documentation for Vocab.get_noun_chunks.
  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @MartinoMensio, @SergeyShk, @R1j1t, @palandlom, @dardoria, @tocic, @clippered, @graue70, @koaning and @jankrepl for the pull requests and contributions!

v3.0.3: Bug fixes for sentence segmentation and config filling

14 Feb 04:43
f4f46b6
Compare
Choose a tag to compare

🔴 Bug fixes

  • Fix issue #7035, #7056: Fix parser transition bug that could lead to incorrect sentence fragments.
  • Fix issue #7055: Preserve sourced components in init fill-config.

📖 Documentation and examples

  • Update spaCy Universe.

👥 Contributors

Thanks @MartinoMensio for the pull request!