Releases · explosion/spaCy

20 Aug 13:13

svlandeg

v3.1.2

e1f88de

v3.1.2: Improved spancat component and various bugfixes

✨ New features and improvements

NEW: Provide scores for the SpanCategorizer predictions.
NEW: Broader compatibility with type checkers thanks to .pyi stub files.
NEW: Auto-detect package dependencies in spacy package.
New INTERSECTS operator for the Matcher.
More debugging info for spacy project push and pull commands.
Allow passing in a precomputed array for speeding up multiple Span.as_doc calls.
The default da transformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo).

🔴 Bug fixes

Fix issue #8767: Fix offsets of empty and out-of-bounds spans.
Fix issue #8774: Ensure debug data runs correctly with a custom tokenizer.
Fix issue #8784: Fix incorrect ISSUBSET and ISSUPERSET in schema and docs.
Fix issue #8796: Respect the no_skip value for spacy project run.
Fix issue #8810: Make ConsoleLogger flush after each logging line.
Fix issue #8819: Pass exclude when serializing the vocab.
Fix issue #8830: Avoid adding sourced vectors hashes if not necessary.
Fix issue #8970: Fix allow_overlap default for span categorizer scoring.
Fix issue #8982: Add glossary entry for _SP.
Fix issue #9007: Fix span categorizer training on nested entities.

📖 Documentation and examples

New developer documentation covering spaCy's internals and code conventions.
Added a documentation section on preparing training data in spaCy's binary format.
Updated some error/log messages to be more informative.
Various updates to the documentation.
A few new additions to the spaCy universe.

👥 Contributors

@adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker

Contributors

polm, ezorita, and 15 other contributors

Assets 2

23 Jul 08:37

adrianeboyd

v3.0.7

034ac0a

v3.0.7: Bug fixes and base support for Azerbaijani

✨ New features and improvements

Alpha tokenization support for Azerbaijani.
Updates for French stop words.

🔴 Bug fixes

Fix issue #7629: Fix scoring normalization.
Fix issue #7886: Fix unknown tokens percentage in debug data.
Fix issue #7907: Update load_lookups return type and docstring.
Fix issue #7930: Make EntityLinker robust for nO=None.
Fix issue #7925: Skip vector ngram backoff if minn is not set.
Fix issue #7973: Fix debug model for transformers.
Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
Fix issue #7992: Fix span offsets for Matcher(as_spans) on spans.
Fix issue #8004: Handle errors while multiprocessing.
Fix issue #8009: Fix Doc.from_docs() for all empty docs.
Fix issue #8012: Fix ensemble textcat with listener.
Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
Fix issue #8055: Handle partial entities in Span.as_doc.
Fix issue #8062: Make all Span attrs writable.
Fix issue #8066: Update debug data for textcat.
Fix issue #8069: Custom warning if DocBin is too large.
Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
Fix issue #8116: Fix offsets in Span.get_lca_matrix.
Fix issue #8132: Remove unsupported attrs from attrs.IDS.
Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
Fix issue #8208: Address missing config overrides post load of models.
Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
Fix issue #8216: Don't add duplicate patterns in EntityRuler.
Fix issue #8244: Use context manager when reading model file.
Fix issue #8245: Fix other open calls without context managers.
Fix issue #8265: Address mypy errors.
Fix issue #8299: Restrict pymorphy2 requirement to pymorphy2 mode in Russian and Ukrainian lemmatizers.
Fix issue #8335: Raise error if deps not provided with heads in Doc.
Fix issue #8368: Preserve whitespace in Span.lemma_.
Fix issue #8396: Make JsonlReader path optional.
Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
Fix issue #8423: Update validate CLI to fix compat and ignore warnings.
Fix issue #8426: Fix setting empty entities in Example.from_dict.
Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
Fix issue #8584: Raise an error for textcat with <2 labels.
Fix issue #8551: Fix duplicate spacy package CLI opts.

👥 Contributors

@adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD

Assets 2

20 Jul 08:40

svlandeg

v3.1.1

ffaead8

v3.1.1: Support for Ancient Greek and various bug fixes

✨ New features and improvements

Alpha tokenization support for Ancient Greek.
Implementation of a noun_chunk iterator for Dutch.
Support for black & flake8 as pre-commit hooks.
New spacy.ngram_range_suggester.v1 for suggesting a range of n-gram sizes for the spancat component.

🔴 Bug fixes

Fix issue #8638: Fix Azerbaijani initialization.
Fix issue #8639: Use 0-vector for OOV lexemes.
Fix issue #8640: Update lexeme ranks for loaded vectors.
Fix issue #8651: Fix ru and uk multiprocessing (with spawn).
Fix issue #8663: Preserve existing meta information with spacy package.
Fix issue #8718: Ensure that replace_pipe takes disabled components into account.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe

Assets 2

07 Jul 14:34

adrianeboyd

v3.1.0

530b5d7

v3.1.0: New pipelines for Catalan & Danish, SpanCategorizer for arbitrary overlapping spans, use predicted annotations during training, bug fixes & more

✨ New features and improvements

NEW: Trained pipelines for Catalan and a new transformer-based pipeline for Danish.
NEW: Experimental SpanCategorizer component for labeling arbitrary and potentially overlapping spans of text.
NEW: Use predicted annotations during training via the [training.annotating_components] config setting.
Alpha tokenization support for Azerbaijani.
Part-of-speech tag-based lemmatizers for Catalan and Italian.
The TextCatCNN and TextCatBOW architectures are now resizable.
Support updating the EntityRecognizer with known incorrect span annotations.
Auto-generate a pretty README.md based on the meta in spacy package.

For more details, see the New in v3.1 usage guide.

📦 New trained pipelines

Package	Language	UPOS	Parser LAS	NER F
`ca_core_news_sm`	Catalan	98.2	87.4	79.8
`ca_core_news_md`	Catalan	98.3	88.2	84.0
`ca_core_news_lg`	Catalan	98.5	88.4	84.2
`ca_core_news_trf`	Catalan	98.9	93.0	91.2
`da_core_news_trf`	Danish	98.0	85.0	82.9

⚠️ Upgrading from v3.0

Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the spacy_version in your model package meta to ">=3.0.0,<3.2.0". If you run into degraded performance, retrain your pipeline with v3.1.
Use spacy init fill-config to update a v3.0 config for v3.1.
When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in [initialize.vectors].
Logger warnings have been converted to Python warnings. Use warnings.filterwarnings or the new helper method spacy.errors.filter_warning(action, error_msg='') to manage warnings.

For more information, see Notes on upgrading from v3.0.

🔴 Bug fixes

Fix issue #7036: Use a context manager when reading model.
Fix issue #7629: Fix scoring normalization.
Fix issue #7799: Ensure spacy ray command works.
Fix issue #7807: Show warning if entity ruler runs without patterns.
Fix issue #7886: Fix unknown tokens percentage in debug data.
Fix issue #7930: Make EntityLinker robust for nO=None.
Fix issue #7925: Skip vector ngram backoff if minn is not set.
Fix issue #7973: Fix debug model for transformers.
Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
Fix issue #8004: Handle errors while multiprocessing.
Fix issue #8009: Fix Doc.from_docs() for all empty docs.
Fix issue #8012: Fix ensemble textcat with listener.
Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
Fix issue #8055: Handle partial entities in Span.as_doc.
Fix issue #8062: Make all Span attrs writable.
Fix issue #8066: Update debug data for textcat.
Fix issue #8069: Custom warning if DocBin is too large.
Fix issue #8099: Update Vietnamese tokenizer.
Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
Fix issue #8116: Fix offsets in Span.get_lca_matrix.
Fix issue #8132: Remove unsupported attrs from attrs.IDS.
Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
Fix issue #8208: Address missing config overrides post load of models.
Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
Fix issue #8216: Don't add duplicate patterns in EntityRuler.
Fix issue #8265: Address mypy errors.
Fix issue #8335: Raise error if deps not provided with heads in Doc.
Fix issue #8368: Preserve whitespace in Span.lemma_.
Fix issue #8388: Don't clobber vectors when loading components from source models.
Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
Fix issue #8426: Fix setting empty entities in Example.from_dict.
Fix issue #8441: Add correct types for Language.pipe return values.
Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
Fix issue #8559: Fix vectors check for sourced components.
Fix issue #8584: Raise an error for textcat with <2 labels.

👥 Contributors

@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD

Assets 2

0 Join discussion

04 Jun 18:56

adrianeboyd

v2.3.7

cae72e4

v2.3.7: Bug fix for download CLI

🔴 Bug fixes

Fix issue #8286: Fix spacy download.

Assets 2

18 May 06:23

adrianeboyd

v2.3.6

2c1de4b

v2.3.6: Bug fixes and base support for Amharic

✨ New features and improvements

Add base support for Amharic.
Add noun chunk iterator for Danish.
Updates to French, Portuguese and Romanian stop words.

🔴 Bug fixes

Fix issue #6705: Fix deserialization of null token_match and url_match for the tokenizer.
Fix issue #6712: Prevent overlapping noun chunks for Spanish.
Fix issue #6745: Fix minibatch iterator when size iterator is finished.
Fix issue #6759: Skip 0-length matches in the Matcher.
Fix issue #6771: Support IS_SENT_START in the PhraseMatcher.
Fix issue #6772: Fix Span.text for empty spans.
Fix issue #6820: Improve Doc.char_span alignment_mode handling.
Fix issue #6857: Remove --no-cache-dir when downloading models.
Fix issue #8115: Fix offsets in Span.get_lca_matrix.

👥 Contributors

Thanks to @alexcombessie, @AMArostegui, @bryant1410, @Cristianasp, @garethsparks, @jenojp, @jganseman, @jumasheff, @lorenanda, @ophelielacroix, @thomasbird, @timgates42, @tupui and @yosiasz for the pull requests and contributions.

Assets 2

23 Apr 12:15

adrianeboyd

v3.0.6

df34444

v3.0.6: assemble CLI, Matcher alignments, training from streamed corpora and many bug fixes

✨ New features and improvements

New assemble CLI command for assembling a pipeline from a config without training.
Add support for match alignments in the Matcher to align matched tokens with matcher patterns.
Add support for training from streamed corpora.
Add support for W&B data and model checkpoint logging and versioning in spacy.WandbLogger.v2.
Extend Scorer.score_spans to support overlapping and unlabeled spans.
Update debug data for new v3 components.
Improve language data for Italian.
Various improvements to error handling and UX.

🔴 Bug fixes

Fix issue #7408: Add vocab kwarg to spacy.load.
Fix issue #7419: Exclude user hooks in displacy conversion.
Fix issue #7421: Update --code usage in CLI commands.
Fix issue #7424: Preserve sent starts on retokenization without parse.
Fix issue #7440: Fix pymorphy2 lookup lemmatizer.
Fix issue #7471: Improve warnings related to listening components.
Fix issue #7488: Fix upstream check in pretraining.
Fix issue #7489: Support callbacks entry points.
Fix issue #7497: Merge doc.spans in Doc.from_docs().
Fix issue #7528: Preserve user data for DependencyMatcher on spans.
Fix issue #7557: Fix __add__ method for PRFScore.
Fix issue #7574: Fix conversion of custom extension data in Span.as_doc and Doc.from_docs.
Fix issue #7620: Fix replace_listeners in configs.
Fix issue #7626: Fix vectors data on GPU.
Fix issue #7630: Update NEL for entities crossing sentence boundaries.
Fix issue #7631: Fix parser sourcing in NER converter.
Fix issue #7642: Fix handling of hyphen string value in config files.
Fix issue #7655: Fix sent starts when converting from v2 JSON training format.
Fix issue #7674: Fix handling of unknown tokens in StaticVectors.
Fix issue #7690: Fix pickling of Lemmatizer.
Fix issue #7749: Update Tokenizer.explain for special cases in v3.
Fix issue #7755: Fix config parsing of ints/strings.
Fix issue #7836: Fix tokenizer cache flushing.
Fix issue #7847: Fix handling of boolean values in Example.from_dict for sent starts.

📖 Documentation and examples

Add documentation for legacy functions and architectures.
Add documentation for pretrained pipeline design.
Add more details about pipe and multiprocessing.
Fix various typos and inconsistencies.

👥 Contributors

Thanks to @alvaroabascar, @armsp, @AyushExel, @BramVanroy, @broaddeep, @bryant1410, @bsweileh, @dpalmasan, @Findus23, @graue70, @jaidevd, @koaning, @langdonholmes, @m0canu1, @meghanabhange, @paoloq, @plison, @richardpaulhudson, @SamEdwardes, @Stannislav for the pull requests and contributions!

Assets 2

10 Mar 11:32

adrianeboyd

v3.0.5

53a3b96

v3.0.5: Bug fix for thinc requirement

🔴 Bug fixes

Fix related to issue #7075: Update thinc requirement for Jupyter notebook GPU warning

Assets 2

10 Mar 01:22

ines

v3.0.4

3b911ee

v3.0.4: Fix tok2vec pretraining, source disabled components, better UX and bug fixes

✨ New features and improvements

Allow sourcing disabled components in config.
Support Doc.spans in Example.from_dict.
Improve transformer recommendations in quickstart widget and init config.
Improve language data for Bulgarian.
Various improvements to error handling and UX.

🔴 Bug fixes

Fix issue #6952, #7285, #7289: Make tok2vec pretraining and pretrain command work as expected again.
Fix issue #7062: Only evaluate named entities for NEL if there is a corresponding gold span.
Fix issue #7065: Correctly handle sentence boundaries in Span.sent.
Fix issue #7071: Fix conll converter option.
Fix issue #7100: Re-add n_sents to entity linker and fix config handling and I/O.
Fix issue #7122: Fix displaCy output in evaluate CLI.
Fix issue #7127: Fix initialization of UkrainianLemmatizer.
Fix issue #7176: Re-refactor Sentencizer to use Pipe API.
Fix issue #7182: Allow SpanGroup import from spacy.tokens.
Fix issue #7204: Adjust Cython compilation for setups with custom include paths.
Fix issue #7222: Correct YAML formatting in quickstart recommendations for bg and bn.
Fix issue #7225: Fix spans weakref in Doc.copy.
Fix issue #7237: Fix is_cython_func for additional imported code.
Fix issue #7250: Fix patience for identical scores.
Fix issue #7329: Make spacy.orth_variants.v1 and spacy.lower_case.v1 augmenters work as expected.
Fix issue #7352: Sort EntityRuler.labels alphabetically.

📖 Documentation and examples

Add documentation for textcat_multilabel component.
Extend documentation for Vocab.get_noun_chunks.
Fix various typos and inconsistencies.

👥 Contributors

Thanks to @MartinoMensio, @SergeyShk, @R1j1t, @palandlom, @dardoria, @tocic, @clippered, @graue70, @koaning and @jankrepl for the pull requests and contributions!

Assets 2

14 Feb 04:43

ines

v3.0.3

f4f46b6

v3.0.3: Bug fixes for sentence segmentation and config filling

🔴 Bug fixes

Fix issue #7035, #7056: Fix parser transition bug that could lead to incorrect sentence fragments.
Fix issue #7055: Preserve sourced components in init fill-config.

📖 Documentation and examples

Update spaCy Universe.

👥 Contributors

Thanks @MartinoMensio for the pull request!

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

Contributors

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

✨ New features and improvements

📦 New trained pipelines

⚠️ Upgrading from v3.0

🔴 Bug fixes

👥 Contributors

🔴 Bug fixes

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

🔴 Bug fixes

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

Releases: explosion/spaCy

v3.1.2: Improved spancat component and various bugfixes

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

Contributors

v3.0.7: Bug fixes and base support for Azerbaijani

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

v3.1.1: Support for Ancient Greek and various bug fixes

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

v3.1.0: New pipelines for Catalan & Danish, SpanCategorizer for arbitrary overlapping spans, use predicted annotations during training, bug fixes & more

✨ New features and improvements

📦 New trained pipelines

⚠️ Upgrading from v3.0

🔴 Bug fixes

👥 Contributors

v2.3.7: Bug fix for download CLI

🔴 Bug fixes

v2.3.6: Bug fixes and base support for Amharic

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

v3.0.6: assemble CLI, Matcher alignments, training from streamed corpora and many bug fixes

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

v3.0.5: Bug fix for thinc requirement

🔴 Bug fixes

v3.0.4: Fix tok2vec pretraining, source disabled components, better UX and bug fixes

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

v3.0.3: Bug fixes for sentence segmentation and config filling

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors