Releases: explosion/spaCy
v2.3.0: Models for Chinese, Danish, Japanese, Polish and Romanian, new updated models with vectors, faster loading, small API improvements & lots of bug fixes
⚠️ This version of spaCy requires downloading new models. You can use thespacy validate
command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.
✨ New features and improvements
- NEW: Pretrained model families for Chinese, Danish, Japanese, Polish and Romanian, as well as larger models with vectors for Dutch, German, French, Italian, Greek, Lithuanian, Portuguese and Spanish. 29 new models and 46 model packages in total!
- NEW: 2-4× faster loading times for models with vectors and 2× smaller packages.
- NEW: Alpha support for Armenian, Gujarati and Malayalam.
- NEW: Lookup lemmatization for Polish.
- NEW: Allow
Matcher
to match on bothDoc
andSpan
objects. - NEW: Add
Token.is_sent_end
property. - Improve language data for Danish, Dutch, French, German, Italian, Lithuanian, Norwegian, Romanian and Spanish to better match UD corpora.
- Update language data for Danish, Kannada, Korean, Persian, Swedish and Urdu.
- Add support for
pkuseg
alongsidejieba
for Chinese. - Switch from
fugashi
tosudachipy
for Japanese. - Improve punctuation used in sentencizer.
- Switch to new and more consistent alignment method in
gold.align
. - Reduce stored lexemes data and move non-derivable features to
spacy-lookups-data
.
🔴 Bug fixes
- Fix issue #5056: Introduce support for matching
Span
objects. - Fix issue #5086: Remove
Vectors.from_glove
. - Fix issue #5131: Improve data processing in named entity linking scripts.
- Fix issue #5137: Fix passing of component configuration to component.
- Fix issue #5144: Fix sentence comparison in test util.
- Fix issue #5166: Fix handling of
exclusive_classes
in textcat ensemble. - Fix issue #5170: Set rank for new vector in
Vocab.set_vector
. - Fix issue #5181: Prevent
None
values in gold fields. - Fix issue #5191: Fix
GoldParse
initialization when the number of tokens has changed. - Fix issue #5193: Correctly pin
cupy-cuda
extra dependencies. - Fix issue #5200: Fix minor bugs in train CLI.
- Fix issue #5216: Modify
Vectors.resize
to work withcupy
. - Fix issue #5228: Raise error for inplace resize with new vector dimension.
- Fix issue #5230: Fix
unittest
warnings when saving a model. - Fix issue #5257: Use inline flags in
token_match
patterns. - Fix issue #5278, #5359: Add missing
__init__.py
files to language data tests. - Fix issue #5281: Fix comparison predicate handling for
!=
. - Fix issue #5287: Normalize
TokenC.sent_start
values forMatcher
. - Fix issue #5292: Fix typo in option name
--n-save_every
. - Fix issue #5303: Use
max(uint64)
for OOV lexeme rank. - Fix issue #5311: Fix alignment of cards on landing page.
- Fix issue #5320: Fix
most_similar
for vectors with unused rows. - Fix issue #5344: Prevent pip from installing spaCy on Python 3.4.
- Fix issue #5356: Fix bug in
Span.similarity
that could triggerTypeError
. - Fix issue #5361: Fix problems with lower and whitespace in variants.
- Fix issue #5373: Improve exceptions for
'd
(would/had) in English. - Fix issue #5387: Fix logic in train CLI timing eval on CPU/GPU.
- Fix issue #5393, #5458: Fix check for overlapping spans in noun chunks.
- Fix issue #5429: Modify array type to accommodate
OOV_RANK
. - Fix issue #5430: Check that row is within bounds when adding vector.
- Fix issue #5435: Use
Token.sent_start
forSpan.sent
. - Fix issue #5436: Fix
ErrorsWithCodes().__class__
return value. - Fix issue #5450: Disallow merging 0-length spans.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validate
command to find out which models need updating, and print update instructions. - If you're training new models, you'll want to install the package
spacy-lookups-data
, which now includes both the lemmatization tables (as in v2.2) and the normalization tables (new in v2.3). If you're using pretrained models, nothing changes, because the relevant tables are included in the model packages. - Due to the updated Universal Dependencies training data, the fine-grained part-of-speech tags will change for many provided language models. The coarse-grained part-of-speech tagset remains the same, but the mapping from particular fine-grained to coarse-grained tags may show minor differences.
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech tagsets contain new merged tags related to contracted forms, such as
ADP_DET
for French"au"
, which maps to UPOSADP
based on the head"à"
. This increases the accuracy of the models by improving the alignment between spaCy's tokenization and Universal Dependencies multi-word tokens used for contractions. - spaCy's custom warnings have been replaced with native Python
warnings
. Instead of settingSPACY_WARNING_IGNORE
, use thewarnings
filters to manage warnings.
📖 Documentation and examples
- Fix various typos and inconsistencies.
- Add new projects to the spaCy Universe.
- Move
bin/wiki_entity_linking
scripts for Wikipedia toprojects
repo.
🔥 ICYMI: We recently updated the free and interactive spaCy course to include translations for German (with German NLP examples), Spanish (with Spanish NLP examples) and Japanese, as well as videos for English and German. Translations for Chinese (with Chinese NLP examples), French (with French NLP examples) and Russian coming soon!
📦 Model packages (43)
Model | Language | Version | Vectors |
---|---|---|---|
[zh_core_web_sm ] |
Chinese | 2.3.0 | 𐄂 |
[zh_core_web_md ] |
Chinese | 2.3.0 | ✓ |
[zh_core_web_lg ] |
Chinese | 2.3.0 | ✓ |
[da_core_news_sm ] |
Danish | 2.3.0 | 𐄂 |
[da_core_news_md ] |
Danish | 2.3.0 | ✓ |
[da_core_news_lg ] |
Danish | 2.3.0 | ✓ |
[nl_core_news_sm ] |
Dutch | 2.3.0 | 𐄂 |
[nl_core_news_md ] |
Dutch | 2.3.0 | ✓ |
[nl_core_news_lg ] |
Dutch | 2.3.0 | ✓ |
en_core_web_sm |
English | 2.3.0 | 𐄂 |
en_core_web_md |
English | 2.3.0 | ✓ |
en_core_web_lg |
English | 2.3.0 | ✓ |
[fr_core_news_sm ] |
French | 2.3.0 | 𐄂 |
[fr_core_news_md ] |
French | 2.3.0 | ✓ |
[fr_core_news_lg ] |
French | 2.3.0 | ✓ |
de_core_news_sm |
German | 2.3.0 | 𐄂 |
de_core_news_md |
German | 2.3.0 | ✓ |
de_core_news_lg |
German | 2.3.0 | ✓ |
[el_core_news_sm ] |
Greek | 2.3.0 | 𐄂 |
[el_core_news_md ] |
Greek | 2.3.0 | ✓ |
[el_core_news_lg ] |
Greek | 2.3.0 | ✓ |
[it_core_news_sm ] |
Italian | 2.3.0 | 𐄂 |
[it_core_news_md ] |
Italian | 2.3.0 | ✓ |
[it_core_news_lg ] |
Italian | 2.3.0 | ✓ |
[ja_core_news_sm ] |
Japanese | 2.3.0 | 𐄂 |
[ja_core_news_md ] |
Japanese | 2.3.0 | ✓ |
[ja_core_news_lg ] |
Japanese | 2.3.0 | ✓ |
[lt_core_news_sm ] |
Lithuanian | 2.3.0 | 𐄂 |
[lt_core_news_md ] |
Lithuanian | 2.3.0 | ✓ |
[lt_core_news_lg ] |
Lithuanian | 2.3.0 | ✓ |
[nb_core_news_sm ] |
Norwegian Bokmål | 2.3.0 | 𐄂 |
[nb_core_news_md ] |
Norwegian Bokmål | 2.3.0 | ✓ |
[nb_core_news_lg ] |
Norwegian Bokmål | 2.3.0 | ✓ |
[pl_core_news_sm ] |
Polish | 2.3.0 | 𐄂 |
[pl_core_news_md ] |
Polish | 2.3.0 | ✓ |
[pl_core_news_lg ] |
Polish | 2.3.0 | ✓ |
pt_core_news_sm |
Portuguese | 2.3.0 | 𐄂 |
pt_core_news_md |
Portuguese | 2.3.0 | ✓ |
pt_core_news_lg |
Portuguese | 2.3.0 | ✓ |
[ro_core_news_sm ] |
Romanian | 2.3.0 | 𐄂 |
[ro_core_news_md ] |
Romanian | 2.3.0 | ✓ |
[ro_core_news_lg ] |
Romanian | 2.3.0 | ✓ |
es_core_news_sm |
Spanish | 2.3.0 | 𐄂 |
es_core_news_md |
Spanish | 2.3.0 | ✓ |
es_core_news_lg |
Spanish | 2.3.0 | ✓ |
[xx_ent_wiki_sm ] |
Multi-language | 2.3.0 | 𐄂 |
v2.2.4: Alpha support for Yoruba and Basque, language data improvements and lots of bug fixes
✨ New features and improvements
- NEW: Add
Span.char_span
method. - NEW: Base language support for Yoruba and Basque.
- NEW: Add
--tag-map-path
argument todebug-data
andtrain
commands. - NEW Add
add_lemma
option todisplacy
dependency visualizer. - Add
IDX
as an attribute available viaDoc.to_array
. - Improve speed of adding large number of patterns to
EntityRuler
. - Replace
python-mecab3
withfugashi
for Japanese. - Improve language data for Norwegian, Luxembourgish, Finnish, Slovak, Romanian, Greek and German.
🔴 Bug fixes
- Fix issue #3979, #4819, #4871: Add
tok2vec
parameters totrain
command. - Fix issue #4009: Fix use of pretrained vectors in text classifier.
- Fix issue #4342: Improve CLI training with base model.
- Fix issue #4432: Add destructors for states in
TransitionSystem
. - Fix issue #4440: Require
HEAD
foris_parsed
inDoc.from_array
. - Fix issue #4615: Update
SHAPE
docs and examples. - Fix issue #4665: Allow
HEAD
field in CoNLL-U format to be an underscore. - Fix issue #4673: Ensure correct array module is used when returning a vector via
Vocab
. - Fix issue #4674: Make
set_entities
in theKnowledgeBase
more robust. - Fix issue #4677: Add missing tags to tag maps for
el
,es
andpt
. - Fix issue #4688: Iterate over
lr_edges
untilDoc.sents
are correct. - Fix issue #4703, #4823: Facilitate large training files.
- Fix issue #4707: Auto-exclude
disabled
when callingfrom_disk
during load. - Fix issue #4717: Fix int value handling in
Matcher
. - Fix issue #4719: Add message when cli train script throws exception.
- Fix issue #4723: Update
EntityLinker
example. - Fix issue #4725: Take care of global vectors in multiprocessing.
- Fix issue #4770: Include
Doc.cats
in serialization ofDoc
andDocBin
. - Fix issue #4772: Fix bug in
EntityLinker.predict
. - Fix issue #4777: Fix link to user hooks in documentation.
- Fix issue #4829: Update build dependencies in
pyproject.toml
. - Fix issue #4830: Warn for punctuation in entities when training with noise.
- Fix issue #4833: Make example scripts work with transformer starter models.
- Fix issue #4849: Fix serialization of
ENT_ID
. - Fix issue #4862: Fix and improve URL pattern.
- Fix issue #4868: Include
.pyx
and.pxd
files in the distribution. - Fix issue #4876: Add friendlier error to entity linking example script.
- Fix issue #4903: Fix handling of custom underscore attributes during multiprocessing.
- Fix issue #4924: Fix handling of empty docs or golds in
Language.evaluate
. - Fix issue #4934: Prevent updating component config if the
Model
was already defined. - Fix issue #4935: Fix
Sentencizer.pipe
for emptyDoc
. - Fix issue #4961: Remove old docs section links.
- Fix issue #4965: Sync
Span.__eq__
andSpan.__hash__
. - Fix issue #4975: Adjust
srsly
pin. - Fix issue #5048: Fix behavior of
get_doc
test utility. - Fix issue #5073: Normalize
IS_SENT_START
toSENT_START
forMatcher
. - Fix issue #5075: Make it impossible to create invalid heads with
Doc.from_array
. - Fix issue #5082: Correctly set vector of merged span in
merge_entities
. - Fix issue #5115: Ensure paths in
Tokenizer.to_disk
andTokenizer.from_disk
. - Fix issue #5117: Clarify behavior of
Doc.is_
flags for emptyDoc
s.
📖 Documentation and examples
- Fix various typos and inconsistencies.
- Add new projects to the spaCy Universe.
👥 Contributors
Thanks to @polm, @mmaybeno, @jarib, @questoph, @aajanki, @mr-bjerre, @Tclack88, @thiagola92, @tamuhey, @Olamyy, @AlJohri, @iechevarria, @iurshina, @lineality, @pbadeer, @BramVanroy, @kabirkhan, @ceteri, @omri374, @maknotavailable, @onlyanegg, @drndos, @ju-sh, @nlptechbook, @chkoar, @Jan-711, @MisterKeefe, @bryant1410, @mirfan899, @dhpollack and @mabraham for the pull requests and contributions!
v2.2.3: Tokenizer.explain, Korean base support, dependency scores per label and bug fixes
✨ New features and improvements
- NEW:
Tokenizer.explain
method to see which rule or pattern was matched.tok_exp = nlp.tokenizer.explain("(don't)") assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"] assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
- NEW: Official Python 3.8 wheels for spaCy and its dependencies.
- Base language support for Korean.
- Add
Scorer.las_per_type
(labelled depdencency scores per label). - Rework Chinese language initialization and tokenization
- Improve language data for Luxembourgish.
🔴 Bug fixes
- Fix issue #4573, #4645: Improve tokenizer usage docs.
- Fix issue #4575: Add error in
debug-data
if no dev docs are available. - Fix issue #4582: Make
as_tuples=True
inLanguage.pipe
work with multiprocessing. - Fix issue #4590: Correctly call
on_match
inDependencyMatcher
. - Fix issue #4593: Build wheels for Python 3.8.
- Fix issue #4604: Fix realloc in
Retokenizer.split
. - Fix issue #4656: Fix
conllu2json
converter when-n
> 1. - Fix issue #4662: Fix
Language.evaluate
for components without.pipe
method. - Fix issue #4670: Ensure
EntityRuler
is deserialized correctly from disk. - Fix issue #4680: Raise error if non-string labels are added to
Tagger
orTextCategorizer
. - Fix issue #4691: Make
Vectors.find
return keys in correct order.
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @yash1994, @walterhenry, @prilopes, @f11r, @questoph, @erip, @richardpaulhudson and @GuiGel for the pull requests and contributions.
v2.2.2: Multiprocessing, future APIs, Luxembourgish base support & simpler GPU install
✨ New features and improvements
- NEW: Support multiprocessing in
nlp.pipe
via then_process
argument (Python 3 only). - Base language support for Luxembourgish.
- Add noun chunks iterator for Swedish.
- Retrained models for Greek, Norwegian Bokmål and Lithuanian that now correctly support parser-based sentence segmentation.
- Repackaged models for Greek and German with improved lookup tables via
spacy-lookups-data
. - Add warning in
debug-data
for low sentences per doc ratio. - Improve checks and errors related to ill-formed IOB input in
convert
anddebug-data
CLI. - Support training dict format as JSONL.
- Make
EntityRuler
ID resolution 2× faster and support"id"
in patterns to setToken.ent_id
. - Improve rendering of named entity spans in
displacy
for RTL languages. - Update Thinc to ditch
thinc_gpu_ops
for simpler GPU install. - Support Mish activation in
spacy pretrain
. - Add forwards-compatible support for new
Language.disable_pipes
API, which will become
the default in the future. The method can now also take a list of component names as its first argument (instead of a variable number of arguments).- disabled = nlp.disable_pipes("tagger", "parser") + disabled = nlp.disable_pipes(["tagger", "parser"])
- Add forwards-compatible support for new
Matcher.add
andPhraseMatcher.add
API, which will become the default in the future. The patterns are now the second argument and a list (instead of a variable number of arguments). Theon_match
callback becomes an optional keyword argument.patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]] - matcher.add("GoogleNow", None, *patterns) + matcher.add("GoogleNow", patterns) - matcher.add("GoogleNow", on_match, *patterns) + matcher.add("GoogleNow", patterns, on_match=on_match)
- Add new and improved tokenization alignment in
gold.align
behind a feature flag. The new alignment may produce backwards-incompatible results, so it won't be enabled by default before v3.0.import spacy.gold spacy.gold.USE_NEW_ALIGN = True
🔴 Bug fixes
- Fix issue #1303: Support multiprocessing in
nlp.pipe
. - Fix issue #1745: Ditch
thinc_gpu_ops
for simpler GPU install. - Fix issue #2411: Update Thinc to fix compilation on cygwin.
- Fix issue #3412: Prevent division by zero in
Vectors.most_similar
. - Fix issue #3618: Fix memory leak for long-running parsing processes.
- Fix issue #4241: Update Greek lookups in
spacy-lookups-data
. - Fix issue #4269: Extend unicode character block for Sinhala.
- Fix issue #4362: Improve
URL_PATTERN
and handling in tokenizer. - Fix issue #4373: Make
PhraseMatcher.vocab
consistent withMatcher.vocab
. - Fix issue #4377: Clarify serialization of extension attributes.
- Fix issue #4382: Improve usage of
pkg_resources
and handling of entry points. - Fix issue #4386: Consider
batch_size
when sorting similar vectors. - Fix issue #4389: Fix
ner_jsonl2json
converter. - Fix issue #4397: Ensure
on_match
callback is executed inPhraseMatcher
. - Fix issue #4401, #4408: Fix sentence segmentation in Greek, Norwegian and Lithuanian models.
- Fix issue #4402: Fix issue with how training data was passed through the pipeline.
- Fix issue #4406: Correct spelling in lemmatizer API docs.
- Fix issue #4418, #4438: Improve knowledge base and Wikidata parsing.
- Fix issue #4435: Fix
PhraseMatcher.remove
for overlapping patterns. - Fix issue #4443: Fix bug in
Vectors.most_similar
. - Fix issue #4452: Fix
gold.docs_to_json
documentation. - Fix issue #4463: Add missing
cats
toGoldParse.from_annot_tuples
inScorer
. - Fix issue #4470: Suppress convert output if writing to
stdout
. - Fix issue #4475: Correct mistake in docs example.
- Fix issue #4485: Update tag maps and docs for English and German.
- Fix issue #4493: Update information in spaCy Universe.
- Fix issue #4496: Improve docs of
PhraseMatcher.add
arguments. - Fix issue #4506: Ensure
Vectors.most_similar
returns1.0
for identical vectors. - Fix issue #4509: Fix
None
iteration error in entity linking script. - Fix issue #4524: Fix typo in
Parser
sample construction ofGoldParse
. - Fix issue #4528: Fix serialization of extension attribute values in
DocBin
. - Fix issue #4529: Ensure
GoldParse
is initialized correctly with misaligned tokens. - Fix issue #4538: Backport memory leak fix to v2.1.x branch and release v2.1.9.
⚠️ Backwards incompatibilities
- The unused attributes
lemma_rules
,lemma_index
,lemma_exc
andlemma_lookup
of theLanguage.Defaults
have now been removed to prevent confusion (e.g. if users add rules that then have no effect). The only place lemmatization tables are stored and can be modified at runtime is vianlp.vocab.lookups
.- nlp.Defaults.lemma_lookup["spaCies"] = "spaCy" + lemma_lookup = nlp.vocab.lookups.get_table("lemma_lookup") + lemma_lookup["spaCies"] = "spaCy"
📖 Documentation and examples
- Fix various typos and inconsistencies.
- Add more projects to the spaCy Universe.
👥 Contributors
Thanks to @tamuhey, @PeterGilles, @akornilo, @danielkingai2, @ghollah, @pberba, @gustavengstrom, @ju-sh, @kabirkhan, @ZhuoruLin, @nipunsadvilkar and @neelkamath for the pull requests and contributions.
v2.1.9: Backport memory leak fix
This is a small maintenance update that backports a bug fix for a memory leak that'd occur in long-running parsing processes. It's intended for users who can't or don't yet want to upgrade to spaCy v2.2 (e.g. because it requires retraining all the models). If you're able to upgrade, you shouldn't use this version and instead install the latest v2.2.
🔴 Bug fixes
v2.2.1: Fix DocBin and Dutch model, improve Vectors.most_similar
✨ New features and improvements
- Make
Vectors.most_similar
return the top most similar vectors instead of only one.
🔴 Bug fixes
- Fix issue #4365: Fix tag map in Dutch model.
- Fix issue #4368: Fix initialization of
DocBin
with attributes.
📖 Documentation and examples
- Add API docs for
Vectors.most_similar
.
👥 Contributors
Thanks to @bintay and @svlandeg for the pull requests and contributuons.
v2.2.0: Norwegian & Lithuanian models, better Dutch NER, smaller install, faster matching & more
⚠️ This version of spaCy requires downloading new models. You can use thespacy validate
command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.
✨ New features and improvements
- NEW: Pretrained core models for Norwegian (MIT) and Lithuanian (CC BY-SA).
- NEW: Better pre-trained Dutch NER using custom labelled UD corpus instead of WikiNER.
- NEW: Make spaCy roughly 5-10× smaller on disk (depending on your platform) by compressing and moving lookups to a separate package.
- NEW:
EntityLinker
andKnowledgeBase
API to train and access entity linking models, plus scripts to train your own Wikidata models. - NEW: 10× faster
PhraseMatcher
and improved phrase matching algorithm. - NEW:
DocBin
class to efficiently serialize collections ofDoc
objects. - NEW: Train text classification models on the command line with
spacy train
and gettextcat
results via theScorer
. - NEW:
debug-data
command to validate your training and development data, get useful stats, and find problems like invalid entity annotations, cyclic dependencies, low data labels and more. - NEW: Efficient
Lookups
class using Bloom filters that allows storing, accessing and serializing large dictionaries viavocab.lookups
. - Data augmentation in
spacy train
via the--orth-variant-level
flag, which defines the percentage of occurrences of some tokens subject to replacement during training. - Add
nlp.pipe_labels
(labels assigned by pipeline components) and include"labels"
innlp.meta
. - Support
spacy_displacy_colors
entry point to allow packages to add entity colors todisplacy
. - Allow
template
config option indisplacy
to customize entity HTML template. - Improve match pattern validation and handling of unsupported attributes.
- Add lookup lemmatization data for Croatian and Serbian.
- Update and improve language data for Chinese, Croatian, Thai, Romanian, Hindi and English.
🔴 Bug fixes
- Fix issue #3258: Reduce package size on disk by moving and compressing large dictionaries.
- Fix issue #3540: Update lemma and vector information after splitting a token.
- Fix issue #3687: Automatically skip duplicates in
Doc.retokenize
. - Fix issue #3830: Retrain German model and fix
subtok
errors. - Fix issue #3850: Allow customizing entity HTML template in displaCy.
- Fix issue #3879, #3951, #4154: Fix bug in
Matcher
retry loop that'd cause problems with?
operator. - Fix issue #3917: Raise error for negative token indices in
displacy
. - Fix issue #3922: Add
PhraseMatcher.remove
method. - Fix issue #3959, #4133: Make sure both
pos
andtag
are correctly serialized. - Fix issue #3972: Ensure
PhraseMatcher
returns multiple matches for identical rules. - Fix issue #4020: Raise error for overlapping entities in
biluo_tags_from_offsets
. - Fix issue #4051: Ensure retokenizer sets POS tags correctly on merge.
- Fix issue #4070: Improve token pattern checking without validation.
- Fix issue #4096: Add checks for cycles in
debug-data
. - Fix issue #4100: Improve docs on phrase pattern attributes.
- Fix issue #4102: Correct mistakes in English lookup lemmatizer data.
- Fix issue #4104: Make visualized NER examples in docs more clear.
- Fix issue #4107: Automatically set span root attributes on merging.
- Fix issue #4111, #4170: Improve NER/IOB converters.
- Fix issue #4120: Correctly handle
?
operator at the end of pattern. - Fix issue #4123: Provide more details in cycle error message
E069
. - Fix issue #4138: Correctly open
.html
files as UTF-8 inevaluate
command. - Fix issue #4139: Make emoticon data a raw string.
- Fix issue #4148: Add missing API docs for
force
flag onset_extension
. - Fix issue #4155: Correct language code for Serbian.
- Fix issue #4165: Add more attributes to matcher validation schema.
- Fix issue #4190: Fix caching issue that'd cause tokenizer to not be deserialized correctly.
- Fix issue #4200: Work around
tqdm
bug that'd remove text color from terminal output. - Fix issue #4229: Fix handling of pre-set entities.
- Fix issue #4238: Flush tokenizer cache when affixes, token_match, or special cases are modified.
- Fix issue #4242: Make
.pos
/.tag
distinction more clear in the docs. - Fix issue #4245: Fix bug that occurred when processing empty string in Korean.
- Fix issue #4262: Fix handling of spaces in Japanese.
- Fix issue #4269: Tokenize punctuation correctly in Kannada, Tamil, and Telugu and add unicode characters to default sentencizer config.
- Fix issue #4270: Fix
--vectors-loc
documentation. - Fix issue #4302: Remove duplicate
Parser.tok2vec
property. - Fix issue #4303: Correctly support
as_tuples
andreturn_matches
inMatcher.pipe
. - Fix issue #4307: Ensure that pre-set entities are preserved and allow overwriting unset tokens.
- Fix issue #4308: Fix bug that could cause
PhraseMatcher
with very large lists to miss matches. - Fix issue #4348: Ensure training doesn't crash with empty batches.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validate
command to find out which models need updating, and print update instructions. - The lemmatization tables have been moved to their own package,
spacy-lookups-data
, which is not installed by default. If you're using pre-trained models, nothing changes, because the tables are now included in the model packages. If you want to use the lemmatizer for other languages that don't yet have pre-trained models (e.g. Turkish or Croatian) or start off with a blank model that contains lookup data (e.g.spacy.blank("en")
), you'll need to explicitly install spaCy plus data viapip install spacy[lookups]
. The data will be registered automatically via entry points. - Lemmatization tables (rules, exceptions, index and lookups) are now part of the
Vocab
and serialized with it. This means that serialized objects (nlp
, pipeline components, vocab) will now include additional data, and models written to disk will include additional files. - The
Lemmatizer
class is now initialized with an instance ofLookups
containing the rules and tables, instead of dicts as separate arguments. This makes it easier to share data tables and modify them at runtime. This is mostly internals, but if you've been implementing a customLemmatizer
, you'll need to update your code. - If you've been training your own models, you'll need to retrain them with the new version.
- The Dutch model has been trained on a new NER corpus (custom labelled UD instead of WikiNER), so their predictions may be very different compared to the previous version. The results should be significantly better and more generalizable, though.
- The
spacy download
command does not set the--no-deps
pip argument anymore by default, meaning that model package dependencies (if available) will now be also downloaded and installed. If spaCy (which is also a model dependency) is not installed in the current environment, e.g. if a user has built from source,--no-deps
is added back automatically to prevent spaCy from being downloaded and installed again from pip. - The built-in
biluo_tags_from_offsets
converter is now stricter and will raise an error if entities are overlapping (instead of silently skipping them). If your data contains invalid entity annotations, make sure to clean it and resolve conflicts. You can now also use the newdebug-data
command to find problems in your data. - Pipeline components can now overwrite IOB tags of tokens that are not yet part of an entity. Once a token has an
ent_iob
value set, it won't be reset to an "unset" state and will always have at leastO
assigned.list(doc.ents)
now actually keeps the annotations on the token level consistent, instead of resettingO
to an empty string. - The default punctuation in the
Sentencizer
has been extended and now includes more characters common in various languages. This also means that the results it produces may change, depending on your text. If you want the previous behaviour with limited characters, setpunct_chars=[".", "!", "?"]
on initialization. - The
PhraseMatcher
algorithm was rewritten from scratch and it's now 10× faster. The rewrite also resolved a few subtle bugs with very large terminology lists. So if you were matching large lists, you may see slightly different results – however, the results should now be fully correct. See #4309 for details on this change. - The
Serbian
language class (introduced in v2.1.8) incorrectly used the language coders
instead ofsr
. This has now been fixed, soSerbian
is now available viaspacy.lang.sr
. - The
"sources"
in themeta.json
have changed from a list of strings to a list of dicts. This is mostly internals, but if your code usednlp.meta["sources"]
, you might have to update it.
📈 Benchmarks
Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
---|---|---|---|---|---|---|---|---|
[en_core_web_sm ] |
English | 2.2.0 | 91.61 | 89.71 | 97.03 | 85.07 | 𐄂 | 11 MB |
[en_core_web_md ] |
English | 2.2.0 | 91.65 | 89.77 | 97.14 | 86.10 | ✓ | 91 MB |
[en_core_web_lg ] |
English ... |
v2.1.8: Usability improvements and Serbian alpha tokenization
✨ New features and improvements
- NEW: Alpha tokenization support for Serbian
- Improve language data for Urdu.
- Support installing and loading model packages in the same session.
🔴 Bug fixes
- Fix issue #4002: Make
PhraseMatcher
work as expected forNORM
attribute. - Fix issue #4063: Improve docs on
Matcher
attributes. - Fix issue #4068: Make Korean work as expected on Python 2.7.
- Fix issue #4069: Add
validate
option toEntityRuler
. - Fix issue #4074: Raise error if annotation dict in simple training style has unexpected keys.
- Fix issue #4081: Fix typo in
pyproject.toml
. - Fix handling of keyword arguments in
Language.evaluate
.
📖 Documentation and examples
- Improve
Matcher
attribute docs. - Fix various typos and inconsistencies.
👥 Contributors
Thanks to @akornilo, @mirfan899, @veer-bains, @seppeljordan, @Pavle992, @svlandeg, @jenojp and @adrianeboyd for the pull requests and contributions.
v2.1.7: Improved evaluation, better language factories and bug fixes
✨ New features and improvements
- Add
Token.tensor
andSpan.tensor
attributes. - Support simple training format of
(text, annotations)
instead of only(doc, gold)
fornlp.evaluate
. - Add support for
"lang_factory"
setting in modelmeta.json
(see #4031). - Also support
"requirements"
inmeta.json
to define packages for setup'sinstall_requires
. - Improve
Pipe
base class methods and make them less presumptuous. - Improve Danish and Korean tokenization.
- Improve error messages when deserializing model fails.
🔴 Bug fixes
- Fix issue #3669, #3962: Fix dependency copy in
Span.as_doc
that could cause segfault. - Fix issue #3968: Fix bug in per-entity scores.
- Fix issue #4000: Improve entity linking API.
- Fix issue #4022: Fix error when Korean text contains special characters.
- Fix issue #4030: Handle edge case when calling
TextCategorizer.predict
with emptyDoc
. - Fix issue #4045: Correct
Span.sent
docs. - Fix issue #4048: Fix
init-model
command if there's no vocab. - Fix issue #4052: Improve per-type scoring of NER.
- Fix issue #4054: Ensure the
lang
ofnlp
andnlp.vocab
stay consistent. - Fix bugs in
Token.similarity
andSpan.similarity
when called via hook.
📖 Documentation and examples
- Add documentation for
gold.align
helper. - Add more explicit section on processing text.
- Improve documentation on disabling pipeline components.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @sorenlind, @pmbaumgartner, @svlandeg, @FallakAsad, @BreakBB, @adrianeboyd, @polm, @b1uec0in, @mdaudali and @ejarkm for the pull requests and contributions.
v2.1.6: Fix order of symbols that caused tag maps to be out-of-sync
🔴 Bug fixes
- Fix issue #3958: Fix order of symbols that caused tag maps to be out-of-sync.