v3.5.0: New CLI commands, language updates, bug fixes and much more
✨ New features and improvements
- NEW: New
apply
CLI command to annotate new documents with a trained pipeline (#11376). - NEW: New
benchmark
CLI command to benchmark pipelines. The newbenchmark speed
subcommand measures the speed of a pipeline, thebenchmark accuracy
subcommand is a new alias forevaluate
(#11902). - NEW: New
find-threshold
CLI command to identify an optimal threshold for classification models (#11280). - NEW: New
FUZZY
Matcher
operator for fuzzy matches based on Levenshtein edit distance. In addition, theFUZZY
andREGEX
operators are now supported in combination withIN
/NOT_IN
. (#11359). - Language updates for Ancient Greek, Dutch, Russian, Slovenian and Ukrainian (#11345, #11162, #11426, #11753, #11811, #11997, more details below).
- Allow up to
typer
v0.7.x (#11720),mypy
0.990 (#11801) andtyping_extensions
v4.4.x (#12036). - New
spacy.ConsoleLogger.v3
with expanded progress tracking (#11972). - Improved scoring behavior for
textcat
withspacy.textcat_scorer.v2
(#11696 and #11971) andspacy.textcat_multilabel_scorer.v2
(#11820). - Improved customizability of the knowledge base used for entity linking, with the default implementation being the new
InMemoryLookupKB
(#11268). - Optional
before_update
callback that is invoked at the start of each training step (#11739). - Improve performance of
SpanGroup
(#11380). - Improve UX around
displacy.serve
when the default port is in use (#11948). - Patch a security vulnerability in extracting tar files (#11746).
- Add equality definition for vectors (#11806).
- Allow interpolation of variables in directory names in projects (#11235).
- Update default component configs to use the latest
tok2vec
version (#11618).
🔴 Bug fixes
- #11382: Fix lookup behavior for the French and Catalan lemmatizers.
- #11385: Ensure that downstream components can train properly on a frozen
tok2vec
ortransformer
layer. - #11762: Support local file system remotes for projects.
- #11763: Raise an error when unsupported values are used for
textcat
. - #11834: Ensure
Vocab.to_disk
respects the exclude setting forlookups
andvectors
. - #12009: Fix a few typing issues for
SpanGroup
andSpan
objects. - #12098: Correctly handle missing annotations in the edit tree lemmatizer.
⚠️ Backwards incompatibilities and model updates
The following changes may require you to update code that is using the relevant functionality:
- An error is now raised when unsupported values are given as input to train a
textcat
ortextcat_multilabel
model - ensure that values are 0.0 or 1.0 as explained in the docs. - As
KnowledgeBase
is now an abstract class, you should call the constructor of the newInMemoryLookupKB
instead when you want to use spaCy's default KB implementation. If you've written a custom KB that inherits fromKnowledgeBase
, you'll need to implement its abstract methods, or alternatively inherit fromInMemoryLookupKB
instead.
The following changes may influence the output of your language pipeline or trained models:
- Updates to language defaults:
- Updates to model defaults:
- Use the latest
tok2vec
defaults in all components (#11618). - Improve the default attributes used for the
textcat
andtextcat_multilabel
components (#11698). - Update the default scorer for
textcat
andtextcat_multilabel
to fix a bug related tothreshold
fortextcat
and to make it possible to score multipletextcat
/textcat_multilabel
components in a single pipeline with custom scorers. If no custom scorers are used, thecat_p/r/f
scores will now only reflect the final component's labels and performance (#11696, #11820). - Correct the
token_acc
score to report the intended measure (# correct tokens / # predicted tokens
, the same as in spaCy v2). Thetoken_acc
scores for v3.5 will be lower for the same performance because they were incorrectly inflated in v3.0-v3.4. Thetoken_p/r/f
scores should remain unchanged (#12073).
- Use the latest
The following functionality will be changed in the near future - so it's best to start updating your scripts now to make them more generic:
- From v4 onwards, we'll rename the
master
branch tomain
.
📦 Trained pipelines updates
- The CNN pipelines add
IS_SPACE
as atok2vec
feature fortagger
andmorphologizer
components to improve tagging of non-whitespace vs. whitespace tokens. - The transformer pipelines require
spacy-transformers
v1.2, which uses the exact alignment fromtokenizers
for fast tokenizers instead of the heuristic alignment fromspacy-alignments
. For all trained pipelines exceptja_core_news_trf
, the alignments between spaCy tokens and transformer tokens may be slightly different. More details about thespacy-transformers
changes in the v1.2.0 release notes.
📖 Documentation and examples
- We've ported our website from Gatsby to Next 🥳
- Updated the documentation on supported languages.
- Added a note about experimental M1 GPU support to the installation quickstart.
- Included documentation for the
biluo_to_iob
andiob_to_biluo
functions. - Fixed model links in the v3.4 usage documentation.
- Removed "new" tags of functionality from spaCy v2.x.
- Various small additions, spelling and typo fixes.
- spaCy Universe additions:
- greCy: Providing Ancient Greek models
- spacy-pythainlp: Add Thai support for spaCy
- New projects:
- Accelerate NER with Speedster (experimental)
👥 Contributors
@aaronzipp, @adrianeboyd, @albertvillanova, @ArchiDevil, @cfuerbachersparks, @damian-romero, @danieldk, @darigovresearch, @DSLituiev, @essenmitsosse, @gremur, @honnibal, @ines, @jmyerston, @JosPolfliet, @kadarakos, @koaning, @kwhumphreys, @ljvmiranda921, @MarcoGorelli, @orglce, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @ryndaniels, @shadeMe, @svlandeg, @thomashacker, @TrellixVulnTeam, @wannaphong, @zhiiw, @zrpxx