v2.2.4: Alpha support for Yoruba and Basque, language data improvements and lots of bug fixes
✨ New features and improvements
- NEW: Add
Span.char_span
method. - NEW: Base language support for Yoruba and Basque.
- NEW: Add
--tag-map-path
argument todebug-data
andtrain
commands. - NEW Add
add_lemma
option todisplacy
dependency visualizer. - Add
IDX
as an attribute available viaDoc.to_array
. - Improve speed of adding large number of patterns to
EntityRuler
. - Replace
python-mecab3
withfugashi
for Japanese. - Improve language data for Norwegian, Luxembourgish, Finnish, Slovak, Romanian, Greek and German.
🔴 Bug fixes
- Fix issue #3979, #4819, #4871: Add
tok2vec
parameters totrain
command. - Fix issue #4009: Fix use of pretrained vectors in text classifier.
- Fix issue #4342: Improve CLI training with base model.
- Fix issue #4432: Add destructors for states in
TransitionSystem
. - Fix issue #4440: Require
HEAD
foris_parsed
inDoc.from_array
. - Fix issue #4615: Update
SHAPE
docs and examples. - Fix issue #4665: Allow
HEAD
field in CoNLL-U format to be an underscore. - Fix issue #4673: Ensure correct array module is used when returning a vector via
Vocab
. - Fix issue #4674: Make
set_entities
in theKnowledgeBase
more robust. - Fix issue #4677: Add missing tags to tag maps for
el
,es
andpt
. - Fix issue #4688: Iterate over
lr_edges
untilDoc.sents
are correct. - Fix issue #4703, #4823: Facilitate large training files.
- Fix issue #4707: Auto-exclude
disabled
when callingfrom_disk
during load. - Fix issue #4717: Fix int value handling in
Matcher
. - Fix issue #4719: Add message when cli train script throws exception.
- Fix issue #4723: Update
EntityLinker
example. - Fix issue #4725: Take care of global vectors in multiprocessing.
- Fix issue #4770: Include
Doc.cats
in serialization ofDoc
andDocBin
. - Fix issue #4772: Fix bug in
EntityLinker.predict
. - Fix issue #4777: Fix link to user hooks in documentation.
- Fix issue #4829: Update build dependencies in
pyproject.toml
. - Fix issue #4830: Warn for punctuation in entities when training with noise.
- Fix issue #4833: Make example scripts work with transformer starter models.
- Fix issue #4849: Fix serialization of
ENT_ID
. - Fix issue #4862: Fix and improve URL pattern.
- Fix issue #4868: Include
.pyx
and.pxd
files in the distribution. - Fix issue #4876: Add friendlier error to entity linking example script.
- Fix issue #4903: Fix handling of custom underscore attributes during multiprocessing.
- Fix issue #4924: Fix handling of empty docs or golds in
Language.evaluate
. - Fix issue #4934: Prevent updating component config if the
Model
was already defined. - Fix issue #4935: Fix
Sentencizer.pipe
for emptyDoc
. - Fix issue #4961: Remove old docs section links.
- Fix issue #4965: Sync
Span.__eq__
andSpan.__hash__
. - Fix issue #4975: Adjust
srsly
pin. - Fix issue #5048: Fix behavior of
get_doc
test utility. - Fix issue #5073: Normalize
IS_SENT_START
toSENT_START
forMatcher
. - Fix issue #5075: Make it impossible to create invalid heads with
Doc.from_array
. - Fix issue #5082: Correctly set vector of merged span in
merge_entities
. - Fix issue #5115: Ensure paths in
Tokenizer.to_disk
andTokenizer.from_disk
. - Fix issue #5117: Clarify behavior of
Doc.is_
flags for emptyDoc
s.
📖 Documentation and examples
- Fix various typos and inconsistencies.
- Add new projects to the spaCy Universe.
👥 Contributors
Thanks to @polm, @mmaybeno, @jarib, @questoph, @aajanki, @mr-bjerre, @Tclack88, @thiagola92, @tamuhey, @Olamyy, @AlJohri, @iechevarria, @iurshina, @lineality, @pbadeer, @BramVanroy, @kabirkhan, @ceteri, @omri374, @maknotavailable, @onlyanegg, @drndos, @ju-sh, @nlptechbook, @chkoar, @Jan-711, @MisterKeefe, @bryant1410, @mirfan899, @dhpollack and @mabraham for the pull requests and contributions!