Skip to content

Releases: explosion/spaCy

v2.0.8: IS_CURRENCY, PhraseMatcher stability and bug fixes

18 Feb 12:25
Compare
Choose a tag to compare

📊 Help us improve spaCy and take the User Survey 2018!


✨ New features and improvements

  • NEW: Lexical attribute IS_CURRENCY via Token.is_currency for currency symbols.
  • Add noun_chunks syntax iterator for Norwegian.
  • Add get_beam_parse method in ArcEager.
  • Revert changes to the Matcher in favour of the new and improved API (#1971) coming in v2.1.0.

🔴 Bug fixes

  • Fix issue #1706: Ensure files opened in from_disk are closed.
  • Fix issue #1733: Make model loading from package compatible with Python 3.4.
  • Fix issue #1832, #1928: Fix vector handling in init_model command.
  • Fix issue #1915: Pass in hyperparameters correctly during begin_training.
  • Fix issue #1924: Require html5lib in setup.py to prevent six error.
  • Fix issue #1929: Correctly handle NER with pre-set sentence boundaries.
  • Fix issue #1941: Improve documentation around model symlink on Windows.
  • Fix issue #1949: Correct Matcher docs to only include ORTH and LOWER.
  • Fix issue #1950: Fix bug in regex Matcher example.
  • Fix issue #1959: Execute custom pipeline component when using Language.pipe.
  • Fix issue #1964: Correct typo in glossary.
  • Fix issue #1974: Don't set random.seed globally in CLI commands.
  • Fix issue #1989: Correct documentation of match_id and improve example.

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @ohenrik, @tokestermw, @azarezade, @piratos, @mhaddy, @pktippa, @mdcclv, @oxinabox, @SThomasP, @DuyguA, @emulbreh, @ursachec and @enerrio for the pull requests and contributions.

v2.0.7: Fix bug in resuming model training

02 Feb 02:58
Compare
Choose a tag to compare

🔴 Bug fixes

  • Fix issue #1919: Fix missing config property in parser when resuming training.

v2.0.6: Persian tokenization, Turkish and Norwegian lookup lemmatizer, improved Matcher and lots of bug fixes

01 Feb 03:47
Compare
Choose a tag to compare

✨ New features and improvements

  • Alpha tokenization support for Persian.
  • Add lookup lemmatizer for Turkish.
  • Add lookup lemmatizer and tag map for Norwegian and improve tokenizer exceptions.
  • Improve model downloading and linking and use proper exit codes in CLI commands.

🔴 Bug fixes

  • Fix issue #1503: Fix Matcher bugs and behaviour of * and + operators.
  • Fix issue #1539: Fix Vectors.resize on Python 3.
  • Fix issue #1591: Fix compiler flags and remove march=native.
  • Fix issue #1606, #1698: Ensure LIKE_URL doesn't return True for email addresses.
  • Fix issue #1622: Use nlp.to_disk in spacy train command.
  • Fix issue #1633: Add missing Span.vocab property.
  • Fix issue #1640: Fix infinite recursion in token.sent_start.
  • Fix issue #1663, #1721, #1761, #1780: Download models with --no-deps to avoid conda errors.
  • Fix issue #1712, #1813: Don't raise deprecation warning in property.
  • Fix issue #1714: Make sure download and validate commands exit correctly.
  • Fix issue #1727: Dont overwrite pretrained_dims setting from cfg.
  • Fix issue #1728: Correct TextCategorizer documentation.
  • Fix issue #1750: Remove non-breaking spaces from Hindi examples.
  • Fix issue #1757: Fix rich comparison against None objects.
  • Fix issue #1758: Add English tokenizer exception for "would've".
  • Fix issue #1769: Make LIKE_NUM case-insensitive.
  • Fix issue #1774: Allow pickling of Chinese language class.
  • Fix issue #1781: Add missing dev dependency.
  • Fix issue #1799: Set l_edge and r_edge correctly for non-projective parses.
  • Fix issue #1807: Make set_vector add word to vocab.
  • Fix issue #1820: Correct documentation of Matcher operators.
  • Fix issue #1831: Allow vector loading to work on 1d data files.
  • Fix issue #1834: Fix sentence boundaries serialization.
  • Fix issue #1838: Clarify hyperparameters and alias usage in spacy train.
  • Fix issue #1851: Fix typo and use better serialization example.
  • Fix issue #1868: Make Vocab.__contains__ work with ints.
  • Fix issue #1883: Fix unpickling of Matcher.
  • Fix issue #1911: Improve error handling if pipeline component is not callable.
  • Fix issues with spacy init_model command.

📖 Documentation and examples

👥 Contributors

Thanks to @cbilgili, @melanuria, @mpuels, @IsaacHaze, @sorenlind, @Bri-Will, @d99kris, @mdda, @kimfalk, @benjaminp, @zqhZY, @avinashrubird, @nirdesh37, @kwhumphreys, @fucking-signup, @wrathagom, @pbnsilva, @savkov, @matatusko, @GregDubbin, @avadhpatel, @azarezade, @ohenrik, @azarezade, @thomasopsomer, @Kimahriman and @hassanshamim for the pull requests and contributions.

v2.0.5: Fix vector pickling

07 Dec 10:01
Compare
Choose a tag to compare

✨ New features and improvements

  • Add spacy init-model command to create a model directory from raw data (similar to the spacy model command in v1.x).

🔴 Bug fixes

  • Fix an issue with the vector pickling that would cause vectors to be set to None.

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @mpuels and @GreenRiverRUS for the pull requests and contributions.

v2.0.4: Alpha support for Russian, various improvements & bug fixes

06 Dec 12:39
Compare
Choose a tag to compare

✨ New features and improvements

  • Alpha support for Russian via pymorphy2.
  • Improve language data for Danish, Italian and Dutch.
  • Add offsets_from_biluo_tags helper to convert BILUO notation to entity offsets.
  • Use POS instead of TAG by default in displaCy, to prevent visualisation issues in languages with long combined tags (e.g. Italian or Dutch).
  • Drop support for EOL Python 2.6 and 3.3.

🔴 Bug fixes

  • Fix issue #1207: Fix Span.noun_chunks.
  • Fix issue #1494: Handle sequential infixes in tokenizer rules.
  • Fix issue #1587: Add note on attribute extension default arguments in docs.
  • Fix issue #1599: Fix typo in documentation.
  • Fix issue #1612: Ensure that Span.orth_ == Span.text.
  • Fix issue #1617: Make entity_relations.py example Python 2 compatible and fix French test.
  • Fix issue #1654: Fix off-by-one error in nlp.add_pipe when using after.
  • Fix issue #1674: Set correct requirement string in spacy package.
  • Fix issue with StringStore cleanup.

📖 Documentation and examples

  • Update resources page with new spaCy extensions.
  • Add "Unknown locale" error to troubleshooting guide.
  • Always use python -m spacy for CLI commands again to prevent issues on Windows etc.
  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @ligser, @pavillet, @yuukos, @GreenRiverRUS, @MartinoMensio, @raphael0202, @tokestermw, @fsonntag, @cclauss, @bdewilde, @markulrich, @sorenlind, @hugovk, @atomobianco, @twerkmeister, @mkdynamic and @jimregan for the pull requests and contributions.

v2.0.3: Improvements to tokenizer caching and serialization, plus various bug fixes

15 Nov 15:50
Compare
Choose a tag to compare

✨ New features and improvements

  • Require Thinc v6.10.1 to fix GPU installation fix and beam parsing.
  • Improve Turkish stop words.
  • Improve Hindi stop words.

🔴 Bug fixes

  • Fix issue #1248: Update English tokenizer and norm exceptions for "-in" and "-in'" verbs.
  • Fix issue #1506: Fix KeyError from cleaning up strings during Language.pipe (work in progress).
  • Fix issue #1521: Ensure path in Doc.to_disk and Doc.from_disk.
  • Fix issue #1525, #1582: Update fastText example to accommodate whitespace.
  • Fix issue #1541: Remove broken link from documentation.
  • Fix issue #1546: Add missing import to make util.minibatch work correctly.
  • Fix issue #1557: Add dummy serialization methods to Japanese tokenizer to allow saving and loading models.
  • Fix caching in Tokenizer (partially addresses performance regression in #1371 and #1508).

📖 Documentation and examples

👥 Contributors

Thanks to @MathiasDesch, @mcsalgado, @Wahib, @ligser, @abhi18av, @DuyguA, @KMLDS and @yogendrasoni for the pull requests and contributions.

v2.0.2: Fix vector resizing and conda build

08 Nov 22:15
Compare
Choose a tag to compare

✨ New features and improvements

  • Add text examples for Hindi.

🔴 Bug fixes

  • Fix issue #1507, #1512, #1513, #1514, #1516: Improve new documentation and list of backwards incompatibilities.
  • Fix issue #1515: Correct print statement in train_textcat.py example.
  • Fix issue #1518: Make Vectors.resize work as expected.
  • Fix conda build.

👥 Contributors

Thanks to @danielhers and @abhi18av for the pull requests.

v2.0.1: Fix typo that prevented conda build

08 Nov 02:29
Compare
Choose a tag to compare

🔴 Bug fixes

  • Fix syntax error in language data examples that prevented conda build.

v2.0.0: Neural networks, 13 new models for 7+ languages, better training, custom pipelines, Pickle & lots of API improvements

07 Nov 22:14
5864635
Compare
Choose a tag to compare

We're very excited to finally introduce spaCy v2.0. The new version gets spaCy up to date with the latest deep learning technologies and makes it much easier to run spaCy in scalable cloud computing workflows. We've fixed over 60 bugs (every open bug!), including several long-standing issues, trained 13 neural network models for 7+ languages and added alpha tokenization support for 8 new languages. We also re-wrote almost all of the usage guides, API docs and code examples.

pip install -U spacy
conda install -c conda-forge spacy

✨ Major features and improvements

🔮 Models

spaCy v2.0 comes with 13 new convolutional neural network models for 7+ languages. The models have been designed and implemented from scratch specifically for spaCy. A novel bloom embedding strategy with subword features is used to support huge vocabularies in tiny tables.

All core models include part-of-speech tags, dependency labels and named entities. Small models include only context-specific token vectors, while medium-sized and large models ship with word vectors. For more details, see the models directory or try our new model comparison tool.

Name Language Features Size
en_core_web_sm English Tagger, parser, entities 35 MB
en_core_web_md English Tagger, parser, entities, vectors 115 MB
en_core_web_lg English Tagger, parser, entities, vectors 812 MB
en_vectors_web_lg English Vectors 627 MB
de_core_news_sm German Tagger, parser, entities 36 MB
es_core_news_sm Spanish Tagger, parser, entities 35 MB
es_core_news_md Spanish Tagger, parser, entities, vectors 93 MB
pt_core_news_sm Portuguese Tagger, parser, entities 36 MB
fr_core_news_sm French Tagger, parser, entities 37 MB
fr_core_news_md French Tagger, parser, entities, vectors 106 MB
it_core_news_sm Italian Tagger, parser, entities 34 MB
nl_core_news_sm Dutch Tagger, parser, entities 34 MB
xx_ent_wiki_sm Multi-language Entities 33MB

You can download a model by using its name or shortcut. To load a model, use spacy.load(), or import it as a module and call its load() method:

spacy download en_core_web_sm
import spacy
nlp = spacy.load('en_core_web_sm')

import en_core_web_sm
nlp = en_core_web_sm.load()

📈 Benchmarks

spaCy v2.0's new neural network models bring significant improvements in accuracy, especially for English Named Entity Recognition. The new en_core_web_lg model makes about 25% fewer mistakes than the corresponding v1.x model and is within 1% of the current state-of-the-art (Strubell et al., 2017). The v2.0 models are also cheaper to run at scale, as they require under 1 GB of memory per process.

English

Model spaCy Type UAS LAS NER F POS Size
en_core_web_sm-2.0.0 v2.x neural 91.7 89.8 85.3 97.0 35MB
en_core_web_md-2.0.0 v2.x neural 91.7 89.8 85.9 97.1 115MB
en_core_web_lg-2.0.0 v2.x neural 91.9 90.1 85.9 97.2 812MB
en_core_web_sm-1.1.0 v1.x linear 86.6 83.8 78.5 96.6 50MB
en_core_web_md-1.2.1 v1.x linear 90.6 88.5 81.4 96.7 1GB

Spanish

Model spaCy Type UAS LAS NER F POS Size
es_core_news_sm-2.0.0 v2.x neural 89.8 86.8 88.7 96.9 35MB
es_core_news_md-2.0.0 v2.x neural 90.2 87.2 89.0 97.8 93MB
es_core_web_md-1.1.0 v1.x linear 87.5 n/a 94.2 96.7 377MB

For more details of the other models, see the models directory and model comparison tool.

🔴 Bug fixes

  • Fix issue #125, #228, #299, #377, #460, #606, #930: Add full Pickle support.
  • Fix issue #152, #264, #322, #343, #437, #514, #636, #785, #927, #985, #992, #1011: Fix and improve serialization and deserialization of Doc objects.
  • Fix issue #285, #1225: Fix memory growth problem when streaming data.
  • Fix issue #512: Improve parser to prevent it from returning two ROOT objects.
  • Fix issue #519, #611, #725: Retrain German model with better tokenized input.
  • Fix issue #524: Improve parser and handling of noun chunks.
  • Fix issue #621: Prevent double spaces from changing the parser result.
  • Fix issue #664, #999, #1026: Fix bugs that would prevent loading trained NER models.
  • Fix issue #671, #809, #856: Fix importing and loading of word vectors.
  • Fix issue #683, #1052, #1442: Don't require tag maps to provide SP tag.
  • Fix issue #753: Resolve bug that would tag OOV items as personal pronouns.
  • Fix issue #860, #956, #1085, #1381: Allow custom attribute extensions on Doc, Token and Span.
  • Fix issue #905, #954, #1021, #1040, #1042: Improve parsing model and allow faster accuracy updates.
  • Fix issue #933, #977, #1406: Update online demos.
  • Fix issue #995: Improve punctuation rules for Hebrew and other non-latin languages.
  • Fix issue #1008: train command finally works correctly if used without dev_data.
  • Fix issue #1012: Improve word vectors documentation.
  • Fix issue #1043: Improve NER models and allow faster accuracy updates.
  • Fix issue #1044: Fix bugs in French model and improve performance.
  • Fix issue #1051: Improve error messages if functionality needs a model to be installed.
  • Fix issue #1071: Correct typo of "whereve" in English tokenizer exceptions.
  • Fix issue #1088: Emoji are now split into separate tokens wherever possible.
  • Fix issue #1240: Allow merging Spans without keyword arguments.
  • Fix issue #1243: Resolve undefined names in deprecated functions.
  • Fix issue #1250: Fix caching bug that would cause tokenizer to ignore special case rules after first parse.
  • Fix issue #1257: Ensure the compare operator == works as expected on tokens.
  • Fix issue #1291: Improve documentation of training format.
  • Fix issue #1336: Fix bug that caused inconsistencies in NER results.
  • Fix issue #1375: Make sure Token.nbor raises IndexError correctly.
  • Fix issue #1450: Fix error when OP quantifier "*" ends the match pattern.
  • Fix issue #1452: Fix bug that would mutate the original text.

📖 Documentation and examples

  • NEW: Completely rewritten, reorganised and redesigned [usage](http...
Read more

v1.10.0: Alpha support for Thai & Russian, plus improvements and bug fixes

07 Nov 11:41
Compare
Choose a tag to compare

⚠️ Important note: This is a bridge release that gets the current state of the v1.x branch published. Stay tuned for v2.0.

✨ Major features and improvements

  • NEW: Alpha tokenization support for Thai and Russian.
  • NEW: Alpha support for Japanese part-of-speech tagging.
  • NEW: Dependency pattern-matching algorithm (see #1120).
  • Add support for getting a lowest common ancestor matrix via Doc.get_lca_matrix().
  • Improve capturing of English noun chunks.

🔴 Bug fixes

  • Fix issue #1078: Simplify URL pattern.
  • Fix issue #1174: Fix NER model loading bug and make sure JSON keys are loaded as strings.
  • Fix issue #1291: Document correct JSON format for training.
  • Fix issue #1292: Fix error when adding custom infix rules.
  • Fix issue #1387: Ensure that lemmatizer respects exception rules.
  • Fix issue #1410: Support single value for attribute list in Doc.to_scalar and Doc.to_array.

📖 Documentation and examples

  • Document correct JSON format for training.
  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @raphael0202, @gideonite, @delirious-lettuce, @polm, @kevinmarsh, @IamJeffG, @Vimos, @ericzhao28, @galaxyh, @hscspring, @wannaphongcom, @Wellan89, @kokes, @mdcclv, @ameyuuno, @ramananbalakrishnan, @Demfier, @johnhaley81, @mayukh18 and @jnothman for the pull requests and contributions.