fix(deps): update dependency stanza to v1.9.2 #145

renovate · 2023-09-09T01:50:00Z

This PR contains the following updates:

Package	Change	Age	Adoption	Passing	Confidence
stanza	`1.5.0` -> `1.9.2`

Release Notes

stanfordnlp/stanza (stanza)

`v1.9.2`: Multilingual Coref

Compare Source

multilingual coref!

Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref https://github.com/stanfordnlp/stanza/pull/1406

new features

streamlit visualizer for semgrex/ssurgeon https://github.com/stanfordnlp/stanza/pull/1396
updates to the constituency parser ensemble https://github.com/stanfordnlp/stanza/pull/1387
accuracy improvements to the IN_ORDER oracle https://github.com/stanfordnlp/stanza/pull/1391
Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE https://github.com/stanfordnlp/stanza/issues/1417 https://github.com/stanfordnlp/stanza/pull/1419
download_method=None now turns off HF downloads as well, for use in instances with no access to internet https://github.com/stanfordnlp/stanza/pull/1408 https://github.com/stanfordnlp/stanza/issues/1399

new models

Spanish combined models https://github.com/stanfordnlp/stanza/issues/1395
Add IACLT knesset to the HE combined models
NER based on IACLT
XCL (Classical Armenian) models with word vectors from Caval

bugfixes

update tqdm usage to remove some duplicate code: https://github.com/stanfordnlp/stanza/issues/1413 stanfordnlp/stanza@3de69ca
long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: https://github.com/stanfordnlp/stanza/issues/1410
Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue stanfordnlp/stanza@56350a0
actually include the visualization: https://github.com/stanfordnlp/stanza/issues/1421 thank you @bollwyvl

`v1.9.1`: Multilingual Coref

Compare Source

multilingual coref!

Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref https://github.com/stanfordnlp/stanza/pull/1406

new features

streamlit visualizer for semgrex/ssurgeon https://github.com/stanfordnlp/stanza/pull/1396
updates to the constituency parser ensemble https://github.com/stanfordnlp/stanza/pull/1387
accuracy improvements to the IN_ORDER oracle https://github.com/stanfordnlp/stanza/pull/1391
Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE https://github.com/stanfordnlp/stanza/issues/1417 https://github.com/stanfordnlp/stanza/pull/1419
download_method=None now turns off HF downloads as well, for use in instances with no access to internet https://github.com/stanfordnlp/stanza/pull/1408 https://github.com/stanfordnlp/stanza/issues/1399

new models

Spanish combined models https://github.com/stanfordnlp/stanza/issues/1395
Add IACLT knesset to the HE combined models
NER based on IACLT
XCL (Classical Armenian) models with word vectors from Caval

bugfixes

update tqdm usage to remove some duplicate code: https://github.com/stanfordnlp/stanza/issues/1413 stanfordnlp/stanza@3de69ca
long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: https://github.com/stanfordnlp/stanza/issues/1410
Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue stanfordnlp/stanza@56350a0
actually include the visualization: https://github.com/stanfordnlp/stanza/issues/1421 thank you @bollwyvl

`v1.9.0`: Multilingual Coref

Compare Source

multilingual coref!

Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref https://github.com/stanfordnlp/stanza/pull/1406

new features

streamlit visualizer for semgrex/ssurgeon https://github.com/stanfordnlp/stanza/pull/1396
updates to the constituency parser ensemble https://github.com/stanfordnlp/stanza/pull/1387
accuracy improvements to the IN_ORDER oracle https://github.com/stanfordnlp/stanza/pull/1391
Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE https://github.com/stanfordnlp/stanza/issues/1417 https://github.com/stanfordnlp/stanza/pull/1419
download_method=None now turns off HF downloads as well, for use in instances with no access to internet https://github.com/stanfordnlp/stanza/pull/1408 https://github.com/stanfordnlp/stanza/issues/1399

new models

Spanish combined models https://github.com/stanfordnlp/stanza/issues/1395
Add IACLT knesset to the HE combined models
NER based on IACLT
XCL (Classical Armenian) models with word vectors from Caval

bugfixes

update tqdm usage to remove some duplicate code: https://github.com/stanfordnlp/stanza/issues/1413 stanfordnlp/stanza@3de69ca
long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: https://github.com/stanfordnlp/stanza/issues/1410
Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue stanfordnlp/stanza@56350a0

`v1.8.2`: Old English, MWT improvements, and better memory management of Peft

Compare Source

Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.

Old English

Add Old English (ANG) annotation! Thank you to @dmetola https://github.com/stanfordnlp/stanza/issues/1365

MWT improvements

Fix words ending with -nna split into MWT stanfordnlp/handparsed-treebank@2c48d40 https://github.com/stanfordnlp/stanza/issues/1366
Fix MWT for English splitting into weird words by enforcing that the pieces add up to the whole (which is always the case in the English treebanks) https://github.com/stanfordnlp/stanza/issues/1371 https://github.com/stanfordnlp/stanza/pull/1378
Mark start_char and end_char on an MWT if it is composed of exactly its subwords stanfordnlp/stanza@2384089 https://github.com/stanfordnlp/stanza/issues/1361

Peft memory management

Previous versions were loading multiple copies of the transformer in order to use adapters. To save memory, we can use Peft's capacity to attach multiple adapters to the same transformer instead as long as they have different names. This allows for loading just one copy of the entire transformer when using a Pipeline with several finetuned models. https://github.com/huggingface/peft/issues/1523 https://github.com/stanfordnlp/stanza/pull/1381 https://github.com/stanfordnlp/stanza/pull/1384

Other bugfixes and minor upgrades

Fix crash when trying to load previously unknown language https://github.com/stanfordnlp/stanza/issues/1360 stanfordnlp/stanza@381736f
Check that sys.stderr has isatty before manipulating it with tqdm, in case sys.stderr was monkeypatched: stanfordnlp/stanza@d180ae0 https://github.com/stanfordnlp/stanza/issues/1367
Try to avoid OOM in the POS in the Pipeline by reducing its max batch length stanfordnlp/stanza@4271813
Fix usage of gradient checkpointing & a weird interaction with Peft (thanks to @Jemoka) stanfordnlp/stanza@597d48f

Other upgrades

Add * to the list of functional tags to drop in the constituency parser, helping Icelandic annotation stanfordnlp/stanza@57bfa8b https://github.com/stanfordnlp/stanza/issues/1356#issuecomment-19812169122
Can train depparse without using any of the POS columns, especially useful if training a cross-lingual parser: stanfordnlp/stanza@4048cae stanfordnlp/stanza@15b136b
Add a constituency model for German stanfordnlp/stanza@7a4f48c stanfordnlp/stanza@86ddaab https://github.com/stanfordnlp/stanza/issues/1368

`v1.8.1`: PEFT Integration (with bugfixes)

Compare Source

Integrating PEFT into several different annotators

We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results https://github.com/stanfordnlp/stanza/pull/1320
Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. https://github.com/stanfordnlp/stanza/pull/1335
NER also trained with peft: unfortunately, no consistent improvements to scores https://github.com/stanfordnlp/stanza/pull/1336
depparse includes peft: no consistent improvements yet https://github.com/stanfordnlp/stanza/pull/1337 https://github.com/stanfordnlp/stanza/pull/1344
Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser https://github.com/stanfordnlp/stanza/pull/1341
Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. https://github.com/stanfordnlp/stanza/pull/1347
Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. https://github.com/stanfordnlp/stanza/pull/1348
Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. https://github.com/stanfordnlp/stanza/pull/1346 https://github.com/stanfordnlp/stanza/issues/1345

Features

Include SpacesAfter annotations on words in the CoNLL output of documents: https://github.com/stanfordnlp/stanza/issues/1315 https://github.com/stanfordnlp/stanza/pull/1322
Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. https://github.com/stanfordnlp/stanza/pull/1331 https://github.com/stanfordnlp/stanza/issues/1330
wandb support for coref https://github.com/stanfordnlp/stanza/pull/1338
Coref annotator breaks length ties using POS if available https://github.com/stanfordnlp/stanza/issues/1326 stanfordnlp/stanza@c4c3de5

Bugfixes

Using a proxy with download_resources_json was broken: https://github.com/stanfordnlp/stanza/pull/1318 https://github.com/stanfordnlp/stanza/issues/1317 Thank you @ider-zh
Fix deprecation warnings for escape sequences: https://github.com/stanfordnlp/stanza/pull/1321 https://github.com/stanfordnlp/stanza/issues/1293 Thank you @sterliakov
Coref training rounding error https://github.com/stanfordnlp/stanza/pull/1342
Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice https://github.com/stanfordnlp/stanza/pull/1354
V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. https://github.com/stanfordnlp/stanza/pull/1350 https://github.com/stanfordnlp/stanza/issues/1294
Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: https://github.com/stanfordnlp/stanza/issues/1333 https://github.com/stanfordnlp/stanza/issues/1339 stanfordnlp/stanza@f1fbaaa
Clarify error when a language is only partially handled: stanfordnlp/stanza@da01644 https://github.com/stanfordnlp/stanza/issues/1310

Additional 1.8.1 Bugfixes

Older POS models not loaded correctly... need to use .get() stanfordnlp/stanza@13ee3d5 https://github.com/stanfordnlp/stanza/issues/1357
Debug logging for the Constituency retag pipeline to better support someone working on Icelandic stanfordnlp/stanza@6e2520f https://github.com/stanfordnlp/stanza/issues/1356
device arg in MultilingualPipeline would crash if device was passed for an individual Pipeline: stanfordnlp/stanza@44058a0

`v1.8.0`: PEFT integration

Compare Source

Integrating PEFT into several different annotators

We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results https://github.com/stanfordnlp/stanza/pull/1320
Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. https://github.com/stanfordnlp/stanza/pull/1335
NER also trained with peft: unfortunately, no consistent improvements to scores https://github.com/stanfordnlp/stanza/pull/1336
depparse includes peft: no consistent improvements yet https://github.com/stanfordnlp/stanza/pull/1337 https://github.com/stanfordnlp/stanza/pull/1344
Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser https://github.com/stanfordnlp/stanza/pull/1341
Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. https://github.com/stanfordnlp/stanza/pull/1347
Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. https://github.com/stanfordnlp/stanza/pull/1348
Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. https://github.com/stanfordnlp/stanza/pull/1346 https://github.com/stanfordnlp/stanza/issues/1345

Features

Include SpacesAfter annotations on words in the CoNLL output of documents: https://github.com/stanfordnlp/stanza/issues/1315 https://github.com/stanfordnlp/stanza/pull/1322
Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. https://github.com/stanfordnlp/stanza/pull/1331 https://github.com/stanfordnlp/stanza/issues/1330
wandb support for coref https://github.com/stanfordnlp/stanza/pull/1338
Coref annotator breaks length ties using POS if available https://github.com/stanfordnlp/stanza/issues/1326 stanfordnlp/stanza@c4c3de5

Bugfixes

Using a proxy with download_resources_json was broken: https://github.com/stanfordnlp/stanza/pull/1318 https://github.com/stanfordnlp/stanza/issues/1317 Thank you @ider-zh
Fix deprecation warnings for escape sequences: https://github.com/stanfordnlp/stanza/pull/1321 https://github.com/stanfordnlp/stanza/issues/1293 Thank you @sterliakov
Coref training rounding error https://github.com/stanfordnlp/stanza/pull/1342
Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice https://github.com/stanfordnlp/stanza/pull/1354
V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. https://github.com/stanfordnlp/stanza/pull/1350 https://github.com/stanfordnlp/stanza/issues/1294
Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: https://github.com/stanfordnlp/stanza/issues/1333 https://github.com/stanfordnlp/stanza/issues/1339 stanfordnlp/stanza@f1fbaaa
Clarify error when a language is only partially handled: stanfordnlp/stanza@da01644 https://github.com/stanfordnlp/stanza/issues/1310

`v1.7.0`: : Neural coref!

Compare Source

Neural coref processor added!

Conjunction-Aware Word-Level Coreference Resolution
https://arxiv.org/abs/2310.06165
original implementation: https://github.com/KarelDO/wl-coref/tree/master

Updated form of Word-Level Coreference Resolution
https://aclanthology.org/2021.emnlp-main.605/
original implementation: https://github.com/vdobrovolskii/wl-coref

If you use Stanza's coref module in your work, please be sure to cite both of the above papers.

Special thanks to vdobrovolskii, who graciously agreed to allow for integration of his work into Stanza, to @KarelDO for his support of his training enhancement, and to @Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.

Currently there is one model provided, a transformer based English model trained from OntoNotes. The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture. When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.

Future work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models

https://github.com/stanfordnlp/stanza/pull/1309

Interface change: English MWT

English now has an MWT model by default. Text such as won't is now marked as a single token, split into two words, will and not. Previously it was expected to be tokenized into two pieces, but the Sentence object containing that text would not have a single Token object connecting the two pieces. See https://stanfordnlp.github.io/stanza/mwt.html and https://stanfordnlp.github.io/stanza/data_objects.html#token for more information.

Code that used to operate with for word in sentence.words will continue to work as before, but for token in sentence.tokens will now produce one object for MWT such as won't, cannot, Stanza's, etc.

Pipeline creation will not change, as MWT is automatically (but not silently) added at Pipeline creation time if the language and package includes MWT.

https://github.com/stanfordnlp/stanza/pull/1314/commits/f22dceb93275fc724536b03b31c08a94617880ca https://github.com/stanfordnlp/stanza/pull/1314/commits/27983aefe191f6abd93dd49915d2515d7c3973d1

Other updates

NetworkX representation of enhanced dependencies. Allows for easier usage of Semgrex on enhanced dependencies - searching over enhanced dependencies requires CoreNLP >= 4.5.6 https://github.com/stanfordnlp/stanza/pull/1295 https://github.com/stanfordnlp/stanza/pull/1298
Sentence ending punct tags improved for English to avoid labeling non-punct as punct (and POS is switched to using a DataLoader) https://github.com/stanfordnlp/stanza/issues/1000 https://github.com/stanfordnlp/stanza/pull/1303
Optional rewriting of MWT after the MWT processing step - will give the user more control over fixing common errors. Although we still encourage posting issues on github so we can fix them for everyone! https://github.com/stanfordnlp/stanza/pull/1302
Remove deprecated output methods such as conll_as_string and doc2conll_text. Use "{:C}".format(doc) instead stanfordnlp/stanza@e01650f
Mixed OntoNotes and WW NER model for English is now the default. Future versions may include CoNLL 2003 and CoNLL++ data as well.
Sentences now have a doc_id field if the document they are created from has a doc_id. https://github.com/stanfordnlp/stanza/pull/1314/commits/8e2201f42cb99a5a3d8358ce59501c1d88f2585e
Optional processors added in cases where the user may not want the model we have run by default. For example, conparse for Turkish (limited training data) or coref for English (the only available model is the transformer model) https://github.com/stanfordnlp/stanza/pull/1314/commits/3d90d2b8a82048c5cea549b654e52544ed241833

Updated requirements

Support dropped for python 3.6 and 3.7. The peft module used for finetuning the transformer used in the coref processor does not support those versions.
Added peft as an optional dependency to transformer based installations
Added networkx as a dependency for reading enhanced dependencies. Added toml as a dependency for reading the coref config.

`v1.6.1`: Multiple default models and a combined EN NER model

Compare Source

V1.6.1 is a patch of a bug in the Arabic POS tagger.

We also mark Python 3.11 as supported in the setup.py classifiers. This will be the last release that supports Python 3.6

Multiple model levels

The package parameter for building the Pipeline now has three default settings:

default, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
default-fast, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
default-accurate, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast and -accurate versions for each UD dataset.

PR: https://github.com/stanfordnlp/stanza/pull/1287

addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

https://github.com/stanfordnlp/stanza/pull/1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide:   88.71  69.29
simplify-separate                  88.24  75.75
simplify-connected                 88.32  75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm, ontonotes-combined_charlm, and ontonotes-combined_electra-large.

Future plans include using multiple NER datasets for other models as well.

Other features

Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a Pipeline, you can now provide a callable via the tokenize_postprocessor parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the Pipeline https://github.com/stanfordnlp/stanza/pull/1290
Finetuning for transformers in the NER models: have not yet found helpful settings, though stanfordnlp/stanza@45ef544
SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 stanfordnlp/stanza@88cd0df
charlm for PT (improves accuracy on non-transformer models): stanfordnlp/stanza@c10763d
build models with transformers for a few additional languages: MR, AR, PT, JA stanfordnlp/stanza@45b3875 stanfordnlp/stanza@0f3761e stanfordnlp/stanza@c55472a stanfordnlp/stanza@c10763d

Bugfixes

V1.6.1 fixes a bug in the Arabic POS model which was an unfortunate side effect of the NER change to allow multiple tag sets at once: stanfordnlp/stanza@b56f442
Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 stanfordnlp/stanza@c71bf3f
run_ete.py was not correctly processing the charlm, meaning the whole thing wouldn't actually run stanfordnlp/stanza@16f29f3
Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 stanfordnlp/stanza@82a0215

`v1.6.0`: Multiple default models and a combined EN NER model

Compare Source

Multiple model levels

The package parameter for building the Pipeline now has three default settings:

default, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
default-fast, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
default-accurate, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast and -accurate versions for each UD dataset.

PR: https://github.com/stanfordnlp/stanza/pull/1287

addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

https://github.com/stanfordnlp/stanza/pull/1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide:   88.71  69.29
simplify-separate                  88.24  75.75
simplify-connected                 88.32  75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm, ontonotes-combined_charlm, and ontonotes-combined_electra-large.

Future plans include using multiple NER datasets for other models as well.

Other features

Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a Pipeline, you can now provide a callable via the tokenize_postprocessor parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the Pipeline https://github.com/stanfordnlp/stanza/pull/1290
Finetuning for transformers in the NER models: have not yet found helpful settings, though stanfordnlp/stanza@45ef544
SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 stanfordnlp/stanza@88cd0df
charlm for PT (improves accuracy on non-transformer models): stanfordnlp/stanza@c10763d
build models with transformers for a few additional languages: MR, AR, PT, JA stanfordnlp/stanza@45b3875 stanfordnlp/stanza@0f3761e stanfordnlp/stanza@c55472a stanfordnlp/stanza@c10763d

Bugfixes

Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 stanfordnlp/stanza@c71bf3f
run_ete.py was not correctly processing the charlm, meaning the whole thing wouldn't actually run stanfordnlp/stanza@16f29f3
Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 stanfordnlp/stanza@82a0215

`v1.5.1`: : charlm & transformer integration in depparse

Compare Source

Features

depparse can have transformer as an embedding https://github.com/stanfordnlp/stanza/pull/1282/commits/ee171cd167900fbaac16ff4b1f2fbd1a6e97de0a

Lemmatizer can remember word,pos it has seen before with a flag https://github.com/stanfordnlp/stanza/issues/1263 stanfordnlp/stanza@a87ffd0

Scoring scripts for Flair and spAcy NER models (requires the appropriate packages, of course) https://github.com/stanfordnlp/stanza/pull/1282/commits/63dc212b467cd549039392743a0be493cc9bc9d8 https://github.com/stanfordnlp/stanza/pull/1282/commits/c42aed569f9d376e71708b28b0fe5b478697ba05 https://github.com/stanfordnlp/stanza/pull/1282/commits/eab062341480e055f93787d490ff31d923a68398

SceneGraph connection for the CoreNLP client https://github.com/stanfordnlp/stanza/pull/1282/commits/d21a95cc90443ec4737de6d7ba68a106d12fb285

Update constituency parser to reduce the learning rate on plateau. Fiddling with the learning rates significantly improves performance https://github.com/stanfordnlp/stanza/pull/1282/commits/f753a4f35b7c2cf7e8e6b01da3a60f73493178e1

Tokenize [] based on () rules if the original dataset doesn't have [] in ithttps://github.com/stanfordnlp/stanza/pull/128282/commits/063b4ba3c6ce2075655a70e54c434af4ce7ac3a9

Attempt to finetune the charlm when building models (have not found effective settings for this yet) https://github.com/stanfordnlp/stanza/pull/1282/commits/048fdc9c9947a154d4426007301d63d920e60db0

Add the charlm to the lemmatizer - this will not be the default, since it is slower, but it is more accurate https://github.com/stanfordnlp/stanza/pull/1282/commits/e811f52b4cf88d985e7dbbd499fe30dbf2e76d8d https://github.com/stanfordnlp/stanza/pull/1282/commits/66add6d519deb54ca9be5fe3148023a5d7d815e4 https://github.com/stanfordnlp/stanza/pull/1282/commits/f086de2359cce16ef2718c0e6e3b5deef1345c74

Bugfixes

Forgot to include the lemmatizer in CoreNLP 4.5.3, now in 4.5.4 stanfordnlp/stanza@4dda14b https://github.com/bjascob/LemmInflect/issues/14#issuecomment-1470954013

prepare_ner_dataset was always creating an Armenian pipeline, even for non-Armenian langauges stanfordnlp/stanza@78ff85c

Fix an empty bulk_process throwing an exception https://github.com/stanfordnlp/stanza/pull/1282/commits/5e2d15d1aa59e4a1fee8bba1de60c09ba21bf53e https://github.com/stanfordnlp/stanza/issues/1278

Unroll the recursion in the Tarjan part of the Chuliu-Edmonds algorithm - should remove stack overflow errors https://github.com/stanfordnlp/stanza/pull/1282/commits/e0917b0967ba9752fdf489b86f9bfd19186c38eb

Minor updates

Put NER and POS scores on one line to make it easier to grep for: stanfordnlp/stanza@da2ae33 stanfordnlp/stanza@8c4cb04

Switch all pretrains to use a name which indicates their source, rather than the dataset they are used for: https://github.com/stanfordnlp/stanza/pull/1282/commits/d1c68ed01276b3cf1455d497057fbc0b82da49e5 and many others

Pipeline uses torch.no_grad() for a slight speed boost https://github.com/stanfordnlp/stanza/pull/1282/commits/36ab82edfc574d46698c5352e07d2fcb0d68d3b3

Generalize save names, which eventually allows for putting transformer, charlm or nocharlm in the save name - this lets us distinguish different complexities of model https://github.com/stanfordnlp/stanza/pull/1282/commits/cc0845826973576d8d8ed279274e6509250c9ad5 for constituency, and others for the other models

Add the model's flags to the --help for the run scripts, such as https://github.com/stanfordnlp/stanza/pull/1282/commits/83c0901c6ca2827224e156477e42e403d330a16e https://github.com/stanfordnlp/stanza/pull/1282/commits/7c171dd8d066c6973a8ee18a016b65f62376ea4c https://github.com/stanfordnlp/stanza/pull/1282/commits/8e1d112bee42f2211f5153fcc89083b97e3d2600

Remove the dependency on six https://github.com/stanfordnlp/stanza/pull/1282/commits/6daf97142ebc94cca7114a8cda5a20bf66f7f707 (thank you @BLKSerene )

New Models

VLSP constituency stanfordnlp/stanza@500435d

VLSP constituency -> tagging stanfordnlp/stanza@cb0f22d

CTB 5.1 constituency https://github.com/stanfordnlp/stanza/pull/1282/commits/f2ef62b96c79fcaf0b8aa70e4662d33b26dadf31

Add support for CTB 9.0, although those models are not distributed yet https://github.com/stanfordnlp/stanza/pull/1282/commits/1e3ea8a10b2e485bc7c79c6ab41d1f1dd8c2022f

Added an Indonesian charlm

Indonesian constituency from ICON treebank https://github.com/stanfordnlp/stanza/pull/1218

All languages with pretrained charlms now have an option to use that charlm for dependency parsing

French combined models out of GSD, ParisStories, Rhapsodie, and Sequoia https://github.com/stanfordnlp/stanza/pull/1282/commits/ba64d37d3bf21af34373152e92c9f01241e27d8b

UD 2.12 support https://github.com/stanfordnlp/stanza/pull/1282/commits/4f987d2cd708ce4ca27935d347bb5b5d28a78058

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 7dd98d7 to 271568f Compare October 3, 2023 07:52

renovate bot changed the title ~~fix(deps): update dependency stanza to v1.5.1~~ fix(deps): update dependency stanza to v1.6.0 Oct 3, 2023

renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 271568f to 3b5b7a6 Compare October 6, 2023 07:21

renovate bot changed the title ~~fix(deps): update dependency stanza to v1.6.0~~ fix(deps): update dependency stanza to v1.6.1 Oct 6, 2023

renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 3b5b7a6 to d7d111e Compare December 3, 2023 07:26

renovate bot changed the title ~~fix(deps): update dependency stanza to v1.6.1~~ fix(deps): update dependency stanza to v1.7.0 Dec 3, 2023

renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from d7d111e to 530b35d Compare February 25, 2024 11:26

renovate bot changed the title ~~fix(deps): update dependency stanza to v1.7.0~~ fix(deps): update dependency stanza to v1.8.0 Feb 25, 2024

renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 530b35d to 7202919 Compare March 1, 2024 06:58

renovate bot changed the title ~~fix(deps): update dependency stanza to v1.8.0~~ fix(deps): update dependency stanza to v1.8.1 Mar 1, 2024

renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 7202919 to 9d9207d Compare April 20, 2024 22:31

renovate bot changed the title ~~fix(deps): update dependency stanza to v1.8.1~~ fix(deps): update dependency stanza to v1.8.2 Apr 20, 2024

renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 9d9207d to 784a3ae Compare September 12, 2024 09:53

renovate bot changed the title ~~fix(deps): update dependency stanza to v1.8.2~~ fix(deps): update dependency stanza to v1.9.0 Sep 12, 2024

renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 784a3ae to a1abe67 Compare September 12, 2024 22:13

renovate bot changed the title ~~fix(deps): update dependency stanza to v1.9.0~~ fix(deps): update dependency stanza to v1.9.1 Sep 12, 2024

fix(deps): update dependency stanza to v1.9.2

9026dc7

renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from a1abe67 to 9026dc7 Compare September 13, 2024 00:24

renovate bot changed the title ~~fix(deps): update dependency stanza to v1.9.1~~ fix(deps): update dependency stanza to v1.9.2 Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deps): update dependency stanza to v1.9.2 #145

fix(deps): update dependency stanza to v1.9.2 #145

renovate bot commented Sep 9, 2023 •

edited

Loading

fix(deps): update dependency stanza to v1.9.2 #145

Are you sure you want to change the base?

fix(deps): update dependency stanza to v1.9.2 #145

Conversation

renovate bot commented Sep 9, 2023 • edited Loading

Release Notes

v1.9.2: Multilingual Coref

multilingual coref!

new features

new models

bugfixes

v1.9.1: Multilingual Coref

multilingual coref!

new features

new models

bugfixes

v1.9.0: Multilingual Coref

multilingual coref!

new features

new models

bugfixes

v1.8.2: Old English, MWT improvements, and better memory management of Peft

Old English

MWT improvements

Peft memory management

Other bugfixes and minor upgrades

Other upgrades

v1.8.1: PEFT Integration (with bugfixes)

Integrating PEFT into several different annotators

Model improvements

Features

Bugfixes

Additional 1.8.1 Bugfixes

v1.8.0: PEFT integration

Integrating PEFT into several different annotators

Model improvements

Features

Bugfixes

v1.7.0: : Neural coref!

Neural coref processor added!

Interface change: English MWT

Other updates

Updated requirements

v1.6.1: Multiple default models and a combined EN NER model

Multiple model levels

Multiple output heads for one NER model

Other features

Bugfixes

v1.6.0: Multiple default models and a combined EN NER model

Multiple model levels

Multiple output heads for one NER model

Other features

Bugfixes

v1.5.1: : charlm & transformer integration in depparse

Features

Bugfixes

Minor updates

New Models

Configuration

renovate bot commented Sep 9, 2023 •

edited

Loading

`v1.9.2`: Multilingual Coref

`v1.9.1`: Multilingual Coref

`v1.9.0`: Multilingual Coref

`v1.8.2`: Old English, MWT improvements, and better memory management of Peft

`v1.8.1`: PEFT Integration (with bugfixes)

`v1.8.0`: PEFT integration

`v1.7.0`: : Neural coref!

`v1.6.1`: Multiple default models and a combined EN NER model

`v1.6.0`: Multiple default models and a combined EN NER model

`v1.5.1`: : charlm & transformer integration in depparse