fix(deps): update dependency stanza to v1.9.2 #145
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
1.5.0
->1.9.2
Release Notes
stanfordnlp/stanza (stanza)
v1.9.2
: Multilingual CorefCompare Source
multilingual coref!
new features
download_method=None
now turns off HF downloads as well, for use in instances with no access to internet https://github.com/stanfordnlp/stanza/pull/1408 https://github.com/stanfordnlp/stanza/issues/1399new models
bugfixes
v1.9.1
: Multilingual CorefCompare Source
multilingual coref!
new features
download_method=None
now turns off HF downloads as well, for use in instances with no access to internet https://github.com/stanfordnlp/stanza/pull/1408 https://github.com/stanfordnlp/stanza/issues/1399new models
bugfixes
v1.9.0
: Multilingual CorefCompare Source
multilingual coref!
new features
download_method=None
now turns off HF downloads as well, for use in instances with no access to internet https://github.com/stanfordnlp/stanza/pull/1408 https://github.com/stanfordnlp/stanza/issues/1399new models
bugfixes
v1.8.2
: Old English, MWT improvements, and better memory management of PeftCompare Source
Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.
Old English
MWT improvements
Fix words ending with
-nna
split into MWT stanfordnlp/handparsed-treebank@2c48d40 https://github.com/stanfordnlp/stanza/issues/1366Fix MWT for English splitting into weird words by enforcing that the pieces add up to the whole (which is always the case in the English treebanks) https://github.com/stanfordnlp/stanza/issues/1371 https://github.com/stanfordnlp/stanza/pull/1378
Mark
start_char
andend_char
on an MWT if it is composed of exactly its subwords stanfordnlp/stanza@2384089 https://github.com/stanfordnlp/stanza/issues/1361Peft memory management
Other bugfixes and minor upgrades
Fix crash when trying to load previously unknown language https://github.com/stanfordnlp/stanza/issues/1360 stanfordnlp/stanza@381736f
Check that sys.stderr has isatty before manipulating it with tqdm, in case sys.stderr was monkeypatched: stanfordnlp/stanza@d180ae0 https://github.com/stanfordnlp/stanza/issues/1367
Try to avoid OOM in the POS in the Pipeline by reducing its max batch length stanfordnlp/stanza@4271813
Fix usage of gradient checkpointing & a weird interaction with Peft (thanks to @Jemoka) stanfordnlp/stanza@597d48f
Other upgrades
Add * to the list of functional tags to drop in the constituency parser, helping Icelandic annotation stanfordnlp/stanza@57bfa8b https://github.com/stanfordnlp/stanza/issues/1356#issuecomment-19812169122
Can train depparse without using any of the POS columns, especially useful if training a cross-lingual parser: stanfordnlp/stanza@4048cae stanfordnlp/stanza@15b136b
Add a constituency model for German stanfordnlp/stanza@7a4f48c stanfordnlp/stanza@86ddaab https://github.com/stanfordnlp/stanza/issues/1368
v1.8.1
: PEFT Integration (with bugfixes)Compare Source
Integrating PEFT into several different annotators
We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the
default_accurate
model.The biggest gains observed are with the constituency parser and the sentiment classifier.
Previously, the
default_accurate
package used transformers where the head was trained but the transformer itself was not finetuned.Model improvements
Features
Bugfixes
download_resources_json
was broken: https://github.com/stanfordnlp/stanza/pull/1318 https://github.com/stanfordnlp/stanza/issues/1317 Thank you @ider-zhAdditional 1.8.1 Bugfixes
.get()
stanfordnlp/stanza@13ee3d5 https://github.com/stanfordnlp/stanza/issues/1357device
arg inMultilingualPipeline
would crash ifdevice
was passed for an individualPipeline
: stanfordnlp/stanza@44058a0v1.8.0
: PEFT integrationCompare Source
Integrating PEFT into several different annotators
We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the
default_accurate
model.The biggest gains observed are with the constituency parser and the sentiment classifier.
Previously, the
default_accurate
package used transformers where the head was trained but the transformer itself was not finetuned.Model improvements
Features
Bugfixes
download_resources_json
was broken: https://github.com/stanfordnlp/stanza/pull/1318 https://github.com/stanfordnlp/stanza/issues/1317 Thank you @ider-zhv1.7.0
: : Neural coref!Compare Source
Neural coref processor added!
Conjunction-Aware Word-Level Coreference Resolution
https://arxiv.org/abs/2310.06165
original implementation: https://github.com/KarelDO/wl-coref/tree/master
Updated form of Word-Level Coreference Resolution
https://aclanthology.org/2021.emnlp-main.605/
original implementation: https://github.com/vdobrovolskii/wl-coref
If you use Stanza's coref module in your work, please be sure to cite both of the above papers.
Special thanks to vdobrovolskii, who graciously agreed to allow for integration of his work into Stanza, to @KarelDO for his support of his training enhancement, and to @Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.
Currently there is one model provided, a transformer based English model trained from OntoNotes. The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture. When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.
Future work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models
https://github.com/stanfordnlp/stanza/pull/1309
Interface change: English MWT
English now has an MWT model by default. Text such as
won't
is now marked as a single token, split into two words,will
andnot
. Previously it was expected to be tokenized into two pieces, but theSentence
object containing that text would not have a singleToken
object connecting the two pieces. See https://stanfordnlp.github.io/stanza/mwt.html and https://stanfordnlp.github.io/stanza/data_objects.html#token for more information.Code that used to operate with
for word in sentence.words
will continue to work as before, butfor token in sentence.tokens
will now produce one object for MWT such aswon't
,cannot
,Stanza's
, etc.Pipeline creation will not change, as MWT is automatically (but not silently) added at
Pipeline
creation time if the language and package includes MWT.https://github.com/stanfordnlp/stanza/pull/1314/commits/f22dceb93275fc724536b03b31c08a94617880ca https://github.com/stanfordnlp/stanza/pull/1314/commits/27983aefe191f6abd93dd49915d2515d7c3973d1
Other updates
conll_as_string
anddoc2conll_text
. Use"{:C}".format(doc)
instead stanfordnlp/stanza@e01650fdoc_id
field if the document they are created from has adoc_id
. https://github.com/stanfordnlp/stanza/pull/1314/commits/8e2201f42cb99a5a3d8358ce59501c1d88f2585eUpdated requirements
peft
module used for finetuning the transformer used in the coref processor does not support those versions.peft
as an optional dependency to transformer based installationsnetworkx
as a dependency for reading enhanced dependencies. Addedtoml
as a dependency for reading the coref config.v1.6.1
: Multiple default models and a combined EN NER modelCompare Source
V1.6.1 is a patch of a bug in the Arabic POS tagger.
We also mark Python 3.11 as supported in the
setup.py
classifiers. This will be the last release that supports Python 3.6Multiple model levels
The
package
parameter for building thePipeline
now has three default settings:default
, the same as before, where POS, depparse, and NER use the charlm, but lemma does notdefault-fast
, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as welldefault-accurate
, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcomeFurthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into
-fast
and-accurate
versions for each UD dataset.PR: https://github.com/stanfordnlp/stanza/pull/1287
addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284
Multiple output heads for one NER model
The NER models now can learn multiple output layers at once.
https://github.com/stanfordnlp/stanza/pull/1289
Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.
Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:
We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages
ontonotes-combined_nocharlm
,ontonotes-combined_charlm
, andontonotes-combined_electra-large
.Future plans include using multiple NER datasets for other models as well.
Other features
Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a
Pipeline
, you can now provide acallable
via thetokenize_postprocessor
parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of thePipeline
https://github.com/stanfordnlp/stanza/pull/1290Finetuning for transformers in the NER models: have not yet found helpful settings, though stanfordnlp/stanza@45ef544
SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 stanfordnlp/stanza@88cd0df
charlm for PT (improves accuracy on non-transformer models): stanfordnlp/stanza@c10763d
build models with transformers for a few additional languages: MR, AR, PT, JA stanfordnlp/stanza@45b3875 stanfordnlp/stanza@0f3761e stanfordnlp/stanza@c55472a stanfordnlp/stanza@c10763d
Bugfixes
V1.6.1 fixes a bug in the Arabic POS model which was an unfortunate side effect of the NER change to allow multiple tag sets at once: stanfordnlp/stanza@b56f442
Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 stanfordnlp/stanza@c71bf3f
run_ete.py
was not correctly processing the charlm, meaning the whole thing wouldn't actually run stanfordnlp/stanza@16f29f3Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 stanfordnlp/stanza@82a0215
v1.6.0
: Multiple default models and a combined EN NER modelCompare Source
Multiple model levels
The
package
parameter for building thePipeline
now has three default settings:default
, the same as before, where POS, depparse, and NER use the charlm, but lemma does notdefault-fast
, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as welldefault-accurate
, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcomeFurthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into
-fast
and-accurate
versions for each UD dataset.PR: https://github.com/stanfordnlp/stanza/pull/1287
addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284
Multiple output heads for one NER model
The NER models now can learn multiple output layers at once.
https://github.com/stanfordnlp/stanza/pull/1289
Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.
Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:
We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages
ontonotes-combined_nocharlm
,ontonotes-combined_charlm
, andontonotes-combined_electra-large
.Future plans include using multiple NER datasets for other models as well.
Other features
Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a
Pipeline
, you can now provide acallable
via thetokenize_postprocessor
parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of thePipeline
https://github.com/stanfordnlp/stanza/pull/1290Finetuning for transformers in the NER models: have not yet found helpful settings, though stanfordnlp/stanza@45ef544
SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 stanfordnlp/stanza@88cd0df
charlm for PT (improves accuracy on non-transformer models): stanfordnlp/stanza@c10763d
build models with transformers for a few additional languages: MR, AR, PT, JA stanfordnlp/stanza@45b3875 stanfordnlp/stanza@0f3761e stanfordnlp/stanza@c55472a stanfordnlp/stanza@c10763d
Bugfixes
Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 stanfordnlp/stanza@c71bf3f
run_ete.py
was not correctly processing the charlm, meaning the whole thing wouldn't actually run stanfordnlp/stanza@16f29f3Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 stanfordnlp/stanza@82a0215
v1.5.1
: : charlm & transformer integration in depparseCompare Source
Features
depparse can have transformer as an embedding https://github.com/stanfordnlp/stanza/pull/1282/commits/ee171cd167900fbaac16ff4b1f2fbd1a6e97de0a
Lemmatizer can remember word,pos it has seen before with a flag https://github.com/stanfordnlp/stanza/issues/1263 stanfordnlp/stanza@a87ffd0
Scoring scripts for Flair and spAcy NER models (requires the appropriate packages, of course) https://github.com/stanfordnlp/stanza/pull/1282/commits/63dc212b467cd549039392743a0be493cc9bc9d8 https://github.com/stanfordnlp/stanza/pull/1282/commits/c42aed569f9d376e71708b28b0fe5b478697ba05 https://github.com/stanfordnlp/stanza/pull/1282/commits/eab062341480e055f93787d490ff31d923a68398
SceneGraph connection for the CoreNLP client https://github.com/stanfordnlp/stanza/pull/1282/commits/d21a95cc90443ec4737de6d7ba68a106d12fb285
Update constituency parser to reduce the learning rate on plateau. Fiddling with the learning rates significantly improves performance https://github.com/stanfordnlp/stanza/pull/1282/commits/f753a4f35b7c2cf7e8e6b01da3a60f73493178e1
Tokenize [] based on () rules if the original dataset doesn't have [] in ithttps://github.com/stanfordnlp/stanza/pull/128282/commits/063b4ba3c6ce2075655a70e54c434af4ce7ac3a9
Attempt to finetune the charlm when building models (have not found effective settings for this yet) https://github.com/stanfordnlp/stanza/pull/1282/commits/048fdc9c9947a154d4426007301d63d920e60db0
Add the charlm to the lemmatizer - this will not be the default, since it is slower, but it is more accurate https://github.com/stanfordnlp/stanza/pull/1282/commits/e811f52b4cf88d985e7dbbd499fe30dbf2e76d8d https://github.com/stanfordnlp/stanza/pull/1282/commits/66add6d519deb54ca9be5fe3148023a5d7d815e4 https://github.com/stanfordnlp/stanza/pull/1282/commits/f086de2359cce16ef2718c0e6e3b5deef1345c74
Bugfixes
Forgot to include the lemmatizer in CoreNLP 4.5.3, now in 4.5.4 stanfordnlp/stanza@4dda14b https://github.com/bjascob/LemmInflect/issues/14#issuecomment-1470954013
prepare_ner_dataset was always creating an Armenian pipeline, even for non-Armenian langauges stanfordnlp/stanza@78ff85c
Fix an empty
bulk_process
throwing an exception https://github.com/stanfordnlp/stanza/pull/1282/commits/5e2d15d1aa59e4a1fee8bba1de60c09ba21bf53e https://github.com/stanfordnlp/stanza/issues/1278Unroll the recursion in the Tarjan part of the Chuliu-Edmonds algorithm - should remove stack overflow errors https://github.com/stanfordnlp/stanza/pull/1282/commits/e0917b0967ba9752fdf489b86f9bfd19186c38eb
Minor updates
Put NER and POS scores on one line to make it easier to grep for: stanfordnlp/stanza@da2ae33 stanfordnlp/stanza@8c4cb04
Switch all pretrains to use a name which indicates their source, rather than the dataset they are used for: https://github.com/stanfordnlp/stanza/pull/1282/commits/d1c68ed01276b3cf1455d497057fbc0b82da49e5 and many others
Pipeline uses
torch.no_grad()
for a slight speed boost https://github.com/stanfordnlp/stanza/pull/1282/commits/36ab82edfc574d46698c5352e07d2fcb0d68d3b3Generalize save names, which eventually allows for putting
transformer
,charlm
ornocharlm
in the save name - this lets us distinguish different complexities of model https://github.com/stanfordnlp/stanza/pull/1282/commits/cc0845826973576d8d8ed279274e6509250c9ad5 for constituency, and others for the other modelsAdd the model's flags to the
--help
for therun
scripts, such as https://github.com/stanfordnlp/stanza/pull/1282/commits/83c0901c6ca2827224e156477e42e403d330a16e https://github.com/stanfordnlp/stanza/pull/1282/commits/7c171dd8d066c6973a8ee18a016b65f62376ea4c https://github.com/stanfordnlp/stanza/pull/1282/commits/8e1d112bee42f2211f5153fcc89083b97e3d2600Remove the dependency on
six
https://github.com/stanfordnlp/stanza/pull/1282/commits/6daf97142ebc94cca7114a8cda5a20bf66f7f707 (thank you @BLKSerene )New Models
VLSP constituency stanfordnlp/stanza@500435d
VLSP constituency -> tagging stanfordnlp/stanza@cb0f22d
CTB 5.1 constituency https://github.com/stanfordnlp/stanza/pull/1282/commits/f2ef62b96c79fcaf0b8aa70e4662d33b26dadf31
Add support for CTB 9.0, although those models are not distributed yet https://github.com/stanfordnlp/stanza/pull/1282/commits/1e3ea8a10b2e485bc7c79c6ab41d1f1dd8c2022f
Added an Indonesian charlm
Indonesian constituency from ICON treebank https://github.com/stanfordnlp/stanza/pull/1218
All languages with pretrained charlms now have an option to use that charlm for dependency parsing
French combined models out of
GSD
,ParisStories
,Rhapsodie
, andSequoia
https://github.com/stanfordnlp/stanza/pull/1282/commits/ba64d37d3bf21af34373152e92c9f01241e27d8bUD 2.12 support https://github.com/stanfordnlp/stanza/pull/1282/commits/4f987d2cd708ce4ca27935d347bb5b5d28a78058
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.