Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(deps): update dependency stanza to v1.9.2 #145

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

renovate[bot]
Copy link
Contributor

@renovate renovate bot commented Sep 9, 2023

This PR contains the following updates:

Package Change Age Adoption Passing Confidence
stanza 1.5.0 -> 1.9.2 age adoption passing confidence

Release Notes

stanfordnlp/stanza (stanza)

v1.9.2: Multilingual Coref

Compare Source

multilingual coref!

new features

new models

bugfixes

v1.9.1: Multilingual Coref

Compare Source

multilingual coref!

new features

new models

bugfixes

v1.9.0: Multilingual Coref

Compare Source

multilingual coref!

new features

new models

bugfixes

v1.8.2: Old English, MWT improvements, and better memory management of Peft

Compare Source

Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.

Old English

MWT improvements

Peft memory management

Other bugfixes and minor upgrades

Other upgrades

v1.8.1: PEFT Integration (with bugfixes)

Compare Source

Integrating PEFT into several different annotators

We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements
Features
Bugfixes
Additional 1.8.1 Bugfixes

v1.8.0: PEFT integration

Compare Source

Integrating PEFT into several different annotators

We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements
Features
Bugfixes

v1.7.0: : Neural coref!

Compare Source

Neural coref processor added!

Conjunction-Aware Word-Level Coreference Resolution
https://arxiv.org/abs/2310.06165
original implementation: https://github.com/KarelDO/wl-coref/tree/master

Updated form of Word-Level Coreference Resolution
https://aclanthology.org/2021.emnlp-main.605/
original implementation: https://github.com/vdobrovolskii/wl-coref

If you use Stanza's coref module in your work, please be sure to cite both of the above papers.

Special thanks to vdobrovolskii, who graciously agreed to allow for integration of his work into Stanza, to @​KarelDO for his support of his training enhancement, and to @​Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.

Currently there is one model provided, a transformer based English model trained from OntoNotes. The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture. When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.

Future work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models

https://github.com/stanfordnlp/stanza/pull/1309

Interface change: English MWT

English now has an MWT model by default. Text such as won't is now marked as a single token, split into two words, will and not. Previously it was expected to be tokenized into two pieces, but the Sentence object containing that text would not have a single Token object connecting the two pieces. See https://stanfordnlp.github.io/stanza/mwt.html and https://stanfordnlp.github.io/stanza/data_objects.html#token for more information.

Code that used to operate with for word in sentence.words will continue to work as before, but for token in sentence.tokens will now produce one object for MWT such as won't, cannot, Stanza's, etc.

Pipeline creation will not change, as MWT is automatically (but not silently) added at Pipeline creation time if the language and package includes MWT.

https://github.com/stanfordnlp/stanza/pull/1314/commits/f22dceb93275fc724536b03b31c08a94617880ca https://github.com/stanfordnlp/stanza/pull/1314/commits/27983aefe191f6abd93dd49915d2515d7c3973d1

Other updates

Updated requirements

  • Support dropped for python 3.6 and 3.7. The peft module used for finetuning the transformer used in the coref processor does not support those versions.
  • Added peft as an optional dependency to transformer based installations
  • Added networkx as a dependency for reading enhanced dependencies. Added toml as a dependency for reading the coref config.

v1.6.1: Multiple default models and a combined EN NER model

Compare Source

V1.6.1 is a patch of a bug in the Arabic POS tagger.

We also mark Python 3.11 as supported in the setup.py classifiers. This will be the last release that supports Python 3.6

Multiple model levels

The package parameter for building the Pipeline now has three default settings:

  • default, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
  • default-fast, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
  • default-accurate, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast and -accurate versions for each UD dataset.

PR: https://github.com/stanfordnlp/stanza/pull/1287

addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

https://github.com/stanfordnlp/stanza/pull/1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide:   88.71  69.29
simplify-separate                  88.24  75.75
simplify-connected                 88.32  75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm, ontonotes-combined_charlm, and ontonotes-combined_electra-large.

Future plans include using multiple NER datasets for other models as well.

Other features

Bugfixes

v1.6.0: Multiple default models and a combined EN NER model

Compare Source

Multiple model levels

The package parameter for building the Pipeline now has three default settings:

  • default, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
  • default-fast, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
  • default-accurate, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast and -accurate versions for each UD dataset.

PR: https://github.com/stanfordnlp/stanza/pull/1287

addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

https://github.com/stanfordnlp/stanza/pull/1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide:   88.71  69.29
simplify-separate                  88.24  75.75
simplify-connected                 88.32  75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm, ontonotes-combined_charlm, and ontonotes-combined_electra-large.

Future plans include using multiple NER datasets for other models as well.

Other features

Bugfixes

v1.5.1: : charlm & transformer integration in depparse

Compare Source

Features

depparse can have transformer as an embedding https://github.com/stanfordnlp/stanza/pull/1282/commits/ee171cd167900fbaac16ff4b1f2fbd1a6e97de0a

Lemmatizer can remember word,pos it has seen before with a flag https://github.com/stanfordnlp/stanza/issues/1263 stanfordnlp/stanza@a87ffd0

Scoring scripts for Flair and spAcy NER models (requires the appropriate packages, of course) https://github.com/stanfordnlp/stanza/pull/1282/commits/63dc212b467cd549039392743a0be493cc9bc9d8 https://github.com/stanfordnlp/stanza/pull/1282/commits/c42aed569f9d376e71708b28b0fe5b478697ba05 https://github.com/stanfordnlp/stanza/pull/1282/commits/eab062341480e055f93787d490ff31d923a68398

SceneGraph connection for the CoreNLP client https://github.com/stanfordnlp/stanza/pull/1282/commits/d21a95cc90443ec4737de6d7ba68a106d12fb285

Update constituency parser to reduce the learning rate on plateau. Fiddling with the learning rates significantly improves performance https://github.com/stanfordnlp/stanza/pull/1282/commits/f753a4f35b7c2cf7e8e6b01da3a60f73493178e1

Tokenize [] based on () rules if the original dataset doesn't have [] in ithttps://github.com/stanfordnlp/stanza/pull/128282/commits/063b4ba3c6ce2075655a70e54c434af4ce7ac3a9

Attempt to finetune the charlm when building models (have not found effective settings for this yet) https://github.com/stanfordnlp/stanza/pull/1282/commits/048fdc9c9947a154d4426007301d63d920e60db0

Add the charlm to the lemmatizer - this will not be the default, since it is slower, but it is more accurate https://github.com/stanfordnlp/stanza/pull/1282/commits/e811f52b4cf88d985e7dbbd499fe30dbf2e76d8d https://github.com/stanfordnlp/stanza/pull/1282/commits/66add6d519deb54ca9be5fe3148023a5d7d815e4 https://github.com/stanfordnlp/stanza/pull/1282/commits/f086de2359cce16ef2718c0e6e3b5deef1345c74

Bugfixes

Forgot to include the lemmatizer in CoreNLP 4.5.3, now in 4.5.4 stanfordnlp/stanza@4dda14b https://github.com/bjascob/LemmInflect/issues/14#issuecomment-1470954013

prepare_ner_dataset was always creating an Armenian pipeline, even for non-Armenian langauges stanfordnlp/stanza@78ff85c

Fix an empty bulk_process throwing an exception https://github.com/stanfordnlp/stanza/pull/1282/commits/5e2d15d1aa59e4a1fee8bba1de60c09ba21bf53e https://github.com/stanfordnlp/stanza/issues/1278

Unroll the recursion in the Tarjan part of the Chuliu-Edmonds algorithm - should remove stack overflow errors https://github.com/stanfordnlp/stanza/pull/1282/commits/e0917b0967ba9752fdf489b86f9bfd19186c38eb

Minor updates

Put NER and POS scores on one line to make it easier to grep for: stanfordnlp/stanza@da2ae33 stanfordnlp/stanza@8c4cb04

Switch all pretrains to use a name which indicates their source, rather than the dataset they are used for: https://github.com/stanfordnlp/stanza/pull/1282/commits/d1c68ed01276b3cf1455d497057fbc0b82da49e5 and many others

Pipeline uses torch.no_grad() for a slight speed boost https://github.com/stanfordnlp/stanza/pull/1282/commits/36ab82edfc574d46698c5352e07d2fcb0d68d3b3

Generalize save names, which eventually allows for putting transformer, charlm or nocharlm in the save name - this lets us distinguish different complexities of model https://github.com/stanfordnlp/stanza/pull/1282/commits/cc0845826973576d8d8ed279274e6509250c9ad5 for constituency, and others for the other models

Add the model's flags to the --help for the run scripts, such as https://github.com/stanfordnlp/stanza/pull/1282/commits/83c0901c6ca2827224e156477e42e403d330a16e https://github.com/stanfordnlp/stanza/pull/1282/commits/7c171dd8d066c6973a8ee18a016b65f62376ea4c https://github.com/stanfordnlp/stanza/pull/1282/commits/8e1d112bee42f2211f5153fcc89083b97e3d2600

Remove the dependency on six https://github.com/stanfordnlp/stanza/pull/1282/commits/6daf97142ebc94cca7114a8cda5a20bf66f7f707 (thank you @​BLKSerene )

New Models

VLSP constituency stanfordnlp/stanza@500435d

VLSP constituency -> tagging stanfordnlp/stanza@cb0f22d

CTB 5.1 constituency https://github.com/stanfordnlp/stanza/pull/1282/commits/f2ef62b96c79fcaf0b8aa70e4662d33b26dadf31

Add support for CTB 9.0, although those models are not distributed yet https://github.com/stanfordnlp/stanza/pull/1282/commits/1e3ea8a10b2e485bc7c79c6ab41d1f1dd8c2022f

Added an Indonesian charlm

Indonesian constituency from ICON treebank https://github.com/stanfordnlp/stanza/pull/1218

All languages with pretrained charlms now have an option to use that charlm for dependency parsing

French combined models out of GSD, ParisStories, Rhapsodie, and Sequoia https://github.com/stanfordnlp/stanza/pull/1282/commits/ba64d37d3bf21af34373152e92c9f01241e27d8b

UD 2.12 support https://github.com/stanfordnlp/stanza/pull/1282/commits/4f987d2cd708ce4ca27935d347bb5b5d28a78058


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 7dd98d7 to 271568f Compare October 3, 2023 07:52
@renovate renovate bot changed the title fix(deps): update dependency stanza to v1.5.1 fix(deps): update dependency stanza to v1.6.0 Oct 3, 2023
@renovate renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 271568f to 3b5b7a6 Compare October 6, 2023 07:21
@renovate renovate bot changed the title fix(deps): update dependency stanza to v1.6.0 fix(deps): update dependency stanza to v1.6.1 Oct 6, 2023
@renovate renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 3b5b7a6 to d7d111e Compare December 3, 2023 07:26
@renovate renovate bot changed the title fix(deps): update dependency stanza to v1.6.1 fix(deps): update dependency stanza to v1.7.0 Dec 3, 2023
@renovate renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from d7d111e to 530b35d Compare February 25, 2024 11:26
@renovate renovate bot changed the title fix(deps): update dependency stanza to v1.7.0 fix(deps): update dependency stanza to v1.8.0 Feb 25, 2024
@renovate renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 530b35d to 7202919 Compare March 1, 2024 06:58
@renovate renovate bot changed the title fix(deps): update dependency stanza to v1.8.0 fix(deps): update dependency stanza to v1.8.1 Mar 1, 2024
@renovate renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 7202919 to 9d9207d Compare April 20, 2024 22:31
@renovate renovate bot changed the title fix(deps): update dependency stanza to v1.8.1 fix(deps): update dependency stanza to v1.8.2 Apr 20, 2024
@renovate renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 9d9207d to 784a3ae Compare September 12, 2024 09:53
@renovate renovate bot changed the title fix(deps): update dependency stanza to v1.8.2 fix(deps): update dependency stanza to v1.9.0 Sep 12, 2024
@renovate renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from 784a3ae to a1abe67 Compare September 12, 2024 22:13
@renovate renovate bot changed the title fix(deps): update dependency stanza to v1.9.0 fix(deps): update dependency stanza to v1.9.1 Sep 12, 2024
@renovate renovate bot force-pushed the renovate/stanza-1.x-lockfile branch from a1abe67 to 9026dc7 Compare September 13, 2024 00:24
@renovate renovate bot changed the title fix(deps): update dependency stanza to v1.9.1 fix(deps): update dependency stanza to v1.9.2 Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants