John Snow Labs Spark-NLP 3.3.0: New ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer for Token Classification, 50x times faster to save models, new ways to discover pretrained models and pipelines, new state-of-the-art models, and lots more! #6194

maziyarpanahi · 2021-09-29T15:04:19Z

maziyarpanahi
Sep 29, 2021
Maintainer

Overview

We are very excited to release Spark NLP 🚀 3.3.0! This release comes with new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer existing or fine-tuned models for Token Classification on HuggingFace 🤗 , up to 50x times faster saving Spark NLP models & pipelines, no more 2G limitation for the size of imported TensorFlow models, lots of new functions to filter and display pretrained models & pipelines inside Spark NLP, bug fixes, and more!

We are proud to say Spark NLP 3.3.0 is still compatible across all major releases of Apache Spark used locally, by all Cloud providers such as EMR, and all managed services such as Databricks. The major releases of Apache Spark include Apache Spark 3.0.x/3.1.x (spark-nlp), Apache Spark 2.4.x (spark-nlp-spark24), and Apache Spark 2.3.x (spark-nlp-spark23).

As always, we would like to thank our community for their feedback, questions, and feature requests.

Major features and improvements

NEW: Starting Spark NLP 3.3.0 release there will be no limitation of size when you import TensorFlow models! You can now import TF Hub & HuggingFace models larger than 2 Gigabytes of size.
NEW: Up to 50x faster saving Spark NLP models and pipelines! We have improved the way we package TensorFlow SavedModel while saving Spark NLP models & pipelines. For instance, it used to take up to 10 minutes to save the xlm_roberta_base model before Spark NLP 3.3.0, and now it only takes up to 15 seconds!
NEW: Introducing AlbertForTokenClassification annotator in Spark NLP 🚀. AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using AlbertForTokenClassification or TFAlbertForTokenClassification in HuggingFace 🤗
NEW: Introducing XlnetForTokenClassification annotator in Spark NLP 🚀. XlnetForTokenClassification can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using XLNetForTokenClassificationet or TFXLNetForTokenClassificationet in HuggingFace 🤗
NEW: Introducing RoBertaForTokenClassification annotator in Spark NLP 🚀. RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using RobertaForTokenClassification or TFRobertaForTokenClassification in HuggingFace 🤗
NEW: Introducing XlmRoBertaForTokenClassification annotator in Spark NLP 🚀. XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForTokenClassification or TFXLMRobertaForTokenClassification in HuggingFace 🤗
NEW: Introducing LongformerForTokenClassification annotator in Spark NLP 🚀. LongformerForTokenClassification can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using LongformerForTokenClassification or TFLongformerForTokenClassification in HuggingFace 🤗
NEW: Introducing new ResourceDownloader functions to easily look for pretrained models & pipelines inside Spark NLP (Python and Scala). You can filter models or pipelines via language, version, or the name of the annotator

from sparknlp.pretrained import *

# display and filter all available pretrained pipelines
ResourceDownloader.showPublicPipelines()
ResourceDownloader.showPublicPipelines(lang="en")
ResourceDownloader.showPublicPipelines(lang="en", version="3.2.0")

# display and filter all available pretrained pipelines
ResourceDownloader.showPublicModels()
ResourceDownloader.showPublicModels("NerDLModel", "3.2.0")
ResourceDownloader.showPublicModels("NerDLModel", "en")
ResourceDownloader.showPublicModels("XlmRoBertaEmbeddings", "xx")
+--------------------------+------+---------+
| Model                    | lang | version |
+--------------------------+------+---------+
| xlm_roberta_base         |  xx  | 3.1.0   |
| twitter_xlm_roberta_base |  xx  | 3.1.0   |
| xlm_roberta_xtreme_base  |  xx  | 3.1.3   |
| xlm_roberta_large        |  xx  | 3.3.0   |
+--------------------------+------+---------+

# remove all the downloaded models & pipelines to free up storage
ResourceDownloader.clearCache()

# display all available annotators that can be saved as a Model
ResourceDownloader.showAvailableAnnotators()

Welcoming Databricks Runtime 9.1 LTS, 9.1 ML, and 9.1 ML with GPU

Bug Fixes

Fix a bug in RoBertaEmbeddings when all special tokens were identical
Fix a bug in RoBertaEmbeddings when a special token contained valid regex
Fix a bug that leads to memory leak inside NorvigSweeting spell checker. This issue caused issues with pretrained pipelines such as explain_document_ml and explain_document_dl due to some inputs
Fix the wrong types being assigned to minCount and classCount in Python for ContextSpellCheckerApproach annotator
Fix explain_document_ml pretrained pipeline for Spark NLP 3.x on Apache Spark 2.x
Fix WordSegmenterModel wordseg_best model for Thai language
Fix WordSegmenterModel wordseg_large model for Chinese language

Models and Pipelines

Spark NLP 3.3.0 comes with:

New ALBERT, RoBERTa, XLNet, and XLM-RoBERTa for Token Classification models
New XLM-RoBERTa models in Luganda, Kinyarwanda, Igbo, Hausa, and Amharic languages

New Transformer Models

Model	Name	Build	Lang
RoBertaForTokenClassification	roberta_large_token_classifier_ontonotes	3.3.0	`en`
RoBertaForTokenClassification	roberta_large_token_classifier_conll03	3.3.0	`en`
RoBertaForTokenClassification	roberta_base_token_classifier_ontonotes	3.3.0	`en`
RoBertaForTokenClassification	roberta_base_token_classifier_conll03	3.3.0	`en`
RoBertaForTokenClassification	distilroberta_base_token_classifier_ontonotes	3.3.0	`en`
RoBertaForTokenClassification	roberta_token_classifier_zwnj_base_ner	3.3.0	`fa`
XlmRoBertaForTokenClassification	xlm_roberta_token_classifier_ner_40_lang	3.3.0	`xx`
AlbertForTokenClassification	albert_xlarge_token_classifier_conll03	3.3.0	`en`
AlbertForTokenClassification	albert_large_token_classifier_conll03	3.3.0	`en`
AlbertForTokenClassification	albert_base_token_classifier_conll03	3.3.0	`en`
XlnetForTokenClassification	xlnet_large_token_classifier_conll03	3.3.0	`en`
XlnetForTokenClassification	xlnet_base_token_classifier_conll03	3.3.0	`en`
XlmRoBertaEmbeddings	xlm_roberta_large	3.3.0	`xx`
XlmRoBertaEmbeddings	xlm_roberta_base_finetuned_luganda	3.3.0	`lg`
XlmRoBertaEmbeddings	xlm_roberta_base_finetuned_kinyarwanda	3.3.0	`rw`
XlmRoBertaEmbeddings	xlm_roberta_base_finetuned_igbo	3.3.0	`ig`
XlmRoBertaEmbeddings	xlm_roberta_base_finetuned_hausa	3.3.0	`ha`
XlmRoBertaEmbeddings	xlm_roberta_base_finetuned_amharic	3.3.0	`am`

The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Import hundreds of models in different languages to Spark NLP

Spark NLP	HuggingFace Notebooks	Colab
AlbertForTokenClassification	HuggingFace in Spark NLP - AlbertForTokenClassification
RoBertaForTokenClassification	HuggingFace in Spark NLP - RoBertaForTokenClassification
XlmRoBertaForTokenClassification	HuggingFace in Spark NLP - XlmRoBertaForTokenClassification

Documentation

TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==3.3.0

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.0

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.0

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.0

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.3.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.3.0</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.3.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.3.0</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.3.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>3.3.0</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.0.jar
GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.0.jar
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.0.jar
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.0.jar
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.0.jar
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.0.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

John Snow Labs Spark-NLP 3.3.0: New ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer for Token Classification, 50x times faster to save models, new ways to discover pretrained models and pipelines, new state-of-the-art models, and lots more! #6194

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

John Snow Labs Spark-NLP 3.3.0: New ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer for Token Classification, 50x times faster to save models, new ways to discover pretrained models and pipelines, new state-of-the-art models, and lots more! #6194

maziyarpanahi Sep 29, 2021 Maintainer

Overview

Major features and improvements

Bug Fixes

Models and Pipelines

New Transformer Models

New Notebooks

Documentation

Installation

Replies: 0 comments

maziyarpanahi
Sep 29, 2021
Maintainer