John Snow Labs Spark-NLP 4.2.0: Wav2Vec2 for Automatic Speech Recognition (ASR), TAPAS for Table Question Answering, CamemBERT for Token Classification, new evaluation metrics for external datasets in all classifiers, much faster EntityRuler, over 3000+ state-of-the-art multi-lingual models & pipelines, and many more! #12842

maziyarpanahi · 2022-09-27T14:29:47Z

maziyarpanahi
Sep 27, 2022
Maintainer

📢 Overview

For the first time ever we are delighted to announce Automatic Speech Recognition (ASR) support in Spark NLP by using state-of-the-art Wav2Vec2 models at scale 🚀. This release also comes with Table Question Answering by TAPAS, CamemBERT for Token Classification, support for an external test dataset during training of all classifiers, much faster EntityRuler, 3000+ state-of-the-art models, and other enhancements and bug fixes!

We are also celebrating crossing 11000+ free and open-source models & pipelines in our Models Hub. 🎉 As always, we would like to thank our community for their feedback, questions, and feature requests.

⭐ New Features & improvements

NEW: Introducing Wav2Vec2ForCTC annotator in Spark NLP 🚀. Wav2Vec2ForCTC can load Wav2Vec2 models for the Automatic Speech Recognition (ASR) task. Wav2Vec2 is a multi-modal model, that combines speech and text. It's the first multi-modal model of its kind we welcome in Spark NLP. This annotator is compatible with all the models trained/fine-tuned by using Wav2Vec2ForCTC for PyTorch or TFWav2Vec2ForCTC for TensorFlow models in HuggingFace 🤗 (Introducing the first Automatic Speech Recognition annotator: Wav2Vec2ForCTC #12767)

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

NEW: Introducing TapasForQuestionAnswering annotator in Spark NLP 🚀. TapasForQuestionAnswering can load TAPAS Models with a cell selection head and optional aggregation head on top for question-answering tasks on tables (linear layers on top of the hidden-states output to compute logits and optional logits_aggregation), e.g. for SQA, WTQ or WikiSQL-supervised tasks. TAPAS is a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. This annotator is compatible with all the models trained/fine-tuned by using TapasForQuestionAnswering for PyTorch or TFTapasForQuestionAnswering for TensorFlow models in HuggingFace 🤗

TAPAS: Weakly Supervised Table Parsing via Pre-training

NEW: Introducing CamemBertForTokenClassification annotator in Spark NLP 🚀. CamemBertForTokenClassification can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using CamembertForTokenClassification for PyTorch or TFCamembertForTokenClassification for TensorFlow in HuggingFace 🤗
(Introducing CamemBERT for Token Classification #12752)
Implementing setTestDataset to evaluate metrics on an external dataset during training of Text Classifiers in Spark NLP. This feature is similar to NerDLApproach where metrics are calculated on each Epoch and have been added to the following multi-class/multi-label text classifier annotators: ClassifierDLApproach, SentimentDLApproach, and MultiClassifierDLApproach (Adding setTestdatasetParam to Classifiers #12796)
Refactoring and improving EntityRuler annotator inference to up to 24x faster especially when used with a long list of labels/entities. We speed up the inference process by implementing the Aho-Corasick algorithm to match patterns in a string. This requires the following changes when using EntityRuler EntityRuler Latency Improvement #12634
Add support for S3 storage in the cache_folder where models are downloaded, extracted, and loaded from. Previously, we only supported all local file systems, HDFS, and DBFS. This new feature is especially useful for users on Kubernetes clusters with no access to HDFS or any other distributed file systems (Adding support for S3 in cache folder config #12707)
Implementing lookaround functionalities in DocumentNormalizer annotator. Currently, DocumentNormalizer has both lookahead and lookbehind functionalities. To extend support for more complex normalizations, especially within the clinical text we are introducing the lookaround feature (Feat lookaround in doc norm #12735)
Implementing setReplaceEntities param to NerOverwriter annotator to replace all the NER labels (entities) with the given new labels (entities) (Implement setReplaceEntities feature in NerOverwriter annotator #12745)

Bug Fixes

Fix a bug in generating the NerDL graph by using TF v2. The previous graph generated by the TFGraphBuilder annotator resulted in an exception when the length of the sequence was 1. This issue has been resolved and the new graphs created by TFGraphBuilder won't have this issue anymore (Fix bug in generating NER TF2 graphs when sequence length is 1 #12636)
Fix a bug introduced in the 4.0.0 release between Transformer-based Word Embeddings annotators. In the 4.0.0 release, the following annotators were migrated to BatchAnnotate to improve their performance, especially on GPU. However, a bug was introduced in sentence indices which when it is combined with SentenceEmbeddings for Text Classifications tasks (ClassifierDLApproach, SentimentDLApproach, and ClassifierDLApproach) resulted in low accuracy: AlbertEmbeddings, CamemBertEmbeddings, DeBertaEmbeddings, DistilBertEmbeddings, LongformerEmbeddings, RoBertaEmbeddings, XlmRoBertaEmbeddings, and XlnetEmbeddings (Fix a bug introduced in 4.0.0 release between Transformer-based Word Embeddings and SentenceEmbeddings #12641)
Add support for a list of questions and context in LightPipline. Previously, only one context and question at a time were supported in LightPipeline for Question Answering annotators. We have added support to fullAnnotate and annotate to receive two lists of questions and contexts (LightPipeline support for 2 Array Targets #12653)
Fix division by zero exception in the GPT2Transformer annotator when the setDoSample param was set to true (Fix/gpt2 edge case exceptions #12661)
Fix AttributeError when PretrainedPipeline is used in Python with ImageAssembler as one of the stages (Fix AttributeError for ImageAssembler in PretrainedPipeline #12813)

📓 New Notebooks

Spark NLP	Notebooks	Colab
Wav2Vec2ForCTC	Automatic Speech Recognition in Spark NLP
ViTForImageClassification	HuggingFace in Spark NLP - ViTForImageClassification
CamemBertForTokenClassification	HuggingFace in Spark NLP - CamemBertForTokenClassification
ClassifierDLApproach	ClassifierDL Train and Evaluate
MultiClassifierDLApproach	MultiClassifierDL Train and Evaluate
SentimentDLApproach	SentimentDL Train and Evaluate
Pretrained/cache_folder	Download & Load Models From S3
EntityRuler	EntityRuler
EntityRuler	EntityRuler Alphabet
EntityRuler	EntityRuler LightPipeline
EntityRuler	EntityRuler Without Storage
DocumentNormalizer	Apply Lookaround Patterns

You can visit Import Transformers in Spark NLP
You can visit Spark NLP Workshop for 100+ examples

Models

Spark NLP 4.2.0 comes with 3000+ state-of-the-art pre-trained transformer models in many languages.

Featured Models

Model	Name	Lang
Wav2Vec2ForCTC	asr_wav2vec2_base_100h_by_facebook	`en`
Wav2Vec2ForCTC	asr_wav2vec2_base_960h_by_facebook	`en`
Wav2Vec2ForCTC	asr_wav2vec2_large_960h	`en`
Wav2Vec2ForCTC	asr_wav2vec2_large_xlsr_53_german_by_facebook	`de`
Wav2Vec2ForCTC	asr_wav2vec2_large_xlsr_53_french_by_facebook	`fr`
Wav2Vec2ForCTC	asr_wav2vec2_large_xlsr_53_polish_by_facebook	`nl`
Wav2Vec2ForCTC	asr_wav2vec2_base_10k_voxpopuli	`hu`
Wav2Vec2ForCTC	asr_wav2vec2_base_10k_voxpopuli	`fi`
Wav2Vec2ForCTC	asr_wav2vec2_base_10k_voxpopuli	`it`
Wav2Vec2ForCTC	asr_wav2vec2_large_xlsr_japanese_hiragana	`ja`

Check 2000+ Wav2Vec2 models & pipelines for Models Hub - Automatic Speech Recognition (ASR)

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 11000+ models & pipelines in 230+ languages is available on Models Hub

📖 Documentation

TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==4.2.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.0

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.2.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.2.0</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.2.0</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.0.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.0.jar

What's Changed

Contributors

@maziyarpanahi @suvrat-joshi @danilojsl @josejuanmartinez @ahmedlone127 @Damla-Gurbaz @vankov @xusliebana @DevinTDHa @jsl-builder @Cabir40 @muhammetsnts @wolliq @Meryem1425 @pabla @C-K-Loan @rpranab @agsfer

Full Changelog: 4.1.0...4.2.0

This discussion was created from the release John Snow Labs Spark-NLP 4.2.0: Wav2Vec2 for Automatic Speech Recognition (ASR), TAPAS for Table Question Answering, CamemBERT for Token Classification, new evaluation metrics for external datasets in all classifiers, much faster EntityRuler, over 3000+ state-of-the-art multi-lingual models & pipelines, and many more!.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{{title}}

Replies: 0 comments

Select a reply

maziyarpanahi Sep 27, 2022 Maintainer