John Snow Labs Spark-NLP 4.2.0: Wav2Vec2 for Automatic Speech Recognition (ASR), TAPAS for Table Question Answering, CamemBERT for Token Classification, new evaluation metrics for external datasets in all classifiers, much faster EntityRuler, over 3000+ state-of-the-art multi-lingual models & pipelines, and many more! #12842
maziyarpanahi
announced in
Announcement
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
📢 Overview
For the first time ever we are delighted to announce Automatic Speech Recognition (ASR) support in Spark NLP by using state-of-the-art Wav2Vec2 models at scale 🚀. This release also comes with Table Question Answering by TAPAS, CamemBERT for Token Classification, support for an external test dataset during training of all classifiers, much faster EntityRuler, 3000+ state-of-the-art models, and other enhancements and bug fixes!
We are also celebrating crossing 11000+ free and open-source models & pipelines in our Models Hub. 🎉 As always, we would like to thank our community for their feedback, questions, and feature requests.
⭐ New Features & improvements
Wav2Vec2ForCTC
can loadWav2Vec2
models for the Automatic Speech Recognition (ASR) task. Wav2Vec2 is a multi-modal model, that combines speech and text. It's the first multi-modal model of its kind we welcome in Spark NLP. This annotator is compatible with all the models trained/fine-tuned by usingWav2Vec2ForCTC
for PyTorch orTFWav2Vec2ForCTC
for TensorFlow models in HuggingFace 🤗 (Introducing the first Automatic Speech Recognition annotator: Wav2Vec2ForCTC #12767)TapasForQuestionAnswering
can load TAPAS Models with a cell selection head and optional aggregation head on top for question-answering tasks on tables (linear layers on top of the hidden-states output to compute logits and optional logits_aggregation), e.g. for SQA, WTQ or WikiSQL-supervised tasks. TAPAS is a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. This annotator is compatible with all the models trained/fine-tuned by usingTapasForQuestionAnswering
for PyTorch orTFTapasForQuestionAnswering
for TensorFlow models in HuggingFace 🤗CamemBertForTokenClassification
can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by usingCamembertForTokenClassification
for PyTorch orTFCamembertForTokenClassification
for TensorFlow in HuggingFace 🤗(Introducing CamemBERT for Token Classification #12752)
setTestDataset
to evaluate metrics on an external dataset during training of Text Classifiers in Spark NLP. This feature is similar to NerDLApproach where metrics are calculated on each Epoch and have been added to the following multi-class/multi-label text classifier annotators:ClassifierDLApproach
,SentimentDLApproach
, andMultiClassifierDLApproach
(Adding setTestdatasetParam to Classifiers #12796)EntityRuler
annotator inference to up to 24x faster especially when used with a long list of labels/entities. We speed up the inference process by implementing the Aho-Corasick algorithm to match patterns in a string. This requires the following changes when usingEntityRuler
EntityRuler Latency Improvement #12634cache_folder
where models are downloaded, extracted, and loaded from. Previously, we only supported all local file systems, HDFS, and DBFS. This new feature is especially useful for users on Kubernetes clusters with no access to HDFS or any other distributed file systems (Adding support for S3 in cache folder config #12707)lookaround
functionalities inDocumentNormalizer
annotator. Currently,DocumentNormalizer
has bothlookahead
andlookbehind
functionalities. To extend support for more complex normalizations, especially within the clinical text we are introducing thelookaround
feature (Feat lookaround in doc norm #12735)setReplaceEntities
param toNerOverwriter
annotator to replace all the NER labels (entities) with the given new labels (entities) (Implement setReplaceEntities feature in NerOverwriter annotator #12745)Bug Fixes
TFGraphBuilder
annotator resulted in an exception when the length of the sequence was 1. This issue has been resolved and the new graphs created byTFGraphBuilder
won't have this issue anymore (Fix bug in generating NER TF2 graphs when sequence length is 1 #12636)fullAnnotate
andannotate
to receive two lists of questions and contexts (LightPipeline support for 2 Array Targets #12653)GPT2Transformer
annotator when thesetDoSample
param was set to true (Fix/gpt2 edge case exceptions #12661)AttributeError
when PretrainedPipeline is used in Python with ImageAssembler as one of the stages (Fix AttributeError for ImageAssembler in PretrainedPipeline #12813)📓 New Notebooks
Models
Spark NLP 4.2.0 comes with 3000+ state-of-the-art pre-trained transformer models in many languages.
Featured Models
en
en
en
de
fr
nl
hu
fi
it
ja
Check 2000+ Wav2Vec2 models & pipelines for Models Hub - Automatic Speech Recognition (ASR)
Spark NLP covers the following languages:
English
,Multilingual
,Afrikaans
,Afro-Asiatic languages
,Albanian
,Altaic languages
,American Sign Language
,Amharic
,Arabic
,Argentine Sign Language
,Armenian
,Artificial languages
,Atlantic-Congo languages
,Austro-Asiatic languages
,Austronesian languages
,Azerbaijani
,Baltic languages
,Bantu languages
,Basque
,Basque (family)
,Belarusian
,Bemba (Zambia)
,Bengali, Bangla
,Berber languages
,Bihari
,Bislama
,Bosnian
,Brazilian Sign Language
,Breton
,Bulgarian
,Catalan
,Caucasian languages
,Cebuano
,Celtic languages
,Central Bikol
,Chichewa, Chewa, Nyanja
,Chilean Sign Language
,Chinese
,Chuukese
,Colombian Sign Language
,Congo Swahili
,Croatian
,Cushitic languages
,Czech
,Danish
,Dholuo, Luo (Kenya and Tanzania)
,Dravidian languages
,Dutch
,East Slavic languages
,Eastern Malayo-Polynesian languages
,Efik
,Esperanto
,Estonian
,Ewe
,Fijian
,Finnish
,Finnish Sign Language
,Finno-Ugrian languages
,French
,French-based creoles and pidgins
,Ga
,Galician
,Ganda
,Georgian
,German
,Germanic languages
,Gilbertese
,Greek (modern)
,Greek languages
,Gujarati
,Gun
,Haitian, Haitian Creole
,Hausa
,Hebrew (modern)
,Hiligaynon
,Hindi
,Hiri Motu
,Hungarian
,Icelandic
,Igbo
,Iloko
,Indic languages
,Indo-European languages
,Indo-Iranian languages
,Indonesian
,Irish
,Isoko
,Isthmus Zapotec
,Italian
,Italic languages
,Japanese
,Japanese
,Kabyle
,Kalaallisut, Greenlandic
,Kannada
,Kaonde
,Kinyarwanda
,Kirundi
,Kongo
,Korean
,Kwangali
,Kwanyama, Kuanyama
,Latin
,Latvian
,Lingala
,Lithuanian
,Louisiana Creole
,Lozi
,Luba-Katanga
,Luba-Lulua
,Lunda
,Lushai
,Luvale
,Macedonian
,Malagasy
,Malay
,Malayalam
,Malayo-Polynesian languages
,Maltese
,Manx
,Marathi (Marāṭhī)
,Marshallese
,Mexican Sign Language
,Mon-Khmer languages
,Morisyen
,Mossi
,Multiple languages
,Ndonga
,Nepali
,Niger-Kordofanian languages
,Nigerian Pidgin
,Niuean
,North Germanic languages
,Northern Sotho, Pedi, Sepedi
,Norwegian
,Norwegian Bokmål
,Norwegian Nynorsk
,Nyaneka
,Oromo
,Pangasinan
,Papiamento
,Persian (Farsi)
,Peruvian Sign Language
,Philippine languages
,Pijin
,Pohnpeian
,Polish
,Portuguese
,Portuguese-based creoles and pidgins
,Punjabi (Eastern)
,Romance languages
,Romanian
,Rundi
,Russian
,Ruund
,Salishan languages
,Samoan
,San Salvador Kongo
,Sango
,Semitic languages
,Serbo-Croatian
,Seselwa Creole French
,Shona
,Sindhi
,Sino-Tibetan languages
,Slavic languages
,Slovak
,Slovene
,Somali
,South Caucasian languages
,South Slavic languages
,Southern Sotho
,Spanish
,Spanish Sign Language
,Sranan Tongo
,Swahili
,Swati
,Swedish
,Tagalog
,Tahitian
,Tai
,Tamil
,Telugu
,Tetela
,Tetun Dili
,Thai
,Tigrinya
,Tiv
,Tok Pisin
,Tonga (Tonga Islands)
,Tonga (Zambia)
,Tsonga
,Tswana
,Tumbuka
,Turkic languages
,Turkish
,Tuvalu
,Tzotzil
,Ukrainian
,Umbundu
,Uralic languages
,Urdu
,Venda
,Venezuelan Sign Language
,Vietnamese
,Wallisian
,Walloon
,Waray (Philippines)
,Welsh
,West Germanic languages
,West Slavic languages
,Western Malayo-Polynesian languages
,Wolaitta, Wolaytta
,Wolof
,Xhosa
,Yapese
,Yiddish
,Yoruba
,Yucatec Maya, Yucateco
,Zande (individual language)
,Zulu
The complete list of all 11000+ models & pipelines in 230+ languages is available on Models Hub
📖 Documentation
Installation
Python
#PyPI pip install spark-nlp==4.2.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):
GPU
M1
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:
spark-nlp-gpu:
spark-nlp-m1:
FAT JARs
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.0.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.0.jar
What's Changed
Contributors
@maziyarpanahi @suvrat-joshi @danilojsl @josejuanmartinez @ahmedlone127 @Damla-Gurbaz @vankov @xusliebana @DevinTDHa @jsl-builder @Cabir40 @muhammetsnts @wolliq @Meryem1425 @pabla @C-K-Loan @rpranab @agsfer
Full Changelog: 4.1.0...4.2.0
This discussion was created from the release John Snow Labs Spark-NLP 4.2.0: Wav2Vec2 for Automatic Speech Recognition (ASR), TAPAS for Table Question Answering, CamemBERT for Token Classification, new evaluation metrics for external datasets in all classifiers, much faster EntityRuler, over 3000+ state-of-the-art multi-lingual models & pipelines, and many more!.
Beta Was this translation helpful? Give feedback.
All reactions