John Snow Labs Spark-NLP 4.0.0: New modern extractive Question answering (QA) annotators for ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa, official support for Apple silicon M1, support oneDNN to improve CPU up to 97%, improved transformers on GPU up to +700%, 1000+ state-of-the-art models, and lots more! #9316
maziyarpanahi
announced in
Announcement
Replies: 1 comment
-
Biggest release indeed! can't wait to update my project. Thank you! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Overview
We are very excited to release Spark NLP 4.0.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! 🎉
This release comes with official support for Apple silicon M1 chip (for the first time), official support for Spark/PySpark 3.2, support oneAPI Deep Neural Network Library (oneDNN) to improve TensorFlow on CPU up to 97%, optimized transformer-based embeddings on GPU to increase the performance up to +700%, brand new modern extractive transformer-based Question answering (QA) annotators for tasks like SQuAD based on ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa architectures, 1000+ state-of-the-art models, WordEmbeddingsModel now works in clusters without HDFS/DBFS/S3 such as Kubernetes, new Databricks and EMR support, new NER models achieving highest F1 score in Spark NLP, and many more enhancements and bug fixes!
We would like to mention that Spark NLP 4.0.0 drops the support for Spark 2.3 and 2.4 (Scala 2.11). Starting 4.0.0 we only support Spark/PySpark 3.x on Scala 2.12.
As always, we would like to thank our community for their feedback, questions, and feature requests.
Major features and improvements
export TF_ENABLE_ONEDNN_OPTS=1
spark-nlp-m1
package that supports Apple silicon M1 on your macOS machine in Spark NLP 4.0.0AlbertForQuestionAnswering
can loadALBERT
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingAlbertForQuestionAnswering
for PyTorch orTFAlbertForQuestionAnswering
for TensorFlow models in HuggingFace 🤗BertForQuestionAnswering
can loadBERT
&ELECTRA
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingBertForQuestionAnswering
andElectraForQuestionAnswering
for PyTorch orTFBertForQuestionAnswering
andTFElectraForQuestionAnswering
for TensorFlow models in HuggingFace 🤗DeBertaForQuestionAnswering
can loadDeBERTa
v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingDebertaV2ForQuestionAnswering
for PyTorch orTFDebertaV2ForQuestionAnswering
for TensorFlow models in HuggingFace 🤗DistilBertForQuestionAnswering
can loadDistilBERT
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingDistilBertForQuestionAnswering
for PyTorch orTFDistilBertForQuestionAnswering
for TensorFlow models in HuggingFace 🤗LongformerForQuestionAnswering
can loadLongformer
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingLongformerForQuestionAnswering
for PyTorch orTFLongformerForQuestionAnswering
for TensorFlow models in HuggingFace 🤗RoBertaForQuestionAnswering
can loadRoBERTa
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingRobertaForQuestionAnswering
for PyTorch orTFRobertaForQuestionAnswering
for TensorFlow models in HuggingFace 🤗XlmRoBertaForQuestionAnswering
can loadXLM-RoBERTa
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingXLMRobertaForQuestionAnswering
for PyTorch orTFXLMRobertaForQuestionAnswering
for TensorFlow models in HuggingFace 🤗enableInMemoryStorage
parameter inWordEmbeddingsModel
annotator. By enabling this parameter the annotator will no longer require a distributed storage to unpack indices and will perform everything in-memory.spark-nlp
for CPU,spark-nlp-gpu
for GPU, andspark-nlp-m1
for new Apple silicon M1 on macOS. The need for Apache Spark specific packages likespark-nlp-spark32
has been removed.sparknlp.start()
function in Python and Scala for Apple silicon M1 on macOS (m1=True
)setCaseSensitive
param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification.Performance Improvements (Benchmarks)
We have introduced two major performance improvements for GPU and CPU devices in Spark NLP 4.0.0 release.
The following benchmarks have been done by using a single Dell Server with the following specs:
GPU
We have improved our batch processing approach for transformer-based Word Embeddings to improve their performance on a GPU device. These optimizations result in performance improvements up to +700%. The detailed list of improved transformer models on GPU in comparison to Spark NLP 3.4.x:
CPU
The oneAPI Deep Neural Network Library (oneDNN) optimizations are now available in Spark NLP 4.0.0 that uses TensorFlow 2.7.1. You can enable those CPU optimizations by setting the environment variable
TF_ENABLE_ONEDNN_OPTS=1
.Comparing the last release of Spark NLP 3.4.3 on CPU vs. Spark NLP 4.0.0 on CPU with oneDNN enabled.
Bug Fixes
Updated Requirements
Backward Compatibility
spark23
,spark24
, andspark32
parameters. The defaultsparknlp.start()
works on PySpark 3.0.x, 3.1.x, and 3.2.x without the need for any Spark-related flagsModels and Pipelines
Spark NLP 4.0.0 comes with 1000+ state-of-the-art pre-trained transformer models in many languages.
New NER Models
nerdl_conll_deberta_large
NER model breaks the previously highest F1 on CoNLL03 dev by 1%en
96%
en
95.6%
en
94%
Featured Models
en
en
en
en
en
en
pt
pt
zh
ar
ko
xx
el
Spark NLP covers the following languages:
English
,Multilingual
,Afrikaans
,Afro-Asiatic languages
,Albanian
,Altaic languages
,American Sign Language
,Amharic
,Arabic
,Argentine Sign Language
,Armenian
,Artificial languages
,Atlantic-Congo languages
,Austro-Asiatic languages
,Austronesian languages
,Azerbaijani
,Baltic languages
,Bantu languages
,Basque
,Basque (family)
,Belarusian
,Bemba (Zambia)
,Bengali, Bangla
,Berber languages
,Bihari
,Bislama
,Bosnian
,Brazilian Sign Language
,Breton
,Bulgarian
,Catalan
,Caucasian languages
,Cebuano
,Celtic languages
,Central Bikol
,Chichewa, Chewa, Nyanja
,Chilean Sign Language
,Chinese
,Chuukese
,Colombian Sign Language
,Congo Swahili
,Croatian
,Cushitic languages
,Czech
,Danish
,Dholuo, Luo (Kenya and Tanzania)
,Dravidian languages
,Dutch
,East Slavic languages
,Eastern Malayo-Polynesian languages
,Efik
,Esperanto
,Estonian
,Ewe
,Fijian
,Finnish
,Finnish Sign Language
,Finno-Ugrian languages
,French
,French-based creoles and pidgins
,Ga
,Galician
,Ganda
,Georgian
,German
,Germanic languages
,Gilbertese
,Greek (modern)
,Greek languages
,Gujarati
,Gun
,Haitian, Haitian Creole
,Hausa
,Hebrew (modern)
,Hiligaynon
,Hindi
,Hiri Motu
,Hungarian
,Icelandic
,Igbo
,Iloko
,Indic languages
,Indo-European languages
,Indo-Iranian languages
,Indonesian
,Irish
,Isoko
,Isthmus Zapotec
,Italian
,Italic languages
,Japanese
,Japanese
,Kabyle
,Kalaallisut, Greenlandic
,Kannada
,Kaonde
,Kinyarwanda
,Kirundi
,Kongo
,Korean
,Kwangali
,Kwanyama, Kuanyama
,Latin
,Latvian
,Lingala
,Lithuanian
,Louisiana Creole
,Lozi
,Luba-Katanga
,Luba-Lulua
,Lunda
,Lushai
,Luvale
,Macedonian
,Malagasy
,Malay
,Malayalam
,Malayo-Polynesian languages
,Maltese
,Manx
,Marathi (Marāṭhī)
,Marshallese
,Mexican Sign Language
,Mon-Khmer languages
,Morisyen
,Mossi
,Multiple languages
,Ndonga
,Nepali
,Niger-Kordofanian languages
,Nigerian Pidgin
,Niuean
,North Germanic languages
,Northern Sotho, Pedi, Sepedi
,Norwegian
,Norwegian Bokmål
,Norwegian Nynorsk
,Nyaneka
,Oromo
,Pangasinan
,Papiamento
,Persian (Farsi)
,Peruvian Sign Language
,Philippine languages
,Pijin
,Pohnpeian
,Polish
,Portuguese
,Portuguese-based creoles and pidgins
,Punjabi (Eastern)
,Romance languages
,Romanian
,Rundi
,Russian
,Ruund
,Salishan languages
,Samoan
,San Salvador Kongo
,Sango
,Semitic languages
,Serbo-Croatian
,Seselwa Creole French
,Shona
,Sindhi
,Sino-Tibetan languages
,Slavic languages
,Slovak
,Slovene
,Somali
,South Caucasian languages
,South Slavic languages
,Southern Sotho
,Spanish
,Spanish Sign Language
,Sranan Tongo
,Swahili
,Swati
,Swedish
,Tagalog
,Tahitian
,Tai
,Tamil
,Telugu
,Tetela
,Tetun Dili
,Thai
,Tigrinya
,Tiv
,Tok Pisin
,Tonga (Tonga Islands)
,Tonga (Zambia)
,Tsonga
,Tswana
,Tumbuka
,Turkic languages
,Turkish
,Tuvalu
,Tzotzil
,Ukrainian
,Umbundu
,Uralic languages
,Urdu
,Venda
,Venezuelan Sign Language
,Vietnamese
,Wallisian
,Walloon
,Waray (Philippines)
,Welsh
,West Germanic languages
,West Slavic languages
,Western Malayo-Polynesian languages
,Wolaitta, Wolaytta
,Wolof
,Xhosa
,Yapese
,Yiddish
,Yoruba
,Yucatec Maya, Yucateco
,Zande (individual language)
,Zulu
The complete list of all 6000+ models & pipelines in 230+ languages is available on Models Hub
New Notebooks
Import hundreds of models in different languages to Spark NLP
You can visit Import Transformers in Spark NLP for more info
Documentation
Installation
Python
#PyPI pip install spark-nlp==4.0.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x (Scala 2.12):
GPU
M1
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x:
spark-nlp-gpu:
spark-nlp-m1:
FAT JARs
CPU on Apache Spark 3.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.0.jar
What's Changed
Full Changelog: 3.4.4...4.0.0
@vankov @mahmoodbayeshi @Ahmetemintek @DevinTDHa @albertoandreottiATgmail @KshitizGIT @jsl-models @gokhanturer @josejuanmartinez @murat-gunay @rpranab @wolliq @bunyamin-polat @pabla @danilojsl @agsfer @Meryem1425 @gadde5300 @muhammetsnts @Damla-Gurbaz @maziyarpanahi @jsl-builder @Cabir40 @suvrat-joshi
Beta Was this translation helpful? Give feedback.
All reactions