diff --git a/CHANGELOG b/CHANGELOG index f8cfaf23daa84b..938c00eca75039 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,3 +1,25 @@ +======== +5.1.4 +======== +---------------- +New Features & Enhancements +---------------- +* **NEW:** Introduceding the `DocumentCharacterTextSplitter` which allows users to split large documents into smaller chunks. `DocumentCharacterTextSplitter` takes a list of separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks. +* **NEW:** Introducing support for ONNX Runtime in RobertaForSequenceClassification annotator +* **NEW:** Introducing support for ONNX Runtime in RobertaForTokenClassification annotator +* **NEW:** Introducing support for ONNX Runtime in RobertaForQuestionAnswering annotator +* Adding an example to load a model directly from Azure using .load() method. This example helps users to understand how to set Spark NLP to load models from Azure + +---------------- +Bug Fixes +---------------- +* Fix a bug with in `Whisper` annotator, that would not allow every model to be imported +* Fix BPE Tokenizer to include a flag whether or not to always prepend a space before words (previous behavior for embeddings) +* Fix BPE Tokenizer to correctly convert and tokenize non-latin and other special characters/words +* Fix `RobertaForQuestionAnswering` to produce the same logits and indexes as the implementation in Transformer library +* Fix the return order of logits in `BertForQuestionAnswering` and `DistilBertForQuestionAnswering` annotators + + ======== 5.1.3 ======== diff --git a/README.md b/README.md index fe80bc91c0fcb1..5d0d4e50dd59c0 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@ Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed environment. -Spark NLP comes with **21000+** pretrained **pipelines** and **models** in more than **200+** languages. +Spark NLP comes with **22000+** pretrained **pipelines** and **models** in more than **200+** languages. It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Image to Text (captioning)**, **Automatic Speech Recognition**, **Zero-Shot Learning**, and many more [NLP tasks](#features). **Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Facebook BART**, **Instructor**, **E5**, **Google T5**, **MarianMT**, **OpenAI GPT2**, and **Vision Transformers (ViT)** not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively. @@ -80,6 +80,7 @@ documentation and examples - Stop Words Removal - Token Normalizer - Document Normalizer +- Document & Text Splitter - Stemmer - Lemmatizer - NGrams @@ -157,8 +158,8 @@ documentation and examples - Easy ONNX and TensorFlow integrations - GPU Support - Full integration with Spark ML functions -- +15000 pre-trained models in +200 languages! -- +5800 pre-trained pipelines in +200 languages! +- +16800 pre-trained models in +200 languages! +- +5900 pre-trained pipelines in +200 languages! - Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more. @@ -167,11 +168,11 @@ documentation and examples To use Spark NLP you need the following requirements: - Java 8 and 11 -- Apache Spark 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x +- Apache Spark 3.5.x, 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x **GPU (optional):** -Spark NLP 5.1.3 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: +Spark NLP 5.1.4 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 @@ -187,7 +188,7 @@ $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.1.3 pyspark==3.3.1 +$ pip install spark-nlp==5.1.4 pyspark==3.3.1 ``` In Python console or Jupyter `Python3` kernel: @@ -232,23 +233,23 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh ## Apache Spark Support -Spark NLP *5.1.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x - -| Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x | Apache Spark 3.4.x | -|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------| -| 5.0.x | NO | NO | YES | YES | YES | YES | YES | -| 4.4.x | NO | NO | YES | YES | YES | YES | YES | -| 4.3.x | NO | NO | YES | YES | YES | YES | NO | -| 4.2.x | NO | NO | YES | YES | YES | YES | NO | -| 4.1.x | NO | NO | YES | YES | YES | YES | NO | -| 4.0.x | NO | NO | YES | YES | YES | YES | NO | -| 3.4.x | YES | YES | YES | YES | Partially | N/A | NO | -| 3.3.x | YES | YES | YES | YES | NO | NO | NO | -| 3.2.x | YES | YES | YES | YES | NO | NO | NO | -| 3.1.x | YES | YES | YES | YES | NO | NO | NO | -| 3.0.x | YES | YES | YES | YES | NO | NO | NO | -| 2.7.x | YES | YES | NO | NO | NO | NO | NO | - +Spark NLP *5.1.4* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x + +| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x | +|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------| +| 5.1.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 5.0.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 4.4.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 4.3.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 4.2.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 4.1.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 4.0.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 3.4.x | NO | NO | N/A | Partially | YES | YES | YES | YES | +| 3.3.x | NO | NO | NO | NO | YES | YES | YES | YES | +| 3.2.x | NO | NO | NO | NO | YES | YES | YES | YES | +| 3.1.x | NO | NO | NO | NO | YES | YES | YES | YES | +| 3.0.x | NO | NO | NO | NO | YES | YES | YES | YES | +| 2.7.x | NO | NO | NO | NO | NO | NO | YES | YES | Find out more about `Spark NLP` versions from our [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases). @@ -256,6 +257,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github | Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 | |-----------|------------|------------|------------|------------|------------|------------|------------| +| 5.1.x | NO | YES | YES | YES | YES | NO | YES | | 5.0.x | NO | YES | YES | YES | YES | NO | YES | | 4.4.x | NO | YES | YES | YES | YES | NO | YES | | 4.3.x | YES | YES | YES | YES | YES | NO | YES | @@ -271,7 +273,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github ## Databricks Support -Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: +Spark NLP 5.1.4 has been tested and is compatible with the following runtimes: **CPU:** @@ -309,6 +311,10 @@ Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: - 13.2 ML - 13.3 - 13.3 ML +- 14.0 +- 14.0 ML +- 14.1 +- 14.1 ML **GPU:** @@ -329,10 +335,12 @@ Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: - 13.1 ML & GPU - 13.2 ML & GPU - 13.3 ML & GPU +- 14.0 ML & GPU +- 14.0 ML & GPU ## EMR Support -Spark NLP 5.1.3 has been tested and is compatible with the following EMR releases: +Spark NLP 5.1.4 has been tested and is compatible with the following EMR releases: - emr-6.2.0 - emr-6.3.0 @@ -346,6 +354,8 @@ Spark NLP 5.1.3 has been tested and is compatible with the following EMR release - emr-6.10.0 - emr-6.11.0 - emr-6.12.0 +- emr-6.13.0 +- emr-6.14.0 Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html) @@ -357,10 +367,10 @@ NOTE: The EMR 6.1.0 and 6.1.1 are not supported. This is a cheatsheet for corresponding Spark NLP Maven package to Apache Spark / PySpark major version: -| Apache Spark | Spark NLP on CPU | Spark NLP on GPU | Spark NLP on AArch64 (linux) | Spark NLP on Apple Silicon | -|---------------------|--------------------|----------------------------|--------------------------------|--------------------------------------| -| 3.0/3.1/3.2/3.3/3.4 | `spark-nlp` | `spark-nlp-gpu` | `spark-nlp-aarch64` | `spark-nlp-silicon` | -| Start Function | `sparknlp.start()` | `sparknlp.start(gpu=True)` | `sparknlp.start(aarch64=True)` | `sparknlp.start(apple_silicon=True)` | +| Apache Spark | Spark NLP on CPU | Spark NLP on GPU | Spark NLP on AArch64 (linux) | Spark NLP on Apple Silicon | +|-------------------------|--------------------|----------------------------|--------------------------------|--------------------------------------| +| 3.0/3.1/3.2/3.3/3.4/3.5 | `spark-nlp` | `spark-nlp-gpu` | `spark-nlp-aarch64` | `spark-nlp-silicon` | +| Start Function | `sparknlp.start()` | `sparknlp.start(gpu=True)` | `sparknlp.start(aarch64=True)` | `sparknlp.start(apple_silicon=True)` | NOTE: `M1/M2` and `AArch64` are under `experimental` support. Access and support to these architectures are limited by the community and we had to build most of the dependencies by ourselves to make them compatible. We support these two @@ -370,18 +380,18 @@ architectures, however, they may not work in some environments. ### Command line (requires internet connection) -Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, Apache Spark 3.2.x, Apache Spark 3.3.x, and Apache Spark 3.4.x +Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, Apache Spark 3.2.x, Apache Spark 3.3.x, Apache Spark 3.4.x, and Apache Spark 3.5.x -#### Apache Spark 3.x (3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x - Scala 2.12) +#### Apache Spark 3.x (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - Scala 2.12) ```sh # CPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` The `spark-nlp` has been published to @@ -390,11 +400,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # GPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4 ``` @@ -404,11 +414,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # AArch64 -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4 ``` @@ -418,11 +428,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # M1/M2 (Apple Silicon) -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4 ``` @@ -436,7 +446,7 @@ set in your SparkSession: spark-shell \ --driver-memory 16g \ --conf spark.kryoserializer.buffer.max=2000M \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` ## Scala @@ -447,14 +457,14 @@ coordinates: ### Maven -**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x: +**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: ```xml com.johnsnowlabs.nlp spark-nlp_2.12 - 5.1.3 + 5.1.4 ``` @@ -465,7 +475,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-gpu_2.12 - 5.1.3 + 5.1.4 ``` @@ -476,7 +486,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-aarch64_2.12 - 5.1.3 + 5.1.4 ``` @@ -487,38 +497,38 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-silicon_2.12 - 5.1.3 + 5.1.4 ``` ### SBT -**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x: +**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.4" ``` **spark-nlp-gpu:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.4" ``` **spark-nlp-aarch64:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.4" ``` **spark-nlp-silicon:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.4" ``` Maven @@ -540,7 +550,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through Pip: ```bash -pip install spark-nlp==5.1.3 +pip install spark-nlp==5.1.4 ``` Conda: @@ -569,7 +579,7 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4") .getOrCreate() ``` @@ -601,19 +611,19 @@ result = pipeline.annotate('The Mona Lisa is a 16th century oil painting created #### spark-nlp -- FAT-JAR for CPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x +- FAT-JAR for CPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x ```bash sbt assembly ``` -- FAT-JAR for GPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x +- FAT-JAR for GPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x ```bash sbt -Dis_gpu=true assembly ``` -- FAT-JAR for M! on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x +- FAT-JAR for M! on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x ```bash sbt -Dis_silicon=true assembly @@ -640,7 +650,7 @@ Use either one of the following options - Add the following Maven Coordinates to the interpreter's library list ```bash -com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` - Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is @@ -651,7 +661,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 Apart from the previous step, install the python module through pip ```bash -pip install spark-nlp==5.1.3 +pip install spark-nlp==5.1.4 ``` Or you can install `spark-nlp` from inside Zeppelin by using Conda: @@ -679,7 +689,7 @@ launch the Jupyter from the same Python environment: $ conda create -n sparknlp python=3.8 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.1.3 pyspark==3.3.1 jupyter +$ pip install spark-nlp==5.1.4 pyspark==3.3.1 jupyter $ jupyter notebook ``` @@ -696,7 +706,7 @@ export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS=notebook -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp` @@ -723,7 +733,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.3 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.4 ``` [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) @@ -746,7 +756,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.3 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.4 ``` [Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live @@ -765,9 +775,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP 3. In `Libraries` tab inside your cluster you need to follow these steps: - 3.1. Install New -> PyPI -> `spark-nlp==5.1.3` -> Install + 3.1. Install New -> PyPI -> `spark-nlp==5.1.4` -> Install - 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3` -> Install + 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4` -> Install 4. Now you can attach your notebook to the cluster and use Spark NLP! @@ -818,7 +828,7 @@ A sample of your software configuration in JSON on S3 (must be public access): "spark.kryoserializer.buffer.max": "2000M", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.driver.maxResultSize": "0", - "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3" + "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4" } }] ``` @@ -827,7 +837,7 @@ A sample of AWS CLI to launch EMR cluster: ```.sh aws emr create-cluster \ ---name "Spark NLP 5.1.3" \ +--name "Spark NLP 5.1.4" \ --release-label emr-6.2.0 \ --applications Name=Hadoop Name=Spark Name=Hive \ --instance-type m4.4xlarge \ @@ -891,7 +901,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \ --enable-component-gateway \ --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \ - --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 + --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` 2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI. @@ -930,7 +940,7 @@ spark = SparkSession.builder .config("spark.kryoserializer.buffer.max", "2000m") .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4") .getOrCreate() ``` @@ -944,7 +954,7 @@ spark-shell \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` **pyspark:** @@ -957,7 +967,7 @@ pyspark \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` **Databricks:** @@ -1229,16 +1239,16 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars", "/tmp/spark-nlp-assembly-5.1.3.jar") + .config("spark.jars", "/tmp/spark-nlp-assembly-5.1.4.jar") .getOrCreate() ``` - You can download provided Fat JARs from each [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases), please pay attention to pick the one that suits your environment depending on the device (CPU/GPU) and Apache Spark - version (3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x) + version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x) - If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. ( - i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.3.jar`) + i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.4.jar`) Example of using pretrained Models and Pipelines in offline: diff --git a/build.sbt b/build.sbt index 09a29f16499a47..ab4480b192823f 100644 --- a/build.sbt +++ b/build.sbt @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64) organization := "com.johnsnowlabs.nlp" -version := "5.1.3" +version := "5.1.4" (ThisBuild / scalaVersion) := scalaVer diff --git a/docs/_layouts/landing.html b/docs/_layouts/landing.html index f9be7fb96ee5ac..40a26f72e5d375 100755 --- a/docs/_layouts/landing.html +++ b/docs/_layouts/landing.html @@ -201,7 +201,7 @@

{{ _section.title }}

{% highlight bash %} # Using PyPI - $ pip install spark-nlp==5.1.3 + $ pip install spark-nlp==5.1.4 # Using Anaconda/Conda $ conda install -c johnsnowlabs spark-nlp @@ -274,6 +274,7 @@

NLP Features

  • Tokenization
  • Word Segmentation
  • Stop Words Removal
  • +
  • Document & Text Splitter
  • Normalizer
  • Stemmer
  • Lemmatizer
  • @@ -339,8 +340,8 @@

    NLP Features

  • Easy ONNX and TensorFlow integrations
  • GPU Support
  • Full integration with Spark ML functions
  • -
  • 15000+ pre-trained models in 200+ languages! -
  • 5800+ pre-trained pipelines in 200+ languages! +
  • 16800+ pre-trained models in 200+ languages! +
  • 5900+ pre-trained pipelines in 200+ languages!
  • {% highlight python %} diff --git a/docs/en/concepts.md b/docs/en/concepts.md index 8605840b4c4c4d..f2e19127611594 100644 --- a/docs/en/concepts.md +++ b/docs/en/concepts.md @@ -66,7 +66,7 @@ $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.1.3 pyspark==3.3.1 jupyter +$ pip install spark-nlp==5.1.4 pyspark==3.3.1 jupyter $ jupyter notebook ``` diff --git a/docs/en/examples.md b/docs/en/examples.md index 2007c8b10bd263..319b43664fd3f6 100644 --- a/docs/en/examples.md +++ b/docs/en/examples.md @@ -18,7 +18,7 @@ $ java -version # should be Java 8 (Oracle or OpenJDK) $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp -$ pip install spark-nlp==5.1.3 pyspark==3.3.1 +$ pip install spark-nlp==5.1.4 pyspark==3.3.1 ```
    @@ -40,7 +40,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -p is for pyspark # -s is for spark-nlp # by default they are set to the latest -!bash colab.sh -p 3.2.3 -s 5.1.3 +!bash colab.sh -p 3.2.3 -s 5.1.4 ``` [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP pretrained pipelines. diff --git a/docs/en/hardware_acceleration.md b/docs/en/hardware_acceleration.md index 73e66b581fc690..bbba02def8f91d 100644 --- a/docs/en/hardware_acceleration.md +++ b/docs/en/hardware_acceleration.md @@ -49,7 +49,7 @@ Since the new Transformer models such as BERT for Word and Sentence embeddings a | DeBERTa Large | +477%(5.8x) | | Longformer Base | +52%(1.5x) | -Spark NLP 5.1.3 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support: +Spark NLP 5.1.4 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 diff --git a/docs/en/install.md b/docs/en/install.md index 97a8aea99cac27..7a148496246749 100644 --- a/docs/en/install.md +++ b/docs/en/install.md @@ -17,22 +17,22 @@ sidebar: ```bash # Install Spark NLP from PyPI -pip install spark-nlp==5.1.3 +pip install spark-nlp==5.1.4 # Install Spark NLP from Anacodna/Conda conda install -c johnsnowlabs spark-nlp # Load Spark NLP with Spark Shell -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 # Load Spark NLP with PySpark -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 # Load Spark NLP with Spark Submit -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 # Load Spark NLP as external JAR after compiling and building Spark NLP by `sbt assembly` -spark-shell --jars spark-nlp-assembly-5.1.3.jar +spark-shell --jars spark-nlp-assembly-5.1.4.jar ```
    @@ -55,7 +55,7 @@ $ java -version # should be Java 8 (Oracle or OpenJDK) $ conda create -n sparknlp python=3.8 -y $ conda activate sparknlp -$ pip install spark-nlp==5.1.3 pyspark==3.3.1 +$ pip install spark-nlp==5.1.4 pyspark==3.3.1 ``` Of course you will need to have jupyter installed in your system: @@ -83,7 +83,7 @@ spark = SparkSession.builder \ .config("spark.driver.memory","16G")\ .config("spark.driver.maxResultSize", "0") \ .config("spark.kryoserializer.buffer.max", "2000M")\ - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3")\ + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4")\ .getOrCreate() ``` @@ -100,7 +100,7 @@ spark = SparkSession.builder \ com.johnsnowlabs.nlp spark-nlp_2.12 - 5.1.3 + 5.1.4 ``` @@ -111,7 +111,7 @@ spark = SparkSession.builder \ com.johnsnowlabs.nlp spark-nlp-gpu_2.12 - 5.1.3 + 5.1.4 ``` @@ -122,7 +122,7 @@ spark = SparkSession.builder \ com.johnsnowlabs.nlp spark-nlp-silicon_2.12 - 5.1.3 + 5.1.4 ``` @@ -133,7 +133,7 @@ spark = SparkSession.builder \ com.johnsnowlabs.nlp spark-nlp-aarch64_2.12 - 5.1.3 + 5.1.4 ``` @@ -145,28 +145,28 @@ spark = SparkSession.builder \ ```scala // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.4" ``` **spark-nlp-gpu:** ```scala // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.4" ``` **spark-nlp-silicon:** ```scala // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.4" ``` **spark-nlp-aarch64:** ```scala // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.4" ``` Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp) @@ -248,7 +248,7 @@ maven coordinates like these: com.johnsnowlabs.nlp spark-nlp-silicon_2.12 - 5.1.3 + 5.1.4 ``` @@ -256,7 +256,7 @@ or in case of sbt: ```scala // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.4" ``` If everything went well, you can now start Spark NLP with the `m1` flag set to `true`: @@ -293,7 +293,7 @@ spark = sparknlp.start(apple_silicon=True) ## Installation for Linux Aarch64 Systems -Starting from version 5.1.3, Spark NLP supports Linux systems running on an aarch64 +Starting from version 5.1.4, Spark NLP supports Linux systems running on an aarch64 processor architecture. The necessary dependencies have been built on Ubuntu 16.04, so a recent system with an environment of at least that will be needed. @@ -341,7 +341,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -p is for pyspark # -s is for spark-nlp # by default they are set to the latest -!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.3 +!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.4 ``` [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP pretrained pipelines. @@ -363,7 +363,7 @@ Run the following code in Kaggle Kernel and start using spark-nlp right away. ## Databricks Support -Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: +Spark NLP 5.1.4 has been tested and is compatible with the following runtimes: **CPU:** @@ -401,6 +401,10 @@ Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: - 13.2 ML - 13.3 - 13.3 ML +- 14.0 +- 14.0 ML +- 14.1 +- 14.1 ML **GPU:** @@ -421,6 +425,8 @@ Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: - 13.1 ML & GPU - 13.2 ML & GPU - 13.3 ML & GPU +- 14.0 ML & GPU +- 14.1 ML & GPU
    @@ -439,7 +445,7 @@ Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: 3.1. Install New -> PyPI -> `spark-nlp` -> Install - 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3` -> Install + 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4` -> Install 4. Now you can attach your notebook to the cluster and use Spark NLP! @@ -459,7 +465,7 @@ Note: You can import these notebooks by using their URLs. ## EMR Support -Spark NLP 5.1.3 has been tested and is compatible with the following EMR releases: +Spark NLP 5.1.4 has been tested and is compatible with the following EMR releases: - emr-6.2.0 - emr-6.3.0 @@ -471,6 +477,10 @@ Spark NLP 5.1.3 has been tested and is compatible with the following EMR release - emr-6.8.0 - emr-6.9.0 - emr-6.10.0 +- emr-6.11.0 +- emr-6.12.0 +- emr-6.13.0 +- emr-6.14.0 Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html) @@ -518,7 +528,7 @@ A sample of your software configuration in JSON on S3 (must be public access): "spark.kryoserializer.buffer.max": "2000M", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.driver.maxResultSize": "0", - "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3" + "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4" } } ] @@ -528,7 +538,7 @@ A sample of AWS CLI to launch EMR cluster: ```sh aws emr create-cluster \ ---name "Spark NLP 5.1.3" \ +--name "Spark NLP 5.1.4" \ --release-label emr-6.2.0 \ --applications Name=Hadoop Name=Spark Name=Hive \ --instance-type m4.4xlarge \ @@ -793,7 +803,7 @@ We recommend using `conda` to manage your Python environment on Windows. Now you can use the downloaded binary by navigating to `%SPARK_HOME%\bin` and running -Either create a conda env for python 3.6, install *pyspark==3.3.1 spark-nlp numpy* and use Jupyter/python console, or in the same conda env you can go to spark bin for *pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3*. +Either create a conda env for python 3.6, install *pyspark==3.3.1 spark-nlp numpy* and use Jupyter/python console, or in the same conda env you can go to spark bin for *pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4*. @@ -821,12 +831,12 @@ spark = SparkSession.builder \ .config("spark.driver.memory","16G")\ .config("spark.driver.maxResultSize", "0") \ .config("spark.kryoserializer.buffer.max", "2000M")\ - .config("spark.jars", "/tmp/spark-nlp-assembly-5.1.3.jar")\ + .config("spark.jars", "/tmp/spark-nlp-assembly-5.1.4.jar")\ .getOrCreate() ``` - You can download provided Fat JARs from each [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases), please pay attention to pick the one that suits your environment depending on the device (CPU/GPU) and Apache Spark version (3.x) -- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.3.jar`) +- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.4.jar`) Example of using pretrained Models and Pipelines in offline: diff --git a/docs/en/spark_nlp.md b/docs/en/spark_nlp.md index 0358445d5f4af2..41c988ba3348c9 100644 --- a/docs/en/spark_nlp.md +++ b/docs/en/spark_nlp.md @@ -25,7 +25,7 @@ Spark NLP is built on top of **Apache Spark 3.x**. For using Spark NLP you need: **GPU (optional):** -Spark NLP 5.1.3 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support: +Spark NLP 5.1.4 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 diff --git a/python/README.md b/python/README.md index fe80bc91c0fcb1..5d0d4e50dd59c0 100644 --- a/python/README.md +++ b/python/README.md @@ -19,7 +19,7 @@ Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed environment. -Spark NLP comes with **21000+** pretrained **pipelines** and **models** in more than **200+** languages. +Spark NLP comes with **22000+** pretrained **pipelines** and **models** in more than **200+** languages. It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Image to Text (captioning)**, **Automatic Speech Recognition**, **Zero-Shot Learning**, and many more [NLP tasks](#features). **Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Facebook BART**, **Instructor**, **E5**, **Google T5**, **MarianMT**, **OpenAI GPT2**, and **Vision Transformers (ViT)** not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively. @@ -80,6 +80,7 @@ documentation and examples - Stop Words Removal - Token Normalizer - Document Normalizer +- Document & Text Splitter - Stemmer - Lemmatizer - NGrams @@ -157,8 +158,8 @@ documentation and examples - Easy ONNX and TensorFlow integrations - GPU Support - Full integration with Spark ML functions -- +15000 pre-trained models in +200 languages! -- +5800 pre-trained pipelines in +200 languages! +- +16800 pre-trained models in +200 languages! +- +5900 pre-trained pipelines in +200 languages! - Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more. @@ -167,11 +168,11 @@ documentation and examples To use Spark NLP you need the following requirements: - Java 8 and 11 -- Apache Spark 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x +- Apache Spark 3.5.x, 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x **GPU (optional):** -Spark NLP 5.1.3 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: +Spark NLP 5.1.4 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 @@ -187,7 +188,7 @@ $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.1.3 pyspark==3.3.1 +$ pip install spark-nlp==5.1.4 pyspark==3.3.1 ``` In Python console or Jupyter `Python3` kernel: @@ -232,23 +233,23 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh ## Apache Spark Support -Spark NLP *5.1.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x - -| Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x | Apache Spark 3.4.x | -|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------| -| 5.0.x | NO | NO | YES | YES | YES | YES | YES | -| 4.4.x | NO | NO | YES | YES | YES | YES | YES | -| 4.3.x | NO | NO | YES | YES | YES | YES | NO | -| 4.2.x | NO | NO | YES | YES | YES | YES | NO | -| 4.1.x | NO | NO | YES | YES | YES | YES | NO | -| 4.0.x | NO | NO | YES | YES | YES | YES | NO | -| 3.4.x | YES | YES | YES | YES | Partially | N/A | NO | -| 3.3.x | YES | YES | YES | YES | NO | NO | NO | -| 3.2.x | YES | YES | YES | YES | NO | NO | NO | -| 3.1.x | YES | YES | YES | YES | NO | NO | NO | -| 3.0.x | YES | YES | YES | YES | NO | NO | NO | -| 2.7.x | YES | YES | NO | NO | NO | NO | NO | - +Spark NLP *5.1.4* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x + +| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x | +|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------| +| 5.1.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 5.0.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 4.4.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 4.3.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 4.2.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 4.1.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 4.0.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 3.4.x | NO | NO | N/A | Partially | YES | YES | YES | YES | +| 3.3.x | NO | NO | NO | NO | YES | YES | YES | YES | +| 3.2.x | NO | NO | NO | NO | YES | YES | YES | YES | +| 3.1.x | NO | NO | NO | NO | YES | YES | YES | YES | +| 3.0.x | NO | NO | NO | NO | YES | YES | YES | YES | +| 2.7.x | NO | NO | NO | NO | NO | NO | YES | YES | Find out more about `Spark NLP` versions from our [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases). @@ -256,6 +257,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github | Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 | |-----------|------------|------------|------------|------------|------------|------------|------------| +| 5.1.x | NO | YES | YES | YES | YES | NO | YES | | 5.0.x | NO | YES | YES | YES | YES | NO | YES | | 4.4.x | NO | YES | YES | YES | YES | NO | YES | | 4.3.x | YES | YES | YES | YES | YES | NO | YES | @@ -271,7 +273,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github ## Databricks Support -Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: +Spark NLP 5.1.4 has been tested and is compatible with the following runtimes: **CPU:** @@ -309,6 +311,10 @@ Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: - 13.2 ML - 13.3 - 13.3 ML +- 14.0 +- 14.0 ML +- 14.1 +- 14.1 ML **GPU:** @@ -329,10 +335,12 @@ Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: - 13.1 ML & GPU - 13.2 ML & GPU - 13.3 ML & GPU +- 14.0 ML & GPU +- 14.0 ML & GPU ## EMR Support -Spark NLP 5.1.3 has been tested and is compatible with the following EMR releases: +Spark NLP 5.1.4 has been tested and is compatible with the following EMR releases: - emr-6.2.0 - emr-6.3.0 @@ -346,6 +354,8 @@ Spark NLP 5.1.3 has been tested and is compatible with the following EMR release - emr-6.10.0 - emr-6.11.0 - emr-6.12.0 +- emr-6.13.0 +- emr-6.14.0 Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html) @@ -357,10 +367,10 @@ NOTE: The EMR 6.1.0 and 6.1.1 are not supported. This is a cheatsheet for corresponding Spark NLP Maven package to Apache Spark / PySpark major version: -| Apache Spark | Spark NLP on CPU | Spark NLP on GPU | Spark NLP on AArch64 (linux) | Spark NLP on Apple Silicon | -|---------------------|--------------------|----------------------------|--------------------------------|--------------------------------------| -| 3.0/3.1/3.2/3.3/3.4 | `spark-nlp` | `spark-nlp-gpu` | `spark-nlp-aarch64` | `spark-nlp-silicon` | -| Start Function | `sparknlp.start()` | `sparknlp.start(gpu=True)` | `sparknlp.start(aarch64=True)` | `sparknlp.start(apple_silicon=True)` | +| Apache Spark | Spark NLP on CPU | Spark NLP on GPU | Spark NLP on AArch64 (linux) | Spark NLP on Apple Silicon | +|-------------------------|--------------------|----------------------------|--------------------------------|--------------------------------------| +| 3.0/3.1/3.2/3.3/3.4/3.5 | `spark-nlp` | `spark-nlp-gpu` | `spark-nlp-aarch64` | `spark-nlp-silicon` | +| Start Function | `sparknlp.start()` | `sparknlp.start(gpu=True)` | `sparknlp.start(aarch64=True)` | `sparknlp.start(apple_silicon=True)` | NOTE: `M1/M2` and `AArch64` are under `experimental` support. Access and support to these architectures are limited by the community and we had to build most of the dependencies by ourselves to make them compatible. We support these two @@ -370,18 +380,18 @@ architectures, however, they may not work in some environments. ### Command line (requires internet connection) -Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, Apache Spark 3.2.x, Apache Spark 3.3.x, and Apache Spark 3.4.x +Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, Apache Spark 3.2.x, Apache Spark 3.3.x, Apache Spark 3.4.x, and Apache Spark 3.5.x -#### Apache Spark 3.x (3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x - Scala 2.12) +#### Apache Spark 3.x (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - Scala 2.12) ```sh # CPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` The `spark-nlp` has been published to @@ -390,11 +400,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # GPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4 ``` @@ -404,11 +414,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # AArch64 -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4 ``` @@ -418,11 +428,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # M1/M2 (Apple Silicon) -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4 ``` @@ -436,7 +446,7 @@ set in your SparkSession: spark-shell \ --driver-memory 16g \ --conf spark.kryoserializer.buffer.max=2000M \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` ## Scala @@ -447,14 +457,14 @@ coordinates: ### Maven -**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x: +**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: ```xml com.johnsnowlabs.nlp spark-nlp_2.12 - 5.1.3 + 5.1.4 ``` @@ -465,7 +475,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-gpu_2.12 - 5.1.3 + 5.1.4 ``` @@ -476,7 +486,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-aarch64_2.12 - 5.1.3 + 5.1.4 ``` @@ -487,38 +497,38 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-silicon_2.12 - 5.1.3 + 5.1.4 ``` ### SBT -**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x: +**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.4" ``` **spark-nlp-gpu:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.4" ``` **spark-nlp-aarch64:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.4" ``` **spark-nlp-silicon:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.4" ``` Maven @@ -540,7 +550,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through Pip: ```bash -pip install spark-nlp==5.1.3 +pip install spark-nlp==5.1.4 ``` Conda: @@ -569,7 +579,7 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4") .getOrCreate() ``` @@ -601,19 +611,19 @@ result = pipeline.annotate('The Mona Lisa is a 16th century oil painting created #### spark-nlp -- FAT-JAR for CPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x +- FAT-JAR for CPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x ```bash sbt assembly ``` -- FAT-JAR for GPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x +- FAT-JAR for GPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x ```bash sbt -Dis_gpu=true assembly ``` -- FAT-JAR for M! on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x +- FAT-JAR for M! on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x ```bash sbt -Dis_silicon=true assembly @@ -640,7 +650,7 @@ Use either one of the following options - Add the following Maven Coordinates to the interpreter's library list ```bash -com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` - Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is @@ -651,7 +661,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 Apart from the previous step, install the python module through pip ```bash -pip install spark-nlp==5.1.3 +pip install spark-nlp==5.1.4 ``` Or you can install `spark-nlp` from inside Zeppelin by using Conda: @@ -679,7 +689,7 @@ launch the Jupyter from the same Python environment: $ conda create -n sparknlp python=3.8 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.1.3 pyspark==3.3.1 jupyter +$ pip install spark-nlp==5.1.4 pyspark==3.3.1 jupyter $ jupyter notebook ``` @@ -696,7 +706,7 @@ export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS=notebook -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp` @@ -723,7 +733,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.3 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.4 ``` [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) @@ -746,7 +756,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.3 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.4 ``` [Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live @@ -765,9 +775,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP 3. In `Libraries` tab inside your cluster you need to follow these steps: - 3.1. Install New -> PyPI -> `spark-nlp==5.1.3` -> Install + 3.1. Install New -> PyPI -> `spark-nlp==5.1.4` -> Install - 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3` -> Install + 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4` -> Install 4. Now you can attach your notebook to the cluster and use Spark NLP! @@ -818,7 +828,7 @@ A sample of your software configuration in JSON on S3 (must be public access): "spark.kryoserializer.buffer.max": "2000M", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.driver.maxResultSize": "0", - "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3" + "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4" } }] ``` @@ -827,7 +837,7 @@ A sample of AWS CLI to launch EMR cluster: ```.sh aws emr create-cluster \ ---name "Spark NLP 5.1.3" \ +--name "Spark NLP 5.1.4" \ --release-label emr-6.2.0 \ --applications Name=Hadoop Name=Spark Name=Hive \ --instance-type m4.4xlarge \ @@ -891,7 +901,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \ --enable-component-gateway \ --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \ - --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 + --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` 2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI. @@ -930,7 +940,7 @@ spark = SparkSession.builder .config("spark.kryoserializer.buffer.max", "2000m") .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4") .getOrCreate() ``` @@ -944,7 +954,7 @@ spark-shell \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` **pyspark:** @@ -957,7 +967,7 @@ pyspark \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4 ``` **Databricks:** @@ -1229,16 +1239,16 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars", "/tmp/spark-nlp-assembly-5.1.3.jar") + .config("spark.jars", "/tmp/spark-nlp-assembly-5.1.4.jar") .getOrCreate() ``` - You can download provided Fat JARs from each [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases), please pay attention to pick the one that suits your environment depending on the device (CPU/GPU) and Apache Spark - version (3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x) + version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x) - If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. ( - i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.3.jar`) + i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.4.jar`) Example of using pretrained Models and Pipelines in offline: diff --git a/python/docs/conf.py b/python/docs/conf.py index 60259e13b84167..2e9ecc6385bd75 100644 --- a/python/docs/conf.py +++ b/python/docs/conf.py @@ -23,7 +23,7 @@ author = "John Snow Labs" # The full version, including alpha/beta/rc tags -release = "5.1.3" +release = "5.1.4" pyspark_version = "3.2.3" # -- General configuration --------------------------------------------------- diff --git a/python/setup.py b/python/setup.py index 75874a87e6df69..38d28d3339fb82 100644 --- a/python/setup.py +++ b/python/setup.py @@ -41,7 +41,7 @@ # project code, see # https://packaging.python.org/en/latest/single_source_version.html - version='5.1.3', # Required + version='5.1.4', # Required # This is a one-line description or tagline of what your project does. This # corresponds to the 'Summary' metadata field: diff --git a/python/sparknlp/__init__.py b/python/sparknlp/__init__.py index 0de1c23009bca8..6dafd53a57e17f 100644 --- a/python/sparknlp/__init__.py +++ b/python/sparknlp/__init__.py @@ -128,7 +128,7 @@ def start(gpu=False, The initiated Spark session. """ - current_version = "5.1.3" + current_version = "5.1.4" if params is None: params = {} @@ -309,4 +309,4 @@ def version(): str The current Spark NLP version. """ - return '5.1.3' + return '5.1.4' diff --git a/scripts/colab_setup.sh b/scripts/colab_setup.sh index ef3b7525bbf752..66fc63fae4fddb 100644 --- a/scripts/colab_setup.sh +++ b/scripts/colab_setup.sh @@ -1,7 +1,7 @@ #!/bin/bash #default values for pyspark, spark-nlp, and SPARK_HOME -SPARKNLP="5.1.3" +SPARKNLP="5.1.4" PYSPARK="3.2.3" while getopts s:p:g option diff --git a/scripts/kaggle_setup.sh b/scripts/kaggle_setup.sh index f09b7f7cd16132..cf07f133fd051d 100644 --- a/scripts/kaggle_setup.sh +++ b/scripts/kaggle_setup.sh @@ -1,7 +1,7 @@ #!/bin/bash #default values for pyspark, spark-nlp, and SPARK_HOME -SPARKNLP="5.1.3" +SPARKNLP="5.1.4" PYSPARK="3.2.3" while getopts s:p:g option diff --git a/scripts/sagemaker_setup.sh b/scripts/sagemaker_setup.sh index dc5e6114357233..ab85b329fc5256 100644 --- a/scripts/sagemaker_setup.sh +++ b/scripts/sagemaker_setup.sh @@ -1,7 +1,7 @@ #!/bin/bash # Default values for pyspark, spark-nlp, and SPARK_HOME -SPARKNLP="5.1.3" +SPARKNLP="5.1.4" PYSPARK="3.2.3" echo "Setup SageMaker for PySpark $PYSPARK and Spark NLP $SPARKNLP"