Skip to content

Commit

Permalink
Bump version and update CHANGELOG [run doc]
Browse files Browse the repository at this point in the history
  • Loading branch information
maziyarpanahi committed Oct 26, 2023
1 parent c9c0eee commit b856b05
Show file tree
Hide file tree
Showing 16 changed files with 248 additions and 195 deletions.
22 changes: 22 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,25 @@
========
5.1.4
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introduceding the `DocumentCharacterTextSplitter` which allows users to split large documents into smaller chunks. `DocumentCharacterTextSplitter` takes a list of separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.
* **NEW:** Introducing support for ONNX Runtime in RobertaForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in RobertaForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in RobertaForQuestionAnswering annotator
* Adding an example to load a model directly from Azure using .load() method. This example helps users to understand how to set Spark NLP to load models from Azure

----------------
Bug Fixes
----------------
* Fix a bug with in `Whisper` annotator, that would not allow every model to be imported
* Fix BPE Tokenizer to include a flag whether or not to always prepend a space before words (previous behavior for embeddings)
* Fix BPE Tokenizer to correctly convert and tokenize non-latin and other special characters/words
* Fix `RobertaForQuestionAnswering` to produce the same logits and indexes as the implementation in Transformer library
* Fix the return order of logits in `BertForQuestionAnswering` and `DistilBertForQuestionAnswering` annotators


========
5.1.3
========
Expand Down
162 changes: 86 additions & 76 deletions README.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)

organization := "com.johnsnowlabs.nlp"

version := "5.1.3"
version := "5.1.4"

(ThisBuild / scalaVersion) := scalaVer

Expand Down
7 changes: 4 additions & 3 deletions docs/_layouts/landing.html
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ <h3 class="grey h3_title">{{ _section.title }}</h3>
<div class="highlight-box">
{% highlight bash %}
# Using PyPI
$ pip install spark-nlp==5.1.3
$ pip install spark-nlp==5.1.4

# Using Anaconda/Conda
$ conda install -c johnsnowlabs spark-nlp
Expand Down Expand Up @@ -274,6 +274,7 @@ <h4 class="blue h4_title">NLP Features</h4>
<li>Tokenization</li>
<li>Word Segmentation</li>
<li>Stop Words Removal</li>
<li>Document & Text Splitter</li>
<li>Normalizer</li>
<li>Stemmer</li>
<li>Lemmatizer</li>
Expand Down Expand Up @@ -339,8 +340,8 @@ <h4 class="blue h4_title">NLP Features</h4>
<li>Easy <strong>ONNX</strong> and <strong>TensorFlow</strong> integrations</li>
<li><strong>GPU</strong> Support</li>
<li>Full integration with <strong>Spark ML</strong> functions</li>
<li><strong>15000+</strong> pre-trained <strong>models </strong> in <strong>200+ languages! </strong>
<li><strong>5800+</strong> pre-trained <strong>pipelines </strong> in <strong>200+ languages! </strong>
<li><strong>16800+</strong> pre-trained <strong>models </strong> in <strong>200+ languages! </strong>
<li><strong>5900+</strong> pre-trained <strong>pipelines </strong> in <strong>200+ languages! </strong>
</ul>
</div>
{% highlight python %}
Expand Down
2 changes: 1 addition & 1 deletion docs/en/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.1.3 pyspark==3.3.1 jupyter
$ pip install spark-nlp==5.1.4 pyspark==3.3.1 jupyter
$ jupyter notebook
```

Expand Down
4 changes: 2 additions & 2 deletions docs/en/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ $ java -version
# should be Java 8 (Oracle or OpenJDK)
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
$ pip install spark-nlp==5.1.3 pyspark==3.3.1
$ pip install spark-nlp==5.1.4 pyspark==3.3.1
```

</div><div class="h3-box" markdown="1">
Expand All @@ -40,7 +40,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -p is for pyspark
# -s is for spark-nlp
# by default they are set to the latest
!bash colab.sh -p 3.2.3 -s 5.1.3
!bash colab.sh -p 3.2.3 -s 5.1.4
```

[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP pretrained pipelines.
Expand Down
2 changes: 1 addition & 1 deletion docs/en/hardware_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ Since the new Transformer models such as BERT for Word and Sentence embeddings a
| DeBERTa Large | +477%(5.8x) |
| Longformer Base | +52%(1.5x) |

Spark NLP 5.1.3 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
Spark NLP 5.1.4 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:

- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
Expand Down
64 changes: 37 additions & 27 deletions docs/en/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,22 +17,22 @@ sidebar:

```bash
# Install Spark NLP from PyPI
pip install spark-nlp==5.1.3
pip install spark-nlp==5.1.4

# Install Spark NLP from Anacodna/Conda
conda install -c johnsnowlabs spark-nlp

# Load Spark NLP with Spark Shell
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4

# Load Spark NLP with PySpark
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4

# Load Spark NLP with Spark Submit
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4

# Load Spark NLP as external JAR after compiling and building Spark NLP by `sbt assembly`
spark-shell --jars spark-nlp-assembly-5.1.3.jar
spark-shell --jars spark-nlp-assembly-5.1.4.jar
```

</div><div class="h3-box" markdown="1">
Expand All @@ -55,7 +55,7 @@ $ java -version
# should be Java 8 (Oracle or OpenJDK)
$ conda create -n sparknlp python=3.8 -y
$ conda activate sparknlp
$ pip install spark-nlp==5.1.3 pyspark==3.3.1
$ pip install spark-nlp==5.1.4 pyspark==3.3.1
```

Of course you will need to have jupyter installed in your system:
Expand Down Expand Up @@ -83,7 +83,7 @@ spark = SparkSession.builder \
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0") \
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3")\
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4")\
.getOrCreate()
```

Expand All @@ -100,7 +100,7 @@ spark = SparkSession.builder \
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.1.3</version>
<version>5.1.4</version>
</dependency>
```

Expand All @@ -111,7 +111,7 @@ spark = SparkSession.builder \
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.1.3</version>
<version>5.1.4</version>
</dependency>
```

Expand All @@ -122,7 +122,7 @@ spark = SparkSession.builder \
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.1.3</version>
<version>5.1.4</version>
</dependency>
```

Expand All @@ -133,7 +133,7 @@ spark = SparkSession.builder \
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.1.3</version>
<version>5.1.4</version>
</dependency>
```

Expand All @@ -145,28 +145,28 @@ spark = SparkSession.builder \

```scala
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.3"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.4"
```

**spark-nlp-gpu:**

```scala
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.3"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.4"
```

**spark-nlp-silicon:**

```scala
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.3"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.4"
```

**spark-nlp-aarch64:**

```scala
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.3"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.4"
```

Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp)
Expand Down Expand Up @@ -248,15 +248,15 @@ maven coordinates like these:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.1.3</version>
<version>5.1.4</version>
</dependency>
```

or in case of sbt:

```scala
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.3"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.4"
```

If everything went well, you can now start Spark NLP with the `m1` flag set to `true`:
Expand Down Expand Up @@ -293,7 +293,7 @@ spark = sparknlp.start(apple_silicon=True)

## Installation for Linux Aarch64 Systems

Starting from version 5.1.3, Spark NLP supports Linux systems running on an aarch64
Starting from version 5.1.4, Spark NLP supports Linux systems running on an aarch64
processor architecture. The necessary dependencies have been built on Ubuntu 16.04, so a
recent system with an environment of at least that will be needed.

Expand Down Expand Up @@ -341,7 +341,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -p is for pyspark
# -s is for spark-nlp
# by default they are set to the latest
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.3
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.4
```

[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP pretrained pipelines.
Expand All @@ -363,7 +363,7 @@ Run the following code in Kaggle Kernel and start using spark-nlp right away.

## Databricks Support

Spark NLP 5.1.3 has been tested and is compatible with the following runtimes:
Spark NLP 5.1.4 has been tested and is compatible with the following runtimes:

**CPU:**

Expand Down Expand Up @@ -401,6 +401,10 @@ Spark NLP 5.1.3 has been tested and is compatible with the following runtimes:
- 13.2 ML
- 13.3
- 13.3 ML
- 14.0
- 14.0 ML
- 14.1
- 14.1 ML

**GPU:**

Expand All @@ -421,6 +425,8 @@ Spark NLP 5.1.3 has been tested and is compatible with the following runtimes:
- 13.1 ML & GPU
- 13.2 ML & GPU
- 13.3 ML & GPU
- 14.0 ML & GPU
- 14.1 ML & GPU

</div><div class="h3-box" markdown="1">

Expand All @@ -439,7 +445,7 @@ Spark NLP 5.1.3 has been tested and is compatible with the following runtimes:
3.1. Install New -> PyPI -> `spark-nlp` -> Install
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3` -> Install
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4` -> Install
4. Now you can attach your notebook to the cluster and use Spark NLP!
Expand All @@ -459,7 +465,7 @@ Note: You can import these notebooks by using their URLs.

## EMR Support

Spark NLP 5.1.3 has been tested and is compatible with the following EMR releases:
Spark NLP 5.1.4 has been tested and is compatible with the following EMR releases:

- emr-6.2.0
- emr-6.3.0
Expand All @@ -471,6 +477,10 @@ Spark NLP 5.1.3 has been tested and is compatible with the following EMR release
- emr-6.8.0
- emr-6.9.0
- emr-6.10.0
- emr-6.11.0
- emr-6.12.0
- emr-6.13.0
- emr-6.14.0

Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html)

Expand Down Expand Up @@ -518,7 +528,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
"spark.kryoserializer.buffer.max": "2000M",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.driver.maxResultSize": "0",
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3"
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4"
}
}
]
Expand All @@ -528,7 +538,7 @@ A sample of AWS CLI to launch EMR cluster:

```sh
aws emr create-cluster \
--name "Spark NLP 5.1.3" \
--name "Spark NLP 5.1.4" \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--instance-type m4.4xlarge \
Expand Down Expand Up @@ -793,7 +803,7 @@ We recommend using `conda` to manage your Python environment on Windows.
Now you can use the downloaded binary by navigating to `%SPARK_HOME%\bin` and
running
Either create a conda env for python 3.6, install *pyspark==3.3.1 spark-nlp numpy* and use Jupyter/python console, or in the same conda env you can go to spark bin for *pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3*.
Either create a conda env for python 3.6, install *pyspark==3.3.1 spark-nlp numpy* and use Jupyter/python console, or in the same conda env you can go to spark bin for *pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4*.
<img class="image image--xl" src="/assets/images/installation/90126972-c03e5500-dd64-11ea-8285-e4f76aa9e543.jpg" style="width:100%; align:center; box-shadow: 0 3px 6px rgba(0,0,0,0.16), 0 3px 6px rgba(0,0,0,0.23);"/>
Expand Down Expand Up @@ -821,12 +831,12 @@ spark = SparkSession.builder \
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0") \
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars", "/tmp/spark-nlp-assembly-5.1.3.jar")\
.config("spark.jars", "/tmp/spark-nlp-assembly-5.1.4.jar")\
.getOrCreate()
```
- You can download provided Fat JARs from each [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases), please pay attention to pick the one that suits your environment depending on the device (CPU/GPU) and Apache Spark version (3.x)
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.3.jar`)
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.4.jar`)
Example of using pretrained Models and Pipelines in offline:
Expand Down
2 changes: 1 addition & 1 deletion docs/en/spark_nlp.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Spark NLP is built on top of **Apache Spark 3.x**. For using Spark NLP you need:

**GPU (optional):**

Spark NLP 5.1.3 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
Spark NLP 5.1.4 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:

- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
Expand Down
Loading

0 comments on commit b856b05

Please sign in to comment.