JohnSnowLabs · maziyarpanahi · Dec 15, 2024 · Sep 28, 2024 · Oct 18, 2024 · Oct 18, 2024
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,36 @@
+========
+5.5.1
+========
+----------------
+New Features & Enhancements
+----------------
+* `BertForMultipleChoice` Transformer Added. Enhanced BERT’s capabilities to handle multiple-choice tasks such as standardized test questions and survey or quiz automation.
+* Integrated New Tasks and Documentation:
+  * Added support and documentation for the following tasks:
+  * Automatic Speech Recognition
+  * Dependency Parsing
+  * Image Captioning
+  * Image Classification
+  * Landing Page
+  * Question Answering
+  * Summarization
+  * Table Question Answering
+  * Text Classification
+  * Text Generation
+  * Text Preprocessing
+  * Token Classification
+  * Translation
+  * Zero-Shot Classification
+  * Zero-Shot Image Classification
+* `PromptAssembler` Annotator Introduced. Introduced a new annotator that constructs prompts for LLMs using a chat template and a sequence of messages. Accepts an array of tuples with roles (“system”, “user”, “assistant”) and message texts. Utilizes llama.cpp as a backend for template parsing, supporting basic template applications.
+
+----------------
+Bug Fixes
+----------------
+* Resolved Pretrained Model Loading Issue on DBFS Systems.
+* Fixed a bug where pretrained models were not found when running AutoGGUF model pipelines on Databricks due to incorrect path handling of gguf files.
+
+
 ========
 5.5.0
 ========

diff --git a/README.md b/README.md
@@ -63,7 +63,7 @@ $ java -version
 $ conda create -n sparknlp python=3.7 -y
 $ conda activate sparknlp
 # spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==5.5.0 pyspark==3.3.1
+$ pip install spark-nlp==5.5.1 pyspark==3.3.1
 ```
 
 In Python console or Jupyter `Python3` kernel:
@@ -129,7 +129,7 @@ For a quick example of using pipelines and models take a look at our official [d
 
 ### Apache Spark Support
 
-Spark NLP *5.5.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
+Spark NLP *5.5.1* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
 
 | Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
 |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
@@ -157,7 +157,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http
 
 ### Databricks Support
 
-Spark NLP 5.5.0 has been tested and is compatible with the following runtimes:
+Spark NLP 5.5.1 has been tested and is compatible with the following runtimes:
 
 | **CPU**            | **GPU**            |
 |--------------------|--------------------|
@@ -174,7 +174,7 @@ We are compatible with older runtimes. For a full list check databricks support
 
 ### EMR Support
 
-Spark NLP 5.5.0 has been tested and is compatible with the following EMR releases:
+Spark NLP 5.5.1 has been tested and is compatible with the following EMR releases:
 
 | **EMR Release**    |
 |--------------------|
@@ -205,7 +205,7 @@ deployed to Maven central. To add any of our packages as a dependency in your ap
 from our official documentation.
 
 If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your
-projects [Spark NLP SBT S5.5.0r](https://github.com/maziyarpanahi/spark-nlp-starter)
+projects [Spark NLP SBT S5.5.1r](https://github.com/maziyarpanahi/spark-nlp-starter)
 
 ### Python
 
@@ -250,7 +250,7 @@ In Spark NLP we can define S3 locations to:
 
 Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation.
 
-## Document5.5.0
+## Document5.5.1
 
 ### Examples
 
@@ -283,7 +283,7 @@ the Spark NLP library:
     keywords = {Spark, Natural language processing, Deep learning, Tensorflow, Cluster},
     abstract = {Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.}
     }
-}5.5.0
+}5.5.1
 ```
 
 ## Community support

diff --git a/build.sbt b/build.sbt
@@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)
 
 organization := "com.johnsnowlabs.nlp"
 
-version := "5.5.0"
+version := "5.5.1"
 
 (ThisBuild / scalaVersion) := scalaVer
 
@@ -156,7 +156,8 @@ lazy val utilDependencies = Seq(
     exclude ("com.fasterxml.jackson.dataformat", "jackson-dataformat-cbor"),
   greex,
   azureIdentity,
-  azureStorage)
+  azureStorage,
+  jsoup)
 
 lazy val typedDependencyParserDependencies = Seq(junit)
 
@@ -185,8 +186,8 @@ val llamaCppDependencies =
     Seq(llamaCppGPU)
   else if (is_silicon.equals("true"))
     Seq(llamaCppSilicon)
-//  else if (is_aarch64.equals("true"))
-//    Seq(openVinoCPU)
+  else if (is_aarch64.equals("true"))
+    Seq(llamaCppAarch64)
   else
     Seq(llamaCppCPU)
 

diff --git a/conda/meta.yaml b/conda/meta.yaml
@@ -1,13 +1,13 @@
 {% set name = "spark-nlp" %}
-{% set version = "5.5.0" %}
+{% set version = "5.5.1" %}
 
 package:
   name: {{ name|lower }}
   version: {{ version }}
 
 source:
   url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/spark-nlp-{{ version }}.tar.gz
-  sha256: edc71585f462f548770bd13899686f10d88fa4a4a6e201bc1bf9c7711e398dc0
+  sha256: e8ddaf939a1b0acbe0d7b6d6a67f7fa0c5a73339d9e4563e3c1aba1cf0039409
 
 build:
   noarch: python

diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml
@@ -44,6 +44,8 @@ sparknlp:
         url: /docs/en/pipelines
       - title: General Concepts
         url: /docs/en/concepts
+      - title: Tasks
+        url: /docs/en/tasks/landing_page
       - title: Annotators
         url: /docs/en/annotators
       - title: Transformers

diff --git a/docs/_layouts/landing.html b/docs/_layouts/landing.html
@@ -201,7 +201,7 @@ <h3 class="grey h3_title">{{ _section.title }}</h3>
                   <div class="highlight-box">
     {% highlight bash %}
     # Using PyPI
-    $ pip install spark-nlp==5.5.0
+    $ pip install spark-nlp==5.5.1
 
     # Using Anaconda/Conda
     $ conda install -c johnsnowlabs spark-nlp

diff --git a/docs/_posts/Cabir40/2024-10-21-bge_medembed_base_v0_1_en.md b/docs/_posts/Cabir40/2024-10-21-bge_medembed_base_v0_1_en.md
@@ -0,0 +1,101 @@
+---
+layout: model
+title: English bge_medembed_base_v0_1 BGEEmbeddings from abhinand
+author: John Snow Labs
+name: bge_medembed_base_v0_1
+date: 2024-10-21
+tags: [embedding, en, open_source, bge, medical, onnx]
+task: Embeddings
+language: en
+edition: Spark NLP 5.5.0
+spark_version: 3.0
+supported: true
+engine: onnx
+annotator: BGEEmbeddings
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+Pretrained BGEEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. 
+`bge_medembed_base_v0_1` is a English model originally trained by abhinand
+
+{:.btn-box}
+<button class="button button-orange" disabled>Live Demo</button>
+<button class="button button-orange" disabled>Open in Colab</button>
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bge_medembed_base_v0_1_en_5.5.0_3.0_1729515433167.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
+[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bge_medembed_base_v0_1_en_5.5.0_3.0_1729515433167.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+
+document_assembler = DocumentAssembler()\
+      .setInputCol("text")\
+      .setOutputCol("document")
+
+embeddings = BGEEmbeddings.pretrained("bge_medembed_base_v0_1","en")\
+      .setInputCols(["document"])\
+      .setOutputCol("embeddings")       
+
+pipeline = Pipeline(
+    stages = [
+        document_assembler, 
+        embeddings
+])
+
+data = spark.createDataFrame([["I love spark-nlp"]]).toDF("text")
+
+result = pipeline.fit(data).transform(data)
+
+```
+```scala
+
+val document_assembler = new DocumentAssembler() 
+    .setInputCol("text") 
+    .setOutputCol("document")
+
+val embeddings = BGEEmbeddings.pretrained("bge_medembed_base_v0_1","en") 
+    .setInputCols(Array("document")) 
+    .setOutputCol("embeddings")
+
+val pipeline = new Pipeline().setStages(Array(document_assembler, embeddings))
+
+val data = Seq("I love spark-nlp").toDS.toDF("text")
+
+val result = pipeline.fit(data).transform(data)
+
+```
+</div>
+
+## Results
+
+```bash
+
++----------------------------------------------------------------------------------------------------+
+|                                                                                       bge_embedding|
++----------------------------------------------------------------------------------------------------+
+|[{sentence_embeddings, 0, 15, I love spark-nlp, {sentence -> 0}, [-0.018065551, -0.032784615, 0.0...|
++----------------------------------------------------------------------------------------------------+
+
+```
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|bge_medembed_base_v0_1|
+|Compatibility:|Spark NLP 5.5.0+|
+|License:|Open Source|
+|Edition:|Official|
+|Input Labels:|[document]|
+|Output Labels:|[bge]|
+|Language:|en|
+|Size:|389.7 MB|
diff --git a/docs/_posts/Cabir40/2024-10-21-bge_medembed_large_v0_1_en.md b/docs/_posts/Cabir40/2024-10-21-bge_medembed_large_v0_1_en.md
@@ -0,0 +1,101 @@
+---
+layout: model
+title: English bge_medembed_large_v0_1 BGEEmbeddings from abhinand
+author: John Snow Labs
+name: bge_medembed_large_v0_1
+date: 2024-10-21
+tags: [embedding, en, open_source, bge, medical, onnx]
+task: Embeddings
+language: en
+edition: Spark NLP 5.5.0
+spark_version: 3.0
+supported: true
+engine: onnx
+annotator: BGEEmbeddings
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+Pretrained BGEEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. 
+`bge_medembed_large_v0_1` is a English model originally trained by abhinand
+
+{:.btn-box}
+<button class="button button-orange" disabled>Live Demo</button>
+<button class="button button-orange" disabled>Open in Colab</button>
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bge_medembed_large_v0_1_en_5.5.0_3.0_1729515260623.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
+[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bge_medembed_large_v0_1_en_5.5.0_3.0_1729515260623.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+
+document_assembler = DocumentAssembler()\
+      .setInputCol("text")\
+      .setOutputCol("document")
+
+embeddings = BGEEmbeddings.pretrained("bge_medembed_large_v0_1","en")\
+      .setInputCols(["document"])\
+      .setOutputCol("embeddings")       
+
+pipeline = Pipeline(
+    stages = [
+        document_assembler, 
+        embeddings
+])
+
+data = spark.createDataFrame([["I love spark-nlp"]]).toDF("text")
+
+result = pipeline.fit(data).transform(data)
+
+```
+```scala
+
+val document_assembler = new DocumentAssembler() 
+    .setInputCol("text") 
+    .setOutputCol("document")
+
+val embeddings = BGEEmbeddings.pretrained("bge_medembed_large_v0_1","en") 
+    .setInputCols(Array("document")) 
+    .setOutputCol("embeddings")
+
+val pipeline = new Pipeline().setStages(Array(document_assembler, embeddings))
+
+val data = Seq("I love spark-nlp").toDS.toDF("text")
+
+val result = pipeline.fit(data).transform(data)
+
+```
+</div>
+
+## Results
+
+```bash
+
++----------------------------------------------------------------------------------------------------+
+|                                                                                       bge_embedding|
++----------------------------------------------------------------------------------------------------+
+|[{sentence_embeddings, 0, 15, I love spark-nlp, {sentence -> 0}, [-0.018065551, -0.032784615, 0.0...|
++----------------------------------------------------------------------------------------------------+
+
+```
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|bge_medembed_large_v0_1|
+|Compatibility:|Spark NLP 5.5.0+|
+|License:|Open Source|
+|Edition:|Official|
+|Input Labels:|[document]|
+|Output Labels:|[bge]|
+|Language:|en|
+|Size:|1.2 GB|