Skip to content

Commit

Permalink
Add Pooling Average to Broken XXXForSentenceEmbedding annotators (#14328
Browse files Browse the repository at this point in the history
)

* SPARKNLP-1036: Onnx Example notebooks (#14234)

* SPARKNLP-1036: Fix dev python kernel names

* SPARKNLP-1036: Bump transformers version

* SPARKNLP-1036: Fix Colab buttons

* SPARKNLP-1036: Pin onnx version for compatibility

* SPARKNLP-1036: Upgrade Spark version

* SPARKNLP-1036: Minor Fixes

* SPARKNLP-1036: Clean Metadata

* SPARKNLP-1036: Add/Adjust Documentation

- Note for supported Spark Version of Annotators
- added missing Documentation for BGEEmbeddings

* Fixies (#14307)

* adding fix for broken annotators

---------

Co-authored-by: Devin Ha <[email protected]>
Co-authored-by: Lev <[email protected]>
Co-authored-by: Maziyar Panahi <[email protected]>
  • Loading branch information
4 people authored Jun 12, 2024
1 parent 85c90dd commit 1cba7e3
Show file tree
Hide file tree
Showing 57 changed files with 30,397 additions and 31,852 deletions.
160 changes: 160 additions & 0 deletions docs/en/annotator_entries/BGEEmbeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
{%- capture title -%}
BGEEmbeddings
{%- endcapture -%}

{%- capture description -%}
Sentence embeddings using BGE.

BGE, or BAAI General Embeddings, a model that can map any text to a low-dimensional dense
vector which can be used for tasks like retrieval, classification, clustering, or semantic
search.

Note that this annotator is only supported for Spark Versions 3.4 and up.

Pretrained models can be loaded with `pretrained` of the companion object:

```scala
val embeddings = BGEEmbeddings.pretrained()
.setInputCols("document")
.setOutputCol("embeddings")
```

The default model is `"bge_base"`, if no name is provided.

For available pretrained models please see the
[Models Hub](https://sparknlp.org/models?q=BGE).

For extended examples of usage, see
[BGEEmbeddingsTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/embeddings/BGEEmbeddingsTestSpec.scala).

**Sources** :

[C-Pack: Packaged Resources To Advance General Chinese Embedding](https://arxiv.org/pdf/2309.07597)

[BGE Github Repository](https://github.com/FlagOpen/FlagEmbedding)

**Paper abstract**

*We introduce C-Pack, a package of resources that significantly advance the field of general
Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive
benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive
text embedding dataset curated from labeled and unlabeled Chinese corpora for training
embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models
outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the
release. We also integrate and optimize the entire suite of training methods for C-TEM. Along
with our resources on general Chinese embedding, we release our data and models for English
text embeddings. The English models achieve stateof-the-art performance on the MTEB benchmark;
meanwhile, our released English data is 2 times larger than the Chinese data. All these
resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.*
{%- endcapture -%}

{%- capture input_anno -%}
DOCUMENT
{%- endcapture -%}

{%- capture output_anno -%}
SENTENCE_EMBEDDINGS
{%- endcapture -%}

{%- capture python_example -%}
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
embeddings = BGEEmbeddings.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("bge_embeddings")
embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["bge_embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True)
pipeline = Pipeline().setStages([
documentAssembler,
embeddings,
embeddingsFinisher
])
data = spark.createDataFrame([["query: how much protein should a female eat",
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." + \
"But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" + \
"marathon. Check out the chart below to see how much protein you should be eating each day.",
]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
| result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+
{%- endcapture -%}

{%- capture scala_example -%}
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.BGEEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val embeddings = BGEEmbeddings.pretrained("bge_base", "en")
.setInputCols("document")
.setOutputCol("bge_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("bge_embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
documentAssembler,
embeddings,
embeddingsFinisher
))

val data = Seq("query: how much protein should a female eat",
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." +
But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" +
marathon. Check out the chart below to see how much protein you should be eating each day."

).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
| result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+
{%- endcapture -%}

{%- capture api_link -%}
[BGEEmbeddings](/api/com/johnsnowlabs/nlp/embeddings/BGEEmbeddings)
{%- endcapture -%}

{%- capture python_api_link -%}
[BGEEmbeddings](/api/python/reference/autosummary/sparknlp/annotator/embeddings/bge_embeddings/index.html#sparknlp.annotator.embeddings.bge_embeddings.BGEEmbeddings)
{%- endcapture -%}

{%- capture source_link -%}
[BGEEmbeddings](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/embeddings/BGEEmbeddings.scala)
{%- endcapture -%}

{% include templates/anno_template.md
title=title
description=description
input_anno=input_anno
output_anno=output_anno
python_example=python_example
scala_example=scala_example
api_link=api_link
python_api_link=python_api_link
source_link=source_link
%}
1 change: 1 addition & 0 deletions docs/en/annotators.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ There are two types of Annotators:
{:.table-model-big}
|Annotator|Description|Version |
|---|---|---|
{% include templates/anno_table_entry.md path="" name="BGEEmbeddings" summary="Sentence embeddings using BGE."%}
{% include templates/anno_table_entry.md path="" name="BigTextMatcher" summary="Annotator to match exact phrases (by token) provided in a file against a Document."%}
{% include templates/anno_table_entry.md path="" name="Chunk2Doc" summary="Converts a `CHUNK` type column back into `DOCUMENT`. Useful when trying to re-tokenize or do further analysis on a `CHUNK` result."%}
{% include templates/anno_table_entry.md path="" name="ChunkEmbeddings" summary="This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs."%}
Expand Down
2 changes: 1 addition & 1 deletion docs/en/auxiliary.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ import com.johnsnowlabs.nlp.Annotation
**Examples:**

Complete usage examples can be seen here:
https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/234-release-candidate/jupyter/annotation/english/spark-nlp-basics/spark-nlp-basics-functions.ipynb
[https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/234-release-candidate/jupyter/annotation/english/spark-nlp-basics/spark-nlp-basics-functions.ipynb](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/234-release-candidate/jupyter/annotation/english/spark-nlp-basics/spark-nlp-basics-functions.ipynb)

<div class="tabs-box tabs-new" markdown="1">

Expand Down
2 changes: 1 addition & 1 deletion docs/en/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -760,7 +760,7 @@ Finally, use **jupyter_notebook_config.json** for the password:
In order to fully take advantage of Spark NLP on Windows (8 or 10), you need to setup/install Apache Spark, Apache Hadoop, Java and a Pyton environment correctly by following the following instructions: [https://github.com/JohnSnowLabs/spark-nlp/discussions/1022](https://github.com/JohnSnowLabs/spark-nlp/discussions/1022)
</div><div class="h3-box" markdown="1">\
</div><div class="h3-box" markdown="1">
### How to correctly install Spark NLP on Windows
Expand Down
2 changes: 2 additions & 0 deletions docs/en/transformer_entries/E5Embeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ Sentence embeddings using E5.
E5, an instruction-finetuned text embedding model that can generate text embeddings tailored
to any task (e.g., classification, retrieval, clustering, text evaluation, etc.)

Note that this annotator is only supported for Spark Versions 3.4 and up.

Pretrained models can be loaded with `pretrained` of the companion object:

```scala
Expand Down
2 changes: 2 additions & 0 deletions docs/en/transformer_entries/MPNetEmbeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ Understanding by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. MPNet a
pre-training method, named masked and permuted language modeling, to inherit the advantages of
masked language modeling and permuted language modeling for natural language understanding.

Note that this annotator is only supported for Spark Versions 3.4 and up.

Pretrained models can be loaded with `pretrained` of the companion object:

```scala
Expand Down
2 changes: 1 addition & 1 deletion docs/en/transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ modify_date: "2023-06-18"
use_language_switcher: "Python-Scala-Java"
show_nav: true
sidebar:
nav: sparknlp
nav: sparknlp
---

<script> {% include scripts/transformerUseCaseSwitcher.js %} </script>
Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,7 @@ data:

- title:
image:
src: https://upload.wikimedia.org/wikipedia/fr/thumb/8/8e/Centre_national_de_la_recherche_scientifique.svg/2048px-Centre_national_de_la_recherche_scientifique.svg.png
src: https://iscpif.fr/wp-content/uploads/2023/11/Logo-CNRS-ISCPIF.png
url: https://iscpif.fr/
style: "padding: 30px;"
is_row: true
Expand Down Expand Up @@ -344,7 +344,7 @@ data:
is_row: true
- title:
image:
src: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f1/Columbia_University_shield.svg/1184px-Columbia_University_shield.svg.png
src: https://miro.medium.com/v2/resize:fit:1024/0*3qIWoFnZgVUtsXB-.png
url: https://www.columbia.edu/
style: "padding: 25px;"
is_row: true
Expand Down
Loading

0 comments on commit 1cba7e3

Please sign in to comment.