Bumped version

MinishLab · Feb 12, 2025 · 0072bbc · 0072bbc
2 parents a0d4597 + 486e2bf
commit 0072bbc
Show file tree

Hide file tree

Showing 33 changed files with 3,406 additions and 1,061 deletions.
diff --git a/Makefile b/Makefile
@@ -9,7 +9,7 @@ install:
 	uv run pre-commit install
 
 install-no-pre-commit:
-	uv pip install ".[dev,distill]"
+	uv pip install ".[dev,distill,inference,train]"
 	uv pip install "torch<2.5.0"
 
 install-base:

diff --git a/README.md b/README.md
diff --git a/assets/images/speed_vs_mteb_score_v3.png b/assets/images/speed_vs_mteb_score_v3.png
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,6 @@
+# Documentation
+
+This directory contains the documentation for Model2Vec. The documentation is formatted in Markdown. The documentation is organized as follows:
+- [usage.md](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md): This document provides a technical overview of how to use Model2Vec.
+- [integrations.md](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md): This document provides examples of how to use Model2Vec in various downstream libraries.
+- [what_is_model2vec.md](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md): This document provides a high-level overview of how Model2Vec works.
diff --git a/docs/integrations.md b/docs/integrations.md
@@ -0,0 +1,155 @@
+
+# Integrations
+
+Model2Vec can be used in a variety of downstream libraries. This document provides examples of how to use Model2Vec in some of these libraries.
+
+## Table of Contents
+- [Sentence Transformers](#sentence-transformers)
+- [LangChain](#langchain)
+- [Txtai](#txtai)
+- [Chonkie](#chonkie)
+- [Transformers.js](#transformersjs)
+
+## Sentence Transformers
+
+Model2Vec can be used directly in [Sentence Transformers](https://github.com/UKPLab/sentence-transformers):
+
+The following code snippet shows how to load a Model2Vec model into a Sentence Transformer model:
+```python
+from sentence_transformers import SentenceTransformer
+
+# Load a Model2Vec model from the Hub
+model = SentenceTransformer("minishlab/potion-base-8M")
+# Make embeddings
+embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
+```
+
+The following code snippet shows how to distill a model directly into a Sentence Transformer model:
+
+```python
+from sentence_transformers import SentenceTransformer
+from sentence_transformers.models import StaticEmbedding
+
+static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256)
+model = SentenceTransformer(modules=[static_embedding])
+embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
+```
+
+For more documentation, please refer to the [Sentence Transformers documentation](https://sbert.net/docs/package_reference/sentence_transformer/models.html#sentence_transformers.models.StaticEmbedding).
+
+
+## LangChain
+
+Model2Vec can be used in [LangChain](https://github.com/langchain-ai/langchain) using the `langchain-community` package. For more information, see the [LangChain Model2Vec docs](https://python.langchain.com/docs/integrations/text_embedding/model2vec/). The following code snippet shows how to use Model2Vec in LangChain after installing the `langchain-community` package with `pip install langchain-community`:
+
+```python
+from langchain_community.embeddings import Model2vecEmbeddings
+from langchain_community.vectorstores import FAISS
+from langchain.schema import Document
+
+# Initialize a Model2Vec embedder
+embedder = Model2vecEmbeddings("minishlab/potion-base-8M")
+
+# Create some example texts
+texts = [
+    "Enduring Stew",
+    "Hearty Elixir",
+    "Mighty Mushroom Risotto",
+    "Spicy Meat Skewer",
+    "Fruit Salad",
+]
+
+# Embed the texts
+embeddings = embedder.embed_documents(texts)
+
+# Or, create a vector store and query it
+documents = [Document(page_content=text) for text in texts]
+vector_store = FAISS.from_documents(documents, embedder)
+query = "Risotto"
+query_vector = embedder.embed_query(query)
+retrieved_docs = vector_store.similarity_search_by_vector(query_vector, k=1)
+```
+
+## Txtai
+
+Model2Vec can be used in [txtai](https://github.com/neuml/txtai) for text embeddings, nearest-neighbors search, and any of the other functionalities that txtai offers. The following code snippet shows how to use Model2Vec in txtai after installing the `txtai` package (including the `vectors` dependency) with `pip install txtai[vectors]`:
+
+```python
+from txtai import Embeddings
+
+# Load a model2vec model
+embeddings = Embeddings(path="minishlab/potion-base-8M", method="model2vec", backend="numpy")
+
+# Create some example texts
+texts = ["Enduring Stew", "Hearty Elixir", "Mighty Mushroom Risotto", "Spicy Meat Skewer", "Chilly Fruit Salad"]
+
+# Create embeddings for downstream tasks
+vectors = embeddings.batchtransform(texts)
+
+# Or create a nearest-neighbors index and search it
+embeddings.index(texts)
+result = embeddings.search("Risotto", 1)
+```
+
+## Chonkie
+
+Model2Vec is the default model for semantic chunking in [Chonkie](https://github.com/bhavnicksm/chonkie). To use Model2Vec for semantic chunking in Chonkie, simply install Chonkie with `pip install chonkie[semantic]` and use one of the `potion` models in the `SemanticChunker` class. The following code snippet shows how to use Model2Vec in Chonkie:
+
+```python
+from chonkie import SDPMChunker
+
+# Create some example text to chunk
+text = "It's dangerous to go alone! Take this."
+
+# Initialize the SemanticChunker with a potion model
+chunker = SDPMChunker(
+    embedding_model="minishlab/potion-base-8M",
+    similarity_threshold=0.3
+)
+
+# Chunk the text
+chunks = chunker.chunk(text)
+```
+
+## Transformers.js
+
+To use a Model2Vec model in [transformers.js](https://github.com/huggingface/transformers.js), the following code snippet can be used as a starting point:
+
+```javascript
+import { AutoModel, AutoTokenizer, Tensor } from '@huggingface/transformers';
+
+const modelName = 'minishlab/potion-base-8M';
+
+const modelConfig = {
+    config: { model_type: 'model2vec' },
+    dtype: 'fp32',
+    revision: 'refs/pr/1'
+};
+const tokenizerConfig = {
+    revision: 'refs/pr/2'
+};
+
+const model = await AutoModel.from_pretrained(modelName, modelConfig);
+const tokenizer = await AutoTokenizer.from_pretrained(modelName, tokenizerConfig);
+
+const texts = ['hello', 'hello world'];
+const { input_ids } = await tokenizer(texts, { add_special_tokens: false, return_tensor: false });
+
+const cumsum = arr => arr.reduce((acc, num, i) => [...acc, num + (acc[i - 1] || 0)], []);
+const offsets = [0, ...cumsum(input_ids.slice(0, -1).map(x => x.length))];
+
+const flattened_input_ids = input_ids.flat();
+const modelInputs = {
+    input_ids: new Tensor('int64', flattened_input_ids, [flattened_input_ids.length]),
+    offsets: new Tensor('int64', offsets, [offsets.length])
+};
+
+const { embeddings } = await model(modelInputs);
+console.log(embeddings.tolist()); // output matches python version
+```
+
+Note that this requires that the Model2Vec has a `model.onnx` file and several required tokenizers file. To generate these for a model that does not have them yet, the following code snippet can be used:
+
+```bash
+python scripts/export_to_onnx.py --model_path <path-to-a-model2vec-model> --save_path "<path-to-save-the-onnx-model>"
+```
diff --git a/docs/usage.md b/docs/usage.md
@@ -0,0 +1,202 @@
+
+# Usage
+
+This document provides an overview of how to use Model2Vec for inference, distillation, training, and evaluation.
+
+## Table of Contents
+- [Inference](#inference)
+  - [Inference with a pretrained model](#inference-with-a-pretrained-model)
+  - [Inference with the Sentence Transformers library](#inference-with-the-sentence-transformers-library)
+- [Distillation](#distillation)
+    - [Distilling from a Sentence Transformer](#distilling-from-a-sentence-transformer)
+    - [Distilling from a loaded model](#distilling-from-a-loaded-model)
+    - [Distilling with the Sentence Transformers library](#distilling-with-the-sentence-transformers-library)
+    - [Distilling with a custom vocabulary](#distilling-with-a-custom-vocabulary)
+- [Training](#training)
+    - [Training a classifier](#training-a-classifier)
+- [Evaluation](#evaluation)
+    - [Installation](#installation)
+    - [Evaluation Code](#evaluation-code)
+
+## Inference
+
+### Inference with a pretrained model
+
+Inference works as follows. The example shows one of our own models, but you can also just load a local one, or another one from the hub.
+```python
+from model2vec import StaticModel
+
+# Load a model from the Hub. You can optionally pass a token when loading a private model
+model = StaticModel.from_pretrained(model_name="minishlab/potion-base-8M", token=None)
+
+# Make embeddings
+embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
+
+# Make sequences of token embeddings
+token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."])
+```
+
+### Inference with the Sentence Transformers library
+
+The following code snippet shows how to use a Model2Vec model in the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline.
+
+```python
+from sentence_transformers import SentenceTransformer
+
+# Load a Model2Vec model from the Hub
+model = SentenceTransformer("minishlab/potion-base-8M")
+
+# Make embeddings
+embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
+```
+
+## Distillation
+
+### Distilling from a Sentence Transformer
+
+The following code can be used to distill a model from a Sentence Transformer. As mentioned above, this leads to really small model that might be less performant.
+```python
+from model2vec.distill import distill
+
+# Distill a Sentence Transformer model
+m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256)
+
+# Save the model
+m2v_model.save_pretrained("m2v_model")
+
+```
+
+### Distilling from a loaded model
+
+If you already have a model loaded, or need to load a model in some special way, we also offer an interface to distill models in memory.
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+from model2vec.distill import distill_from_model
+
+# Assuming a loaded model and tokenizer
+model_name = "baai/bge-base-en-v1.5"
+model = AutoModel.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+
+m2v_model = distill_from_model(model=model, tokenizer=tokenizer, pca_dims=256)
+
+m2v_model.save_pretrained("m2v_model")
+
+```
+
+### Distilling with the Sentence Transformers library
+
+The following code snippet shows how to distill a model using the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline.
+
+```python
+from sentence_transformers import SentenceTransformer
+from sentence_transformers.models import StaticEmbedding
+
+static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256)
+model = SentenceTransformer(modules=[static_embedding])
+embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
+```
+
+### Distilling with a custom vocabulary
+
+If you pass a vocabulary, you get a set of static word embeddings, together with a custom tokenizer for exactly that vocabulary. This is comparable to how you would use GLoVe or traditional word2vec, but doesn't actually require a corpus or data.
+```python
+from model2vec.distill import distill
+
+# Load a vocabulary as a list of strings
+vocabulary = ["word1", "word2", "word3"]
+
+# Distill a Sentence Transformer model with the custom vocabulary
+m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", vocabulary=vocabulary)
+
+# Save the model
+m2v_model.save_pretrained("m2v_model")
+
+# Or push it to the hub
+m2v_model.push_to_hub("my_organization/my_model", token="<it's a secret to everybody>")
+```
+
+By default, this will distill a model with a subword tokenizer, combining the models (subword) vocab with the new vocabulary. If you want to get a word-level tokenizer instead (with only the passed vocabulary), the `use_subword` parameter can be set to `False`, e.g.:
+
+```python
+m2v_model = distill(model_name=model_name, vocabulary=vocabulary, use_subword=False)
+```
+
+**Important note:** we assume the passed vocabulary is sorted in rank frequency. i.e., we don't care about the actual word frequencies, but do assume that the most frequent word is first, and the least frequent word is last. If you're not sure whether this is case, set `apply_zipf` to `False`. This disables the weighting, but will also make performance a little bit worse.
+
+
+## Training
+
+### Training a classifier
+
+Model2Vec can be used to train a classifier on top of a distilled model. The following code snippet shows how to train a classifier on top of a distilled model. For more advanced usage, as well as results, please refer to the [training documentation](https://github.com/MinishLab/model2vec/blob/main/model2vec/train/README.md).
+
+```python
+import numpy as np
+from datasets import load_dataset
+from model2vec.train import StaticModelForClassification
+
+# Initialize a classifier from a pre-trained model
+classifer = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M")
+
+# Load a dataset
+ds = load_dataset("setfit/subj")
+train = ds["train"]
+test = ds["test"]
+
+X_train, y_train = train["text"], train["label"]
+X_test, y_test = test["text"], test["label"]
+
+# Train the classifier
+classifier.fit(X_train, y_train)
+
+# Evaluate the classifier
+y_hat = classifier.predict(X_test)
+accuracy = np.mean(np.array(y_hat) == np.array(y_test)) * 100
+```
+
+## Evaluation
+
+### Installation
+
+Our models can be evaluated using our [evaluation package](https://github.com/MinishLab/evaluation). Install the evaluation package with:
+
+```bash
+pip install git+https://github.com/MinishLab/evaluation.git@main
+```
+
+### Evaluation Code
+
+The following code snippet shows how to evaluate a Model2Vec model:
+```python
+from model2vec import StaticModel
+
+from evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results
+from mteb import ModelMeta
+
+# Get all available tasks
+tasks = get_tasks()
+# Define the CustomMTEB object with the specified tasks
+evaluation = CustomMTEB(tasks=tasks)
+
+# Load the model
+model_name = "m2v_model"
+model = StaticModel.from_pretrained(model_name)
+
+# Optionally, add model metadata in MTEB format
+model.mteb_model_meta = ModelMeta(
+            name=model_name, revision="no_revision_available", release_date=None, languages=None
+        )
+
+# Run the evaluation
+results = evaluation.run(model, eval_splits=["test"], output_folder=f"results")
+
+# Parse the results and summarize them
+parsed_results = parse_mteb_results(mteb_results=results, model_name=model_name)
+task_scores = summarize_results(parsed_results)
+
+# Print the results in a leaderboard format
+print(make_leaderboard(task_scores))
+```
diff --git a/docs/what_is_model2vec.md b/docs/what_is_model2vec.md
@@ -0,0 +1,11 @@
+# What is Model2Vec?
+
+This document provides a high-level overview of how Model2Vec works.
+
+The base model2vec technique works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using SIF weighting (previously zipf weighting). During inference, we simply take the mean of all token embeddings occurring in a sentence.
+
+Our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) are pre-trained using [tokenlearn](https://github.com/MinishLab/tokenlearn), a technique to pre-train model2vec distillation models. These models are created with the following steps:
+- **Distillation**: We distill a Model2Vec model from a Sentence Transformer model, using the method described above.
+- **Sentence Transformer inference**: We use the Sentence Transformer model to create mean embeddings for a large number of texts from a corpus.
+- **Training**: We train a model to minimize the cosine distance between the mean embeddings generated by the Sentence Transformer model and the mean embeddings generated by the Model2Vec model.
+- **Post-training re-regularization**: We re-regularize the trained embeddings by first performing PCA, and then weighting the embeddings using `smooth inverse frequency (SIF)` weighting using the following formula: `w = 1e-3 / (1e-3 + proba)`. Here, `proba` is the probability of the token in the corpus we used for training.