-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In Batch Negatives #3072
Comments
how can i create dataset where for each query , i have one positive and k-negatives .
can i do this way? |
Hello! Yes, this is possible. If you have from __future__ import annotations
from collections.abc import Iterable
from typing import Any
import torch
import torch.nn.functional as F
from torch import Tensor, nn
from datasets import load_dataset
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
SentenceTransformerModelCardData,
)
from sentence_transformers.SentenceTransformer import SentenceTransformer
from sentence_transformers.evaluation import TripletEvaluator
class KTupleLoss(nn.Module):
def __init__(
self, model: SentenceTransformer, scale: float = 20.0
) -> None:
super().__init__()
self.model = model
self.scale = scale
self.cross_entropy_loss = nn.CrossEntropyLoss()
def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:
embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
# Collect the anchor, positive, and negative embeddings
anchor_embeddings = embeddings[0] # [batch_size, embedding_dim]
positive_embeddings = embeddings[1] # [batch_size, embedding_dim]
negative_embeddings = torch.stack(embeddings[2:], dim=1) # [batch_size, num_negatives, embedding_dim]
# Normalize them
anchor_embeddings = torch.nn.functional.normalize(anchor_embeddings, p=2, dim=-1)
positive_embeddings = torch.nn.functional.normalize(positive_embeddings, p=2, dim=-1)
negative_embeddings = torch.nn.functional.normalize(negative_embeddings, p=2, dim=-1)
# Compute the similarity scores, i.e. 1) pairwise cosine similarity between anchor and positive,
# and 2) pairwise cosine similarity between anchor and negatives
pos_similarity = (anchor_embeddings * positive_embeddings).sum(1, keepdim=True) # [batch_size, 1]
anchor_embeddings_3d = anchor_embeddings.unsqueeze(1) # [batch_size, 1, embedding_dim]
neg_similarity = torch.matmul(anchor_embeddings_3d, negative_embeddings.transpose(1, 2)).squeeze(1) # [batch_size, num_negatives]
# Concatenate the positive and negative similarity scores so we have 1 + num_negatives similarity scores per anchor
scores = torch.cat((pos_similarity, neg_similarity), dim=1) * self.scale # [batch_size, 1 + num_negatives]
# Set the labels as 0, i.e. the positive sample is always the first one in the scores tensor
labels = torch.zeros(scores.size(0), dtype=torch.long, device=scores.device)
return self.cross_entropy_loss(scores, labels)
def get_config_dict(self) -> dict[str, Any]:
return {"scale": self.scale}
# 1. Load a model to finetune with 2. (Optional) model card data
model = SentenceTransformer(
"microsoft/mpnet-base",
model_card_data=SentenceTransformerModelCardData(
language="en",
license="apache-2.0",
model_name="MPNet base trained on AllNLI triplets",
)
)
# 3. Load a dataset to finetune on
dataset = load_dataset("sentence-transformers/all-nli", "triplet")
train_dataset = dataset["train"].select(range(100_000))
eval_dataset = dataset["dev"].select(range(1_000))
test_dataset = dataset["test"]
# This is a simple way to turn this into a dataset with k negative samples
def to_k_tuple(sample, k: int = 5):
return {
"anchor": sample["anchor"],
"positive": sample["positive"],
"negative": sample["negative"],
**{
f"negative_{i}": sample["negative"] for i in range(k - 1)
}
}
train_dataset = train_dataset.map(to_k_tuple, fn_kwargs={"k": 5})
eval_dataset = eval_dataset.map(to_k_tuple, fn_kwargs={"k": 5})
test_dataset = test_dataset.map(to_k_tuple, fn_kwargs={"k": 5})
# 4. Define a loss function
loss = KTupleLoss(model)
# 5. (Optional) Specify training arguments
run_name = "mpnet-base-all-nli-ktuple"
args = SentenceTransformerTrainingArguments(
# Required parameter:
output_dir=f"models/{run_name}",
# Optional training parameters:
num_train_epochs=1,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=True, # Set to False if you get an error that your GPU can't run on FP16
bf16=False, # Set to True if you have a GPU that supports BF16
# Optional tracking/debugging parameters:
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
save_total_limit=2,
logging_steps=100,
run_name=run_name, # Will be used in W&B if `wandb` is installed
)
# 6. (Optional) Create an evaluator & evaluate the base model
dev_evaluator = TripletEvaluator(
anchors=eval_dataset["anchor"],
positives=eval_dataset["positive"],
negatives=eval_dataset["negative"],
name="all-nli-dev",
)
dev_evaluator(model)
# 7. Create a trainer & train
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=dev_evaluator,
)
trainer.train()
# (Optional) Evaluate the trained model on the test set
test_evaluator = TripletEvaluator(
anchors=test_dataset["anchor"],
positives=test_dataset["positive"],
negatives=test_dataset["negative"],
name="all-nli-test",
)
test_evaluator(model)
# 8. Save the trained model
model.save_pretrained(f"models/{run_name}/final") The The loss can be optimized I believe (I think you can just take This is akin to the MultipleNegativesRankingLoss, except no in-batch negatives. You can create a dataset_final[t] = Dataset.from_dict({
"query": query,
"positive": positive,
"negative_1": negative_1,
"negative_2": negative_2,
...,
"negative_k": negative_k,
}) Here are my first logs: as you can see the model indeed learns (even though this test script makes k negatives in a bit of a hacky way by just repeating the actual 1 negative)
I do want to say that in-batch negatives often help.
|
Thanks for sharing ??
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Tom Aarsen ***@***.***>
Sent: Tuesday, November 19, 2024 8:14:51 PM
To: UKPLab/sentence-transformers ***@***.***>
Cc: Riyaj Atar ***@***.***>; Author ***@***.***>
Subject: Re: [UKPLab/sentence-transformers] In Batch Negatives (Issue #3072)
[External Email]
________________________________
Hello!
Yes, this is possible. If you have k negatives, then you'll have to use a custom loss function as there's no non-IBN loss that takes more than triplets. That should be fine, though. Here is an example:
from __future__ import annotations
from collections.abc import Iterable
from typing import Any
import torch
import torch.nn.functional as F
from torch import Tensor, nn
from datasets import load_dataset
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
SentenceTransformerModelCardData,
)
from sentence_transformers.SentenceTransformer import SentenceTransformer
from sentence_transformers.evaluation import TripletEvaluator
class KTupleLoss(nn.Module):
def __init__(
self, model: SentenceTransformer, scale: float = 20.0
) -> None:
super().__init__()
self.model = model
self.scale = scale
self.cross_entropy_loss = nn.CrossEntropyLoss()
def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:
embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
# Collect the anchor, positive, and negative embeddings
anchor_embeddings = embeddings[0] # [batch_size, embedding_dim]
positive_embeddings = embeddings[1] # [batch_size, embedding_dim]
negative_embeddings = torch.stack(embeddings[2:], dim=1) # [batch_size, num_negatives, embedding_dim]
# Normalize them
anchor_embeddings = torch.nn.functional.normalize(anchor_embeddings, p=2, dim=-1)
positive_embeddings = torch.nn.functional.normalize(positive_embeddings, p=2, dim=-1)
negative_embeddings = torch.nn.functional.normalize(negative_embeddings, p=2, dim=-1)
# Compute the similarity scores, i.e. 1) pairwise cosine similarity between anchor and positive,
# and 2) pairwise cosine similarity between anchor and negatives
pos_similarity = (anchor_embeddings * positive_embeddings).sum(1, keepdim=True) # [batch_size, 1]
anchor_embeddings_3d = anchor_embeddings.unsqueeze(1) # [batch_size, 1, embedding_dim]
neg_similarity = torch.matmul(anchor_embeddings_3d, negative_embeddings.transpose(1, 2)).squeeze(1) # [batch_size, num_negatives]
# Concatenate the positive and negative similarity scores so we have 1 + num_negatives similarity scores per anchor
scores = torch.cat((pos_similarity, neg_similarity), dim=1) * self.scale # [batch_size, 1 + num_negatives]
# Set the labels as 0, i.e. the positive sample is always the first one in the scores tensor
labels = torch.zeros(scores.size(0), dtype=torch.long, device=scores.device)
return self.cross_entropy_loss(scores, labels)
def get_config_dict(self) -> dict[str, Any]:
return {"scale": self.scale}
# 1. Load a model to finetune with 2. (Optional) model card data
model = SentenceTransformer(
"microsoft/mpnet-base",
model_card_data=SentenceTransformerModelCardData(
language="en",
license="apache-2.0",
model_name="MPNet base trained on AllNLI triplets",
)
)
# 3. Load a dataset to finetune on
dataset = load_dataset("sentence-transformers/all-nli", "triplet")
train_dataset = dataset["train"].select(range(100_000))
eval_dataset = dataset["dev"].select(range(1_000))
test_dataset = dataset["test"]
# This is a simple way to turn this into a dataset with k negative samples
def to_k_tuple(sample, k: int = 5):
return {
"anchor": sample["anchor"],
"positive": sample["positive"],
"negative": sample["negative"],
**{
f"negative_{i}": sample["negative"] for i in range(k - 1)
}
}
train_dataset = train_dataset.map(to_k_tuple, fn_kwargs={"k": 5})
eval_dataset = eval_dataset.map(to_k_tuple, fn_kwargs={"k": 5})
test_dataset = test_dataset.map(to_k_tuple, fn_kwargs={"k": 5})
# 4. Define a loss function
loss = KTupleLoss(model)
# 5. (Optional) Specify training arguments
run_name = "mpnet-base-all-nli-ktuple"
args = SentenceTransformerTrainingArguments(
# Required parameter:
output_dir=f"models/{run_name}",
# Optional training parameters:
num_train_epochs=1,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=True, # Set to False if you get an error that your GPU can't run on FP16
bf16=False, # Set to True if you have a GPU that supports BF16
# Optional tracking/debugging parameters:
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
save_total_limit=2,
logging_steps=100,
run_name=run_name, # Will be used in W&B if `wandb` is installed
)
# 6. (Optional) Create an evaluator & evaluate the base model
dev_evaluator = TripletEvaluator(
anchors=eval_dataset["anchor"],
positives=eval_dataset["positive"],
negatives=eval_dataset["negative"],
name="all-nli-dev",
)
dev_evaluator(model)
# 7. Create a trainer & train
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=dev_evaluator,
)
trainer.train()
# (Optional) Evaluate the trained model on the test set
test_evaluator = TripletEvaluator(
anchors=test_dataset["anchor"],
positives=test_dataset["positive"],
negatives=test_dataset["negative"],
name="all-nli-test",
)
test_evaluator(model)
# 8. Save the trained model
model.save_pretrained(f"models/{run_name}/final")
The KTupleLoss from this calculates the similarity score for anchor vs positive, as well as anchor vs k negatives. These are then concatenated such that for every single anchor, you get k+1 similarity scores. The first similarity score corresponds to the positive one, so we can set "labels" as a list of zeros (representing the index with the true positive score).
The loss can be optimized I believe (I think you can just take anchor_embeddings = embeddings[0] and then other_embeddings = torch.stack(embeddings[1:], dim=1), no need to separate the positive and the negative), but this should work.
This is akin to the MultipleNegativesRankingLoss, except no in-batch negatives.
________________________________
You can create a k-negative dataset like so:
dataset_final[t] = Dataset.from_dict({
"query": query,
"positive": positive,
"negative_1": negative_1,
"negative_2": negative_2,
...,
"negative_k": negative_k,
})
Here are my first logs: as you can see the model indeed learns (even though this test script makes k negatives in a bit of a hacky way by just repeating the actual 1 negative)
{'loss': 1.8269, 'grad_norm': 14.627801895141602, 'learning_rate': 3.04e-06, 'epoch': 0.02}
{'eval_loss': 1.4666913747787476, 'eval_all-nli-dev_cosine_accuracy': 0.719, 'eval_runtime': 20.5417, 'eval_samples_per_second': 48.682, 'eval_steps_per_second': 3.067, 'epoch': 0.02}
{'loss': 1.0463, 'grad_norm': 60.64706802368164, 'learning_rate': 6.176000000000001e-06, 'epoch': 0.03}
{'eval_loss': 0.7802979946136475, 'eval_all-nli-dev_cosine_accuracy': 0.859, 'eval_runtime': 20.9557, 'eval_samples_per_second': 47.72, 'eval_steps_per_second': 3.006, 'epoch': 0.03}
{'loss': 0.6681, 'grad_norm': 35.55289840698242, 'learning_rate': 9.376000000000001e-06, 'epoch': 0.05}
{'eval_loss': 0.540037989616394, 'eval_all-nli-dev_cosine_accuracy': 0.902, 'eval_runtime': 20.9779, 'eval_samples_per_second': 47.669, 'eval_steps_per_second': 3.003, 'epoch': 0.05}
{'loss': 0.428, 'grad_norm': 22.288776397705078, 'learning_rate': 1.2576000000000001e-05, 'epoch': 0.06}
{'eval_loss': 0.49275851249694824, 'eval_all-nli-dev_cosine_accuracy': 0.913, 'eval_runtime': 20.7686, 'eval_samples_per_second': 48.15, 'eval_steps_per_second': 3.033, 'epoch': 0.06}
6%|███████▊ | 405/6250 [05:50<3:57:49, 2.44s/it]
* Tom Aarsen
—
Reply to this email directly, view it on GitHub<#3072 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BIJR5XRR2MRVYX55JDRZLYD2BNFGHAVCNFSM6AAAAABSCESV66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBVHEYTGOJVHA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Hi , File ~/.local/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py:623, in SentenceTransformer.encode(self, sentences, prompt_name, prompt, batch_size, show_progress_bar, output_value, precision, convert_to_numpy, convert_to_tensor, device, normalize_embeddings, **kwargs) File ~/.local/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py:690, in SentenceTransformer.forward(self, input, **kwargs) File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs) File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs) File ~/.local/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py:393, in Transformer.forward(self, features, **kwargs) File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs) File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs) File /opt/conda/lib/python3.10/site-packages/peft/peft_model.py:1849, in PeftModelForFeatureExtraction.forward(self, input_ids, attention_mask, inputs_embeds, output_attentions, output_hidden_states, return_dict, **kwargs) File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs) File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs) File /opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:103, in BaseTuner.forward(self, *args, **kwargs) TypeError: NVEmbedModel.forward() got an unexpected keyword argument 'inputs_embeds'` |
I think this is because PEFT expects models to have the "standard" signature, e.g. https://github.com/huggingface/transformers/blob/f297af55dfc27485189f352cd36b4683de12e0b3/src/transformers/models/qwen2/modeling_qwen2.py#L808-L820 But NV-Embed-v2 does not seem to have this parameter: https://huggingface.co/nvidia/NV-Embed-v2/blob/main/modeling_nvembed.py#L397 I think the only solution is to fix it in this cc @BenjaminBossan as this is related to PEFT with a custom architecture - feel free to correct me if my above hypothesis is wrong.
|
doing this way it resolve that issue but don't know is this correct way or not model = SentenceTransformer('NV-Embed-v2',trust_remote_code=True) |
If that path consists an adapter (i.e. model = SentenceTransformer('nv-embed-v2-ft/checkpoint-150', trust_remote_code=True)
model.max_seq_length = 4096
model.tokenizer.padding_side="right" Which should be equivalent, but I'm not 100% sure.
|
model = SentenceTransformer('nv-embed-v2-ft/checkpoint-150', trust_remote_code=True) |
Okay, that may be a bug in ST, will look into it shortly. |
thanks |
This depends. If you use some PEFT methods like prefix-tuning or p-tuning (all "prompt learning" methods), yes, we need to make some assumptions about the underlying model, like |
# See https://huggingface.co/collections/tomaarsen/training-with-prompts-672ce423c85b4d39aed52853 for some already trained models
import logging
import random
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
import numpy
import torch
from datasets import Dataset, load_dataset
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerModelCardData,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
)
from sentence_transformers.evaluation import NanoBEIREvaluator
from sentence_transformers.losses import CachedMultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
random.seed(12)
torch.manual_seed(12)
numpy.random.seed(12)
# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}
query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
# Feel free to adjust these variables:
use_prompts = True
include_prompts_in_pooling = True
# 1. Load a model to finetune with 2. (Optional) model card data
model = SentenceTransformer(
'nvidia/NV-Embed-v2',trust_remote_code=True,
)
model.set_pooling_include_prompt(include_prompts_in_pooling)
model.max_seq_length = 4096 #32768
model.tokenizer.padding_side="right"
# 2. Create a LoRA adapter for the model & add it
peft_config = LoraConfig(
task_type=TaskType.FEATURE_EXTRACTION,
inference_mode=False,
r=16,
lora_alpha=32,
lora_dropout=0.1,
target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)
model.add_adapter(peft_config,adapter_name='adaptor_1')
# 2. (Optional) Define prompts
if use_prompts:
query_prompt = query_prefix
corpus_prompt = ""
prompts = {
"query": query_prompt,
"answer": corpus_prompt,
}
from datasets import load_from_disk
# Load the saved dataset back into a Dataset object
# replace this with any of msmarco triplet from sentencen transoformer
train_dataset = #load_from_disk("train_triplet_ours")
eval_dataset = #load_from_disk("test_triplet_ours")
# 4. Define a loss function
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=4)
# 5. (Optional) Specify training arguments
run_name = "nv-embed-v2-nq"
if use_prompts:
run_name += "-prompts"
if not include_prompts_in_pooling:
run_name += "-exclude-pooling-prompts"
args = SentenceTransformerTrainingArguments(
# Required parameter:
output_dir=f"models/{run_name}",
# Optional training parameters:
num_train_epochs=1,
per_device_train_batch_size=256,
per_device_eval_batch_size=256,
learning_rate=2e-5,
warmup_steps=500,
fp16=False, # Set to False if you get an error that your GPU can't run on FP16
bf16=True, # Set to True if you have a GPU that supports BF16
batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
# Optional tracking/debugging parameters:
eval_strategy="steps",
eval_steps=500,
save_strategy="steps",
save_steps=50,
save_total_limit=5,
logging_steps=1,
logging_first_step=True,dataloader_drop_last=True,
run_name=run_name, # Will be used in W&B if `wandb` is installed
seed=12,
prompts=prompts if use_prompts else None,
)
# 6. (Optional) Create an evaluator & evaluate the base model
dev_evaluator = NanoBEIREvaluator(
query_prompts=query_prompt if use_prompts else None,
corpus_prompts=corpus_prompt if use_prompts else None,
)
# dev_evaluator(model)
# 7. Create a trainer & train
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=dev_evaluator,
)
if __name__=="__main__":
trainer.train()
# 8. Save the trained model
model.save_pretrained(f"models/{run_name}/final")
# (Optional) Evaluate the trained model on the evaluator after training
dev_evaluator(model)
# 9. (Optional) Push it to the Hugging Face Hub
# model.push_to_hub(run_name) |
@riyajatar37003 I took your code with some small changes (bfloat16, smaller batch size to fit in memory, using this dataset) and it passed for me locally. Where exactly do you get the error? Do you have additional code where you load the model for inference and that's where it fails? |
I tried this code just to load
model = SentenceTransformer('nv-embed-v2-ft/checkpoint-150', trust_remote_code=True)
model.max_seq_length = 4096
model.tokenizer.padding_side="right"
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Benjamin Bossan ***@***.***>
Sent: Wednesday, November 20, 2024 8:46:02 PM
To: UKPLab/sentence-transformers ***@***.***>
Cc: Riyaj Atar ***@***.***>; Mention ***@***.***>
Subject: Re: [UKPLab/sentence-transformers] In Batch Negatives (Issue #3072)
[External Email]
________________________________
@riyajatar37003<https://github.com/riyajatar37003> I took your code with some small changes (bfloat16, smaller batch size to fit in memory, using this dataset<https://github.com/UKPLab/sentence-transformers/blob/348190d46b0c010c7a4693f198f0ddf70c6ceb35/examples/training/prompts/training_nq_prompts.py#L50-L53>) and it passed for me locally. Where exactly do you get the error? Do you have additional code where you load the model for inference and that's where it fails?
—
Reply to this email directly, view it on GitHub<#3072 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BIJR5XU4ZEMHEII44AXV2A32BSRTFAVCNFSM6AAAAABSCESV66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBYHA2TENBSGM>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
If it works then please do share sentence transformer version and transformer versions
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Riyaj Atar ***@***.***>
Sent: Wednesday, November 20, 2024 8:57:51 PM
To: UKPLab/sentence-transformers ***@***.***>; UKPLab/sentence-transformers ***@***.***>
Cc: Mention ***@***.***>
Subject: Re: [UKPLab/sentence-transformers] In Batch Negatives (Issue #3072)
I tried this code just to load
model = SentenceTransformer('nv-embed-v2-ft/checkpoint-150', trust_remote_code=True)
model.max_seq_length = 4096
model.tokenizer.padding_side="right"
Get Outlook for Android<https://aka.ms/AAb9ysg>
________________________________
From: Benjamin Bossan ***@***.***>
Sent: Wednesday, November 20, 2024 8:46:02 PM
To: UKPLab/sentence-transformers ***@***.***>
Cc: Riyaj Atar ***@***.***>; Mention ***@***.***>
Subject: Re: [UKPLab/sentence-transformers] In Batch Negatives (Issue #3072)
[External Email]
________________________________
@riyajatar37003<https://github.com/riyajatar37003> I took your code with some small changes (bfloat16, smaller batch size to fit in memory, using this dataset<https://github.com/UKPLab/sentence-transformers/blob/348190d46b0c010c7a4693f198f0ddf70c6ceb35/examples/training/prompts/training_nq_prompts.py#L50-L53>) and it passed for me locally. Where exactly do you get the error? Do you have additional code where you load the model for inference and that's where it fails?
—
Reply to this email directly, view it on GitHub<#3072 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BIJR5XU4ZEMHEII44AXV2A32BSRTFAVCNFSM6AAAAABSCESV66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBYHA2TENBSGM>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I could successfully run: model = SentenceTransformer(<path>, trust_remote_code=True, model_kwargs=dict(torch_dtype=torch.bfloat16))
model.max_seq_length = 4096
model.tokenizer.padding_side="right" The versions I use:
|
Okay, none of the above packages installed from source on my side.
Let me try again
Thanks for quick response
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Benjamin Bossan ***@***.***>
Sent: Wednesday, November 20, 2024 9:12:43 PM
To: UKPLab/sentence-transformers ***@***.***>
Cc: Riyaj Atar ***@***.***>; Mention ***@***.***>
Subject: Re: [UKPLab/sentence-transformers] In Batch Negatives (Issue #3072)
[External Email]
________________________________
I could successfully run:
model = SentenceTransformer(<path>, trust_remote_code=True, model_kwargs=dict(torch_dtype=torch.bfloat16))
model.max_seq_length = 4096
model.tokenizer.padding_side="right"
The versions I use:
* PEFT installed from source
* transformers installed from source
* sentence-transformers 3.3.1
* torch 2.5.1
—
Reply to this email directly, view it on GitHub<#3072 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BIJR5XRKL5KFPXBIYIFSXPT2BSUXHAVCNFSM6AAAAABSCESV66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBYHEZDONJZHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Ran into similar problem recently where i wanted to avoid to in batch negatives, I created a solution which was almost identical to one you have shared. The issue is MNR loss creates [batch_size,embedding_size] tensors but for the above case its [batch_size,n_negatives+2,embedding_size] which quickly blows out of proportion, I currently have to train with 1/4th of MNR batch size, @tomaarsen any tips for optimization? |
Hmm, I believe MNRL should also use
|
hi @tomaarsen so during encoding which token's representation will be considered as embedding ? |
Sentence Transformer models consist of a few steps: In the last transition, i.e. token embeddings -> text embeddings, we do pooling. For example mean pooling (text embedding is the average of all token embedding), or CLS embedding (text embedding is the first token embedding). In my tests (see https://sbert.net/examples/training/prompts/README.html#training-script - Experiments with Details: https://sbert.net/examples/training/prompts/README.html
|
Thanks man for such explanation
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Tom Aarsen ***@***.***>
Sent: Tuesday, November 26, 2024 3:48:38 PM
To: UKPLab/sentence-transformers ***@***.***>
Cc: Riyaj Atar ***@***.***>; Mention ***@***.***>
Subject: Re: [UKPLab/sentence-transformers] In Batch Negatives (Issue #3072)
[External Email]
________________________________
Sentence Transformer models consist of a few steps:
text -> tokens -> token embeddings -> text embeddings
In the last transition, i.e. token embeddings -> text embeddings, we do pooling. For example mean pooling (text embedding is the average of all token embedding), or CLS embedding (text embedding is the first token embedding).
Some researchers add a prompt or instruction text in front of their text, like query: or Represent this sentence for searching relevant passages: , and some of those want to exclude the token embeddings from the prompt/instruction from the eventual pooling. If you call model.set_pooling_include_prompt(False), then the prompt will not be included in the pooling.
In my tests (see https://sbert.net/examples/training/prompts/README.html#training-script<https://sbert.net/examples/training/prompts/README.html#training-script> - Experiments with bert-base-uncased), I got the best performance when keeping the include_prompt as the default True.
Details: https://sbert.net/examples/training/prompts/README.html<https://sbert.net/examples/training/prompts/README.html>
* Tom Aarsen
—
Reply to this email directly, view it on GitHub<#3072 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BIJR5XXVLBT24M7BKIGKEWL2CRDH5AVCNFSM6AAAAABSCESV66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBQGIYTMMZQHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi,
is there any way to disable in-batch negatives during training in Sentence Transformers?
Thanks
@tomaarsen
The text was updated successfully, but these errors were encountered: