MSMARCO training with SentenceTransformersTrainer instead of deprecated training scripts #3128

sirCamp · 2024-12-10T13:32:54Z

Hi there, first thanks for you great work!
I'm wodering what is the correct approach to train on msmarco with a similar approach to train_bi-encoder_margin-mse.py, where both positive and negative are sampled diffrently everytime using the SentenceTransformersTrainer instead of the deprecated training method to be able to use the multigpu and a more structured approach.

I'm also wondering what is the exact procedure to use the evaluators when using accelerate or torch.distributed.

Thanks!

tomaarsen · 2024-12-23T11:08:00Z

Hello!
Apologies for the delay, I've been working on a release.

The exact approach from that script is tricky to reproduce, because Sentence Transformers now works with Dataset instances, with which it's trickier to fully change them up every epoch. Instead, you can now train with multiple negatives at a time (by creating a column for each, see the Loss Overview docs)

Or you can create a Dataset with all triplets, a bit like e.g. https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-tas-b/viewer/triplet-hard. You can use this one out of the box:

train_dataset = load_dataset(
    "sentence-transformers/msmarco-msmarco-distilbert-base-tas-b",
    "triplet-hard",
    split="train",
)

Regarding evaluator instances - sadly they simply don't work well on multi-GPU right now. They only run on process 0 during training, and if you want to run an evaluator prior to training, you could add if trainer.is_local_process_zero() so it only has to compute on one of the GPUs, but that won't make it quicker.

Tom Aarsen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSMARCO training with SentenceTransformersTrainer instead of deprecated training scripts #3128

MSMARCO training with SentenceTransformersTrainer instead of deprecated training scripts #3128

sirCamp commented Dec 10, 2024

tomaarsen commented Dec 23, 2024

MSMARCO training with SentenceTransformersTrainer instead of deprecated training scripts #3128

MSMARCO training with SentenceTransformersTrainer instead of deprecated training scripts #3128

Comments

sirCamp commented Dec 10, 2024

tomaarsen commented Dec 23, 2024