Asymmetric semantic search with multilingual-MiniLM-L12-v2? #1463

ErfolgreichCharismatisch · 2022-03-13T05:17:45Z

I have a corpus with 144,491 entries with around 2000 characters each forming phrases in english and german.

Each entry in monolingual.

My goal is to enter a query like a question or a set of keywords for it to output the best fitting index in the corpus.

I am using sentence-transformers_paraphrase-multilingual-MiniLM-L12-v2 currently with a

query_embedding = embedder.encode(query, convert_to_tensor=True,batch_size=6).to('cuda')
cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)

This gives reasonable results, but is there a better approach?

I am asking, because this is an asymmetric semantic search, which should use the MSMARCO Models according to your description, yet those are only in english and https://www.sbert.net/examples/training/ms_marco/multilingual/README.html seems unfinished.

Is the idea to

use a random msmarco model,
translate it to german using https://towardsdatascience.com/a-complete-guide-to-transfer-learning-from-english-to-other-languages-using-sentence-embeddings-8c427f8804a9 to then
detect the language of each corpus entry to
apply the SentenceTransformer_german for all german entries and SentenceTransformer_english accordingly to then
summarize the result of both?

Which approach using sbert do you suggest?

nreimers · 2022-03-13T19:03:00Z

Yes, this is currently work in progress. We hope we can start soon the training process.

For German & English, there are some MSMARCO English-German models on the hub:
https://huggingface.co/sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch

ErfolgreichCharismatisch · 2022-03-13T19:23:48Z

Interesting. To avoid double training, I used

teacher_model_name = 'multi-qa-MiniLM-L6-cos-v1'
student_model_name = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'

and got the 2000 steps result of https://drive.google.com/drive/folders/1--U-RQJscmfiZ7BxCzayLBLImO10HsRc?usp=sharing which can be reused

nishanthcgit · 2022-05-12T10:11:29Z

Eagerly awaiting the results of this training!

nickchomey · 2022-10-21T18:13:46Z

I just made this comment in another similar issue - it should solve this problem.

Has anyone here tried the newest multilingual Cross Encoder model? It uses multilingual versions of the MiniLM and MSMarco datasets. It doesn't appear to be in the SBert documentation, but I just stumbled upon it while browsing HF. https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

There isn't any benchmark data, but this paper seems to have used a fairly similar process and shows that these multilingual datasets/models provide very competitive results when compared to monolingual datasets. https://arxiv.org/pdf/2108.13897.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asymmetric semantic search with multilingual-MiniLM-L12-v2? #1463

Asymmetric semantic search with multilingual-MiniLM-L12-v2? #1463

ErfolgreichCharismatisch commented Mar 13, 2022 •

edited

Loading

nreimers commented Mar 13, 2022

ErfolgreichCharismatisch commented Mar 13, 2022 •

edited

Loading

nishanthcgit commented May 12, 2022

nickchomey commented Oct 21, 2022

Asymmetric semantic search with multilingual-MiniLM-L12-v2? #1463

Asymmetric semantic search with multilingual-MiniLM-L12-v2? #1463

Comments

ErfolgreichCharismatisch commented Mar 13, 2022 • edited Loading

nreimers commented Mar 13, 2022

ErfolgreichCharismatisch commented Mar 13, 2022 • edited Loading

nishanthcgit commented May 12, 2022

nickchomey commented Oct 21, 2022

ErfolgreichCharismatisch commented Mar 13, 2022 •

edited

Loading

ErfolgreichCharismatisch commented Mar 13, 2022 •

edited

Loading