-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EN-DE MS-Marco #1011
Comments
Hey everyone, we currently use the msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned in production. Thus, I can confirm that an EN-DE cross-encoder is a must-have. Cross-encoders are much more fine-grained and result in a significant performance boost. We really look forward to production-ready EN-DE information retrieval models (including a v3 bi-encoder 🙂). |
I can confirm that as well. I have searched all over the internet. There were many requests, but hardly could I find a working model. An EN-DE cross-encoder / bi-encoder is a must-have for us as well. |
Same for me I have been waiting for a EN-DE cross-encoder since this issue was opened #695 in January. Also a production-ready EN-DE bi-encoder (v3) would be great!!! |
Same here, could you possibly have a closer look on this @nreimers? |
Multilingual Information retrieval with more than 2 languages would be awesome. But still DE-EN are our main priority. Please add a multilingual cross-encoder with at least EN, DE to this repository @nreimers, if it is feasible. I think there are a lot of people who would benefit greatly. |
Sorry for my in-activity on this topic. Will finish the work on sentence-transformers v2 this week. After that, I have some more free capacity. Will start to upload the German training data and finalize the translation code so that more languages can be supported. Will then start training with this code: Based on mixed German and English data (and potentially more languages once the translated versions are ready). |
Thanks @nreimers, you're the best. Great commitment from your side! |
Great support as always @nreimers. |
Hi, They also uploaded a translated version of MS MARCO: Great to see that effort :) More to come. |
This is great news @nreimers. Thanks for the info. More and more people become aware of the power of multilingual information retrieval. |
Hey @nreimers, thanks for your fast reply. Are you sure these models support English-German and not just German? |
@janandreschweiger Will release English-German Cross-Encoder models soon. |
Perfect, thank you @nreimers. |
I'm sorry for the confusion, but the "cross" in the name comes from "cross-encoder". |
Hello @nreimers, we have a live demo on Monday, 2.8. Do you think that a first version of these new models will be ready by then? If not, I would be happy if you could give us a rough time frame for the bi- and cross-encoders. Thanks a lot. |
Great work @nreimers on multillinguality, dankeschön. I saw you pushed to the models hub the new multilingual minim v2 models, do you plan to use them for CrossEncoder knowledge distillation ? I am too interested in multilingual cross encoders. I have translated some MS-Marco passages to French using your translation scripts (thanks again), and I have tried distilling using the cross-v2 script to those multi-minillm2 models, using only the English passages, the French translated passages, or a mix of both. In any case every time I get a CUDA memory error a few iterations in (around 9000), even when the only line of code that changed from your script is the model_name (from microsoft/MiniLM-L12-H384-uncased to 'nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large', since I wanted to test if a distillation using your method from a monolingual to a multilingual student using only English data yields anything). What GPU did you use for these CrossEncoder-v2 scripts ? |
Hi @nero-nazok I just started to train a cross-encoder for EN-DE. First results after 30k steps look quite promising, with some toy examples it gave quite good re-ranking results, but a more careful evaluation will be needed. Will release this model this week (and hopefully a better model further down the line). Had to fix several issues with the Neural Machine Translation generation, it was sadly rather unstable when applied to noisy data as you have it in the MS MARCO case. @RachelKer The max_length parameter is quite important, as memory requirement grows quadratic with the input length. So input text has twice the size => 4 times more GPU memory is needed. I let the models currently run with a max_length of 350 and a batch size of 16. In that case, about 14GB of GPU memory is needed. |
Thanks for the quick reply @nreimers. I also look forward to the new cross- and bi-encoder models (DE, EN). |
Hi @bmw-friedrich-mayr @nero-nazok @RachelKer For some performance numbers, see: I have not yet trained new bi-encoders. On the test datasets (TREC DL for EN-EN and DE-EN and GermanDPR for DE-DE re-ranking), cross-encoder models perform quite well. We also see quite a nice boost compared to the bi-encoder models. |
Thanks for all of your effort @nreimers. Great work as always! |
Thanks @nreimers for publishing them in time!!! |
Thanks @nreimers! I tried your cross encoder model today. At first glance, everything worked perfectly. The model understands the meaning of the text in both English and German. However, I have come across a devastating weakness that affects all cross encoders. We have a classic setup with a bi-encoder and a cross-encoder, as you described it here: https://www.sbert.net/examples/applications/information-retrieval/README.html The problem: Example: Tested models: Since real users usually search with keywords and proper names, this problem unfortunately makes the cross-encoders unusable. Therefore, we currently only have a pure bi-encoder search, which is a shame. Do you have any idea how to solve this problem @nreiners? |
Same for me @nreimers. This problem makes the models worse than a keyword search. The bi-encoder on the other hand work perfectly. Hopefully this can be fixed soon. Thanks in advance! Also many thanks to you @nero-nazok for writing a detailed report. |
Hi @nero-nazok For bi-encoders, cossim(A, A) = 1, so a perfect match will always rank the highest. For cross-encoder, this is not necessarily the case. What might help in your case is a query classifier. See here for a long thread with implementations: Basically, you use the CrossEncoder only when you notice that it is a question or more complex query, e.g. you could check if the query contains spaces. Or when only a single hit (or 2-3 hits) contains an exact string match => don't run the CrossEncoder. When I run the following example: from sentence_transformers import CrossEncoder
model_name='cross-encoder/msmarco-MiniLM-L12-en-de-v1'
model = CrossEncoder(model_name, max_length=512)
query = "Prospekt_Tigon_DE.pdf"
docs = ["Prospekt_Tigon_DE.pdf", "Prospekt_Stromerzeuger_DE.pdf"]
print(model_name)
print(model.predict([(query, doc) for doc in docs]))
model_name='cross-encoder/msmarco-MiniLM-L6-en-de-v1'
model = CrossEncoder(model_name, max_length=512)
print(model_name)
print(model.predict([(query, doc) for doc in docs])) I get as an output:
So both cross-encoder give document 1 (Prospekt_Tigon_DE.pdf) a much higher score than document 2 (Prospekt_Stromerzeuger_DE.pdf). Did you use some other query / docs? Or where these keywords embedded in more text? Would be great if you could post a working (minimal) code example so that I can have a more detailed look. |
Hi @nreimers, thanks for your quick reply! The problem was on my side. We use Docker and save the models in the file system. I mistakenly saved the model with the SentenceTransformer class like for the bi-encoder. Saving:
Loading:
|
Thanks @nero-nazok, I made a similar mistake. The cross-encoder models work perfectly now @nreimers. |
I just want to mention one shortcoming that we encountered. We develop an enterprise search. So we have long text files on various topics. The problem is that for transformer-models we have to divide a text file into its paragraphs. As a result, the context of the document is often lost. Example: Our solution ideas: Maybe you have an idea about this problem @nreimers? |
@nreimers @nero-nazok I can confirm that this is a major drawback of a neural search. It is probably the biggest challenge my team and I are facing right now. A good solution to this problem would fundamentally change the game. |
Hello @nreimers! I did a detailed performance analysis of your 4 DE-EN models. If the search documents were in English, your models were almost always able to find the right result. But I noticed a significant drop in performance with German documents. Later I found your benchmarks, which confirmed that to me: https://huggingface.co/cross-encoder/msmarco-MiniLM-L6-en-de-v1#performance In particular, the bi-encoders suffer from this issue. GermanDPR DE-DE:
That is a pity because 80% of our documents are German. Will the next version of Bi-encoders overcome this weakness? |
@paologruber I made some tests with about 150 custom queries. Unfortunately, I experience this issue as well. I think @nreimers mentioned that he will work on a new bi-encoder when the cross-encoders are finished. |
Is there already a timeline @bmw-friedrich-mayr? |
@nero-nazok @bmw-friedrich-mayr If available, a good option is to encode your paragraphs like this: title+" "+paragraph Where title reflects what the document is about. For Wikipedia, it would be the article title. You could also try to see if there is a "paragraph" title. This allow the model to capture the larger context of the paragraph. ========= I currently train more authentic training data for German. Hopefully it leads to an improvement. Will do a test with training additionally on GermanDPR, which should boost the German-German capability of the model. |
Thanks @nreimers for your answer. Can the new training data be used to close the DE performance gap of both bi- and cross-encoders? Although the EN-EN and DE-EN capabilities are really useful, the German-German performance is the most important one for our business. Unfortunately, there are almost no good German retrieval models. Therefore I am really grateful for your work. Many thanks and good luck in advance for to the new models. |
@naro-nazok Not sure, will see if it helps training and by how much. Cross-Lingual is usually quite challenging and sadly not too much good training data is available for other languages. |
Thanks @nreimers, the performance of most German information retrieval models is quite low. We really appreciate your effort to move the ball forward. A score that is comparable to English would be a game changer for us. |
Great @nreimers, hopefully this will improve the overall performance. I have a bi-encoder, cross-encoder setup. Unfortunately, when just searching with a combination of keywords (e.g. "Audi CEO", "BMW revenue"), the model is significantly weaker than a keyword-based search. |
Thanks @nreimers for your work on DE-EN cross-encoders. I also look forward to a new DE-EN bi-encoder as all the others. It would be awesome to have a great German performance similar to English. In addition, we also experience issues when searching through longer paragraphs (~ 250 words / paragraph) and the keywords are unknown to the model. This unfortunately happens often as we use the model on technical documents. It would be great if the model could still work at least as good as a keyword-search if the words are unknown. Do you have any ideas/plans regrading this @nreimers? Maybe there is another dataset that could be applied afterwards. Thank you!! |
Hi @ace-kay-law-neo Hence, for production settings, it makes sense to combine semantic search with keyword search, which is also known as hybrid search. Have a look at: You can either run dense and keyword search independently and merge the results (there are different options to do this), or you use one of the hybrid approaches. There are not too many search softwares available that allow to run hybrid search. I think the only one I'm familiar with is Vespa.ai, which might support such hybrid search. |
Hi @nreimers, my team has tested your EN-DE cross encoder quite a bit and we want to use it in production. We have a Java environment so we run the models with an onnx runtime. We have rewritten the BertTokenizer to Java and it works like a charm for your bi-encoder. Unfortunately, the cross-encoder tokenizer is different to any tokenizer I have seen so far. For example: For subwording the cross-encoder tokenizer uses a '_' instead of '#' or '##'. DE-EN Cross Encoder:
Regular Cross Encoder:
As you can see both cross-encoders are quite different although they are from the same class PreTrainedTokenizerFast. It would be awesome if you could give us some information about how this tokenizer works, so we can replicate it in Java. Thanks a lot @nreimers!! |
Hi @janandreschweiger Multilingual models usually use a SentencePiece tokenizer: Founds this Java version, don't know if it works: |
Thank you @nreimers. This helps us a lot. But, your bi-encoder (sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch) uses a regular BertTokenizer with just a different vocabulary right? Sample vocabulary of your bi-encoder:
|
@janandreschweiger Yes, the bi-encoder is based on DistilmBERT, which uses word pieces. Word pieces has several issues, especially for multilingual tokenization. Hence, more recent mutlilingual transformer models use SentencePiece instead of word piece tokenization. |
Has anyone here tried the newest multilingual Cross Encoder model? It uses multilingual versions of the MiniLM and MSMarco datasets. It doesn't appear to be in the SBert documentation, but I just stumbled upon it while browsing HF. https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 There isn't any benchmark data, but this paper seems to have used a fairly similar process and shows that these multilingual datasets/models provide very competitive results when compared to monolingual datasets. https://arxiv.org/pdf/2108.13897.pdf @nreimers I made this same comment in various other Issues that I found so that a) more people could learn about this and b) so that it could all be consolidated in one place for you to close a bunch of issues. Since this seems to be an important innovation and there's surely many other issues that I didn't find/tag, perhaps it would be worth adding this model to the SBERT documentation, and maybe even making some sort of announcement? Edit: It would also be interesting to see how this new dataset, MIRACL, compares: https://github.com/project-miracl/miracl |
Hi @nreimers, Hi Sentence-transformers community,
First of all, I want to thank you for your continued support throughout the years. I have been following this repository for three years now and I'm amazed by the progress that is made on a monthly basis 👍.
After digging through several forums I discovered that many people are interested in multilingual information retrieval, especially German-English. We are no exception. We currently use your msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned which yields us already good results. Unfortunately this model seems to be outdated compared to your amazing v3 models and is therefore not mentioned in your docs.
My kind request / question is, if you could finish your work on EN-DE information retrieval (based on MS Marco), if there is enough demand? I think there are many people who have been waiting in anticipation for your EN-DE cross-encoder and v3 bi-encoder for quite some time. 😅
The text was updated successfully, but these errors were encountered: