-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pretrained multilingual model for sentence embedding with a Max Sequence Length > 128 #1476
Comments
You can try to use them with longer inputs, but the quality is unclear as they have not been trained for longer inputs. Otherwise we currently work on multilingual models for longer inputs |
@nreimers |
@nreimers any experiments/results with sequence chunking and i.e. averaging resulting temsor stack? |
See: Averaging: This works if you have a few vectors and they are on the same topic. It doesn't work for many vectors or if they are on different topics |
Tnx & agree. With strog assumption that longer sequence (e.g. news) has single "dominant" topic this might be proxy for document embedding and similarity search. Just wanted to check before experimenting. |
@nreimers Can you share some details on your current strategy for training multilingual models to perform better for longer inputs? Most OPUS data is aligned on a sentence level. Would it be valid to for example concatenate several aligned sentences from each respective language into longer paragraphs/documents (even if the sentences aren't necessarily from the same document to begin with) ? I trained and published a Swedish Sentence BERT with the help of the Any advice on how to come by or create training data of sufficient length to ensure vectors are of high quality for longer inputs? |
@Lauler I think that is valid. But it would be better if sentences are consecutive sentences. You could get these e.g. from the TED 2020 dataset |
(copying my message from there ) I'm sharing my own 'rolling' sbert script to avoid clipping the sentences. It's seemingly functionnal but not very elegent, a class would be better of course but I just hope it helps someone : |
@thiswillbeyourgithub when I run your chunking code, i get a |
Hi, very sorry about that I made a few mistakes in the previous code and hastily pasted it on github. I fixed my above comment with the fixed code. It's quite a bit more expensive to run because it gradually moves the window's end then beginning but works fine. Also I used maxpooling instead of a sum. Thanks and sorry! |
@thiswillbeyourgithub thank you! |
Hi,
is there any pretrained multilingual model (for sentence embedding) with a Max Sequence Length > 128 (e.g. 256 or 512)?
distiluse-base-multilingual-cased-v1
,distiluse-base-multilingual-cased-v2
,paraphrase-multilingual-MiniLM-L12-v2
and
paraphrase-multilingual-mpnet-base-v2
all have a Max Sequence Length of 128 tokensThanks
The text was updated successfully, but these errors were encountered: