Pretrained multilingual model for sentence embedding with a Max Sequence Length > 128 #1476

candalfigomoro · 2022-03-22T14:31:31Z

Hi,

is there any pretrained multilingual model (for sentence embedding) with a Max Sequence Length > 128 (e.g. 256 or 512)?

distiluse-base-multilingual-cased-v1, distiluse-base-multilingual-cased-v2, paraphrase-multilingual-MiniLM-L12-v2
and paraphrase-multilingual-mpnet-base-v2 all have a Max Sequence Length of 128 tokens

Thanks

The text was updated successfully, but these errors were encountered:

nreimers · 2022-03-22T14:39:54Z

You can try to use them with longer inputs, but the quality is unclear as they have not been trained for longer inputs.

Otherwise we currently work on multilingual models for longer inputs

candalfigomoro · 2022-03-22T14:49:58Z

@nreimers
Thanks. Can I use them "off the shelf" or is there something I have to set in order to avoid a truncation to 128 tokens?

zidsi · 2022-03-22T21:19:30Z

@nreimers any experiments/results with sequence chunking and i.e. averaging resulting temsor stack?

nreimers · 2022-03-22T21:31:20Z

See:
https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length

Averaging: This works if you have a few vectors and they are on the same topic. It doesn't work for many vectors or if they are on different topics

zidsi · 2022-03-22T22:15:18Z

Tnx & agree. With strog assumption that longer sequence (e.g. news) has single "dominant" topic this might be proxy for document embedding and similarity search. Just wanted to check before experimenting.
For multi topic documents I guess "chunk" clustering and using topn cluster centoids as document "topics" could be used instead of single document vector.

Lauler · 2022-05-19T18:48:03Z

@nreimers Can you share some details on your current strategy for training multilingual models to perform better for longer inputs? Most OPUS data is aligned on a sentence level.

Would it be valid to for example concatenate several aligned sentences from each respective language into longer paragraphs/documents (even if the sentences aren't necessarily from the same document to begin with) ?

I trained and published a Swedish Sentence BERT with the help of the sentence-transformers package. However, there has been some interest and some requests for a longer max_seq_length than the current limit which I set to 256.

Any advice on how to come by or create training data of sufficient length to ensure vectors are of high quality for longer inputs?

nreimers · 2022-05-20T07:47:17Z

@Lauler I think that is valid. But it would be better if sentences are consecutive sentences. You could get these e.g. from the TED 2020 dataset

thiswillbeyourgithub · 2023-09-24T16:58:01Z

(copying my message from there )

I'm sharing my own 'rolling' sbert script to avoid clipping the sentences. It's seemingly functionnal but not very elegent, a class would be better of course but I just hope it helps someone :

moved the code to this comment

Related to #2236 #147 #1164 #695 #498 #364 #1918 #491 #285

aflip · 2023-10-16T08:03:25Z

@thiswillbeyourgithub when I run your chunking code, i get a AssertionError: Some sentences are too long for the transformer and are cropped. The rolling average failed apparently. I have a corpus of texts that are between 80 and 600 words. I run this without that assertion and the outputs seem ok.

thiswillbeyourgithub · 2023-10-16T12:33:13Z

Hi, very sorry about that I made a few mistakes in the previous code and hastily pasted it on github. I fixed my above comment with the fixed code. It's quite a bit more expensive to run because it gradually moves the window's end then beginning but works fine. Also I used maxpooling instead of a sum.

Thanks and sorry!

aflip · 2023-10-17T11:02:19Z

@thiswillbeyourgithub thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretrained multilingual model for sentence embedding with a Max Sequence Length > 128 #1476

Pretrained multilingual model for sentence embedding with a Max Sequence Length > 128 #1476

candalfigomoro commented Mar 22, 2022 •

edited

Loading

nreimers commented Mar 22, 2022

candalfigomoro commented Mar 22, 2022

zidsi commented Mar 22, 2022

nreimers commented Mar 22, 2022

zidsi commented Mar 22, 2022

Lauler commented May 19, 2022 •

edited

Loading

nreimers commented May 20, 2022

thiswillbeyourgithub commented Sep 24, 2023 •

edited

Loading

aflip commented Oct 16, 2023

thiswillbeyourgithub commented Oct 16, 2023

aflip commented Oct 17, 2023

Pretrained multilingual model for sentence embedding with a Max Sequence Length > 128 #1476

Pretrained multilingual model for sentence embedding with a Max Sequence Length > 128 #1476

Comments

candalfigomoro commented Mar 22, 2022 • edited Loading

nreimers commented Mar 22, 2022

candalfigomoro commented Mar 22, 2022

zidsi commented Mar 22, 2022

nreimers commented Mar 22, 2022

zidsi commented Mar 22, 2022

Lauler commented May 19, 2022 • edited Loading

nreimers commented May 20, 2022

thiswillbeyourgithub commented Sep 24, 2023 • edited Loading

aflip commented Oct 16, 2023

thiswillbeyourgithub commented Oct 16, 2023

aflip commented Oct 17, 2023

candalfigomoro commented Mar 22, 2022 •

edited

Loading

Lauler commented May 19, 2022 •

edited

Loading

thiswillbeyourgithub commented Sep 24, 2023 •

edited

Loading