EmbeddingRetriever does not account for longer documents #3240

bglearning · 2022-09-19T14:01:03Z

Describe the bug
The EmbeddingRetriever does not account for long sequences. More precisely, it passes on the documents to the underlying encoder model (sentence-transformer) which truncates the sequence before embedding.

As such,
i) The embedding isn't based on the full document and
ii) Multiple documents can have the same embedding if they have the same starting sequence up to the max-len of the model.

Note: Also the case for tableQA

Expected behavior

Short-term: a relevant warning message.
Long-term: some way to natively handle long documents.

From: @sjrl

I believe as @ju-gu mentioned it would be helpful to add a warning message when passing documents to the embedding retriever that are longer than the max_seq_length supported by the loaded Embedding model. This would be very similar to the warning that is thrown by the FARMReader (I think just during training) which warns the user that the texts being passed to the reader are being truncated.

We also discussed offline in Slack alternatives to using truncation. For example, for each long documents we could split into smaller text chunks, pass each chunk into the EmbeddingRetriever, and then pool (e.g. mean or max) the embeddings together. This post here explains the concept well but falls short in actually evaluating how well the new embedding vectors perform.

This would allow us to create a single embedding that "represents" the whole document. However, as @mathislucka pointed out this is not how the embedding retriever models were trained and does not seem to have good benchmarks for testing this in NLP communities.

Additional resources:

Discussion in Sentence Transformers about this topic where they mention that they also use something like mean pooling for larger texts: Long sentence embedding UKPLab/sentence-transformers#364 (comment)

Longformer: The Long-Document Transformer Another potential alternative could be to support the Longformer which was built to embed longer documents. This model type is supported in HuggingFace and the docs page can be found for it here: https://huggingface.co/docs/transformers/model_doc/longformer

BigBird was also developed to handle longer texts. HuggingFace link here: https://huggingface.co/docs/transformers/model_doc/big_bird

To Reproduce
This Colab Notebook.

FAQ Check

Have you had a look at our new FAQ page?

liorshk · 2023-02-23T12:43:18Z

Hi,
Any update on this issue?

julian-risch · 2023-03-13T15:34:51Z

I can share some intuition why the idea of pooling won't work well for long documents.
First, pooling would ignore the order of the chunks/words. This is not a big concern if the chunks are not just a single token or a few tokens but hundreds of tokens. However, in the linked post, pooling is also used for calculating one embedding for all the words in one chunk. In that case, the more words are used in one pool, the more generic the resulting embedding will be. And this effect will increase if we later on pool also the embeddings of each chunk. As an example, imagine you have a long book with thousands of words. If you calculate the embedding of each word and then average all of them, the resulting document embedding will be very generic and similar to the embedding of many other books - simply because they have so many words in common. Stop words will have the biggest effect on the document embedding and the document embeddings will be hardly usable.

Venkatesh-Balavadani · 2023-11-16T17:12:11Z

Hi, I am working on a use case that involves longer context length documents and have been experimenting with the embedding retriever for this. The contents are truncated at 512 tokens and because of this RAG pipeline is not performing well. Do you have any other work around like processing the documents in batches or using some of the embedding algorithms for longer documents (mentioned above)?

bglearning added the topic:retriever label Sep 19, 2022

sjrl added the type:bug Something isn't working label Sep 27, 2022

masci added the P2 Medium priority, add to the next sprint if no P1 available label Nov 24, 2022

masci added P3 Low priority, leave it in the backlog and removed P2 Medium priority, add to the next sprint if no P1 available labels Jan 25, 2023

masci added the 1.x label Apr 7, 2024

masci added the wontfix This will not be worked on label May 10, 2024

masci closed this as completed May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EmbeddingRetriever does not account for longer documents #3240

EmbeddingRetriever does not account for longer documents #3240

bglearning commented Sep 19, 2022 •

edited

Loading

liorshk commented Feb 23, 2023

julian-risch commented Mar 13, 2023

Venkatesh-Balavadani commented Nov 16, 2023

EmbeddingRetriever does not account for longer documents #3240

EmbeddingRetriever does not account for longer documents #3240

Comments

bglearning commented Sep 19, 2022 • edited Loading

liorshk commented Feb 23, 2023

julian-risch commented Mar 13, 2023

Venkatesh-Balavadani commented Nov 16, 2023

bglearning commented Sep 19, 2022 •

edited

Loading