Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EmbeddingRetriever does not account for longer documents #3240

Closed
1 task done
bglearning opened this issue Sep 19, 2022 · 3 comments
Closed
1 task done

EmbeddingRetriever does not account for longer documents #3240

bglearning opened this issue Sep 19, 2022 · 3 comments
Labels
1.x P3 Low priority, leave it in the backlog topic:retriever type:bug Something isn't working wontfix This will not be worked on

Comments

@bglearning
Copy link
Contributor

bglearning commented Sep 19, 2022

Describe the bug
The EmbeddingRetriever does not account for long sequences. More precisely, it passes on the documents to the underlying encoder model (sentence-transformer) which truncates the sequence before embedding.

As such,
i) The embedding isn't based on the full document and
ii) Multiple documents can have the same embedding if they have the same starting sequence up to the max-len of the model.

Note: Also the case for tableQA

Expected behavior

  • Short-term: a relevant warning message.
  • Long-term: some way to natively handle long documents.

From: @sjrl

I believe as @ju-gu mentioned it would be helpful to add a warning message when passing documents to the embedding retriever that are longer than the max_seq_length supported by the loaded Embedding model. This would be very similar to the warning that is thrown by the FARMReader (I think just during training) which warns the user that the texts being passed to the reader are being truncated.

We also discussed offline in Slack alternatives to using truncation. For example, for each long documents we could split into smaller text chunks, pass each chunk into the EmbeddingRetriever, and then pool (e.g. mean or max) the embeddings together. This post here explains the concept well but falls short in actually evaluating how well the new embedding vectors perform.

This would allow us to create a single embedding that "represents" the whole document. However, as @mathislucka pointed out this is not how the embedding retriever models were trained and does not seem to have good benchmarks for testing this in NLP communities.

Additional resources:

To Reproduce
This Colab Notebook.

FAQ Check

@sjrl sjrl added the type:bug Something isn't working label Sep 27, 2022
@masci masci added the P2 Medium priority, add to the next sprint if no P1 available label Nov 24, 2022
@masci masci added P3 Low priority, leave it in the backlog and removed P2 Medium priority, add to the next sprint if no P1 available labels Jan 25, 2023
@liorshk
Copy link

liorshk commented Feb 23, 2023

Hi,
Any update on this issue?

@julian-risch
Copy link
Member

I can share some intuition why the idea of pooling won't work well for long documents.
First, pooling would ignore the order of the chunks/words. This is not a big concern if the chunks are not just a single token or a few tokens but hundreds of tokens. However, in the linked post, pooling is also used for calculating one embedding for all the words in one chunk. In that case, the more words are used in one pool, the more generic the resulting embedding will be. And this effect will increase if we later on pool also the embeddings of each chunk. As an example, imagine you have a long book with thousands of words. If you calculate the embedding of each word and then average all of them, the resulting document embedding will be very generic and similar to the embedding of many other books - simply because they have so many words in common. Stop words will have the biggest effect on the document embedding and the document embeddings will be hardly usable.

@Venkatesh-Balavadani
Copy link

Hi, I am working on a use case that involves longer context length documents and have been experimenting with the embedding retriever for this. The contents are truncated at 512 tokens and because of this RAG pipeline is not performing well. Do you have any other work around like processing the documents in batches or using some of the embedding algorithms for longer documents (mentioned above)?

@masci masci added the 1.x label Apr 7, 2024
@masci masci added the wontfix This will not be worked on label May 10, 2024
@masci masci closed this as completed May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.x P3 Low priority, leave it in the backlog topic:retriever type:bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants