EmbeddingRetriever does not account for longer documents #3240
Labels
1.x
P3
Low priority, leave it in the backlog
topic:retriever
type:bug
Something isn't working
wontfix
This will not be worked on
Describe the bug
The EmbeddingRetriever does not account for long sequences. More precisely, it passes on the documents to the underlying encoder model (sentence-transformer) which truncates the sequence before embedding.
As such,
i) The embedding isn't based on the full document and
ii) Multiple documents can have the same embedding if they have the same starting sequence up to the max-len of the model.
Note: Also the case for tableQA
Expected behavior
From: @sjrl
To Reproduce
This Colab Notebook.
FAQ Check
The text was updated successfully, but these errors were encountered: