-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There appear to be duplicate embeddings in the production embedding store #691
Comments
Yeah, the number of vectors in production is 100k, instead of the expected ~15k, 10x larger than expected, likely due to these extra reembeddings. |
|
My guess is that it's in reindexCardEmbeddings, it's bulk-fetching all of the items, but the cardsInfo is coming back incorreclty. Looks like it gets the content field but not the card_id field? |
BTW this "lots of duplicates of the same card content and embedding" is likely why the semanticSort in #688 was finding so many non-existent embeddings? Maybe? Because it was fetching a random embedding for that cardID? |
The limit was 10, which was comically too low, so we erroneously reindxeed just about everything when there was a miss. Part of #691.
…t, hit the endpoint. Before, if we had a small cardsContent at all we'd think all of the other ones that didn't have a record in it didn't exist, erroneosuly. Now if for whatever reason we get too few items, we can still be resilient and notice they already exist. Part of #691.
The bug that has been fixed was leading to a lot of duplicate embeddings being stored, every time reindexCardEmbeddings was run, which was on every deploy. |
This is now fixed and deployed into production |
Running reindexCardEmbeddings appears to add embeddings even ones that should already be in the store?
If you look at any given point ID and search by similar you'll find a huge number of duplicates with the same card id and version.
The text was updated successfully, but these errors were encountered: