Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There appear to be duplicate embeddings in the production embedding store #691

Closed
jkomoros opened this issue Apr 13, 2024 · 6 comments
Closed

Comments

@jkomoros
Copy link
Owner

Running reindexCardEmbeddings appears to add embeddings even ones that should already be in the store?

If you look at any given point ID and search by similar you'll find a huge number of duplicates with the same card id and version.

@jkomoros
Copy link
Owner Author

Yeah, the number of vectors in production is 100k, instead of the expected ~15k, 10x larger than expected, likely due to these extra reembeddings.

@jkomoros
Copy link
Owner Author

jkomoros commented Apr 13, 2024

  • Figure out why every time reindexCardEmbeddings is called it's storing a whole new item (did it ever work?)
  • Get gulp configure-qdrant to work again
  • Re deploy reindex cardEmbeddings in dev and then hit the endpoint and verify it works
  • Update private config on macbook
  • Blow away the whole dev index and rebuild
  • Blow away the whole prod build, then put in th enew cluster and API key in the config, then gulp configure-qdrant and also npm run generate:config and npm run generate:env
  • After verifying that worked ^, Blow away the whole prod index and rebuild
  • Switch back to free tier of Qdrant once it's under control and embedding index rebuilt

jkomoros added a commit that referenced this issue Apr 13, 2024
jkomoros added a commit that referenced this issue Apr 13, 2024
@jkomoros
Copy link
Owner Author

My guess is that it's in reindexCardEmbeddings, it's bulk-fetching all of the items, but the cardsInfo is coming back incorreclty. Looks like it gets the content field but not the card_id field?

@jkomoros jkomoros reopened this Apr 14, 2024
@jkomoros
Copy link
Owner Author

BTW this "lots of duplicates of the same card content and embedding" is likely why the semanticSort in #688 was finding so many non-existent embeddings? Maybe? Because it was fetching a random embedding for that cardID?

jkomoros added a commit that referenced this issue Apr 14, 2024
The limit was 10, which was comically too low, so we erroneously reindxeed just about everything when there was a miss.

Part of #691.
jkomoros added a commit that referenced this issue Apr 14, 2024
…t, hit the endpoint.

Before, if we had a small cardsContent at all we'd think all of the other ones that didn't have a record in it didn't exist, erroneosuly.

Now if for whatever reason we get too few items, we can still be resilient and notice they already exist.

Part of #691.
jkomoros added a commit that referenced this issue Apr 14, 2024
@jkomoros
Copy link
Owner Author

The bug that has been fixed was leading to a lot of duplicate embeddings being stored, every time reindexCardEmbeddings was run, which was on every deploy.

jkomoros added a commit that referenced this issue Apr 14, 2024
@jkomoros
Copy link
Owner Author

This is now fixed and deployed into production

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant