Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In memory vector search information retrieval metrics #100

Open
4 of 13 tasks
kuraisle opened this issue Jan 3, 2025 · 0 comments
Open
4 of 13 tasks

In memory vector search information retrieval metrics #100

kuraisle opened this issue Jan 3, 2025 · 0 comments

Comments

@kuraisle
Copy link
Collaborator

kuraisle commented Jan 3, 2025

Is this the right issue type?

  • Yes, I'm planning work for this project team.

Summary

For quick comparisons of embeddings models for vector search, it would be useful to have an option that doesn't require the faffing about needed to load embeddings into a database, then querying that. A faster version that simply loads the embeddings for some subset of the OMOP vocabulary into memory and computes similarity there would be useful for rapid prototyping.

I've kept embeddings in parquet files, so it should be able to read those.

Acceptance Criteria

  • Embeddings can be loaded from parquet files
  • A subset of embeddings can be taken from the parquet
  • Cosine similarity and dot product can be calculated
  • Metrics exist for vector search
    • top k precision, recall, f-score
    • relative position of correct answer in the vocabulary

Tasks

  • Embeddings loading code
  • Similarity calculation code
  • metrics
  • Unit tests.
  • Documentation.

Confirm creation

  • This issue is ready
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant