Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings: work out how and when to run upload scripts #136

Open
josephjclark opened this issue Dec 18, 2024 · 0 comments
Open

Embeddings: work out how and when to run upload scripts #136

josephjclark opened this issue Dec 18, 2024 · 0 comments

Comments

@josephjclark
Copy link
Collaborator

The apollo server comes with a connected vector database populated with embedding data. At the time of writing there are three datasets - docs, loinc and snomed - but more will be added later.

Each embedded dataset will include a script to populate the database. We'll likely need to maintain databases for staging, production and individual dev. So it's gotta be easy to run these scripts, and they should have minimal inputs (ideally none, apart from credentials in env vars).

There are multiple ways we might set up the scripts:

  • As an ad-hoc python script. Ie, we do cd services/my_embedding && python upload.py to populate the database. This would run on a local machine. I don't really like the idea of managing mulitple envs on a dev machine - dev would have to have one environment for their local development, which is fine, but when uploading to the production database it's a bit hairier. Usually better to do that sort of thing in CI, with a nice clean env.
  • Through an endpoint on the Apollo server. Ie, we POST to apollo/embeddings/my_embedding/upload. The server uses its own environment to run an update of the embeddings table. The worry about this is that the embedding might be slow and expensive (think about downloading and embedding 1gb of loinc codes), and our production environments may not be able to handle it. We'd also need an auth solution because we can't have just anyone posting to the server to flush the database.
  • We might even add CLI support which calls up to the server and can do things like prompting and easily read in input files?

So I'm not sure really. All solutions seem flawed.

We certainly do need a standalone script so we'll start from there. The question is how to make this work effectively and safely with multiple production environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant