Embeddings: work out how and when to run upload scripts #136

josephjclark · 2024-12-18T10:21:00Z

The apollo server comes with a connected vector database populated with embedding data. At the time of writing there are three datasets - docs, loinc and snomed - but more will be added later.

Each embedded dataset will include a script to populate the database. We'll likely need to maintain databases for staging, production and individual dev. So it's gotta be easy to run these scripts, and they should have minimal inputs (ideally none, apart from credentials in env vars).

There are multiple ways we might set up the scripts:

As an ad-hoc python script. Ie, we do cd services/my_embedding && python upload.py to populate the database. This would run on a local machine. I don't really like the idea of managing mulitple envs on a dev machine - dev would have to have one environment for their local development, which is fine, but when uploading to the production database it's a bit hairier. Usually better to do that sort of thing in CI, with a nice clean env.
Through an endpoint on the Apollo server. Ie, we POST to apollo/embeddings/my_embedding/upload. The server uses its own environment to run an update of the embeddings table. The worry about this is that the embedding might be slow and expensive (think about downloading and embedding 1gb of loinc codes), and our production environments may not be able to handle it. We'd also need an auth solution because we can't have just anyone posting to the server to flush the database.
We might even add CLI support which calls up to the server and can do things like prompting and easily read in input files?

So I'm not sure really. All solutions seem flawed.

We certainly do need a standalone script so we'll start from there. The question is how to make this work effectively and safely with multiple production environments.

The text was updated successfully, but these errors were encountered:

josephjclark mentioned this issue Dec 18, 2024

Add pinecone store functionality to vector stores #134

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings: work out how and when to run upload scripts #136

Embeddings: work out how and when to run upload scripts #136

josephjclark commented Dec 18, 2024

Embeddings: work out how and when to run upload scripts #136

Embeddings: work out how and when to run upload scripts #136

Comments

josephjclark commented Dec 18, 2024