You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The apollo server comes with a connected vector database populated with embedding data. At the time of writing there are three datasets - docs, loinc and snomed - but more will be added later.
Each embedded dataset will include a script to populate the database. We'll likely need to maintain databases for staging, production and individual dev. So it's gotta be easy to run these scripts, and they should have minimal inputs (ideally none, apart from credentials in env vars).
There are multiple ways we might set up the scripts:
As an ad-hoc python script. Ie, we do cd services/my_embedding && python upload.py to populate the database. This would run on a local machine. I don't really like the idea of managing mulitple envs on a dev machine - dev would have to have one environment for their local development, which is fine, but when uploading to the production database it's a bit hairier. Usually better to do that sort of thing in CI, with a nice clean env.
Through an endpoint on the Apollo server. Ie, we POST to apollo/embeddings/my_embedding/upload. The server uses its own environment to run an update of the embeddings table. The worry about this is that the embedding might be slow and expensive (think about downloading and embedding 1gb of loinc codes), and our production environments may not be able to handle it. We'd also need an auth solution because we can't have just anyone posting to the server to flush the database.
We might even add CLI support which calls up to the server and can do things like prompting and easily read in input files?
So I'm not sure really. All solutions seem flawed.
We certainly do need a standalone script so we'll start from there. The question is how to make this work effectively and safely with multiple production environments.
The text was updated successfully, but these errors were encountered:
The apollo server comes with a connected vector database populated with embedding data. At the time of writing there are three datasets - docs, loinc and snomed - but more will be added later.
Each embedded dataset will include a script to populate the database. We'll likely need to maintain databases for staging, production and individual dev. So it's gotta be easy to run these scripts, and they should have minimal inputs (ideally none, apart from credentials in env vars).
There are multiple ways we might set up the scripts:
cd services/my_embedding && python upload.py
to populate the database. This would run on a local machine. I don't really like the idea of managing mulitple envs on a dev machine - dev would have to have one environment for their local development, which is fine, but when uploading to the production database it's a bit hairier. Usually better to do that sort of thing in CI, with a nice clean env.So I'm not sure really. All solutions seem flawed.
We certainly do need a standalone script so we'll start from there. The question is how to make this work effectively and safely with multiple production environments.
The text was updated successfully, but these errors were encountered: