Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate vector search into pipeline #19

Open
1 of 2 tasks
kuraisle opened this issue Aug 5, 2024 · 0 comments
Open
1 of 2 tasks

Integrate vector search into pipeline #19

kuraisle opened this issue Aug 5, 2024 · 0 comments

Comments

@kuraisle
Copy link
Collaborator

kuraisle commented Aug 5, 2024

There are 697 informal names where the LLM gave a sensible output (not blank or a “No specific drug name” response) in Esmond’s dataset. Of these, 63 (9%) are exact matches to a concept in the RxNorm vocabulary. Of the rest, in 208 cases, a vector search gives the exact same answer as GPT-3. As the vector search has a far lower computational cost, and can succesfully answer at least 39% of queries, it's worth integrating into the pipeline. There might be further improvements if a little effort is made.

My experiment with this used roughly this code:

from dotenv import load_dotenv
from sqlalchemy import create_engine
from os import environ
from urllib.parse import quote_plus
import pandas as pd
import txtai
import time
import numpy as np

load_dotenv()

DB_HOST = environ["DB_HOST"]
DB_USER = environ["DB_USER"]
DB_PASSWORD = quote_plus(environ["DB_PASSWORD"])
DB_NAME = environ["DB_NAME"]
DB_PORT = environ["DB_PORT"]
DB_SCHEMA = environ["DB_SCHEMA"]

connection_string = (
    f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
)
engine = create_engine(connection_string)

rxNorm_concepts = pd.read_sql(
    f"""
    SELECT concept_id, concept_name
    FROM {DB_SCHEMA}.concept
    WHERE
        vocabulary_id = 'RxNorm'
    """,
    con=engine
)

embeddings = txtai.Embeddings(path="neuml/pubmedbert-base-embeddings", content=True)

embeddings.index(rxNorm_concepts.apply(lambda x: (x.concept_id, x.concept_name, None), axis = 1))

Then using embeddings.search() you can fetch the closest n embeddings.

We need to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant