-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline for non-drug concepts #40
Comments
Using Co-connect twins as an exampleThe datasetWe have a dataset to use as an example for adding non-drug domains to LLettuce's use-case. This is the TwinsUK phenobase. The part of this that's interesting for us is the "Variables" sheet of a spreadsheet I was sent. Within this, there are two columns that are interesting:
There are 8144 of these name/description pairs. In OMOP, there's a "CO-CONNECT TWINS" vocabulary. 4234 of the PhenotypeNames match a CO-CONNECT TWINS concept. I've retrieved the standard concepts for these non-standard concepts. This provides a nice example for us to test versions of LLettuce. The PhenotypeDescription is the kind of long description of something that LLettuce is well positioned to parse into standard concepts. By making this PhenotypeDescription -> PhenotypeName -> CO-CONNECT TWINS -> OMOP standard concept chain, I've made a table of:
Getting LLettuce to predict the right column from the left is what we want to test. The exact format of how the OMOP standard concepts is represented isn't important, it could be JSON or whatever, as long as it can be parsed into a set of relationships to concepts. An important thing to note is that a mapping can be made to multiple concepts. Preliminary testI fine-tuned Flan-T5-small on 80%/10% train/test split of the dataset. It did OK, given the small size of the model. I calculated the precision, recall, and Future directionA useful comparison to make will be between a fine-tuned Flan-T5 model and Llama 3.1. The steps for this will be:
|
Now we have shown that LLettuce can work for drug concepts, we need to expand to non-drug concepts.
This could be as simple as removing references to "domain = 'Drug'" from queries. However, the OMOP vocabularies are large, and querying the whole database will be slow. However, we could get NLP to help us. Here's a rough scheme:
Estimating the class will be less useful - the main thing will be to narrow it down to domain. I would guess it's harder for an NLP system to achieve, too, so we can test, but
For this to work we will need a new pipeline
The text was updated successfully, but these errors were encountered: