-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add interval logic for l2g features #812
base: dev
Are you sure you want to change the base?
Conversation
# feature will be the same for any gene associated with a studyLocus) | ||
local_max.withColumn( | ||
"regional_maximum", | ||
f.max(local_feature_name).over(Window.partitionBy("studyLocusId")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it maximum? According to the table and what we discussed it should be mean?
https://docs.google.com/spreadsheets/d/1wUs1AprRCCGItZmgDhc1fF5BtwCSosdzFv4NQ8V6Dtg/edit?gid=452826388#gid=452826388
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes Jack!!!
The logic to build the features looks good! Please see my comments, but they are more along the lines of how we process the interval data in the L2G step.
I suggested processing all interval sources to make the process simpler, but since the code is accommodated to take source names and paths individually and changing it is a mess, it's also fine to leave it like that as long as the interval_paths parameter is correctly configured.
The implemented changes wouldn't run, because of the creation of a Interval dataset with a mismatching schema. I would encourage you to:
- add any features you add to the
test_l2g_feature_matrix.py
suite, to make sure that the code doesnt crash - In the same file, add a semantic test for the common logic
- Update the documentation pages
- Pull dev branch to bring the changes to the feature matrix step
…1_l2g_intervals
…1_l2g_intervals
…d test for interval features
…1_l2g_intervals
…1_l2g_intervals
…1_l2g_intervals
…1_l2g_intervals
…1_l2g_intervals
We have investigated the Intervals-only V2G dataset, the problem is that one variant can contain interval information from multiple genes (up to 200) from one interval source, in addition, the Interval-only V2G dataset is too big, and can potentially be 4x the size of the variant index. Therefore, it is not feasible to include interval data inside the variant index, instead, the "processed interval" dataset (requires gene index), will be generated each release, and the feature generation step will intersect variants found within studyLocus to the "processed interval" using an overlap approach to generate the scores needed for the interval features. |
…1_l2g_intervals
…ntropy into xg1_l2g_intervals
✨ Context
Adding interval based features to the l2g model, based on the feature list (opentargets/issues#3521).
opentargets/issues#3512
🛠 What does this PR implement
🙈 Missing
More features from anderson + thurman.
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?