Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(l2gprediction): add score explanation based on features #939

Merged
merged 53 commits into from
Feb 19, 2025

Conversation

ireneisdoomed
Copy link
Contributor

@ireneisdoomed ireneisdoomed commented Dec 3, 2024

✨ Context

This PR closes opentargets/issues#3664

This is how the prioritisation for the 44acafc7985c3180b072394a28d7bad9 locus row looks like:

--------------------------------------------------------------------------------------------------------------------------
 geneId        | ENSG00000075073                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 shapleyValues | {pQtlColocH4MaximumNeighbourhood -> -0.08760687195471195, eQtlColocH4Maximum -> -0.14339800289527474, distanceTssMean -> 0.5949956624176115, vepMeanNeighbourhood -> -0.011421102911212407, geneCount500kb -> 0.504088935973282, eQtlColocClppMaximumNeighbourhood -> 3.8670788812320505E-4, credibleSetConfidence -> 0.3579454324922001, distanceTssMeanNeighbourhood -> 2.1287083945895433, distanceSentinelTssNeighbourhood -> 0.06814055239948229, pQtlColocH4Maximum -> -4.429795836247109E-5, sQtlColocH4MaximumNeighbourhood -> -0.0014771542785202165, pQtlColocClppMaximum -> 0.0, distanceFootprintMeanNeighbourhood -> 3.5838524396517637, eQtlColocH4MaximumNeighbourhood -> 9.521371211488606, sQtlColocH4Maximum -> 0.08286727373455878, eQtlColocClppMaximum -> 1.8582865064964005, distanceSentinelTss -> 1.2031946462979695, sQtlColocClppMaximum -> -0.19218105019873652, distanceFootprintMean -> -0.21218899079465908, sQtlColocClppMaximumNeighbourhood -> -0.8441121532817452, isProteinCoding -> 1.3144695281881593, pQtlColocClppMaximumNeighbourhood -> -0.019016532039634684, distanceSentinelFootprintNeighbourhood -> 0.3102929859324838, vepMaximumNeighbourhood -> 0.047033325222490756, proteinGeneCount500kb -> 0.2529121503720645, vepMean -> 0.28520782840325626, distanceSentinelFootprint -> 0.3116443797012342, vepMaximum -> 0.0} 
--------------------------------------------------------------------------------------------------------------------------
 geneId        | ENSG00000156515                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 shapleyValues | {pQtlColocH4MaximumNeighbourhood -> -0.06818768537940262, eQtlColocH4Maximum -> -0.14336530875137046, distanceTssMean -> 1.5530291409428105, vepMeanNeighbourhood -> 0.27627936610430315, geneCount500kb -> 0.7291870883206015, eQtlColocClppMaximumNeighbourhood -> -0.27041447664625295, credibleSetConfidence -> 0.27395001762237353, distanceTssMeanNeighbourhood -> 1.8170243838499736, distanceSentinelTssNeighbourhood -> 0.06262635310529577, pQtlColocH4Maximum -> -4.429795836247109E-5, sQtlColocH4MaximumNeighbourhood -> -0.0015916146826161937, pQtlColocClppMaximum -> 0.0, distanceFootprintMeanNeighbourhood -> 3.3691309949297295, eQtlColocH4MaximumNeighbourhood -> 9.585267225244944, sQtlColocH4Maximum -> 0.5632581225491728, eQtlColocClppMaximum -> 0.37135374806985344, distanceSentinelTss -> -0.12253906908689487, sQtlColocClppMaximum -> -0.14436242678936584, distanceFootprintMean -> -0.12424150830529755, sQtlColocClppMaximumNeighbourhood -> 1.0804192712631597, isProteinCoding -> 1.1216576014723711, pQtlColocClppMaximumNeighbourhood -> -0.019016532039634684, distanceSentinelFootprintNeighbourhood -> 0.46839704438965074, vepMaximumNeighbourhood -> -0.034280363722216364, proteinGeneCount500kb -> 0.39402035712660993, vepMean -> 0.7099149349443197, distanceSentinelFootprint -> 0.3021197989635074, vepMaximum -> 0.0} 

All results available at: gs://ot-team/irene/l2g/06122024/locus_to_gene_predictions
All predictions have their corresponding explanations.

🛠 What does this PR implement

  • New shapleyValues field (map type) in the prediction schema
  • New util convert_map_type_to_columns to convert the feature annotation in the locusToGeneFeatures map type to a dataframe that I can pass to the SHAP explainer
  • I have added model as an instance attribute to the Predictions dataset.
  • New explain method in the predictions dataset. Calculates shapley values and returns another object with the new column.
  • Edited the step to add this information
  • Enhancement in Dataset.filter so that the returned new instance of the object maintains the attributes. This was necessary to propagate the model instance attribute after each modification of the predictions dataset.

🙈 Missing

  • To run the step properly: I have only tried that it works by running predictions.explain() interactively.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@ireneisdoomed ireneisdoomed marked this pull request as ready for review December 4, 2024 16:32
@ireneisdoomed ireneisdoomed requested a review from d0choa December 4, 2024 16:32
@github-actions github-actions bot added the Method label Dec 6, 2024
@ireneisdoomed
Copy link
Contributor Author

The new version fixes the bug in the previous one by avoiding the operations with dictionaries, and just building the new map by joining the initial dataframe with the dataframe with the contributions.

Getting the shapley values takes time, but in my experiments creating the Spark dataframe from the Pandas df was the real bottleneck. The code might complain with memory issues when run locally on a very big dataframe.

I have tried avoiding this by using Pandas UDFs taking this and this as a guide, but Spark kept crashing due to serialization issues. Predicting now has gone from 6m to 13m (job). All predictions have their explanations built in.

@d0choa
Copy link
Collaborator

d0choa commented Jan 21, 2025

Related to opentargets/issues#3709

@ireneisdoomed ireneisdoomed removed the request for review from d0choa January 27, 2025 19:22
@ireneisdoomed
Copy link
Contributor Author

Dataset with real annotated predictions: gs://ot-team/irene/shap/1302/predictions.

The inputs to generate the new predictions are based on the 24.12 release.

QC

  • 973,480 predictions after filtering, same as gs://open-targets-pre-data-releases/24.12-uo_test-3/output/genetics/parquet/l2g_predictions
  • All of them with the same score
  • After approximating the probabilities, summing each shap value + the base value adds up to the L2G score in practically all cases, with a marginal error of 0.001
+--------------------+------+                                                   
|sum_equals_l2g_score| count|
+--------------------+------+
|                true|973442|
|               false|    38|
+--------------------+------+
  • Those 38 predictions tend to be low. Their median is ~0.2
  • The marginal contributions for the top scoring gene of the locus we were looking at (2089b267ff0a27715af4b75d81abd834) make sense. Values are not the same with the ones here (diff background data) but trends prevail:
    • Biggest contribution is distanceSentinelFootprintNeighbourhood (-0.27) balanced out by distanceSentinelFootprint (0.26)
    • VEP mean has a big impact (0.17)
    • And the sceQTL is also relevant (0.09642758)

The job has taken 30 mins.

@project-defiant project-defiant self-requested a review February 18, 2025 16:20
Copy link
Contributor

@project-defiant project-defiant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that on purpose to store the repetition of the ShapBaseValue at each row?

+-------------+-------+
|shapBaseValue|  count|
+-------------+-------+
|  0.070535816|973480|
+-------------+-------+

@ireneisdoomed
Copy link
Contributor Author

Is that on purpose to store the repetition of the ShapBaseValue at each row?

+-------------+-------+
|shapBaseValue|  count|
+-------------+-------+
|  0.070535816|973480|
+-------------+-------+

Yes. This value is constant for all predictions.

Copy link
Contributor

@project-defiant project-defiant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look at the comments, nothing major, just wanted to clarify.

Great work on the explainability!

src/gentropy/dataset/dataset.py Outdated Show resolved Hide resolved
src/gentropy/dataset/l2g_prediction.py Show resolved Hide resolved
src/gentropy/dataset/l2g_prediction.py Show resolved Hide resolved
src/gentropy/dataset/l2g_prediction.py Show resolved Hide resolved
@ireneisdoomed ireneisdoomed merged commit f952f6c into dev Feb 19, 2025
7 checks passed
@ireneisdoomed ireneisdoomed deleted the il-shapley-predictions branch February 19, 2025 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add shapley values to L2G predictions
3 participants