feat(l2gprediction): add score explanation based on features #939

ireneisdoomed · 2024-12-03T11:40:12Z

✨ Context

This is how the prioritisation for the 44acafc7985c3180b072394a28d7bad9 locus row looks like:

--------------------------------------------------------------------------------------------------------------------------
 geneId        | ENSG00000075073                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 shapleyValues | {pQtlColocH4MaximumNeighbourhood -> -0.08760687195471195, eQtlColocH4Maximum -> -0.14339800289527474, distanceTssMean -> 0.5949956624176115, vepMeanNeighbourhood -> -0.011421102911212407, geneCount500kb -> 0.504088935973282, eQtlColocClppMaximumNeighbourhood -> 3.8670788812320505E-4, credibleSetConfidence -> 0.3579454324922001, distanceTssMeanNeighbourhood -> 2.1287083945895433, distanceSentinelTssNeighbourhood -> 0.06814055239948229, pQtlColocH4Maximum -> -4.429795836247109E-5, sQtlColocH4MaximumNeighbourhood -> -0.0014771542785202165, pQtlColocClppMaximum -> 0.0, distanceFootprintMeanNeighbourhood -> 3.5838524396517637, eQtlColocH4MaximumNeighbourhood -> 9.521371211488606, sQtlColocH4Maximum -> 0.08286727373455878, eQtlColocClppMaximum -> 1.8582865064964005, distanceSentinelTss -> 1.2031946462979695, sQtlColocClppMaximum -> -0.19218105019873652, distanceFootprintMean -> -0.21218899079465908, sQtlColocClppMaximumNeighbourhood -> -0.8441121532817452, isProteinCoding -> 1.3144695281881593, pQtlColocClppMaximumNeighbourhood -> -0.019016532039634684, distanceSentinelFootprintNeighbourhood -> 0.3102929859324838, vepMaximumNeighbourhood -> 0.047033325222490756, proteinGeneCount500kb -> 0.2529121503720645, vepMean -> 0.28520782840325626, distanceSentinelFootprint -> 0.3116443797012342, vepMaximum -> 0.0} 
--------------------------------------------------------------------------------------------------------------------------
 geneId        | ENSG00000156515                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 shapleyValues | {pQtlColocH4MaximumNeighbourhood -> -0.06818768537940262, eQtlColocH4Maximum -> -0.14336530875137046, distanceTssMean -> 1.5530291409428105, vepMeanNeighbourhood -> 0.27627936610430315, geneCount500kb -> 0.7291870883206015, eQtlColocClppMaximumNeighbourhood -> -0.27041447664625295, credibleSetConfidence -> 0.27395001762237353, distanceTssMeanNeighbourhood -> 1.8170243838499736, distanceSentinelTssNeighbourhood -> 0.06262635310529577, pQtlColocH4Maximum -> -4.429795836247109E-5, sQtlColocH4MaximumNeighbourhood -> -0.0015916146826161937, pQtlColocClppMaximum -> 0.0, distanceFootprintMeanNeighbourhood -> 3.3691309949297295, eQtlColocH4MaximumNeighbourhood -> 9.585267225244944, sQtlColocH4Maximum -> 0.5632581225491728, eQtlColocClppMaximum -> 0.37135374806985344, distanceSentinelTss -> -0.12253906908689487, sQtlColocClppMaximum -> -0.14436242678936584, distanceFootprintMean -> -0.12424150830529755, sQtlColocClppMaximumNeighbourhood -> 1.0804192712631597, isProteinCoding -> 1.1216576014723711, pQtlColocClppMaximumNeighbourhood -> -0.019016532039634684, distanceSentinelFootprintNeighbourhood -> 0.46839704438965074, vepMaximumNeighbourhood -> -0.034280363722216364, proteinGeneCount500kb -> 0.39402035712660993, vepMean -> 0.7099149349443197, distanceSentinelFootprint -> 0.3021197989635074, vepMaximum -> 0.0}

All results available at: gs://ot-team/irene/l2g/06122024/locus_to_gene_predictions
All predictions have their corresponding explanations.

🛠 What does this PR implement

New shapleyValues field (map type) in the prediction schema
New util convert_map_type_to_columns to convert the feature annotation in the locusToGeneFeatures map type to a dataframe that I can pass to the SHAP explainer
I have added model as an instance attribute to the Predictions dataset.
New explain method in the predictions dataset. Calculates shapley values and returns another object with the new column.
Edited the step to add this information
Enhancement in Dataset.filter so that the returned new instance of the object maintains the attributes. This was necessary to propagate the model instance attribute after each modification of the predictions dataset.

🙈 Missing

To run the step properly: I have only tried that it works by running predictions.explain() interactively.

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

…-shapley-predictions

ireneisdoomed · 2024-12-06T17:21:01Z

The new version fixes the bug in the previous one by avoiding the operations with dictionaries, and just building the new map by joining the initial dataframe with the dataframe with the contributions.

Getting the shapley values takes time, but in my experiments creating the Spark dataframe from the Pandas df was the real bottleneck. The code might complain with memory issues when run locally on a very big dataframe.

I have tried avoiding this by using Pandas UDFs taking this and this as a guide, but Spark kept crashing due to serialization issues. Predicting now has gone from 6m to 13m (job). All predictions have their explanations built in.

d0choa · 2025-01-21T17:09:01Z

Related to opentargets/issues#3709

…-shapley-predictions

… the hub metadata

…config

…ts/gentropy into il-shapley-predictions

…-shapley-predictions

…ures

ireneisdoomed · 2025-02-13T19:21:55Z

Dataset with real annotated predictions: gs://ot-team/irene/shap/1302/predictions.

The inputs to generate the new predictions are based on the 24.12 release.

QC

973,480 predictions after filtering, same as gs://open-targets-pre-data-releases/24.12-uo_test-3/output/genetics/parquet/l2g_predictions
All of them with the same score
After approximating the probabilities, summing each shap value + the base value adds up to the L2G score in practically all cases, with a marginal error of 0.001

+--------------------+------+                                                   
|sum_equals_l2g_score| count|
+--------------------+------+
|                true|973442|
|               false|    38|
+--------------------+------+

Those 38 predictions tend to be low. Their median is ~0.2
The marginal contributions for the top scoring gene of the locus we were looking at (2089b267ff0a27715af4b75d81abd834) make sense. Values are not the same with the ones here (diff background data) but trends prevail:
- Biggest contribution is distanceSentinelFootprintNeighbourhood (-0.27) balanced out by distanceSentinelFootprint (0.26)
- VEP mean has a big impact (0.17)
- And the sceQTL is also relevant (0.09642758)

The job has taken 30 mins.

project-defiant

Is that on purpose to store the repetition of the ShapBaseValue at each row?

+-------------+-------+
|shapBaseValue|  count|
+-------------+-------+
|  0.070535816|973480|
+-------------+-------+

ireneisdoomed · 2025-02-18T16:50:43Z

Is that on purpose to store the repetition of the ShapBaseValue at each row?
+-------------+-------+
|shapBaseValue|  count|
+-------------+-------+
|  0.070535816|973480|
+-------------+-------+

Yes. This value is constant for all predictions.

project-defiant

Please have a look at the comments, nothing major, just wanted to clarify.

Great work on the explainability!

src/gentropy/dataset/dataset.py

src/gentropy/dataset/l2g_prediction.py

ireneisdoomed added 3 commits December 3, 2024 09:33

feat(prediction): add model as instance attribute

72259fc

feat: added convert_map_type_to_columns spark util

9e8c491

feat(prediction): new method explain returns shapley values

450a937

github-actions bot added size-S Dataset Feature labels Dec 3, 2024

ireneisdoomed added 3 commits December 4, 2024 15:58

feat(prediction): explain returns predictions with shapley values

08ae6bd

chore: compute shapleyValues in the l2g step

9d40e62

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

125425f

…-shapley-predictions

github-actions bot added size-M Step and removed size-S labels Dec 4, 2024

ireneisdoomed marked this pull request as ready for review December 4, 2024 16:32

ireneisdoomed requested a review from d0choa December 4, 2024 16:32

ireneisdoomed added 6 commits December 5, 2024 17:53

refactor: use pandas udf instead

f407512

refactor: forget about udfs and get shaps single threaded

f542395

chore: remove reference to chromatin interaction data in HF card

9403fe6

fix(l2g_prediction): methods that return new instance preserve attribute

1bc6f3a

feat(dataset): filter method preserves all instance attributes

8420933

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

8a85f4f

…-shapley-predictions

github-actions bot added the Method label Dec 6, 2024

ireneisdoomed added 2 commits January 27, 2025 14:44

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

e6249c0

…-shapley-predictions

feat(l2gmodel): add features_list as model attribute and load it from…

b987496

… the hub metadata

ireneisdoomed removed the request for review from d0choa January 27, 2025 19:22

ireneisdoomed added 4 commits January 28, 2025 10:15

chore: merge

3e99415

fix: pass correct order of features to shapley explainer

12de669

feat(l2g): predict mode to extract feature list from model, not from …

78027da

…config

feat(l2g): pass default features list if model is loaded from a path

48b78ab

github-actions bot added the size-L label Jan 28, 2025

pre-commit-ci bot and others added 9 commits January 28, 2025 18:30

chore: pre-commit auto fixes [...]

30a4676

feat: report as log odds

9a98332

feat: calculate scaled probabilities

1fc73ca

chore(l2gprediction): remove shapBaseProbability

625992a

chore: correct typo in add_features and make schemas non nullable

134bc51

fix: rename columns in pandas df after pivoting

ee44c46

fix: add raw shap contributions

e927b44

chore: merge

4bca7c1

Merge branch 'il-shapley-predictions' of https://github.com/opentarge…

7b9aa03

…ts/gentropy into il-shapley-predictions

github-actions bot added size-M and removed size-L labels Feb 6, 2025

ireneisdoomed mentioned this pull request Feb 6, 2025

SHAP values are not additive in a probability space opentargets/issues#3755

Open

3 tasks

ireneisdoomed added 9 commits February 12, 2025 17:03

fix(model): when saving create directory if not exists

cfc4529

feat(l2g): bundle model and training data in hf

fc32ba4

feat(model): include data when loading model

37b83ac

feat: final version of shap explanations

62f45b4

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

c388ff5

…-shapley-predictions

fix: do not infer features_list from df

c635e18

fix: get_features_list_from_metadata returned cols that were not feat…

58f35d9

…ures

refactor(model): read training data in the local filesystem w pandas

d45acea

chore: successful run, remove test

7b826a4

project-defiant self-requested a review February 18, 2025 16:20

project-defiant reviewed Feb 18, 2025

View reviewed changes

project-defiant approved these changes Feb 19, 2025

View reviewed changes

src/gentropy/dataset/dataset.py Outdated Show resolved Hide resolved

src/gentropy/dataset/l2g_prediction.py Show resolved Hide resolved

src/gentropy/dataset/l2g_prediction.py Show resolved Hide resolved

src/gentropy/dataset/l2g_prediction.py Show resolved Hide resolved

Merge branch 'dev' into il-shapley-predictions

0f6b7e0

ireneisdoomed merged commit f952f6c into dev Feb 19, 2025
7 checks passed

ireneisdoomed deleted the il-shapley-predictions branch February 19, 2025 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(l2gprediction): add score explanation based on features #939

feat(l2gprediction): add score explanation based on features #939

ireneisdoomed commented Dec 3, 2024 •

edited

Loading

ireneisdoomed commented Dec 6, 2024

d0choa commented Jan 21, 2025

ireneisdoomed commented Feb 13, 2025

project-defiant left a comment

ireneisdoomed commented Feb 18, 2025

project-defiant left a comment

feat(l2gprediction): add score explanation based on features #939

feat(l2gprediction): add score explanation based on features #939

Conversation

ireneisdoomed commented Dec 3, 2024 • edited Loading

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

ireneisdoomed commented Dec 6, 2024

d0choa commented Jan 21, 2025

ireneisdoomed commented Feb 13, 2025

QC

project-defiant left a comment

Choose a reason for hiding this comment

ireneisdoomed commented Feb 18, 2025

project-defiant left a comment

Choose a reason for hiding this comment

ireneisdoomed commented Dec 3, 2024 •

edited

Loading