Parsing dbNSFP values in seqr #3896

gilmorera · 2024-02-16T00:07:14Z

gilmorera
Feb 16, 2024

Hi all.

Our team has updated to dbnsfpv4.5a, and we have some questions about how the dbnsfp fields are parsed in seqr and the seqr-pipeline.

There are multiple values for sites with multiple transcripts for fields like VEST4, REVEL, and AlphaMissense (new to v4.5a).

#chr	pos(1-based)	ref	alt	rs_dbSNP	hg18_chr	hg18_pos(1-based)	genename	Ensembl_geneid	Ensembl_transcriptid	Ensembl_proteinid	APPRIS	GENCODE_basic	TSL	VEP_canonical	VEST4_score	REVEL_score	REVEL_rankscore	AlphaMissense_score	AlphaMissense_rankscore	AlphaMissense_pred
2	26311219	G	A	rs139522210	2	26387591	ADGRF3;ADGRF3;ADGRF3	ENSG00000173567;ENSG00000173567;ENSG00000173567	ENST00000333478;ENST00000421160;ENST00000311519	ENSP00000327396;ENSP00000388537;ENSP00000307831	principal3;alternative2;alternative2	Y;Y;Y	5;2;1	.;.;YES	0.089;0.085;0.081	.;.;.	.	.;.;.	.	.;.;.
2	26311219	G	C	rs139522210	2	26387591	ADGRF3;ADGRF3;ADGRF3	ENSG00000173567;ENSG00000173567;ENSG00000173567	ENST00000333478;ENST00000421160;ENST00000311519	ENSP00000327396;ENSP00000388537;ENSP00000307831	principal3;alternative2;alternative2	Y;Y;Y	5;2;1	.;.;YES	0.126;0.16;0.18	0.048;0.048;0.048	0.13305	.;0.0628;0.0678	0.02733	.;B;B

From what I can tell these values are currently parsed out on the frontend by picking one of the(or the first?) non-missing . values. However, there's not always just one non-missing value. Should we instead use one of the transcript quality tags to pick the right scores (looking at you canonical 😄 )?

From the dbNSFP readme:

25	APPRIS: APPRIS annotation for the transcripts matching Ensembl_transcriptid
		Multiple entries separated by ";". Potential values: principal1, principal2, 
		principal3, principal4, principal5, alternative1, alternative2. 
		See https://useast.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
26	GENCODE_basic: Whether the transcript belongs to GENCODE_basic (5' and 3' complete
		transcripts). Multiple entries separated by ";", matching Ensembl_transcriptid.
		See https://useast.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
27	TSL: Transcript Support Level.
		Multiple entries separated by ";", matching Ensembl_transcriptid.
		Potential values: 1 to 5, NA. 
		See https://useast.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
28	**VEP_canonical:** canonical transcript used in Ensembl.
		Multiple entries separated by ";", matching Ensembl_transcriptid.
		See https://useast.ensembl.org/Help/Glossary?id=521

If so it would seem like the best thing to do would be to create a custom "select" function for these fields delimited by ; in order to parse the VEP_canonical score for REVEL, VEST4, etc. similar to the way it's done for gnomad:
https://github.com/broadinstitute/seqr-loading-pipelines/blob/54ed7bb07c719cdc831d1277afbfa595d01b3fc6/v03_pipeline/lib/reference_data/config.py#L96-L122

gilmorera · 2024-02-16T13:26:06Z

gilmorera
Feb 16, 2024
Author

I remember seeing a "parse" function somewhere too, but I couldn't find that. I did find an example how this type of data is parsed in the seqr code:

seqr/seqr/utils/search/elasticsearch/constants.py

Lines 341 to 345 in 9b2815f

    
           'dbnsfp_VEST4_score': { 
        
               'response_key': 'vest', 
        
               'format_value': lambda x: x and next((v for v in x.split(';') if v != '.'), None), 
        
           }, 
        
           'dbnsfp_MutPred_score': {'response_key': 'mut_pred', 'format_value': lambda x: None if x == '-' else x},

0 replies

hanars · 2024-02-16T17:24:08Z

hanars
Feb 16, 2024
Maintainer

So dbnsfp comes into seqr in 2 places, one to annotate genes and one to annotate variants.
The gene-level annotations are added via a mange command: https://github.com/broadinstitute/seqr/blob/master/reference_data/management/commands/update_dbnsfp_gene.py

The variant-level annotations are added in the loading pipeline when variants are joined with the reference data table. Generally speaking, the code for generating and updating the reference data is not something that is well structured to have other groups run, and we have no documentation for it and its a little finicky. We make the reference data table freely available for download which is how we recommend users interact with this reference data. However, if you are curious about how the parsing is done for dbnsfp when creating that table, the relevant code is here:
https://github.com/broadinstitute/seqr-loading-pipelines/blob/main/download_and_create_reference_datasets/v02/hail_scripts/write_dbnsfp_ht.py

We have also recently update seqr to support a new v3 loading pipeliene. While this is not yet ready to be used buy other groups, you are welcome to take a look at the code to see how we plan to parse dbnsfp going forward with the new pipeline:
https://github.com/broadinstitute/seqr-loading-pipelines/blob/main/v03_pipeline/lib/reference_data/config.py#L209
https://github.com/broadinstitute/seqr-loading-pipelines/blob/main/v03_pipeline/lib/reference_data/config.py#L70

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing dbNSFP values in seqr #3896

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Parsing dbNSFP values in seqr #3896

gilmorera Feb 16, 2024

Replies: 2 comments

gilmorera Feb 16, 2024 Author

hanars Feb 16, 2024 Maintainer

gilmorera
Feb 16, 2024

gilmorera
Feb 16, 2024
Author

hanars
Feb 16, 2024
Maintainer