Scripts: add vcf processing script and dockerfile #153

cklamann · 2021-12-09T17:35:10Z

No description provided.

hannahle · 2021-12-10T19:32:56Z

scripts/parse-vcf/README.md

+You can use the image to run the `parse_vcf` python module, which will parse a gnomAD annotation vcf and write the filtered results to a csv that can be imported into a database.
+
+```bash
+ docker run --rm -v `pwd`:`pwd` -w `pwd` --entrypoint="python" bcftools -m parse_vcf --source gnomad_2.1.1_exome --out my-local-results-directory https://gnomad-public-us-east-1.s3.amazonaws.com/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz


Do we know what gnomad version we're supposed to use going forward, or is the consensus is that we will always use the most up-to-date one?

For the 2x releases, the most recent I think is what we'd like. By specifying the source/version in the --source arg, there will be a record of it for each annotation, which hopefully will cut down on confusion. The 3x releases are for GRCh38 only, so there may be special considerations there, but I would also assume the latest would be best. If a new version is released and users would like to see it, the script can be run again and the database updated.

hannahle · 2021-12-10T19:34:16Z

scripts/parse-vcf/README.md

+
+## What the script does
+
+The parsing script will process a vcf of variant annotations, returning a subset of fields and combining them into a single csv, which can then be moved to a database. By filtering out unnecessary values, the size of the annotation file can be greatly reduced. Besides the `pos`, `chrom`, `alt`, and `ref` fields, the script will also return `nhomalt`, `AF`, `AC` and relevant `VEP` fields (if available). If you would like to adjust this output, the script will have to be manually updated (see `parse_vcf_response` function). There is also some data type coercion that is tailored to the configuration of the mongo database for ease of import that may need to be changed if the script is used for other purposes.


To confirm, we are only using nhomalt, AF, and AC from gnomad right?

Yeah -- for now we aren't using the VEP fields, as they're coming from CADD. They can either be removed at processing time by commenting out the appropriate lines in the script or excluded from the database by not uploading them into mongo. Or they could be included and simply not queried, in case they might be needed later.

hannahle · 2021-12-10T19:37:06Z

scripts/parse-vcf/README.md

+
+The `parse_vcf` module is very much a work in progress and currently has many limitations.
+
+- If execution is interrupted, it will be difficult to restart. While passing a region to the `--start-position` argument will begin the script at a given locus, this is not guarnteed to restart the script where it left off. Because jobs are handled in parallel and, moreover, failed jobs are pushed back into the queue, care will have to be taken to determine which regions remain to be processed when examining an incomplete results file. Also, bear in mind that the script will not remove a csv file that already exists, so if you are restarting a script from scratch after an error, be sure to manually remove the results file first (or point to a new one), otherwise the results will be appended to the extant file.


This might be a silly question, but say we are using parse_vcf for a large VCF and somewhere along the line, the script fails due to some errors that we did not foresee, we'd have to delete the CSV file that existed and rerun the whole process again?

Yes, this is the sad truth. I think that to reliably restart the process, there would need to be some extra work done to parse the csv, compare with what's in the vcf headers, determine the missing rows, and relaunch the parsing script accordingly (would probably need to update the script to take config of many regions).

Right, that's what I thought too. Thanks for clarifying!

hannahle · 2021-12-10T19:41:36Z

scripts/parse-vcf/README.md

+## Limitations
+
+The `parse_vcf` module is very much a work in progress and currently has many limitations.
+


For failed jobs, where would we see the logs?

You'll see them printed to the console, but better logging is definitely needed!

hannahle · 2021-12-10T19:47:17Z

scripts/parse-vcf/parse_vcf.py

+
+def parse_vep(df: pd.DataFrame):
+    # FYI
+    VEP_HEADERS = "Allele,Consequence,IMPACT,SYMBOL,Gene,Feature_type,Feature,BIOTYPE,EXON,INTRON,HGVSc,HGVSp,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,ALLELE_NUM,DISTANCE,STRAND,FLAGS,VARIANT_CLASS,MINIMISED,SYMBOL_SOURCE,HGNC_ID,CANONICAL,TSL,APPRIS,CCDS,ENSP,SWISSPROT,TREMBL,UNIPARC,GENE_PHENO,SIFT,PolyPhen,DOMAINS,HGVS_OFFSET,GMAF,AFR_MAF,AMR_MAF,EAS_MAF,EUR_MAF,SAS_MAF,AA_MAF,EA_MAF,ExAC_MAF,ExAC_Adj_MAF,ExAC_AFR_MAF,ExAC_AMR_MAF,ExAC_EAS_MAF,ExAC_FIN_MAF,ExAC_NFE_MAF,ExAC_OTH_MAF,ExAC_SAS_MAF,CLIN_SIG,SOMATIC,PHENO,PUBMED,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,MOTIF_SCORE_CHANGE,LoF,LoF_filter,LoF_flags,LoF_info\n"


Where do these headers come from?

This is a list of the VEP headers that I copy/pasted from (somewhere) for reference.

hannahle · 2021-12-10T19:54:52Z

scripts/parse-vcf/parse_vcf.py

+                    is_canonical = True
+                if i == BIOTYPE_INDEX and val == "protein_coding":
+                    is_protein_coding = True
+            if is_canonical and is_protein_coding:


Can you elaborate on the significance behind this logic? Is VEP only reliable for canonical and protein coding genes?

There will be a VEP entry for every transcript, of which there might be several for a given SNP --- for now we're only interested in canonical/protein coding.

add vcf processing script and dockerfile

4b92e60

cklamann requested a review from hannahle December 9, 2021 17:35

hannahle reviewed Dec 10, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripts: add vcf processing script and dockerfile #153

Scripts: add vcf processing script and dockerfile #153

cklamann commented Dec 9, 2021

hannahle Dec 10, 2021

cklamann Dec 10, 2021

hannahle Dec 10, 2021

cklamann Dec 10, 2021

hannahle Dec 10, 2021

cklamann Dec 10, 2021

hannahle Dec 10, 2021

hannahle Dec 10, 2021

cklamann Dec 10, 2021

hannahle Dec 10, 2021

cklamann Dec 10, 2021

hannahle Dec 10, 2021

cklamann Dec 10, 2021


		## What the script does

		The parsing script will process a vcf of variant annotations, returning a subset of fields and combining them into a single csv, which can then be moved to a database. By filtering out unnecessary values, the size of the annotation file can be greatly reduced. Besides the `pos`, `chrom`, `alt`, and `ref` fields, the script will also return `nhomalt`, `AF`, `AC` and relevant `VEP` fields (if available). If you would like to adjust this output, the script will have to be manually updated (see `parse_vcf_response` function). There is also some data type coercion that is tailored to the configuration of the mongo database for ease of import that may need to be changed if the script is used for other purposes.


		The `parse_vcf` module is very much a work in progress and currently has many limitations.

		- If execution is interrupted, it will be difficult to restart. While passing a region to the `--start-position` argument will begin the script at a given locus, this is not guarnteed to restart the script where it left off. Because jobs are handled in parallel and, moreover, failed jobs are pushed back into the queue, care will have to be taken to determine which regions remain to be processed when examining an incomplete results file. Also, bear in mind that the script will not remove a csv file that already exists, so if you are restarting a script from scratch after an error, be sure to manually remove the results file first (or point to a new one), otherwise the results will be appended to the extant file.

		## Limitations

		The `parse_vcf` module is very much a work in progress and currently has many limitations.

Scripts: add vcf processing script and dockerfile #153

Are you sure you want to change the base?

Scripts: add vcf processing script and dockerfile #153

Conversation

cklamann commented Dec 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment