-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scripts: add vcf processing script and dockerfile #153
base: develop
Are you sure you want to change the base?
Conversation
You can use the image to run the `parse_vcf` python module, which will parse a gnomAD annotation vcf and write the filtered results to a csv that can be imported into a database. | ||
|
||
```bash | ||
docker run --rm -v `pwd`:`pwd` -w `pwd` --entrypoint="python" bcftools -m parse_vcf --source gnomad_2.1.1_exome --out my-local-results-directory https://gnomad-public-us-east-1.s3.amazonaws.com/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know what gnomad version we're supposed to use going forward, or is the consensus is that we will always use the most up-to-date one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the 2x releases, the most recent I think is what we'd like. By specifying the source/version in the --source
arg, there will be a record of it for each annotation, which hopefully will cut down on confusion. The 3x releases are for GRCh38 only, so there may be special considerations there, but I would also assume the latest would be best. If a new version is released and users would like to see it, the script can be run again and the database updated.
|
||
## What the script does | ||
|
||
The parsing script will process a vcf of variant annotations, returning a subset of fields and combining them into a single csv, which can then be moved to a database. By filtering out unnecessary values, the size of the annotation file can be greatly reduced. Besides the `pos`, `chrom`, `alt`, and `ref` fields, the script will also return `nhomalt`, `AF`, `AC` and relevant `VEP` fields (if available). If you would like to adjust this output, the script will have to be manually updated (see `parse_vcf_response` function). There is also some data type coercion that is tailored to the configuration of the mongo database for ease of import that may need to be changed if the script is used for other purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, we are only using nhomalt, AF, and AC from gnomad right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- for now we aren't using the VEP fields, as they're coming from CADD. They can either be removed at processing time by commenting out the appropriate lines in the script or excluded from the database by not uploading them into mongo. Or they could be included and simply not queried, in case they might be needed later.
|
||
The `parse_vcf` module is very much a work in progress and currently has many limitations. | ||
|
||
- If execution is interrupted, it will be difficult to restart. While passing a region to the `--start-position` argument will begin the script at a given locus, this is not guarnteed to restart the script where it left off. Because jobs are handled in parallel and, moreover, failed jobs are pushed back into the queue, care will have to be taken to determine which regions remain to be processed when examining an incomplete results file. Also, bear in mind that the script will not remove a csv file that already exists, so if you are restarting a script from scratch after an error, be sure to manually remove the results file first (or point to a new one), otherwise the results will be appended to the extant file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a silly question, but say we are using parse_vcf
for a large VCF and somewhere along the line, the script fails due to some errors that we did not foresee, we'd have to delete the CSV file that existed and rerun the whole process again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the sad truth. I think that to reliably restart the process, there would need to be some extra work done to parse the csv, compare with what's in the vcf headers, determine the missing rows, and relaunch the parsing script accordingly (would probably need to update the script to take config of many regions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that's what I thought too. Thanks for clarifying!
## Limitations | ||
|
||
The `parse_vcf` module is very much a work in progress and currently has many limitations. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For failed jobs, where would we see the logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll see them printed to the console, but better logging is definitely needed!
|
||
def parse_vep(df: pd.DataFrame): | ||
# FYI | ||
VEP_HEADERS = "Allele,Consequence,IMPACT,SYMBOL,Gene,Feature_type,Feature,BIOTYPE,EXON,INTRON,HGVSc,HGVSp,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,ALLELE_NUM,DISTANCE,STRAND,FLAGS,VARIANT_CLASS,MINIMISED,SYMBOL_SOURCE,HGNC_ID,CANONICAL,TSL,APPRIS,CCDS,ENSP,SWISSPROT,TREMBL,UNIPARC,GENE_PHENO,SIFT,PolyPhen,DOMAINS,HGVS_OFFSET,GMAF,AFR_MAF,AMR_MAF,EAS_MAF,EUR_MAF,SAS_MAF,AA_MAF,EA_MAF,ExAC_MAF,ExAC_Adj_MAF,ExAC_AFR_MAF,ExAC_AMR_MAF,ExAC_EAS_MAF,ExAC_FIN_MAF,ExAC_NFE_MAF,ExAC_OTH_MAF,ExAC_SAS_MAF,CLIN_SIG,SOMATIC,PHENO,PUBMED,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,MOTIF_SCORE_CHANGE,LoF,LoF_filter,LoF_flags,LoF_info\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do these headers come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a list of the VEP headers that I copy/pasted from (somewhere) for reference.
is_canonical = True | ||
if i == BIOTYPE_INDEX and val == "protein_coding": | ||
is_protein_coding = True | ||
if is_canonical and is_protein_coding: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on the significance behind this logic? Is VEP only reliable for canonical and protein coding genes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be a VEP entry for every transcript, of which there might be several for a given SNP --- for now we're only interested in canonical/protein coding.
No description provided.