Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripts: add vcf processing script and dockerfile #153

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

cklamann
Copy link
Contributor

@cklamann cklamann commented Dec 9, 2021

No description provided.

@cklamann cklamann requested a review from hannahle December 9, 2021 17:35
You can use the image to run the `parse_vcf` python module, which will parse a gnomAD annotation vcf and write the filtered results to a csv that can be imported into a database.

```bash
docker run --rm -v `pwd`:`pwd` -w `pwd` --entrypoint="python" bcftools -m parse_vcf --source gnomad_2.1.1_exome --out my-local-results-directory https://gnomad-public-us-east-1.s3.amazonaws.com/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what gnomad version we're supposed to use going forward, or is the consensus is that we will always use the most up-to-date one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 2x releases, the most recent I think is what we'd like. By specifying the source/version in the --source arg, there will be a record of it for each annotation, which hopefully will cut down on confusion. The 3x releases are for GRCh38 only, so there may be special considerations there, but I would also assume the latest would be best. If a new version is released and users would like to see it, the script can be run again and the database updated.


## What the script does

The parsing script will process a vcf of variant annotations, returning a subset of fields and combining them into a single csv, which can then be moved to a database. By filtering out unnecessary values, the size of the annotation file can be greatly reduced. Besides the `pos`, `chrom`, `alt`, and `ref` fields, the script will also return `nhomalt`, `AF`, `AC` and relevant `VEP` fields (if available). If you would like to adjust this output, the script will have to be manually updated (see `parse_vcf_response` function). There is also some data type coercion that is tailored to the configuration of the mongo database for ease of import that may need to be changed if the script is used for other purposes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, we are only using nhomalt, AF, and AC from gnomad right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- for now we aren't using the VEP fields, as they're coming from CADD. They can either be removed at processing time by commenting out the appropriate lines in the script or excluded from the database by not uploading them into mongo. Or they could be included and simply not queried, in case they might be needed later.


The `parse_vcf` module is very much a work in progress and currently has many limitations.

- If execution is interrupted, it will be difficult to restart. While passing a region to the `--start-position` argument will begin the script at a given locus, this is not guarnteed to restart the script where it left off. Because jobs are handled in parallel and, moreover, failed jobs are pushed back into the queue, care will have to be taken to determine which regions remain to be processed when examining an incomplete results file. Also, bear in mind that the script will not remove a csv file that already exists, so if you are restarting a script from scratch after an error, be sure to manually remove the results file first (or point to a new one), otherwise the results will be appended to the extant file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a silly question, but say we are using parse_vcf for a large VCF and somewhere along the line, the script fails due to some errors that we did not foresee, we'd have to delete the CSV file that existed and rerun the whole process again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the sad truth. I think that to reliably restart the process, there would need to be some extra work done to parse the csv, compare with what's in the vcf headers, determine the missing rows, and relaunch the parsing script accordingly (would probably need to update the script to take config of many regions).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's what I thought too. Thanks for clarifying!

## Limitations

The `parse_vcf` module is very much a work in progress and currently has many limitations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For failed jobs, where would we see the logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll see them printed to the console, but better logging is definitely needed!


def parse_vep(df: pd.DataFrame):
# FYI
VEP_HEADERS = "Allele,Consequence,IMPACT,SYMBOL,Gene,Feature_type,Feature,BIOTYPE,EXON,INTRON,HGVSc,HGVSp,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,ALLELE_NUM,DISTANCE,STRAND,FLAGS,VARIANT_CLASS,MINIMISED,SYMBOL_SOURCE,HGNC_ID,CANONICAL,TSL,APPRIS,CCDS,ENSP,SWISSPROT,TREMBL,UNIPARC,GENE_PHENO,SIFT,PolyPhen,DOMAINS,HGVS_OFFSET,GMAF,AFR_MAF,AMR_MAF,EAS_MAF,EUR_MAF,SAS_MAF,AA_MAF,EA_MAF,ExAC_MAF,ExAC_Adj_MAF,ExAC_AFR_MAF,ExAC_AMR_MAF,ExAC_EAS_MAF,ExAC_FIN_MAF,ExAC_NFE_MAF,ExAC_OTH_MAF,ExAC_SAS_MAF,CLIN_SIG,SOMATIC,PHENO,PUBMED,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,MOTIF_SCORE_CHANGE,LoF,LoF_filter,LoF_flags,LoF_info\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these headers come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a list of the VEP headers that I copy/pasted from (somewhere) for reference.

is_canonical = True
if i == BIOTYPE_INDEX and val == "protein_coding":
is_protein_coding = True
if is_canonical and is_protein_coding:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on the significance behind this logic? Is VEP only reliable for canonical and protein coding genes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a VEP entry for every transcript, of which there might be several for a given SNP --- for now we're only interested in canonical/protein coding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants