SNVCalling_Filtering.Rmd

---
title: "SNV Calling and Filtering"
author: "Eugene Gardner"
date: "03 December 2020"
output: 
  html_document:
    toc: true
    toc_depth: 2
---

# 1. Startup and Introduction

This Markdown documents the annotation and extraction of SNV and InDel variants from UKBB FE WES files. This annotation uses a custom java pipeline, of which the precompiled jar is included at `scripts/SNVCounter.jar`. The source code and Eclipse project for this pipeline is available in `src/SNVCounter/`.

This code makes use of the following tools:

1. CADD requires a clone and install of their code repo: https://github.com/kircherlab/CADD-scripts/
2. VEP requires a clone and install of their code repo: https://github.com/Ensembl/ensembl-vep
    * Below I provide a rough set of instructions to install [LOFTEE](https://github.com/konradjk/loftee) by Konrad Karczewski from gnomAD, which can be _very_ annoying for hg38
3. To liftover MPC and PEXT to hg38, we need [CrossMap](http://crossmap.sourceforge.net/)
4. As a supplement to crossmap, a local download of the GRCh38 reference to check that our mapping is correct
5. [bcftools](http://samtools.github.io/bcftools/bcftools.html) is used in many places to parse information out of downloaded vcf/bcf files.
    * The bcftools plugin [split-vep](https://samtools.github.io/bcftools/howtos/plugin.split-vep.html) is also used when parsing vep output. Information on how to use plugins with bcftools is available [here](https://samtools.github.io/bcftools/howtos/index.html).
6. bgzip and tabix, which is provided as part of [htslib](https://github.com/samtools/htslib).
7. bedtools, which is used for various genome arithmatic functions

**Note**: The intent of this document is _not_ to be a runnable document like `PhenotypeTesting.Rmd`, it is instead to offer a guide on how SNV/InDel annotation and curation was performed. We have provided the output of this document to UK Biobank, which you can download at:

`https://fake.ukbbiobank_url.com`

You can view a compiled html version of this document with all code run either within this repository at `compiled_htmls/SNVCalling_Filtering.html` or on [github](https://htmlpreview.github.io/?https://github.com/eugenegardner/UKBBFertility/blob/master/compiled_html/SNVCalling_Filtering.html).

```{r setup}

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE ## Warnings turned off to squelch unecessary ggplot noise in kintted document. Have checked all for accuracy.
)

## Quietly Load Libraries
load.package <- function(name) {
  suppressMessages(suppressWarnings(library(name, quietly = T, warn.conflicts = F, character.only = T)))
}

load.package("readxl") ## Read Supplemental Excel tables
load.package("data.table") ## Better than data.frame
load.package("biomaRt") ## Get gene lists we need
load.package("tidyverse") ## Takes care of ggplot, tidyr, dplyr, and stringr

```


# 2. Required Exome Variant Files

First step to analyse UK Biobank WES data is to actually acquire the variant files themselves. Please see [this page](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23146) for additional information on accquiring required files. All information is stored in population-level pVCF format as provided by UK Biobank.

**Note**: This document makes use of the latest UKBB release (a.k.a. "The 200k")

**Note**: DO NOT delete the raw files after you download them. VEP annotation requires them to work properly!

Roughly, files were accquired and processed into a VCF format file with the following commands:

```{bash Get UKBB VCF, eval = F}

## Going to put these files in vcfs/individual_vcfs/ to make examples in this document consistent:
## Note that UK Biobank has stored variants across 977 separate VCFs and you will need to run these commands to download each pVCF
mkdir -p vcfs/individual_vcfs/
cd vcfs/individual_vcfs/

## First download the blocks reference file:
wget -nd biobank.ndph.ox.ac.uk/showcase/showcase/auxdata/pvcf_blocks.txt

## Then this command downloads ONE pVCF to the folder vcfs/individual_vcfs/. You will need to download the additional 976 files.
## Note that you must also have your keyfile provided by UKB pointed to with the -a flag
# -c is the chromosome and -b is the chromosome slice
gfetch 23156 -ak44165.key -c1 -b0

## You will then need to run the following on each file:
bcftools index ukb23156_cY_bZ.vcf.gz
## Where:
# Y = chromosome
# Z = block

## Once all 977 vcf slices are downloaded and indexed, go back up one directory (into vcfs/) and run the following to generate a bed file list of VCF files with a tabix index:
cd ../ ## Now in vcfs/
perl -ane 'chomp $_; $loc = $ENV{"PWD"}; print "chr$F[1]\t$F[3]\t$F[4]\t$loc/individual_vcfs/ukb23156_c$F[1]_b$F[2]_v1.vcf.gz\n";' pvcf_blocks.txt > ukbb_200k.locations.bed
bgzip ukbb_200k.locations.bed
tabix -p bed ukbb_200k.locations.bed.gz ## Yes, I know bed is 0-based but I handle that in my subsequent code

```

# 3. Generating Annotation Sources

This process is slightly complicated because I have written this code to be project-agnostic and therefore makes the possible sources of data that I am accounting for more complicated. I have provided here a simplified version of this annotation for Hg38 UK Biobank data, but this code should be fairly adaptable to other sequencing formats.

**Note**: All annotations below are done *without* the "chr" prefix!

Basic Steps:

1. Variant Agnostic Databases -- For each of the below annotations I need to generate GRCh38 versions. Annotations are stored in the root folder `annotations/hg38/` with the corresponding directory name in brackets below:
    *CADD scores [`cadd/`]
        + We also have to generate CADD scores for variants not included in the default build.
        + I believe this is completely limited to InDels.
    * gnomAD allele frequencies [`gnomad/`]
        + Note that I am using AF_nfe, which is non-Finnish European AF
    * MPC scores [`mpc/`]
        + Have to lift this over to hg38
    * PEXT scores [`pext/`]
        + Have to lift this over to hg38
2. Variant Specific Databases -- Need to generate annotations for each VCF as variant loci/values will not be identical. These annotations are stored alongside the vcf file, with a changed suffix like `ukbb_plink.*`, where * is the name of the information within:
    * VQSR annotations [`ukbb_plink.vqsr/`]
        + For the purposes of UKBB this amounts to a 'fake' list of VQSR values for each variant as we "assume" that these variants were already filtered to a high sensitivity/specificity. This just ensures that all variants pass default VQSR filtering in the java pipeline
    * VEP annotations [`ukbb_plink.vep/`]
    * Supplemental CADD annotations [`ukbb_plink.cadd/`]

All of the basic annotation is being done in root directory. As noted above, per-vcf annotations are located in `./vcfs/`. The actual "mashing together" of annotations will be performed by a [java pipeline](#3.-running-the-variant-engine).

**Note**: _Many_ of the below commands will take a long time, either due to long download times or to actual processing time. It is highly recommended to run these with some sort of parallel compute architecture. Nonetheless, all of these commands should be runnable on a local machine.

## 3A. Create Required Directory Structure

Note that the variant-specific directories have a prefix that is similar to the vcf bed file that you made above in section 2. This must be maintained for downstream code to function.

```{bash Create Directories, eval = F}

## Make top level directories:
mkdir -p annotations/hg38/

## Make annotation-specific directories:
mkdir annotations/hg38/cadd/
mkdir annotations/hg38/gnomad/
mkdir annotations/hg38/mpc/
mkdir annotations/hg38/pext/

## Make variant-specific directories:
mkdir vcfs/ukbb_200k.locations.cadd/
mkdir vcfs/ukbb_200k.locations.vep/
mkdir vcfs/ukbb_200k.locations.vqsr/
```

## 3B. Variant Agnostic Databases

### CADD

Here we download v1.6 of CADD. At the time of generating the data used in this study, this is what was available. There is no necessary postprocessing required for these files. Simply download, ensure the name is consistent with what is below and you are good to go:

```{bash Obtain CADD Annotations, eval = F}
## Go to CADD directory:
cd annotations/hg38/cadd/

# Download CADD Table and index
curl -o annotations/hg38/cadd/whole_genome_SNVs.tsv.gz http://krishna.gs.washington.edu/download/CADD/v1.5/GRCh38/whole_genome_SNVs.tsv.gz
curl -o annotations/hg38/cadd/whole_genome_SNVs.tsv.gz.tbi http://krishna.gs.washington.edu/download/CADD/v1.5/GRCh38/whole_genome_SNVs.tsv.gz.tbi
```

### gnomAD

This code has two ways to function, depending on if you have the gnomAD *v3* sites vcf as a local download. Thus, you can either use:

1. a web-enabled version of bcftools to stream sites (slower, but no storage needed).
2. or you can download your own local version (can be used for other projects).
    * Make sure you download the .tbi index as well!

gnomAD v3 sites are available here: `https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz`

```{bash Obtain Gnomad Annotations, eval = F}
## Go to gnomAD directory:
cd annotations/hg38/gnomad/

# Parse w/BCFtools:
## Online version:
bcftools query https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz -f "%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO/AF_nfe\n" > gnomad.tsv

## Offline version:
## Obviously change the path to your local download:
bcftools query /path/to/gnomadv3/gnomad.genomes.r3.0.sites.vcf.bgz -f "%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO/AF_nfe\n" > gnomad.tsv

## Remove 'chr' prefix, bgzip, and tabix index.
sed -i 's_chr__' gnomad.tsv
bgzip gnomad.tsv
tabix -s 1 -b 2 -e 2 gnomad.tsv.gz
```

### MPC

Obtaining MPC scores from the [Samocha et al.](https://www.biorxiv.org/content/10.1101/148353v1), which developed a score for missense constraint. The table we use includes MPC values for all missense variants in the genome. As this file does not exist for hg38, we need to use crossmap to lift it over.

We hope that CrossMap is doing a good job, but we also do further checks as part of running our annotation engine to ensure that variants are being annotated properly.

```{bash Make hg38 MPC, eval = F}
## Go to MPC directory:
cd annotations/hg38/mpc/

## Get hg19 MPC:
curl -o fordist_constraint_official_mpc_values_v2.txt.gz https://storage.googleapis.com/gnomad-public/legacy/exacv1_downloads/release1/regional_missense_constraint/fordist_constraint_official_mpc_values_v2.txt.gz
curl -o fordist_constraint_official_mpc_values_v2.txt.gz.tbi https://storage.googleapis.com/gnomad-public/legacy/exacv1_downloads/release1/regional_missense_constraint/fordist_constraint_official_mpc_values_v2.txt.gz.tbi

## Download the hg19 -> hg38 chainfile from UCSC:
curl -o GRCh37_to_GRCh38.chain.gz ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/GRCh37_to_GRCh38.chain.gz

## Need to make a fake VCF file for crossmap to use
# need VCF so that crossmap will check REF/ALT)
# Encode the original file as an INFO field so we can decode it later
# Make sure to add chr because liftover requires it (chain file and all...)
zcat fordist_constraint_official_mpc_values_v2.txt.gz | perl -ne 'chomp $_; @F = split("\t", $_); if ($F[0] =~ /chrom/) {print "#CHROM\tPOS\tID\tREF\tALT\tQUALT\tFILTER\tINFO\n";} else {print "chr$F[0]\t$F[1]\t.\t$F[2]\t$F[3]\t.\t.\tMPC=" . join("|",@F) . "\n"}' > mpc.hg19.vcf

## Run crossmap:
python /path/to/CrossMap.py vcf GRCh37_to_GRCh38.chain.gz mpc.hg19.vcf /path/to/Homo_sapiens.GRCh38_15.fa pext.hg38.vcf

## Convert back to the original file format:
grep -v '#' mpc.hg38.vcf | perl -ane 'chomp $_; $F[7] =~ s/MPC=//; @M = split("\\|", $F[7]); $M[0] = $F[0]; $M[1] = $F[1]; $M[2] = $F[3]; $M[3] = $F[4]; $M[0] =~ s/chr//; print join("\t", @M) . "\n";' > mpc.hg38.unsorted.tsv

## sort, zip, index
sort -k 1,1 -k 2,2n mpc.hg38.unsorted.tsv > mpc.tsv
bgzip mpc.tsv
tabix -s 1 -b 2 -e 2 mpc.tsv.gz
```

### PEXT

Obtaining PEXT scores from [Cummings et al.](https://www.nature.com/articles/s41586-020-2329-2). This manuscript utilized data from GTEx to determine the proportion of transcripts in which a variant is located are actually expressed. Like with MPC, this file does not exist for hg38, and we need to use crossmap to lift it over. The protocol is thus very similar.

```{bash Make hg38 pext, eval=F}
## Go to PEXT directory:
cd annotations/hg38/pext/

## First get hg19 version files:
curl -o all.possible.snvs.tx_annotated.GTEx.v7.021520.tsv.bgz https://storage.googleapis.com/gnomad-public/papers/2019-tx-annotation/pre_computed/all.possible.snvs.tx_annotated.GTEx.v7.021520.tsv.bgz
curl -o all.possible.snvs.tx_annotated.GTEx.v7.021520.tsv.bgz.tbi https://storage.googleapis.com/gnomad-public/papers/2019-tx-annotation/pre_computed/all.possible.snvs.tx_annotated.GTEx.v7.021520.tsv.bgz.tbi

## Parse mean-pext using python script (easier w/python as per-tissue scores are stored as a python table/json):
# This takes a pretty long time to run, so a good idea to launch on parallel compute cluster
./scripts/format_pext.py all.possible.snvs.tx_annotated.021819.tsv.gz pext.tsv

## Download the hg19 -> hg38 chainfile from UCSC:
curl -o GRCh37_to_GRCh38.chain.gz ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/GRCh37_to_GRCh38.chain.gz

## Need to make a fake VCF file for crossmap to use
# need VCF so that crossmap will check REF/ALT)
# Make sure to add chr because liftover requires it (chain file and all...)
# Also add header so that we can parse fake INFO fields we create, set to strings to keep proper format
perl -ane 'chomp $_; print "$F[0]\t$F[1]\t.\t$F[3]\t$F[4]\t.\t.\tAVG=$F[5];CER=$F[6];GENE=$F[7]\n";' pext.tsv > pext.vcf
echo -e "##fileformat=VCFv4.3\n##INFO=<ID=AVG,Number=1,Type=String,Description=\"AVG Expression\">\n##INFO=<ID=CER,Number=1,Type=String,Description=\"CER Expression\">\n##INFO=<ID=GENE,Number=1,Type=String,Description=\"Gene name\">\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" | cat - pext.vcf > pext.hg19.vcf

## Run crossmap:
# Also takes a good amount of time to run
python /path/to/CrossMap.py vcf GRCh37_to_GRCh38.chain.gz pext.hg19.vcf /path/to/Homo_sapiens.GRCh38_15.fa pext.hg38.vcf
python ~/.local/bin/CrossMap.py vcf GRCh37_to_GRCh38.chain.gz pext.hg19.vcf /lustre/scratch115/resources/ref/Homo_sapiens/GRCh38_15/Homo_sapiens.GRCh38_15.fa pext.hg38.vcf

## Convert back to the original file format:
bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\t%INFO/AVG\t%INFO/CER\t%INFO/GENE\n" pext.hg38.vcf > pext.hg38.unsorted.tsv

## sort, zip, index
sort -k 1,1 -k 2,2n pext.hg38.unsorted.tsv > pext.tsv 
bgzip pext.tsv
tabix -s 1 -b 2 -e 2 pext.tsv.gz
```

## 3C. Variant Specific Databases

For all of the following sections, I have performed the required tasks *SPLIT* by each VCF that was downloaded in section 2 (977 separate vcf.gz files). I then `cat` these together to generate a single file as I document below. Each section below concerns the annotation of one of these vcf.gz files, and it is up to the user to decide how to accomplish this task.

### VQSR

As UK Biobank provided variants in pseudo-plink VCF format, VQSR values are not inclued. Due to this code also being needed for other projects, I do annotate VQSR for all variants. However, for UK Biobank calls, we just set a VQSR value of 10 (to make sure all variants pass even strict VQSR filters).

```{bash Obtain VQSR Annotations, eval = F}
## Go to VQSR directory:
cd vcfs/ukbb_200k.locations.vqsr/

## This is a FAKE file with all 10s for VQSR:
## The 1_1.tsv is named based on <chr>_<block start coordinate>.tsv. The first vcf.gz file starts at chr1:1, thus the file name
bcftools norm -m - -Ou ukb23156_c1_b0_v1.vcf.gz | bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t10\n" -o 1_1.tsv
sed -i 's_chr__' 1_1.tsv

## After running the above for all 977 files, we do the following:
cat *.tsv | sed 's_chr__' | sort -k 1,1 -k2,2n > vqsr.tsv
bgzip vqsr.tsv
tabix -s 1 -b 2 -e 2 vqsr.tsv.gz
```

### CADD

As the CADD library files that were created above do not contain most InDel scores, we need to supplement this file with a set of CADD scores for InDels in the primary plink file. This annotation requires a local install of [CADD](https://github.com/kircherlab/CADD-scripts/), the setup of which I do not include in this document. 

As CADD takes a good amount of time, this code is partially parallelized by the script `scripts/run_cadd.pl`.

```{bash run CADD supplemental annotations, eval = F}
## Go to CADD directory:
cd vcfs/ukbb_200k.locations.cadd/

## This is provided for example, but CADD recommends an install via conda.
## Make sure you have it/activate it prior to running CADD via the provided script
conda activate cadd-env-1.6

## Run CADD -- I have provided an example script of how I did this at scripts/run_cadd.pl. Please see that for how this was done:
./run_cadd.pl

## This will generate 977 files with the suffix *.processed.tsv which we then merge together to get our final annotation file:
cat *.processed.tsv | sort -k 1,1 -k2,2n > cadd.tsv
bgzip cadd.tsv
tabix -s 1 -b 2 -e 2 cadd.tsv.gz
```

### VEP

For VEP, we use a build of v102, which can be obtained from github with the following commands:

```
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep/
git checkout release/102
```

New-ish builds should be OK, but have not been tested. I run VEP annotation in 977 chunks (dictated by the number of individual VCF chunks). One needs to make a "stripped" vcf.gz file for each of the VCF files downloaded from UK Biobank as running VEP through a vcf with genotypes takes A LONG time. I also make use of the bcftools plugin [+split-vep](https://samtools.github.io/bcftools/howtos/plugin.split-vep.html) which is VERY helpful!

This code is also run with help from the script `scripts/process_vep.pl` -- see below.

A **Note** on LOFTEE:

The most complicated part of getting VEP to work was building loftee for GRCh38. Ensure you have a plugins dir for VEP as described [here](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html). Some tips when doing this:

1. Pull the grch38 version of Konrad's [git repo](https://github.com/konradjk/loftee):
2. Download each of the required files [for hg38](https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/):
    * human ancestor
    * sql database
    * GERP database
3. For loftee Hg38 to run, it needs additional perl packages, most of which can easily be installed easily via cpan (not going to list here). The only one that was an issue was Bio::DB::BigWig, which has a whole custom setup to follow. I fixed this by installing an older version of the "Kent" UCSC repo [v334](https://github.com/ucscGenomeBrowser/kent/tree/v334_branch) as referenced in [this issue](https://github.com/GMOD/GBrowse-Adaptors/issues/17). This also requires you to edit a file when installing kent as noted in the same issue.

```{bash Run VEP, eval = F}

## Go to VEP directory
cd vcfs/ukbb_200k.locations.vep/

## The actual VEP command (DO NOT USE without editing for your own vep build)
vep \
--cache \
--dir_cache .vep/ \
--fasta .vep/homo_sapiens/102_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz \
--offline \
--format vcf \
--dir_plugins .vep/Plugins \
-i  ukb23156_c1_b0_v1.stripped.vcf.gz\
--plugin SpliceRegion,Extended \
--plugin GeneSplicer,.vep/Plugins/GeneSplicer/bin/linux/genesplicer,.vep/Plugins/GeneSplicer/human \
--plugin UTRannotator,.vep/Plugins/uORF_starts_ends_GRCh38_PUBLIC.txt \
--plugin CADD,whole_genome_SNVs.tsv.gz,gnomad.genomes.r3.0.indel.tsv.gz \
--fork 4 \
--everything \
--plugin LoF,loftee_path:.vep/Plugins,human_ancestor_fa:.vep/Plugins/GRCh38_human_ancestor.fa.gz,conservation_file:.vep/Plugins/loftee.sql,gerp_bigwig:.vep/Plugins/gerp_conservation_scores.homo_sapiens.GRCh38.bw \
--plugin REVEL,.vep/Plugins/grch38_tabbed_revel.tsv.gz \
--plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz \
--vcf \
-o ukb23156_c1_b0_v1.vep.vcf.gz \
--compress_output bgzip \
--allele_number \
--verbose \
--force

## I then do post-processing on each of these files with the bcftools plugin '+split-vep' via the script process_vep.pl
./process_vep.pl ukb23156_c1_b0_v1.vep.vcf.gz 1_1.vep.sites.tsv

## And then once all files are processed, put them all together:
cat *.sites.tsv | sort -k 1,1 -k 2,2n > vep.tsv
bgzip vep.tsv
tabix -s 1 -b 2 -e 2 vep.tsv.gz
```

# 4. Variant Engine

This next bit just runs the annotation engine I have generated in java. This annotation comes as both a compiled .jar file (compiled for Java14), as well as raw src files, which have additional minor documentation annotated within it. This source is located at `src/SNVCounter/` and comes with a pre-set Eclipse project `src/SNVCounter/.project` that can be imported into Eclipse for editing purposes.

## 4A. Preparing Gene Lists

This code block is almost _identical_ to the one that is the "Curating Genelists" section in the `PhenotypeTesting.Rmd` document. The only exception is that we create a 'coordinates.txt' file for both hg19 (not used for Hg38) and hg38 that is used as the primary genelist input for our annotation engine.

Reiterating what gene lists that we use here:

1. ENSEMBL-downloaded resources from [BioMart](https://www.ensembl.org/biomart/martview/0511514c231557b5d24ace4e8f7862e0).
2. pLI information from the [gnomAD project](https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz).
3. s~het~ from [Weghorn et al](https://doi.org/10.1093/molbev/msz092). The reference file is included in this repository (`raw_data/genelist/shet.weghorn.txt`).

**Note**: These scripts assume you have `curl` installed on your system, which _should_ be true if you are using macos and most *nix systems. Please change the scripts below if this is not the case.

### Download Resources from BioMart

```{r Generate biomart resources}

## Hg19
ensembl <- useMart("ensembl", host="http://grch37.ensembl.org", dataset = "hsapiens_gene_ensembl")
hg19.table <- data.table(getBM(attributes = c('ensembl_gene_id','chromosome_name','start_position','end_position','hgnc_id','hgnc_symbol','ensembl_transcript_id'),mart = ensembl))
hg19.table <- hg19.table[!grep("_",chromosome_name)]
write.table(hg19.table,"rawdata/genelists/hg19.genes.txt",col.names=F,row.names=F,quote=F,sep="\t")

## Hg38
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
hg38.table <- data.table(getBM(attributes = c('ensembl_gene_id','chromosome_name','start_position','end_position','hgnc_id','hgnc_symbol','strand'),mart = ensembl))
hg38.table[,hgnc_id:=str_remove(hgnc_id,"HGNC:"),by=1:nrow(hg38.table)]
hg38.table <- hg38.table[!grep("CHR_",chromosome_name)]
write.table(hg38.table,"rawdata/genelists/hg38.genes.txt",col.names=F,row.names=F,quote=F,sep="\t")

rm(hg19.table,hg38.table,ensembl)
```

### s~het~ Gene Lists

```{bash Generate sHET gene lists}

perl -ane 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[7]\n";' rawdata/genelists/shet.weghorn.txt > rawdata/genelists/shet.processed.weghorn.txt
perl -ane 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[1]\n";' rawdata/genelists/shet.cassa.txt > rawdata/genelists/shet.processed.cassa.txt

## sHET gene lists (have to attach ENSG):
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -col1 5 -file2 rawdata/genelists/shet.processed.weghorn.txt -r | perl -ane 'chomp $_; print "$F[2]\t$F[0]\t$F[1]\t$F[6]\n";' > rawdata/genelists/shet.hgnc.txt
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -col1 5 -file2 rawdata/genelists/shet.processed.cassa.txt -r | perl -ane 'chomp $_; print "$F[2]\t$F[0]\t$F[1]\t$F[6]\n";' > rawdata/genelists/shet.cassa.hgnc.txt
```

### Hg19 Gene Lists

```{bash Generate hg19 Gene Lists}

## Download gnomAD scores:
curl -o rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz

## Rename your files gnomAD........
mv rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt.gz
gunzip -f rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt.gz

## Create a reference file of just ENSG and pLI, while removing genes w/o a pLI score:
perl -ane 'chomp $_; @F = split("\t", $_); if ($F[20] ne 'NA') {print "$F[63]\t$F[20]\n";}' rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt > rawdata/genelists/hg19.all_genes_with_pli.txt

## Add additional info from biomart that we acquired:
# pLI file:
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -file2 rawdata/genelists/hg19.all_genes_with_pli.txt -r | perl -ne 'chomp $_;  @F = split("\t", $_); print "$F[0]\t$F[3]\t$F[4]\t$F[5]\t$F[6]\t$F[7]\t$F[8]\t$F[1]\n";' > rawdata/genelists/hg19.all_genes_with_pli.2.txt
mv rawdata/genelists/hg19.all_genes_with_pli.2.txt rawdata/genelists/hg19.all_genes_with_pli.txt

# sHET file:
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -file2 rawdata/genelists/shet.hgnc.txt -r | perl -ne 'chomp $_;  @F = split("\t", $_); print "$F[0]\t$F[5]\t$F[6]\t$F[7]\t$F[8]\t$F[9]\t$F[10]\t$F[2]\n";' > rawdata/genelists/hg19.all_genes_with_shet.txt
```

### Hg38 Gene Lists

```{bash Generate hg38 Gene lists}
# Try and match genes to Hg19 based on HGNC ID
# Generate a list of hg19 genes with HGNC IDs:
perl -ane 'chomp $_; if ($F[4] ne "NA" && $F[4] ne "") {print "$F[4]\t$F[0]\t$F[5]\n";}' rawdata/genelists/hg19.genes.txt | sort | uniq > rawdata/genelists/hg19.trans.txt

scripts/matcher.pl -file1 rawdata/genelists/hg19.trans.txt -file2 rawdata/genelists/hg38.genes.txt -col2 4 -r | perl -ane 'chomp $_; print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]\t$F[5]\t$F[8]\n";' > rawdata/genelists/hg38.hgnc.matched.txt

# Ask which genes have a pLI score:
scripts/matcher.pl -file1 rawdata/genelists/hg38.hgnc.matched.txt -col1 6 -file2 rawdata/genelists/hg19.all_genes_with_pli.txt -r | perl -ane 'chomp $_; @F = split("\t", $_); print "$F[8]\t$F[9]\t$F[10]\t$F[11]\t$F[12]\t$F[13]\t$F[0]\t$F[7]\n";' > rawdata/genelists/hg38.all_genes_with_pli.txt

# Ask which genes have a sHET score:
scripts/matcher.pl -file1 rawdata/genelists/hg38.hgnc.matched.txt -col1 6 -file2 rawdata/genelists/hg19.all_genes_with_shet.txt -r | perl -ane 'chomp $_; @F = split("\t", $_); print "$F[8]\t$F[9]\t$F[10]\t$F[11]\t$F[12]\t$F[13]\t$F[0]\t$F[7]\n";' > rawdata/genelists/hg38.all_genes_with_shet.txt

# There is a fairly large caveat here, which is that I label the genes with their Hg19 ENSG ID so that I can be consistant in my R code below!!! This does't impact too many genes, they mostly have the same IDs (~2-300)
# This gets a translatable list to hg19 ENSG###:
perl -ne 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[6]\n";' rawdata/genelists/hg38.all_genes_with_pli.txt > rawdata/genelists/hg38_to_hg19_ENSG.txt
perl -ne 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[6]\n";' rawdata/genelists/hg38.all_genes_with_shet.txt >> rawdata/genelists/hg38_to_hg19_ENSG.txt
sort rawdata/genelists/hg38_to_hg19_ENSG.txt | uniq > rawdata/genelists/hg38_to_hg19_ENSG.2.txt
mv rawdata/genelists/hg38_to_hg19_ENSG.2.txt rawdata/genelists/hg38_to_hg19_ENSG.txt 
```

## 4B. Running

This annotation engine requires all flags, and descriptions can also be seen by running only the `--help` flag. Briefly, these flags are:

* genelist: This is the file created above: `hg38.coordinates.txt`
* gene: An integer between 1 and the number of genes in hg38.coordinates.txt (currently 18,756 genes).
* genomeversion: The version of the genome being used. This should always be hg38 for UKBiobank data, but can be changed to Hg19 for other datasets. This takes care of two things
    + Tells which annotation set to use at `annotations/hg**/`
    + Also is a hacky way that I handle the 'chr' prefix for chromosome IDs typical of hg38 VCFs. I currently don't have a way to deal with hg38 VCFs that don't use 'chr' and, for hg19, vice versa.
* vcf: path to the UK Biobank VCF we downloaded above, with annotations at the correct locus. For the 200k data this is the *.BED FILE* that has coordinates attached to each vcf chunk. This can also be a single vcf file if you choose to run it this way.
    + The java code provided knows to use this bed file to find the vcf slice that contains variants for the gene given by `-gene`
* samples: list of samples to include in annotations, with one ID per line.
    + This is simply the output of `bcftools query -l ukb23156_c1_b1.vcf.gz`, where the vcf file used can be ANY of the VCF chunks acquired from UKBiobank
    + Sample IDs must match identically to those found in the header. UKBiobank WES VCF IDs look like "123456", where 123456 is the eid of the individual.
    + This is important to get correct, as this is how both AF is calculated and how individuals are included in the final output.
* maf: maf filter to apply to variants. This _only_ filters based on the variants in the VCF file, not provided gnomAD MAFs. Can set higher, but could generate a *very* large file. For this project we filter on MAF > 1e-3.
* outdir: Where to store per-gene txt files like: `results_ukbb_wes/ENSG00000003056.rare_variants.tsv`
* rootdir: Where are annotations we created above stored?

This command is intended to be parallelized, and each job will essentially annotated 1 gene at a time within the file given to `-genelist`. In the below code chunk I have given the raw command to run the first gene in this file, with an example LSF submission for parallelization given below.

```{bash Run Annotations, eval = F}

## Example to run one gene.
java -XX:+UseSerialGC -Xms2400M -Xmx2400M -jar /nfs/ddd0/eg15/SNVCounter.jar -genelist annotations/hg38/genelists/hg38.coordinates.txt -gene 1 -genomeversion hg38 -vcf vcfs/ukbb_200k.locations.bed.gz -samples ukbb_ids.200k.txt -maf 0.001 -outdir results_ukbb_wes_200k/ -rootdir annotations/

## Example job submission command for LSF:
bsub -q normal -M 4000 -o gridout/count.%J.%I -J 'SNV[1-18791]%1000' "java -XX:+UseSerialGC -Djava.io.tmpdir=/lustre/scratch115/teams/hurles/users/eg15/INTERVAL/snv_cognition/tmp/ -Xms4000M -Xmx4000M -jar /nfs/users/nfs_e/eg15/SNVCounter.jar -genelist annotations/hg38/genelists/hg38.coordinates.txt -gene \$LSB_JOBINDEX -genomeversion hg38 -vcf vcfs/ukbb_200k.locations.bed.gz -samples ukbb_ids.200k.txt -maf 0.001 -outdir results_ukbb_wes_200k/ -rootdir annotations/"

## Mash the genes together:
grep -hv 'NONE' results_ukbb_wes_200k/*.rare_variants.tsv > counts.ukbb_wes.200k.txt
```

# 5. Postprocessing

Due to the large amount of data, we can no longer load the above 'counts.ukbb_wes.200k.txt' directly into a local version of RStudio and process it here with the script `scripts/process_snvs.200k.r`. This script calculates both regular s~het~ scores and s~het~ scores that exclude specific genes for use in `PhenotypeTesting.Rmd`. It requires ~20Gb of memory in order to run.

## 5A. Gene Exclusion Data

Each of these sections downloads some "exclusion" list that we then use to generate a modified s~het~ for a specific analysis we do in `PhenotypeTesting.Rmd`.

### Expression-Based Exclusion

This code just quickly downloads GTEx expression data (v7) from the [GTEx Consortium](https://gtexportal.org/home/datasets) and loads it into R.

```{bash Download GTEx}

curl -o rawdata/genelists/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct.gz https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct.gz
gunzip -f rawdata/genelists/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct.gz

```

We then load it into memory here.

```{r Load Gene Expression}
## GTEX expression data:
expression <- fread("rawdata/genelists/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct")
expression[,c("gene","vers"):=tstrsplit(gene_id,".",fixed=T),by=1:nrow(expression)]
setnames(expression,"gene","hg19.GENE")

## Load sHET Genes:
shet.genes <- fread("rawdata/genelists/shet.hgnc.txt")
setnames(shet.genes,names(shet.genes),c("hg19.GENE","GENE","sHET.val","HGNC.ID"))
shet.genes[,deciles:=cut(sHET.val,breaks=quantile(sHET.val,seq(0,1,by=0.1)),include.lowest = T)]
shet.genes[,sHET.val.binary:=cut(sHET.val,breaks=c(0,0.15,1),labels = c("lt_015","gt_015"),right = F)]

## Only retain genes with an sHET score:
expression <- merge(expression, shet.genes[,"hg19.GENE"], by = "hg19.GENE")
```

#### Making Gene Lists For Calculating sHET Burden

This block generates "expressed gene" lists for the purposes of creating tissue-specific s~het~ burdens. These individual files are placed in a subdirectory (`tissues/`), located within the `genelists/` directory:

```{bash make tissue dir}
mkdir -p rawdata/genelists/tissues/
```


```{r make gene lists}

## Generate highly expressed gene lists for all tissues:
for (t in names(expression)[4:56]) {
  
  t.file <- str_remove_all(t, " ")
  t.file <- str_replace_all(t.file,"-","_")
  t.file <- str_replace_all(t.file,"\\(\\S+\\)","")
  t.file <- str_to_lower(t.file)
  file <- paste0("rawdata/genelists/tissues/",t.file,"High.txt")
  fwrite(expression[get(t) > 0.5, "hg19.GENE"], file, quote = F, row.names = F, col.names = F, sep = "\t")

}

```

### Disease-Based Exclusion

This section of the document is not evaluated, as we need to acquire disease gene resources from three locations. The actual acquisition of these files is trivial, but did not want to attempted to make reproduceable due to potential links breaking. If this section needs to be reproduced, download the resources at the _rough_ following locations and run the code in this section. Otherwise, move to the next section where the file produced by this section has already been generated. Locations to place necessary files and file dates used in the manuscript are listed below:

1. Clinvar - https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2019/clinvar_20191003.vcf.gz
    +  `rawdata/genelists/clinvar.vcf.gz`
    +  date: October 3, 2019
2. DDG2P - http://www.ebi.ac.uk/gene2phenotype/downloads/DDG2P.csv.gz
    +  `rawdata/genelists/DDG2P.csv`
    +  date: November 12, 2019
3. OMIM - https://www.omim.org/downloads
    +  `rawdata/genelists/morbidmap.txt`
    +  date: October 8, 2019

**Note**: To download OMIM morbid map you will need to register and place a copy of this file at: `rawdata/genelists/morbidmap.txt`

```{r Process Disease Genes, eval = F}
## Clinvar
clinvarVCFfiltered <- read_tsv("rawdata/genelists/clinvar.vcf.gz", comment = '#', col_names = F, col_types = 'cccccccccccccc') %>% 
  mutate(PHEN = str_remove(str_extract(X8, 'CLNDN=[^;]*;'), "CLNDN=")) %>% 
  mutate(PHEN = str_remove(PHEN, "not_provided")) %>% 
  mutate(PHEN = str_remove(PHEN, "|")) %>%
  mutate(PHEN = str_remove(PHEN, ";")) %>%
  filter(PHEN != '') %>% 
  mutate(GENE = str_remove(str_extract(X8, 'GENEINFO=[^;]*:'), "GENEINFO=")) %>% 
  mutate(CLNSIG = str_remove(str_extract(X8, 'CLNSIG=[^;]*;'), "CLNSIG=")) %>% 
  filter(CLNSIG %in% c("Pathogenic/Likely_pathogenic;", "Pathogenic;", "Likely_pathogenic;")) %>% 
  mutate(CLNSIG = str_remove(CLNSIG, ";")) %>% 
  mutate(ID = X3) %>% 
  select(c(X1,X2,ID, PHEN, GENE, CLNSIG)) %>% 
  drop_na(PHEN, GENE) 
clinvarGenes <- clinvarVCFfiltered %>% 
  select(GENE) %>% 
  distinct() %>% 
  transform(GENE = strsplit(GENE, "\\|")) %>%
  unnest(GENE) %>% 
  mutate(GENE = gsub(":.*", "", GENE))
clinvarGenes <- data.table(clinvarGenes)

## DDG2P
ddg2p <- fread("rawdata/genelists/DDG2P.csv") %>% 
  rename(hgnc_id = `hgnc id`)
develGenes <- ddg2p %>% 
  filter(`DDD category` %in% c("probable","confirmed","both DD and IF")) %>% 
  distinct(`hgnc_id`)
develGenes <- data.table(develGenes)

## OMIM
omimGenes <- read_tsv("rawdata/genelists/morbidmap.txt", skip = 4, comment = '#', col_types = 'ccic',
                      col_names = c("Pheno", "Gene", "MIM_Number", "Cyto_Location")) %>%
  select(Pheno, Gene) %>%
  drop_na() %>%
  mutate(Gene = gsub(",.*", "", Gene)) %>%
  mutate(omim_nondisease = ifelse(grepl("\\[", Pheno), T,F)) %>%
  mutate(omim_complex = ifelse(grepl("\\{", Pheno), T,F)) %>%
  mutate(omim_provisional = ifelse(grepl("\\?", Pheno), T,F))
omimGenes <- omimGenes %>% 
  filter(!omim_complex & !omim_nondisease)
omimGenes <- data.table(omimGenes)

hgnc.to.ENSG <- fread("rawdata/genelists/hg19.coordinates.txt",fill=T)
setnames(hgnc.to.ENSG,names(hgnc.to.ENSG),c("hg19.GENE","chr","start","end","hgnc_id","hgnc.symbol"))

clinvarGenes <- unique(merge(clinvarGenes,hgnc.to.ENSG,by.x="GENE",by.y="hgnc.symbol"))
develGenes <- unique(merge(develGenes,hgnc.to.ENSG,by="hgnc_id"))
omimGenes <- unique(merge(omimGenes[,c("Gene")],hgnc.to.ENSG,by.x="Gene",by.y="hgnc.symbol"))

diseaseGenes <- bind_rows(clinvarGenes[,c("hg19.GENE")],develGenes[,c("hg19.GENE")],omimGenes[,c("hg19.GENE")])
diseaseGenes <- unique(diseaseGenes)

write.table(diseaseGenes, "rawdata/genelists/diseaseGenes.txt",col.names = F, row.names = F, quote = F, sep ="\t")

rm(ddg2p, develGenes, omimGenes, clinvarVCFfiltered, clinvarGenes, hgnc.to.ENSG, diseaseGenes)
```

### Male Infertility-Based Exclusion

This is a list of genes confirmed as playing a role in male infertility from [Manon Oud and Joris Veltman](https://academic.oup.com/humrep/article/34/5/932/5377831). We download and process the data direct from thier supplement (Table S3-S6):

**Note**: I have no idea if the link below will work for everyone. If it doesn't, download "Supplementary Table 3-6" manually from [here](https://academic.oup.com/humrep/article/34/5/932/5377831#supplementary-data) and set the name of the file to `rawdata/genelists/male_infertility_genes.xlsx`

```{bash Get Male Infertility Genes, eval = F}

curl -L -o rawdata/genelists/male_infertility_genes.xlsx "https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/humrep/34/5/10.1093_humrep_dez022/2/dez022_supplementary_tables_siii_-_svi_final.xlsx?Expires=1610104943&Signature=aNNfVC~2fDVGJtkTwMipconDNKJ-wBpIAuirlY9Vlb10rdiEhGSlRiV2w01eiRwPdg~c7N6j~5Vvle-XbWzPs8hrNnDJkTZSzTJSzAY8rJJcGsvQrrZrmakP87O9iIEuKvsBqCyhLU04osczMLaWVRL7oSX~MmBiPNOgWIOUs7nQodIhFAGf9gyscadyUJ9q3yL5ptybEOcd2VbkeDNuHgRCWuhbE1KB7LQStWbRp6gxXR6JMf8qSHWxknccaxBTdhQ3Pk0bLYL3dxOjr2PYikPlogfO98JF2HgakoJmumx-zJ1wh~4RuB9tJykz3or6FMfOqJLSOiFPF9XVEUJ0zQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA"

```

```{r Process Male Infertility Gene XLSX}

male.genes.xlsx <- data.table(read_xlsx("rawdata/genelists/male_infertility_genes.xlsx",sheet = "Supplementary Table SIV",skip=1))
male.genes.xlsx <- male.genes.xlsx[4:nrow(male.genes.xlsx)]
male.genes.xlsx <- male.genes.xlsx[,c("HGNC gene name","Inheritance pattern in human","Conclusion")]
male.genes.xlsx <- male.genes.xlsx[`HGNC gene name` != ""]
male.genes.xlsx <- male.genes.xlsx[Conclusion!="No evidence"]
male.genes.xlsx <- male.genes.xlsx[`Inheritance pattern in human`!= "XL" & `Inheritance pattern in human` != "YL"]

## Need to rename some genes as they used Hg38 gene IDs:
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="CATSPERE","C1orf101",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="CFAP43","WDR96",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="CFAP44","WDR52",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="CFAP69","C7orf63",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="DNAAF4","DYX1C1",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="DNAAF5","HEATR2",`HGNC gene name`)]

write.table(male.genes.xlsx,"rawdata/genelists/male_infertility_genes.txt",row.names=F,col.names=F,sep="\t",quote=F)

rm(male.genes.xlsx)
```

```{bash Annotate Male Infertility Genes}

scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -file2 rawdata/genelists/male_infertility_genes.txt -col1 5 -r | perl -ne 'chomp $_; @F = split("\t", $_); splice(@F, 4, 9); print join("\t", @F) . "\n";' > rawdata/genelists/male_infertility_genes.annotated.txt

```

```{bash Annotate Mouse Infertility Genes}

scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -file2 rawdata/genelists/mouse_infertility_genes.txt -col1 5 -r | perl -ne 'chomp $_; @F = split("\t", $_); splice(@F, 4, 9); print join("\t", @F) . "\n";' > rawdata/genelists/mouse_infertility_genes.annotated.txt

```

## 5B. Run Postprocessing

The script `scripts/process_snvs.200k.R` takes NO inputs, but the following files need to be in the directory that you execute the script from *EXCEPT* for tissue files, which need to be in `./tissues/`:

1. hg38 --> hg19 Gene name translation file: `hg38_to_hg19_ENSG.txt` (created in section: "Hg38 Gene Lists")
2. s~het~ scores file: `shet.hgnc.txt` (created in section "s~het~ Gene Lists")
3. Cassa et al s~het~ scores file: `shet.cassa.hgnc.txt` (created in section "s~het~ Gene Lists")
4. pLI gene scores file: `hg19.all_genes_with_pli.txt` (created in section "Hg19 Gene Lists")
5. Variant counts file: `counts.ukbb_wes.200k.txt` (created in section "4B. Running")
6. sample ID files: `ukbb_ids.200k.txt` (created in section "4B. Running")
7. Individuals to keep file: `indv_to_keep.txt`
    + This file is a list of all unrelated, european ancestry individuals that will be included in this study (n ~ 350k). The generation of this file is described at the top of `PhenotypeTesting.Rmd`.
8. Disease genes: `diseaseGenes.txt` (created in section "Disease-Based Exclusion")
9. Male infertility genes: `male_infertility_genes.annotated.txt` (created in section "Male Infertility-Based Exclusion")
10. Mouse infertility genes: `mouse_infertility_genes.annotated.txt` (created in section "Male Infertility-Based Exclusion")
11. Tissue-specific lists: in `./tissues/`, named like `stomachHigh.txt` (created in section "Expression-Based Exclusion")

The script will generate three Rdat formatted outputs that are used as the direct input of `PhenotypeTesting.Rmd` in section 4D of that document:

1. 200k_counts.rds - Main file of per-individual variant data
2. counts.rds - Number of total variants per-person
3. genes.rds - Number of total variants per-gene

```{r Postprocessing, eval = F}

./process_snvs.200k.R

```