-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathSNVCalling_Filtering.Rmd
707 lines (513 loc) · 41.3 KB
/
SNVCalling_Filtering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
---
title: "SNV Calling and Filtering"
author: "Eugene Gardner"
date: "03 December 2020"
output:
html_document:
toc: true
toc_depth: 2
---
# 1. Startup and Introduction
This Markdown documents the annotation and extraction of SNV and InDel variants from UKBB FE WES files. This annotation uses a custom java pipeline, of which the precompiled jar is included at `scripts/SNVCounter.jar`. The source code and Eclipse project for this pipeline is available in `src/SNVCounter/`.
This code makes use of the following tools:
1. CADD requires a clone and install of their code repo: https://github.com/kircherlab/CADD-scripts/
2. VEP requires a clone and install of their code repo: https://github.com/Ensembl/ensembl-vep
* Below I provide a rough set of instructions to install [LOFTEE](https://github.com/konradjk/loftee) by Konrad Karczewski from gnomAD, which can be _very_ annoying for hg38
3. To liftover MPC and PEXT to hg38, we need [CrossMap](http://crossmap.sourceforge.net/)
4. As a supplement to crossmap, a local download of the GRCh38 reference to check that our mapping is correct
5. [bcftools](http://samtools.github.io/bcftools/bcftools.html) is used in many places to parse information out of downloaded vcf/bcf files.
* The bcftools plugin [split-vep](https://samtools.github.io/bcftools/howtos/plugin.split-vep.html) is also used when parsing vep output. Information on how to use plugins with bcftools is available [here](https://samtools.github.io/bcftools/howtos/index.html).
6. bgzip and tabix, which is provided as part of [htslib](https://github.com/samtools/htslib).
7. bedtools, which is used for various genome arithmatic functions
**Note**: The intent of this document is _not_ to be a runnable document like `PhenotypeTesting.Rmd`, it is instead to offer a guide on how SNV/InDel annotation and curation was performed. We have provided the output of this document to UK Biobank, which you can download at:
`https://fake.ukbbiobank_url.com`
You can view a compiled html version of this document with all code run either within this repository at `compiled_htmls/SNVCalling_Filtering.html` or on [github](https://htmlpreview.github.io/?https://github.com/eugenegardner/UKBBFertility/blob/master/compiled_html/SNVCalling_Filtering.html).
```{r setup}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE ## Warnings turned off to squelch unecessary ggplot noise in kintted document. Have checked all for accuracy.
)
## Quietly Load Libraries
load.package <- function(name) {
suppressMessages(suppressWarnings(library(name, quietly = T, warn.conflicts = F, character.only = T)))
}
load.package("readxl") ## Read Supplemental Excel tables
load.package("data.table") ## Better than data.frame
load.package("biomaRt") ## Get gene lists we need
load.package("tidyverse") ## Takes care of ggplot, tidyr, dplyr, and stringr
```
# 2. Required Exome Variant Files
First step to analyse UK Biobank WES data is to actually acquire the variant files themselves. Please see [this page](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23146) for additional information on accquiring required files. All information is stored in population-level pVCF format as provided by UK Biobank.
**Note**: This document makes use of the latest UKBB release (a.k.a. "The 200k")
**Note**: DO NOT delete the raw files after you download them. VEP annotation requires them to work properly!
Roughly, files were accquired and processed into a VCF format file with the following commands:
```{bash Get UKBB VCF, eval = F}
## Going to put these files in vcfs/individual_vcfs/ to make examples in this document consistent:
## Note that UK Biobank has stored variants across 977 separate VCFs and you will need to run these commands to download each pVCF
mkdir -p vcfs/individual_vcfs/
cd vcfs/individual_vcfs/
## First download the blocks reference file:
wget -nd biobank.ndph.ox.ac.uk/showcase/showcase/auxdata/pvcf_blocks.txt
## Then this command downloads ONE pVCF to the folder vcfs/individual_vcfs/. You will need to download the additional 976 files.
## Note that you must also have your keyfile provided by UKB pointed to with the -a flag
# -c is the chromosome and -b is the chromosome slice
gfetch 23156 -ak44165.key -c1 -b0
## You will then need to run the following on each file:
bcftools index ukb23156_cY_bZ.vcf.gz
## Where:
# Y = chromosome
# Z = block
## Once all 977 vcf slices are downloaded and indexed, go back up one directory (into vcfs/) and run the following to generate a bed file list of VCF files with a tabix index:
cd ../ ## Now in vcfs/
perl -ane 'chomp $_; $loc = $ENV{"PWD"}; print "chr$F[1]\t$F[3]\t$F[4]\t$loc/individual_vcfs/ukb23156_c$F[1]_b$F[2]_v1.vcf.gz\n";' pvcf_blocks.txt > ukbb_200k.locations.bed
bgzip ukbb_200k.locations.bed
tabix -p bed ukbb_200k.locations.bed.gz ## Yes, I know bed is 0-based but I handle that in my subsequent code
```
# 3. Generating Annotation Sources
This process is slightly complicated because I have written this code to be project-agnostic and therefore makes the possible sources of data that I am accounting for more complicated. I have provided here a simplified version of this annotation for Hg38 UK Biobank data, but this code should be fairly adaptable to other sequencing formats.
**Note**: All annotations below are done *without* the "chr" prefix!
Basic Steps:
1. Variant Agnostic Databases -- For each of the below annotations I need to generate GRCh38 versions. Annotations are stored in the root folder `annotations/hg38/` with the corresponding directory name in brackets below:
*CADD scores [`cadd/`]
+ We also have to generate CADD scores for variants not included in the default build.
+ I believe this is completely limited to InDels.
* gnomAD allele frequencies [`gnomad/`]
+ Note that I am using AF_nfe, which is non-Finnish European AF
* MPC scores [`mpc/`]
+ Have to lift this over to hg38
* PEXT scores [`pext/`]
+ Have to lift this over to hg38
2. Variant Specific Databases -- Need to generate annotations for each VCF as variant loci/values will not be identical. These annotations are stored alongside the vcf file, with a changed suffix like `ukbb_plink.*`, where * is the name of the information within:
* VQSR annotations [`ukbb_plink.vqsr/`]
+ For the purposes of UKBB this amounts to a 'fake' list of VQSR values for each variant as we "assume" that these variants were already filtered to a high sensitivity/specificity. This just ensures that all variants pass default VQSR filtering in the java pipeline
* VEP annotations [`ukbb_plink.vep/`]
* Supplemental CADD annotations [`ukbb_plink.cadd/`]
All of the basic annotation is being done in root directory. As noted above, per-vcf annotations are located in `./vcfs/`. The actual "mashing together" of annotations will be performed by a [java pipeline](#3.-running-the-variant-engine).
**Note**: _Many_ of the below commands will take a long time, either due to long download times or to actual processing time. It is highly recommended to run these with some sort of parallel compute architecture. Nonetheless, all of these commands should be runnable on a local machine.
## 3A. Create Required Directory Structure
Note that the variant-specific directories have a prefix that is similar to the vcf bed file that you made above in section 2. This must be maintained for downstream code to function.
```{bash Create Directories, eval = F}
## Make top level directories:
mkdir -p annotations/hg38/
## Make annotation-specific directories:
mkdir annotations/hg38/cadd/
mkdir annotations/hg38/gnomad/
mkdir annotations/hg38/mpc/
mkdir annotations/hg38/pext/
## Make variant-specific directories:
mkdir vcfs/ukbb_200k.locations.cadd/
mkdir vcfs/ukbb_200k.locations.vep/
mkdir vcfs/ukbb_200k.locations.vqsr/
```
## 3B. Variant Agnostic Databases
### CADD
Here we download v1.6 of CADD. At the time of generating the data used in this study, this is what was available. There is no necessary postprocessing required for these files. Simply download, ensure the name is consistent with what is below and you are good to go:
```{bash Obtain CADD Annotations, eval = F}
## Go to CADD directory:
cd annotations/hg38/cadd/
# Download CADD Table and index
curl -o annotations/hg38/cadd/whole_genome_SNVs.tsv.gz http://krishna.gs.washington.edu/download/CADD/v1.5/GRCh38/whole_genome_SNVs.tsv.gz
curl -o annotations/hg38/cadd/whole_genome_SNVs.tsv.gz.tbi http://krishna.gs.washington.edu/download/CADD/v1.5/GRCh38/whole_genome_SNVs.tsv.gz.tbi
```
### gnomAD
This code has two ways to function, depending on if you have the gnomAD *v3* sites vcf as a local download. Thus, you can either use:
1. a web-enabled version of bcftools to stream sites (slower, but no storage needed).
2. or you can download your own local version (can be used for other projects).
* Make sure you download the .tbi index as well!
gnomAD v3 sites are available here: `https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz`
```{bash Obtain Gnomad Annotations, eval = F}
## Go to gnomAD directory:
cd annotations/hg38/gnomad/
# Parse w/BCFtools:
## Online version:
bcftools query https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz -f "%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO/AF_nfe\n" > gnomad.tsv
## Offline version:
## Obviously change the path to your local download:
bcftools query /path/to/gnomadv3/gnomad.genomes.r3.0.sites.vcf.bgz -f "%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO/AF_nfe\n" > gnomad.tsv
## Remove 'chr' prefix, bgzip, and tabix index.
sed -i 's_chr__' gnomad.tsv
bgzip gnomad.tsv
tabix -s 1 -b 2 -e 2 gnomad.tsv.gz
```
### MPC
Obtaining MPC scores from the [Samocha et al.](https://www.biorxiv.org/content/10.1101/148353v1), which developed a score for missense constraint. The table we use includes MPC values for all missense variants in the genome. As this file does not exist for hg38, we need to use crossmap to lift it over.
We hope that CrossMap is doing a good job, but we also do further checks as part of running our annotation engine to ensure that variants are being annotated properly.
```{bash Make hg38 MPC, eval = F}
## Go to MPC directory:
cd annotations/hg38/mpc/
## Get hg19 MPC:
curl -o fordist_constraint_official_mpc_values_v2.txt.gz https://storage.googleapis.com/gnomad-public/legacy/exacv1_downloads/release1/regional_missense_constraint/fordist_constraint_official_mpc_values_v2.txt.gz
curl -o fordist_constraint_official_mpc_values_v2.txt.gz.tbi https://storage.googleapis.com/gnomad-public/legacy/exacv1_downloads/release1/regional_missense_constraint/fordist_constraint_official_mpc_values_v2.txt.gz.tbi
## Download the hg19 -> hg38 chainfile from UCSC:
curl -o GRCh37_to_GRCh38.chain.gz ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/GRCh37_to_GRCh38.chain.gz
## Need to make a fake VCF file for crossmap to use
# need VCF so that crossmap will check REF/ALT)
# Encode the original file as an INFO field so we can decode it later
# Make sure to add chr because liftover requires it (chain file and all...)
zcat fordist_constraint_official_mpc_values_v2.txt.gz | perl -ne 'chomp $_; @F = split("\t", $_); if ($F[0] =~ /chrom/) {print "#CHROM\tPOS\tID\tREF\tALT\tQUALT\tFILTER\tINFO\n";} else {print "chr$F[0]\t$F[1]\t.\t$F[2]\t$F[3]\t.\t.\tMPC=" . join("|",@F) . "\n"}' > mpc.hg19.vcf
## Run crossmap:
python /path/to/CrossMap.py vcf GRCh37_to_GRCh38.chain.gz mpc.hg19.vcf /path/to/Homo_sapiens.GRCh38_15.fa pext.hg38.vcf
## Convert back to the original file format:
grep -v '#' mpc.hg38.vcf | perl -ane 'chomp $_; $F[7] =~ s/MPC=//; @M = split("\\|", $F[7]); $M[0] = $F[0]; $M[1] = $F[1]; $M[2] = $F[3]; $M[3] = $F[4]; $M[0] =~ s/chr//; print join("\t", @M) . "\n";' > mpc.hg38.unsorted.tsv
## sort, zip, index
sort -k 1,1 -k 2,2n mpc.hg38.unsorted.tsv > mpc.tsv
bgzip mpc.tsv
tabix -s 1 -b 2 -e 2 mpc.tsv.gz
```
### PEXT
Obtaining PEXT scores from [Cummings et al.](https://www.nature.com/articles/s41586-020-2329-2). This manuscript utilized data from GTEx to determine the proportion of transcripts in which a variant is located are actually expressed. Like with MPC, this file does not exist for hg38, and we need to use crossmap to lift it over. The protocol is thus very similar.
```{bash Make hg38 pext, eval=F}
## Go to PEXT directory:
cd annotations/hg38/pext/
## First get hg19 version files:
curl -o all.possible.snvs.tx_annotated.GTEx.v7.021520.tsv.bgz https://storage.googleapis.com/gnomad-public/papers/2019-tx-annotation/pre_computed/all.possible.snvs.tx_annotated.GTEx.v7.021520.tsv.bgz
curl -o all.possible.snvs.tx_annotated.GTEx.v7.021520.tsv.bgz.tbi https://storage.googleapis.com/gnomad-public/papers/2019-tx-annotation/pre_computed/all.possible.snvs.tx_annotated.GTEx.v7.021520.tsv.bgz.tbi
## Parse mean-pext using python script (easier w/python as per-tissue scores are stored as a python table/json):
# This takes a pretty long time to run, so a good idea to launch on parallel compute cluster
./scripts/format_pext.py all.possible.snvs.tx_annotated.021819.tsv.gz pext.tsv
## Download the hg19 -> hg38 chainfile from UCSC:
curl -o GRCh37_to_GRCh38.chain.gz ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/GRCh37_to_GRCh38.chain.gz
## Need to make a fake VCF file for crossmap to use
# need VCF so that crossmap will check REF/ALT)
# Make sure to add chr because liftover requires it (chain file and all...)
# Also add header so that we can parse fake INFO fields we create, set to strings to keep proper format
perl -ane 'chomp $_; print "$F[0]\t$F[1]\t.\t$F[3]\t$F[4]\t.\t.\tAVG=$F[5];CER=$F[6];GENE=$F[7]\n";' pext.tsv > pext.vcf
echo -e "##fileformat=VCFv4.3\n##INFO=<ID=AVG,Number=1,Type=String,Description=\"AVG Expression\">\n##INFO=<ID=CER,Number=1,Type=String,Description=\"CER Expression\">\n##INFO=<ID=GENE,Number=1,Type=String,Description=\"Gene name\">\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" | cat - pext.vcf > pext.hg19.vcf
## Run crossmap:
# Also takes a good amount of time to run
python /path/to/CrossMap.py vcf GRCh37_to_GRCh38.chain.gz pext.hg19.vcf /path/to/Homo_sapiens.GRCh38_15.fa pext.hg38.vcf
python ~/.local/bin/CrossMap.py vcf GRCh37_to_GRCh38.chain.gz pext.hg19.vcf /lustre/scratch115/resources/ref/Homo_sapiens/GRCh38_15/Homo_sapiens.GRCh38_15.fa pext.hg38.vcf
## Convert back to the original file format:
bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\t%INFO/AVG\t%INFO/CER\t%INFO/GENE\n" pext.hg38.vcf > pext.hg38.unsorted.tsv
## sort, zip, index
sort -k 1,1 -k 2,2n pext.hg38.unsorted.tsv > pext.tsv
bgzip pext.tsv
tabix -s 1 -b 2 -e 2 pext.tsv.gz
```
## 3C. Variant Specific Databases
For all of the following sections, I have performed the required tasks *SPLIT* by each VCF that was downloaded in section 2 (977 separate vcf.gz files). I then `cat` these together to generate a single file as I document below. Each section below concerns the annotation of one of these vcf.gz files, and it is up to the user to decide how to accomplish this task.
### VQSR
As UK Biobank provided variants in pseudo-plink VCF format, VQSR values are not inclued. Due to this code also being needed for other projects, I do annotate VQSR for all variants. However, for UK Biobank calls, we just set a VQSR value of 10 (to make sure all variants pass even strict VQSR filters).
```{bash Obtain VQSR Annotations, eval = F}
## Go to VQSR directory:
cd vcfs/ukbb_200k.locations.vqsr/
## This is a FAKE file with all 10s for VQSR:
## The 1_1.tsv is named based on <chr>_<block start coordinate>.tsv. The first vcf.gz file starts at chr1:1, thus the file name
bcftools norm -m - -Ou ukb23156_c1_b0_v1.vcf.gz | bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t10\n" -o 1_1.tsv
sed -i 's_chr__' 1_1.tsv
## After running the above for all 977 files, we do the following:
cat *.tsv | sed 's_chr__' | sort -k 1,1 -k2,2n > vqsr.tsv
bgzip vqsr.tsv
tabix -s 1 -b 2 -e 2 vqsr.tsv.gz
```
### CADD
As the CADD library files that were created above do not contain most InDel scores, we need to supplement this file with a set of CADD scores for InDels in the primary plink file. This annotation requires a local install of [CADD](https://github.com/kircherlab/CADD-scripts/), the setup of which I do not include in this document.
As CADD takes a good amount of time, this code is partially parallelized by the script `scripts/run_cadd.pl`.
```{bash run CADD supplemental annotations, eval = F}
## Go to CADD directory:
cd vcfs/ukbb_200k.locations.cadd/
## This is provided for example, but CADD recommends an install via conda.
## Make sure you have it/activate it prior to running CADD via the provided script
conda activate cadd-env-1.6
## Run CADD -- I have provided an example script of how I did this at scripts/run_cadd.pl. Please see that for how this was done:
./run_cadd.pl
## This will generate 977 files with the suffix *.processed.tsv which we then merge together to get our final annotation file:
cat *.processed.tsv | sort -k 1,1 -k2,2n > cadd.tsv
bgzip cadd.tsv
tabix -s 1 -b 2 -e 2 cadd.tsv.gz
```
### VEP
For VEP, we use a build of v102, which can be obtained from github with the following commands:
```
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep/
git checkout release/102
```
New-ish builds should be OK, but have not been tested. I run VEP annotation in 977 chunks (dictated by the number of individual VCF chunks). One needs to make a "stripped" vcf.gz file for each of the VCF files downloaded from UK Biobank as running VEP through a vcf with genotypes takes A LONG time. I also make use of the bcftools plugin [+split-vep](https://samtools.github.io/bcftools/howtos/plugin.split-vep.html) which is VERY helpful!
This code is also run with help from the script `scripts/process_vep.pl` -- see below.
A **Note** on LOFTEE:
The most complicated part of getting VEP to work was building loftee for GRCh38. Ensure you have a plugins dir for VEP as described [here](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html). Some tips when doing this:
1. Pull the grch38 version of Konrad's [git repo](https://github.com/konradjk/loftee):
2. Download each of the required files [for hg38](https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/):
* human ancestor
* sql database
* GERP database
3. For loftee Hg38 to run, it needs additional perl packages, most of which can easily be installed easily via cpan (not going to list here). The only one that was an issue was Bio::DB::BigWig, which has a whole custom setup to follow. I fixed this by installing an older version of the "Kent" UCSC repo [v334](https://github.com/ucscGenomeBrowser/kent/tree/v334_branch) as referenced in [this issue](https://github.com/GMOD/GBrowse-Adaptors/issues/17). This also requires you to edit a file when installing kent as noted in the same issue.
```{bash Run VEP, eval = F}
## Go to VEP directory
cd vcfs/ukbb_200k.locations.vep/
## The actual VEP command (DO NOT USE without editing for your own vep build)
vep \
--cache \
--dir_cache .vep/ \
--fasta .vep/homo_sapiens/102_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz \
--offline \
--format vcf \
--dir_plugins .vep/Plugins \
-i ukb23156_c1_b0_v1.stripped.vcf.gz\
--plugin SpliceRegion,Extended \
--plugin GeneSplicer,.vep/Plugins/GeneSplicer/bin/linux/genesplicer,.vep/Plugins/GeneSplicer/human \
--plugin UTRannotator,.vep/Plugins/uORF_starts_ends_GRCh38_PUBLIC.txt \
--plugin CADD,whole_genome_SNVs.tsv.gz,gnomad.genomes.r3.0.indel.tsv.gz \
--fork 4 \
--everything \
--plugin LoF,loftee_path:.vep/Plugins,human_ancestor_fa:.vep/Plugins/GRCh38_human_ancestor.fa.gz,conservation_file:.vep/Plugins/loftee.sql,gerp_bigwig:.vep/Plugins/gerp_conservation_scores.homo_sapiens.GRCh38.bw \
--plugin REVEL,.vep/Plugins/grch38_tabbed_revel.tsv.gz \
--plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz \
--vcf \
-o ukb23156_c1_b0_v1.vep.vcf.gz \
--compress_output bgzip \
--allele_number \
--verbose \
--force
## I then do post-processing on each of these files with the bcftools plugin '+split-vep' via the script process_vep.pl
./process_vep.pl ukb23156_c1_b0_v1.vep.vcf.gz 1_1.vep.sites.tsv
## And then once all files are processed, put them all together:
cat *.sites.tsv | sort -k 1,1 -k 2,2n > vep.tsv
bgzip vep.tsv
tabix -s 1 -b 2 -e 2 vep.tsv.gz
```
# 4. Variant Engine
This next bit just runs the annotation engine I have generated in java. This annotation comes as both a compiled .jar file (compiled for Java14), as well as raw src files, which have additional minor documentation annotated within it. This source is located at `src/SNVCounter/` and comes with a pre-set Eclipse project `src/SNVCounter/.project` that can be imported into Eclipse for editing purposes.
## 4A. Preparing Gene Lists
This code block is almost _identical_ to the one that is the "Curating Genelists" section in the `PhenotypeTesting.Rmd` document. The only exception is that we create a 'coordinates.txt' file for both hg19 (not used for Hg38) and hg38 that is used as the primary genelist input for our annotation engine.
Reiterating what gene lists that we use here:
1. ENSEMBL-downloaded resources from [BioMart](https://www.ensembl.org/biomart/martview/0511514c231557b5d24ace4e8f7862e0).
2. pLI information from the [gnomAD project](https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz).
3. s~het~ from [Weghorn et al](https://doi.org/10.1093/molbev/msz092). The reference file is included in this repository (`raw_data/genelist/shet.weghorn.txt`).
**Note**: These scripts assume you have `curl` installed on your system, which _should_ be true if you are using macos and most *nix systems. Please change the scripts below if this is not the case.
### Download Resources from BioMart
```{r Generate biomart resources}
## Hg19
ensembl <- useMart("ensembl", host="http://grch37.ensembl.org", dataset = "hsapiens_gene_ensembl")
hg19.table <- data.table(getBM(attributes = c('ensembl_gene_id','chromosome_name','start_position','end_position','hgnc_id','hgnc_symbol','ensembl_transcript_id'),mart = ensembl))
hg19.table <- hg19.table[!grep("_",chromosome_name)]
write.table(hg19.table,"rawdata/genelists/hg19.genes.txt",col.names=F,row.names=F,quote=F,sep="\t")
## Hg38
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
hg38.table <- data.table(getBM(attributes = c('ensembl_gene_id','chromosome_name','start_position','end_position','hgnc_id','hgnc_symbol','strand'),mart = ensembl))
hg38.table[,hgnc_id:=str_remove(hgnc_id,"HGNC:"),by=1:nrow(hg38.table)]
hg38.table <- hg38.table[!grep("CHR_",chromosome_name)]
write.table(hg38.table,"rawdata/genelists/hg38.genes.txt",col.names=F,row.names=F,quote=F,sep="\t")
rm(hg19.table,hg38.table,ensembl)
```
### s~het~ Gene Lists
```{bash Generate sHET gene lists}
perl -ane 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[7]\n";' rawdata/genelists/shet.weghorn.txt > rawdata/genelists/shet.processed.weghorn.txt
perl -ane 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[1]\n";' rawdata/genelists/shet.cassa.txt > rawdata/genelists/shet.processed.cassa.txt
## sHET gene lists (have to attach ENSG):
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -col1 5 -file2 rawdata/genelists/shet.processed.weghorn.txt -r | perl -ane 'chomp $_; print "$F[2]\t$F[0]\t$F[1]\t$F[6]\n";' > rawdata/genelists/shet.hgnc.txt
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -col1 5 -file2 rawdata/genelists/shet.processed.cassa.txt -r | perl -ane 'chomp $_; print "$F[2]\t$F[0]\t$F[1]\t$F[6]\n";' > rawdata/genelists/shet.cassa.hgnc.txt
```
### Hg19 Gene Lists
```{bash Generate hg19 Gene Lists}
## Download gnomAD scores:
curl -o rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz
## Rename your files gnomAD........
mv rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt.gz
gunzip -f rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt.gz
## Create a reference file of just ENSG and pLI, while removing genes w/o a pLI score:
perl -ane 'chomp $_; @F = split("\t", $_); if ($F[20] ne 'NA') {print "$F[63]\t$F[20]\n";}' rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt > rawdata/genelists/hg19.all_genes_with_pli.txt
## Add additional info from biomart that we acquired:
# pLI file:
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -file2 rawdata/genelists/hg19.all_genes_with_pli.txt -r | perl -ne 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[3]\t$F[4]\t$F[5]\t$F[6]\t$F[7]\t$F[8]\t$F[1]\n";' > rawdata/genelists/hg19.all_genes_with_pli.2.txt
mv rawdata/genelists/hg19.all_genes_with_pli.2.txt rawdata/genelists/hg19.all_genes_with_pli.txt
# sHET file:
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -file2 rawdata/genelists/shet.hgnc.txt -r | perl -ne 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[5]\t$F[6]\t$F[7]\t$F[8]\t$F[9]\t$F[10]\t$F[2]\n";' > rawdata/genelists/hg19.all_genes_with_shet.txt
```
### Hg38 Gene Lists
```{bash Generate hg38 Gene lists}
# Try and match genes to Hg19 based on HGNC ID
# Generate a list of hg19 genes with HGNC IDs:
perl -ane 'chomp $_; if ($F[4] ne "NA" && $F[4] ne "") {print "$F[4]\t$F[0]\t$F[5]\n";}' rawdata/genelists/hg19.genes.txt | sort | uniq > rawdata/genelists/hg19.trans.txt
scripts/matcher.pl -file1 rawdata/genelists/hg19.trans.txt -file2 rawdata/genelists/hg38.genes.txt -col2 4 -r | perl -ane 'chomp $_; print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]\t$F[5]\t$F[8]\n";' > rawdata/genelists/hg38.hgnc.matched.txt
# Ask which genes have a pLI score:
scripts/matcher.pl -file1 rawdata/genelists/hg38.hgnc.matched.txt -col1 6 -file2 rawdata/genelists/hg19.all_genes_with_pli.txt -r | perl -ane 'chomp $_; @F = split("\t", $_); print "$F[8]\t$F[9]\t$F[10]\t$F[11]\t$F[12]\t$F[13]\t$F[0]\t$F[7]\n";' > rawdata/genelists/hg38.all_genes_with_pli.txt
# Ask which genes have a sHET score:
scripts/matcher.pl -file1 rawdata/genelists/hg38.hgnc.matched.txt -col1 6 -file2 rawdata/genelists/hg19.all_genes_with_shet.txt -r | perl -ane 'chomp $_; @F = split("\t", $_); print "$F[8]\t$F[9]\t$F[10]\t$F[11]\t$F[12]\t$F[13]\t$F[0]\t$F[7]\n";' > rawdata/genelists/hg38.all_genes_with_shet.txt
# There is a fairly large caveat here, which is that I label the genes with their Hg19 ENSG ID so that I can be consistant in my R code below!!! This does't impact too many genes, they mostly have the same IDs (~2-300)
# This gets a translatable list to hg19 ENSG###:
perl -ne 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[6]\n";' rawdata/genelists/hg38.all_genes_with_pli.txt > rawdata/genelists/hg38_to_hg19_ENSG.txt
perl -ne 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[6]\n";' rawdata/genelists/hg38.all_genes_with_shet.txt >> rawdata/genelists/hg38_to_hg19_ENSG.txt
sort rawdata/genelists/hg38_to_hg19_ENSG.txt | uniq > rawdata/genelists/hg38_to_hg19_ENSG.2.txt
mv rawdata/genelists/hg38_to_hg19_ENSG.2.txt rawdata/genelists/hg38_to_hg19_ENSG.txt
```
## 4B. Running
This annotation engine requires all flags, and descriptions can also be seen by running only the `--help` flag. Briefly, these flags are:
* genelist: This is the file created above: `hg38.coordinates.txt`
* gene: An integer between 1 and the number of genes in hg38.coordinates.txt (currently 18,756 genes).
* genomeversion: The version of the genome being used. This should always be hg38 for UKBiobank data, but can be changed to Hg19 for other datasets. This takes care of two things
+ Tells which annotation set to use at `annotations/hg**/`
+ Also is a hacky way that I handle the 'chr' prefix for chromosome IDs typical of hg38 VCFs. I currently don't have a way to deal with hg38 VCFs that don't use 'chr' and, for hg19, vice versa.
* vcf: path to the UK Biobank VCF we downloaded above, with annotations at the correct locus. For the 200k data this is the *.BED FILE* that has coordinates attached to each vcf chunk. This can also be a single vcf file if you choose to run it this way.
+ The java code provided knows to use this bed file to find the vcf slice that contains variants for the gene given by `-gene`
* samples: list of samples to include in annotations, with one ID per line.
+ This is simply the output of `bcftools query -l ukb23156_c1_b1.vcf.gz`, where the vcf file used can be ANY of the VCF chunks acquired from UKBiobank
+ Sample IDs must match identically to those found in the header. UKBiobank WES VCF IDs look like "123456", where 123456 is the eid of the individual.
+ This is important to get correct, as this is how both AF is calculated and how individuals are included in the final output.
* maf: maf filter to apply to variants. This _only_ filters based on the variants in the VCF file, not provided gnomAD MAFs. Can set higher, but could generate a *very* large file. For this project we filter on MAF > 1e-3.
* outdir: Where to store per-gene txt files like: `results_ukbb_wes/ENSG00000003056.rare_variants.tsv`
* rootdir: Where are annotations we created above stored?
This command is intended to be parallelized, and each job will essentially annotated 1 gene at a time within the file given to `-genelist`. In the below code chunk I have given the raw command to run the first gene in this file, with an example LSF submission for parallelization given below.
```{bash Run Annotations, eval = F}
## Example to run one gene.
java -XX:+UseSerialGC -Xms2400M -Xmx2400M -jar /nfs/ddd0/eg15/SNVCounter.jar -genelist annotations/hg38/genelists/hg38.coordinates.txt -gene 1 -genomeversion hg38 -vcf vcfs/ukbb_200k.locations.bed.gz -samples ukbb_ids.200k.txt -maf 0.001 -outdir results_ukbb_wes_200k/ -rootdir annotations/
## Example job submission command for LSF:
bsub -q normal -M 4000 -o gridout/count.%J.%I -J 'SNV[1-18791]%1000' "java -XX:+UseSerialGC -Djava.io.tmpdir=/lustre/scratch115/teams/hurles/users/eg15/INTERVAL/snv_cognition/tmp/ -Xms4000M -Xmx4000M -jar /nfs/users/nfs_e/eg15/SNVCounter.jar -genelist annotations/hg38/genelists/hg38.coordinates.txt -gene \$LSB_JOBINDEX -genomeversion hg38 -vcf vcfs/ukbb_200k.locations.bed.gz -samples ukbb_ids.200k.txt -maf 0.001 -outdir results_ukbb_wes_200k/ -rootdir annotations/"
## Mash the genes together:
grep -hv 'NONE' results_ukbb_wes_200k/*.rare_variants.tsv > counts.ukbb_wes.200k.txt
```
# 5. Postprocessing
Due to the large amount of data, we can no longer load the above 'counts.ukbb_wes.200k.txt' directly into a local version of RStudio and process it here with the script `scripts/process_snvs.200k.r`. This script calculates both regular s~het~ scores and s~het~ scores that exclude specific genes for use in `PhenotypeTesting.Rmd`. It requires ~20Gb of memory in order to run.
## 5A. Gene Exclusion Data
Each of these sections downloads some "exclusion" list that we then use to generate a modified s~het~ for a specific analysis we do in `PhenotypeTesting.Rmd`.
### Expression-Based Exclusion
This code just quickly downloads GTEx expression data (v7) from the [GTEx Consortium](https://gtexportal.org/home/datasets) and loads it into R.
```{bash Download GTEx}
curl -o rawdata/genelists/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct.gz https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct.gz
gunzip -f rawdata/genelists/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct.gz
```
We then load it into memory here.
```{r Load Gene Expression}
## GTEX expression data:
expression <- fread("rawdata/genelists/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct")
expression[,c("gene","vers"):=tstrsplit(gene_id,".",fixed=T),by=1:nrow(expression)]
setnames(expression,"gene","hg19.GENE")
## Load sHET Genes:
shet.genes <- fread("rawdata/genelists/shet.hgnc.txt")
setnames(shet.genes,names(shet.genes),c("hg19.GENE","GENE","sHET.val","HGNC.ID"))
shet.genes[,deciles:=cut(sHET.val,breaks=quantile(sHET.val,seq(0,1,by=0.1)),include.lowest = T)]
shet.genes[,sHET.val.binary:=cut(sHET.val,breaks=c(0,0.15,1),labels = c("lt_015","gt_015"),right = F)]
## Only retain genes with an sHET score:
expression <- merge(expression, shet.genes[,"hg19.GENE"], by = "hg19.GENE")
```
#### Making Gene Lists For Calculating sHET Burden
This block generates "expressed gene" lists for the purposes of creating tissue-specific s~het~ burdens. These individual files are placed in a subdirectory (`tissues/`), located within the `genelists/` directory:
```{bash make tissue dir}
mkdir -p rawdata/genelists/tissues/
```
```{r make gene lists}
## Generate highly expressed gene lists for all tissues:
for (t in names(expression)[4:56]) {
t.file <- str_remove_all(t, " ")
t.file <- str_replace_all(t.file,"-","_")
t.file <- str_replace_all(t.file,"\\(\\S+\\)","")
t.file <- str_to_lower(t.file)
file <- paste0("rawdata/genelists/tissues/",t.file,"High.txt")
fwrite(expression[get(t) > 0.5, "hg19.GENE"], file, quote = F, row.names = F, col.names = F, sep = "\t")
}
```
### Disease-Based Exclusion
This section of the document is not evaluated, as we need to acquire disease gene resources from three locations. The actual acquisition of these files is trivial, but did not want to attempted to make reproduceable due to potential links breaking. If this section needs to be reproduced, download the resources at the _rough_ following locations and run the code in this section. Otherwise, move to the next section where the file produced by this section has already been generated. Locations to place necessary files and file dates used in the manuscript are listed below:
1. Clinvar - https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2019/clinvar_20191003.vcf.gz
+ `rawdata/genelists/clinvar.vcf.gz`
+ date: October 3, 2019
2. DDG2P - http://www.ebi.ac.uk/gene2phenotype/downloads/DDG2P.csv.gz
+ `rawdata/genelists/DDG2P.csv`
+ date: November 12, 2019
3. OMIM - https://www.omim.org/downloads
+ `rawdata/genelists/morbidmap.txt`
+ date: October 8, 2019
**Note**: To download OMIM morbid map you will need to register and place a copy of this file at: `rawdata/genelists/morbidmap.txt`
```{r Process Disease Genes, eval = F}
## Clinvar
clinvarVCFfiltered <- read_tsv("rawdata/genelists/clinvar.vcf.gz", comment = '#', col_names = F, col_types = 'cccccccccccccc') %>%
mutate(PHEN = str_remove(str_extract(X8, 'CLNDN=[^;]*;'), "CLNDN=")) %>%
mutate(PHEN = str_remove(PHEN, "not_provided")) %>%
mutate(PHEN = str_remove(PHEN, "|")) %>%
mutate(PHEN = str_remove(PHEN, ";")) %>%
filter(PHEN != '') %>%
mutate(GENE = str_remove(str_extract(X8, 'GENEINFO=[^;]*:'), "GENEINFO=")) %>%
mutate(CLNSIG = str_remove(str_extract(X8, 'CLNSIG=[^;]*;'), "CLNSIG=")) %>%
filter(CLNSIG %in% c("Pathogenic/Likely_pathogenic;", "Pathogenic;", "Likely_pathogenic;")) %>%
mutate(CLNSIG = str_remove(CLNSIG, ";")) %>%
mutate(ID = X3) %>%
select(c(X1,X2,ID, PHEN, GENE, CLNSIG)) %>%
drop_na(PHEN, GENE)
clinvarGenes <- clinvarVCFfiltered %>%
select(GENE) %>%
distinct() %>%
transform(GENE = strsplit(GENE, "\\|")) %>%
unnest(GENE) %>%
mutate(GENE = gsub(":.*", "", GENE))
clinvarGenes <- data.table(clinvarGenes)
## DDG2P
ddg2p <- fread("rawdata/genelists/DDG2P.csv") %>%
rename(hgnc_id = `hgnc id`)
develGenes <- ddg2p %>%
filter(`DDD category` %in% c("probable","confirmed","both DD and IF")) %>%
distinct(`hgnc_id`)
develGenes <- data.table(develGenes)
## OMIM
omimGenes <- read_tsv("rawdata/genelists/morbidmap.txt", skip = 4, comment = '#', col_types = 'ccic',
col_names = c("Pheno", "Gene", "MIM_Number", "Cyto_Location")) %>%
select(Pheno, Gene) %>%
drop_na() %>%
mutate(Gene = gsub(",.*", "", Gene)) %>%
mutate(omim_nondisease = ifelse(grepl("\\[", Pheno), T,F)) %>%
mutate(omim_complex = ifelse(grepl("\\{", Pheno), T,F)) %>%
mutate(omim_provisional = ifelse(grepl("\\?", Pheno), T,F))
omimGenes <- omimGenes %>%
filter(!omim_complex & !omim_nondisease)
omimGenes <- data.table(omimGenes)
hgnc.to.ENSG <- fread("rawdata/genelists/hg19.coordinates.txt",fill=T)
setnames(hgnc.to.ENSG,names(hgnc.to.ENSG),c("hg19.GENE","chr","start","end","hgnc_id","hgnc.symbol"))
clinvarGenes <- unique(merge(clinvarGenes,hgnc.to.ENSG,by.x="GENE",by.y="hgnc.symbol"))
develGenes <- unique(merge(develGenes,hgnc.to.ENSG,by="hgnc_id"))
omimGenes <- unique(merge(omimGenes[,c("Gene")],hgnc.to.ENSG,by.x="Gene",by.y="hgnc.symbol"))
diseaseGenes <- bind_rows(clinvarGenes[,c("hg19.GENE")],develGenes[,c("hg19.GENE")],omimGenes[,c("hg19.GENE")])
diseaseGenes <- unique(diseaseGenes)
write.table(diseaseGenes, "rawdata/genelists/diseaseGenes.txt",col.names = F, row.names = F, quote = F, sep ="\t")
rm(ddg2p, develGenes, omimGenes, clinvarVCFfiltered, clinvarGenes, hgnc.to.ENSG, diseaseGenes)
```
### Male Infertility-Based Exclusion
This is a list of genes confirmed as playing a role in male infertility from [Manon Oud and Joris Veltman](https://academic.oup.com/humrep/article/34/5/932/5377831). We download and process the data direct from thier supplement (Table S3-S6):
**Note**: I have no idea if the link below will work for everyone. If it doesn't, download "Supplementary Table 3-6" manually from [here](https://academic.oup.com/humrep/article/34/5/932/5377831#supplementary-data) and set the name of the file to `rawdata/genelists/male_infertility_genes.xlsx`
```{bash Get Male Infertility Genes, eval = F}
curl -L -o rawdata/genelists/male_infertility_genes.xlsx "https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/humrep/34/5/10.1093_humrep_dez022/2/dez022_supplementary_tables_siii_-_svi_final.xlsx?Expires=1610104943&Signature=aNNfVC~2fDVGJtkTwMipconDNKJ-wBpIAuirlY9Vlb10rdiEhGSlRiV2w01eiRwPdg~c7N6j~5Vvle-XbWzPs8hrNnDJkTZSzTJSzAY8rJJcGsvQrrZrmakP87O9iIEuKvsBqCyhLU04osczMLaWVRL7oSX~MmBiPNOgWIOUs7nQodIhFAGf9gyscadyUJ9q3yL5ptybEOcd2VbkeDNuHgRCWuhbE1KB7LQStWbRp6gxXR6JMf8qSHWxknccaxBTdhQ3Pk0bLYL3dxOjr2PYikPlogfO98JF2HgakoJmumx-zJ1wh~4RuB9tJykz3or6FMfOqJLSOiFPF9XVEUJ0zQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA"
```
```{r Process Male Infertility Gene XLSX}
male.genes.xlsx <- data.table(read_xlsx("rawdata/genelists/male_infertility_genes.xlsx",sheet = "Supplementary Table SIV",skip=1))
male.genes.xlsx <- male.genes.xlsx[4:nrow(male.genes.xlsx)]
male.genes.xlsx <- male.genes.xlsx[,c("HGNC gene name","Inheritance pattern in human","Conclusion")]
male.genes.xlsx <- male.genes.xlsx[`HGNC gene name` != ""]
male.genes.xlsx <- male.genes.xlsx[Conclusion!="No evidence"]
male.genes.xlsx <- male.genes.xlsx[`Inheritance pattern in human`!= "XL" & `Inheritance pattern in human` != "YL"]
## Need to rename some genes as they used Hg38 gene IDs:
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="CATSPERE","C1orf101",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="CFAP43","WDR96",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="CFAP44","WDR52",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="CFAP69","C7orf63",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="DNAAF4","DYX1C1",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="DNAAF5","HEATR2",`HGNC gene name`)]
write.table(male.genes.xlsx,"rawdata/genelists/male_infertility_genes.txt",row.names=F,col.names=F,sep="\t",quote=F)
rm(male.genes.xlsx)
```
```{bash Annotate Male Infertility Genes}
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -file2 rawdata/genelists/male_infertility_genes.txt -col1 5 -r | perl -ne 'chomp $_; @F = split("\t", $_); splice(@F, 4, 9); print join("\t", @F) . "\n";' > rawdata/genelists/male_infertility_genes.annotated.txt
```
```{bash Annotate Mouse Infertility Genes}
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -file2 rawdata/genelists/mouse_infertility_genes.txt -col1 5 -r | perl -ne 'chomp $_; @F = split("\t", $_); splice(@F, 4, 9); print join("\t", @F) . "\n";' > rawdata/genelists/mouse_infertility_genes.annotated.txt
```
## 5B. Run Postprocessing
The script `scripts/process_snvs.200k.R` takes NO inputs, but the following files need to be in the directory that you execute the script from *EXCEPT* for tissue files, which need to be in `./tissues/`:
1. hg38 --> hg19 Gene name translation file: `hg38_to_hg19_ENSG.txt` (created in section: "Hg38 Gene Lists")
2. s~het~ scores file: `shet.hgnc.txt` (created in section "s~het~ Gene Lists")
3. Cassa et al s~het~ scores file: `shet.cassa.hgnc.txt` (created in section "s~het~ Gene Lists")
4. pLI gene scores file: `hg19.all_genes_with_pli.txt` (created in section "Hg19 Gene Lists")
5. Variant counts file: `counts.ukbb_wes.200k.txt` (created in section "4B. Running")
6. sample ID files: `ukbb_ids.200k.txt` (created in section "4B. Running")
7. Individuals to keep file: `indv_to_keep.txt`
+ This file is a list of all unrelated, european ancestry individuals that will be included in this study (n ~ 350k). The generation of this file is described at the top of `PhenotypeTesting.Rmd`.
8. Disease genes: `diseaseGenes.txt` (created in section "Disease-Based Exclusion")
9. Male infertility genes: `male_infertility_genes.annotated.txt` (created in section "Male Infertility-Based Exclusion")
10. Mouse infertility genes: `mouse_infertility_genes.annotated.txt` (created in section "Male Infertility-Based Exclusion")
11. Tissue-specific lists: in `./tissues/`, named like `stomachHigh.txt` (created in section "Expression-Based Exclusion")
The script will generate three Rdat formatted outputs that are used as the direct input of `PhenotypeTesting.Rmd` in section 4D of that document:
1. 200k_counts.rds - Main file of per-individual variant data
2. counts.rds - Number of total variants per-person
3. genes.rds - Number of total variants per-gene
```{r Postprocessing, eval = F}
./process_snvs.200k.R
```