-
Notifications
You must be signed in to change notification settings - Fork 4
Single cohort analysis using SMMAT
First of all, (get your data ready)[https://github.com/hmgu-itg/burden_testing/wiki/Data-input], if you haven't already done so.
This step extract all positions and alleles in the cohort of interest. This is to inform the variant selector of the available pool of variants. This is achieved using the step1
script.
step1 [cohort_name] [vcf_file]
[cohort_name] : name of the cohort
[vcf_file] : tabixed, gzipped VCF file. This can be single-chromosome or an autosomal file (but nothing else).
If you are using Singularity, then you have to wrap the call in a singularity one, like so:
singularity exec -B $(pwd) burden.1.6 step1 $COHORT_NAME $VCF_PATH
If you are using a job scheduler like SLURM, the call becomes:
sbatch --mem 2G -e step1.e -o step1.o --wrap "singularity exec -B $(pwd) burden.1.6 step1 $COHORT_NAME $VCF_PATH"
This task is not memory-intensive so small requirements should be sufficient.
The output files will have the standardized naming [cohort_name][.chrom_name].variantlist.gz
where chrom_name
is automatically determined by the script.
If you produced per-chromosome files, you will need to concatenate them:
for i in {1..22}; do zcat $cohort_name.chr$i.variantlist.gz; done | sed 's/^chr//' | gzip > $cohort_name.all.variantlist.gz
At this stage, you can filter your variant list to select variants which you are interested in. For convenience, we provide a script to filter on missingness and minor allele frequency:
single_cohort_munge_variantlist [variantlist] [max_missingness] [max_af]
⚠️ If you would not like to filter, or apply custom filters such as indel removal using e.g. R, please still runsingle_cohort_munge_variantlist 1 1
afterwards to produce a properly formatted tabixed file.
For this, you will first need the eQTL, gene regions and regulatory region information. This info is downloaded from online sources by the prepare-regions
script. To get a list of all available options to the prepare-regions
commands, run
prepare-regions -h
Usage: ./prepare-regions -o <output directory> : required, output directory
-n : optional, do not download Eigen scores
-c : optional, do not download CADD scores
-x : optional, create backup of the downloaded data
-r : optional, re-use previous downloads if present
-d : optional, just download data and exit, do not process them
-s : optional, do not perform checksums (unsafe)
-t : optional, directory to store temporary data, default: "/tmp"
-e <ftp://ftp.ensembl.org> : optional, Ensembl FTP server
prepare-regions
performs two implicit steps:
- download all the necessary data
- merge it
Normally, both steps should be run in one go:
prepare-regions -o $(pwd)/geneset_data
However, if you want to run the downloading and merging steps separately, you can just download data first:
prepare-regions -o $(pwd)/geneset_data -d -r
(note that you can restart this step, in case of Internet connectivity issues)
and then do the merging:
prepare-regions -o $(pwd)/geneset_data -r
The downloading and merging process will take some time and will download a sizeable amount of data (200Gb). The above command also generates a config.txt
file in the geneset_data
directory which contains location of files needed for analyses. If some files in the output directory (geneset_data
) are later moved to a new location, the corresponding entries in the config.txt
should be manually updated.
With this data available, it can be cross-referenced with your variants to see which variants fall within which region of interest, and build sets from that.
Your variant data and the gene overlap data can now be combined to produce set annotations with weights. The variant annotator that does this allows a lot of flexibility in how you define your test sets. We have found these four variant selection methods to give good results:
name | description | arguments |
---|---|---|
exon severe | variants with a "high" predicted consequence according to Ensembl (roughly equivalent to more severe than missense) | -g exon -o |
exon CADD | all exonic variants (+50 bp outside of exons) weighted by CADD scores | -g exon -x 50 -s CADD |
exon regulatory | same as above, but weighted by Eigen scores (phred-scaled). Variants in regulatory regions that overlap with eQTL for that gene are also included. | -g exon -x 50 -e promoter,enhancer,TF_bind -l promoter,enhancer,TF_bind -s EigenPhred |
regulatory only | same as above but excluding exonic variants | -e promoter,enhancer,TF_bind -l promoter,enhancer,TF_bind -s EigenPhred |
The full command line arguments of the variant selector are described below. To make parallelisation easier, we expose the -d
argument, which chunks all Gencode genes into a fixed number of chunks. The number of the chunk being computed is specified by -c
or by setting the load sharing variables $LSB_JOBINDEX
or $SLURM_ARRAY_TASK_ID
for LSF and SLURM, respectively.
Create group file
version: v13 Last modified: 2020.Oct.09
Usage: make-group-file <parameters>
Variant filter options:
-C - config file (reguired, no default)
-i - input variant list, tab-separated, bgzipped and TABIX indexed (required, no default, each line has to have 5 fields: chr,pos,ID,ref,alt)
-g - comma separated list of GENCODE features (gene, exon, transcript, CDS or UTR)
-e - comma separated list of GTEx features (promoter, CTCF, enhancer, promoterFlank, openChrom, TF_bind or allreg)
-l - comma separated list of overlap features (promoter, CTCF, enhancer, promoterFlank, openChrom, TF_bind or allreg)
-x - extend genomic regions by this amount (bp) (default: 0)
-o - include variants with severe consequences only (more severe than missense)
Parameters to set up scores for variants:
-s - apply weighting; accepted values: CADD, EigenPhred
-t - the value with which the scores will be shifted (default value: if Eigen score weighting specified: 1, otherwise: 0)
-k - below the specified cutoff value, the variants will be excluded (default: 0)
Gene list and chunking:
-L - file with gene IDs (if not specified all genes will be analyzed).
-d - total number of chunks (default: 1).
-c - chunk number (default: 1). Takes precedence over SLURM_ARRAY_TASK_ID etc.
General options:
-w - output directory (required, no default)
Other options:
-h - print this message and exit
An example (chr21 genes) that can be run on a single machine with 10 threads:
for i in {1..10}; do
make-group-file \
-L chr21.genes \
-C config.txt \
-i Pomak.chr21.variantlist.gz \
-g exon -o -w $(pwd) -d 10 -c $i &
done
⚠️ Please note thatmake-group-file
can be a bit slow even for small gene lists. This is because it has to read the entire list of exons and regulatory features in memory to compute variant overlap.
📝 Note: The-L
option accepts a file with a list of genes in either ENSG ID or gencode gene symbols in one column (seegencode.basic.annotation.tsv.gz
file generated byprepare-regions
during the Turning the variant info file into a group file step).
After the step above has completed, concatenate your group files as follows:
find -name "group_file*.txt" -exec cat \{} \+ > group_file.txt
If you have used multiple selection methods for the same genes, you can postfix the gene names with the selection method (e.g. ENSGXXXXX.exon_severe). We also recommend removing genes with only a single filtered SNP at this stage:
cut -f1 concat.group.file.txt | sort | uniq -c | awk '$1==1{print $2}'> singlesnp.genes.txt
fgrep -wvf singlesnp.genes.txt concat.group.file.txt > concat.group.file.filtered.txt
You will need a relatedness matrix to run analyses using this pipeline. All analysed samples must be present in the matrix. The script supports GEMMA and GCTA matrices, we recommend the latter. GCTA produces a set of files with the same prefix and the extensions .grm.gz
and .grm.id
.
At this point you should have the following files:
- GDS file
- group file containing testing groups
- relatedness matrix files
- phenotype file
Simply pass these as arguments to step 2 as below:
usage: step2 [--] [--help] [--cohort-name COHORT-NAME] [--matrix-prefix
MATRIX-PREFIX] [--matrix-type MATRIX-TYPE] [--GDS GDS] [--pheno
PHENO] [--group-file GROUP-FILE] [--fam-file FAM-FILE]
[--threads THREADS] [--out OUT]
Wrapper for single-cohort SMMAT analysis
flags:
-h, --help show this help message and exit
optional arguments:
-c, --cohort-name Name of the cohort, will be used as output
prefix. Assigned at beginning of project, should
remain constant throughout.
-m, --matrix-prefix Prefix of matrix files for GCTA matrix
([prefix].grm.gz and [prefix].grm.id are
expected to exist), or matrix file for GEMMA.
-t, --matrix-type Format of the matrix. GCTA and GEMMA formats are
supported. [default: GCTA]
-y, --GDS GDS (genotype) file.
-p, --pheno Phenotype file. Two columns, whitespace or
comma-delimited, with header. The first column
is the sample "id" and is named as such. The
second is the phenotype name. Please avoid
exotic characters in the phenotype name.
-g, --group-file SMMAT group file as provided by the analysis
team.
-f, --fam-file [optional] fam file for GEMMA matrices.
--threads [optional] Number of parallel threads. [default:
1]
-o, --out [optional] Output file. If not set,
[cohort].[trait].out will be used.
This will produce several output files. If you ran step2 --out test --threads 3
, the script will generate the following:
-
test
, a tab-delimited main results file -
test.score.{1..3}
andtest.var.{1..3}
. These files are used for meta-analysis and contain scores for variants included in the tests. If you are doing single-cohort analysis only, you can delete or archive these files.
In the test
result file above, we recommend using O.pval
as the main p-value. You can consider E.pval
as well, however we have found it to be inflated in a number of cases.