Skip to content

Single cohort analysis using SMMAT

Young-Chan Park edited this page Oct 18, 2021 · 21 revisions

First of all, (get your data ready)[https://github.com/hmgu-itg/burden_testing/wiki/Data-input], if you haven't already done so.

Step 1: Creating per-cohort variant files

This step extract all positions and alleles in the cohort of interest. This is to inform the variant selector of the available pool of variants. This is achieved using the step1 script.

Syntax

step1 [cohort_name] [vcf_file]

   [cohort_name] : name of the cohort
   [vcf_file]    : tabixed, gzipped VCF file. This can be single-chromosome or an autosomal file (but nothing else).

If you are using Singularity, then you have to wrap the call in a singularity one, like so:

singularity exec -B $(pwd) burden.1.6 step1 $COHORT_NAME $VCF_PATH

If you are using a job scheduler like SLURM, the call becomes:

sbatch --mem 2G -e step1.e -o step1.o --wrap "singularity exec -B $(pwd) burden.1.6 step1 $COHORT_NAME $VCF_PATH"

This task is not memory-intensive so small requirements should be sufficient.

The output files will have the standardized naming [cohort_name][.chrom_name].variantlist.gz where chrom_name is automatically determined by the script.

Post-processing and variant filtering

If you produced per-chromosome files, you will need to concatenate them:

for i in {1..22}; do zcat $cohort_name.chr$i.variantlist.gz; done | sed 's/^chr//' | gzip > $cohort_name.all.variantlist.gz

At this stage, you can filter your variant list to select variants which you are interested in. For convenience, we provide a script to filter on missingness and minor allele frequency:

single_cohort_munge_variantlist [variantlist] [max_missingness] [max_af]

⚠️ If you would not like to filter, or apply custom filters such as indel removal using e.g. R, please still run single_cohort_munge_variantlist 1 1 afterwards to produce a properly formatted tabixed file.

Turning the variant info file into a group file

For this, you will first need the eQTL, gene regions and regulatory region information. This info is downloaded from online sources by the prepare-regions script. To get a list of all available options to the prepare-regions commands, run

prepare-regions -h
Usage: ./prepare-regions -o <output directory> : required, output directory
          -n : optional, do not download Eigen scores
          -c : optional, do not download CADD scores
          -x : optional, create backup of the downloaded data
          -r : optional, re-use previous downloads if present
          -d : optional, just download data and exit, do not process them
          -s : optional, do not perform checksums (unsafe)
          -t : optional, directory to store temporary data, default: "/tmp"
          -e <ftp://ftp.ensembl.org> : optional, Ensembl FTP server

prepare-regions performs two implicit steps:

  • download all the necessary data
  • merge it

Normally, both steps should be run in one go:

prepare-regions -o $(pwd)/geneset_data

However, if you want to run the downloading and merging steps separately, you can just download data first:

prepare-regions -o $(pwd)/geneset_data -d -r

(note that you can restart this step, in case of Internet connectivity issues)

and then do the merging:

prepare-regions -o $(pwd)/geneset_data -r

The downloading and merging process will take some time and will download a sizeable amount of data (200Gb). The above command also generates a config.txt file in the geneset_data directory which contains location of files needed for analyses. If some files in the output directory (geneset_data) are later moved to a new location, the corresponding entries in the config.txt should be manually updated.

With this data available, it can be cross-referenced with your variants to see which variants fall within which region of interest, and build sets from that.

Building sets using the variant selector

Your variant data and the gene overlap data can now be combined to produce set annotations with weights. The variant annotator that does this allows a lot of flexibility in how you define your test sets. We have found these four variant selection methods to give good results:

name description arguments
exon severe variants with a "high" predicted consequence according to Ensembl (roughly equivalent to more severe than missense) -g exon -o
exon CADD all exonic variants (+50 bp outside of exons) weighted by CADD scores -g exon -x 50 -s CADD
exon regulatory same as above, but weighted by Eigen scores (phred-scaled). Variants in regulatory regions that overlap with eQTL for that gene are also included. -g exon -x 50 -e promoter,enhancer,TF_bind -l promoter,enhancer,TF_bind -s EigenPhred
regulatory only same as above but excluding exonic variants -e promoter,enhancer,TF_bind -l promoter,enhancer,TF_bind -s EigenPhred

The full command line arguments of the variant selector are described below. To make parallelisation easier, we expose the -d argument, which chunks all Gencode genes into a fixed number of chunks. The number of the chunk being computed is specified by -c or by setting the load sharing variables $LSB_JOBINDEX or $SLURM_ARRAY_TASK_ID for LSF and SLURM, respectively.

Create group file
version: v13 Last modified: 2020.Oct.09

Usage: make-group-file <parameters>

Variant filter options:
     -C  - config file (reguired, no default)
     -i  - input variant list, tab-separated, bgzipped and TABIX indexed (required, no default, each line has to have 5 fields: chr,pos,ID,ref,alt)
     -g  - comma separated list of GENCODE features (gene, exon, transcript, CDS or UTR)
     -e  - comma separated list of GTEx features (promoter, CTCF, enhancer, promoterFlank, openChrom, TF_bind or allreg)
     -l  - comma separated list of overlap features (promoter, CTCF, enhancer, promoterFlank, openChrom, TF_bind or allreg)
     -x  - extend genomic regions by this amount (bp) (default: 0)
     -o  - include variants with severe consequences only (more severe than missense)

Parameters to set up scores for variants:
     -s  - apply weighting; accepted values: CADD, EigenPhred
     -t  - the value with which the scores will be shifted (default value: if Eigen score weighting specified: 1, otherwise: 0)
     -k  - below the specified cutoff value, the variants will be excluded (default: 0)

Gene list and chunking:
     -L  - file with gene IDs (if not specified all genes will be analyzed).
     -d  - total number of chunks (default: 1).
     -c  - chunk number (default: 1). Takes precedence over SLURM_ARRAY_TASK_ID etc.

General options:
     -w  - output directory (required, no default)

Other options:
     -h  - print this message and exit

Example

An example (chr21 genes) that can be run on a single machine with 10 threads:

for i in {1..10}; do
  make-group-file \
    -L chr21.genes \
    -C config.txt \
    -i Pomak.chr21.variantlist.gz \
    -g exon -o -w $(pwd) -d 10 -c $i &
done

⚠️ Please note that make-group-file can be a bit slow even for small gene lists. This is because it has to read the entire list of exons and regulatory features in memory to compute variant overlap.
📝 Note: The -L option accepts a file with a list of genes in either ENSG ID or gencode gene symbols in one column (see gencode.basic.annotation.tsv.gz file generated by prepare-regions during the Turning the variant info file into a group file step).

Concatenating group files

After the step above has completed, concatenate your group files as follows:

find -name "group_file*.txt" -exec cat \{} \+ > group_file.txt

If you have used multiple selection methods for the same genes, you can postfix the gene names with the selection method (e.g. ENSGXXXXX.exon_severe). We also recommend removing genes with only a single filtered SNP at this stage:

cut -f1 concat.group.file.txt | sort | uniq -c | awk '$1==1{print $2}'> singlesnp.genes.txt
fgrep -wvf singlesnp.genes.txt concat.group.file.txt > concat.group.file.filtered.txt

Generating the matrix

You will need a relatedness matrix to run analyses using this pipeline. All analysed samples must be present in the matrix. The script supports GEMMA and GCTA matrices, we recommend the latter. GCTA produces a set of files with the same prefix and the extensions .grm.gz and .grm.id.

Step 2 : Running single-cohort SMMAT

At this point you should have the following files:

  • GDS file
  • group file containing testing groups
  • relatedness matrix files
  • phenotype file

Simply pass these as arguments to step 2 as below:

usage: step2 [--] [--help] [--cohort-name COHORT-NAME] [--matrix-prefix
       MATRIX-PREFIX] [--matrix-type MATRIX-TYPE] [--GDS GDS] [--pheno
       PHENO] [--group-file GROUP-FILE] [--fam-file FAM-FILE]
       [--threads THREADS] [--out OUT]

Wrapper for single-cohort SMMAT analysis

flags:
  -h, --help           show this help message and exit

optional arguments:
  -c, --cohort-name    Name of the cohort, will be used as output
                       prefix. Assigned at beginning of project, should
                       remain constant throughout.
  -m, --matrix-prefix  Prefix of matrix files for GCTA matrix
                       ([prefix].grm.gz and [prefix].grm.id are
                       expected to exist), or matrix file for GEMMA.
  -t, --matrix-type    Format of the matrix. GCTA and GEMMA formats are
                       supported. [default: GCTA]
  -y, --GDS            GDS (genotype) file.
  -p, --pheno          Phenotype file. Two columns, whitespace or
                       comma-delimited, with header. The first column
                       is the sample "id" and is named as such. The
                       second is the phenotype name. Please avoid
                       exotic characters in the phenotype name.
  -g, --group-file     SMMAT group file as provided by the analysis
                       team.
  -f, --fam-file       [optional] fam file for GEMMA matrices.
  --threads            [optional] Number of parallel threads. [default:
                       1]
  -o, --out            [optional] Output file. If not set,
                       [cohort].[trait].out will be used.

This will produce several output files. If you ran step2 --out test --threads 3, the script will generate the following:

  • test, a tab-delimited main results file
  • test.score.{1..3} and test.var.{1..3}. These files are used for meta-analysis and contain scores for variants included in the tests. If you are doing single-cohort analysis only, you can delete or archive these files.

In the test result file above, we recommend using O.pval as the main p-value. You can consider E.pval as well, however we have found it to be inflated in a number of cases.