Skip to content

Data input

Arthur Gilly edited this page Sep 11, 2020 · 5 revisions

Phenotypes

Phenotypes must be written as two-column, tab-separated files with a header line, one phenotype per file. The first column is the sample ID (always named id), and the second is the phenotype name. To make parsing easier later on, we advise to not use spaces or colons in the phenotype name. For example:

id         height
SAMPLE001  0.593
SAMPLE002  -0.135

Note that the file above contains standardised values. We advise you to regress out any relevant covariate directly from your phenotype, standardise and renormalise residuals, since adding covariates in the analysis is not currently supported. If you would like this function added, please raise a GitHub issue or contact the developers.

Genotypes

You will need per-chromosome or genome-wide VCF files. For parallelism, we advise to use per-chromosome ones. The following VCF INFO fields should be present:

  • AC : allele count
  • AN : allele number
  • AF : allele frequency

** Note: ** Since this pipeline is designed for sequencing data, the only FORMAT field that will be used is GT. Dosages are not supported. Please contact the developers if you would like this feature added.

If you are planning to run a LoF analysis, then the following fields produced by Loftee annotation must be added to the VCF:

  • LoF_conf: LoF confidence assigned by Loftee. Should take value LC, HC or - for low-confidence, high-confidence and not LoF.
  • LoF_comment: Comment produced by Loftee. Can take any value assigned by Loftee such as NON_CAN_SPLICE or NON_CAN_SPLICE_SURR. This field is not parsed by the pipeline.

Please contact the developers if you would like help/info on how to add these fields (you have to run loftee yourself, since the relevant Ensembl REST API plugin has not been functional for years).

Producing GDS files

The above VCFs are used by the variant parser and are required no matter which pipeline you want to run. If you want to run a SMMAT pipeline (which we recommend), you will need to convert your VCFs to GDS, since SMMAT does not support any other format.

This is done using the VCF2GDS script, which takes 3 arguments:

VCF2GDS [input_vcf] [output_gds] [thread_number]

The thread number being optional and set to 1 if absent (we do advise to use at least 5 threads to speed up operation). We have found that per-chromosome conversion plus thread-based parallelism completes in a reasonable time, for example:

for i in {1..22}; do
  singularity exec -B $(pwd) burden_latest VCF2GDS chr$i.vcf.gz $cohort_name.chr$i.gds 10 ## submit this to your job scheduler
done

It is good practice to prefix the name of the output GDS with the name of your cohort or analysis. We strip most of the fields off your VCFs, so the conversion shouldn't take long, about 30 minutes for ~2000 samples using the above parallelisation. The -B $(pwd) above is to bind the current directory to the container. As always with containers, local directories are not available unless you bind them.

If you plan on spending some time prototyping or trying around, you can also do:

singularity shell -B $(pwd) -s /bin/bash burden_latest

Then you will be inside the container and can run VCF2GDS directly.