Nextflow pipeline to prepare your data for Genotype Imputation
Before running this pipeline, make sure you have the following dependencies installed:
curl -s https://get.nextflow.io | bash
- Docker
Build docker image before run the pipeline:
docker build -t genepi/pre-imputation-qc . # don't ingore the dot here
Test the pipeline and the created docker image with test-data:
nextflow run main.nf
Create a config file (e.g. my-project.config
in your project folder) and set the paths to your data:
params.project = "my-project"
params.input = "data/*/*.{map,ped}"
params.output = "output"
params.chip = "GSAMD-24v3-0-EA_20034606_A1.b37"
params.strand_file = "data/${params.chip}.strand"
params.refalt_file = "data/${params.chip}.RefAlt"
params.chunkSize= 20000000
params.minSampleCallRate = 0.5
params.minSnpCallRate = 0.9
Execute it the pipeline with you project specific configuration from your project folder:
nextflow run main.nf -c my-project.config
A html report is created in output
and the vcf.gz
are ready to submit them to the Michigan Imputation Server.
Here's a table of the parameters along with their descriptions and default values:
Parameter | Description | Default Value | Required |
---|---|---|---|
project | Name of the project | null | Yes |
input | Directory containing input PLINK files | null | Yes |
output | Output directory for pipeline results | "output/genotyped" | Yes |
chip | Type of genotyping chip used | null | Yes |
build | Reference genome build (e.g., hg19, hg38) | null | Yes |
strand_file | Path to the strand file | null | Yes |
refalt_file | Path to the reference/alternate allele file | null | Yes |
chunkSize | Chunk size for processing data | 20000000 | No |
minSampleCallRate | Minimum sample call rate threshold | 0.5 | No |
minSnpCallRate | Minimum SNP call rate threshold | 0.9 | No |
maf | Minor allele frequency threshold | 0 | No |
hwe | Hardy-Weinberg equilibrium p-value threshold | 1E-6 | No |
cleanSampleIds | Clean sample IDs (true/false) | false | No |
excludeSamples | File containing samples to exclude | null | No |
useDoubleId | Use double IDs for samples (true/false) | true | No |
These parameters allow customization of the pipeline's behavior and thresholds for various data processing steps. You can adjust these values as needed to fit your specific project requirements.
-
Clean Sample IDs: (Optional) Clean sample IDs in input files if specified.
-
Exclude Samples: (Optional) Exclude specified samples from the analysis.
-
Filter and Fix Strand Flips: Filter and fix strand flips using provided strand and ref/alt allele files.
-
Merge VCF Files: Merge filtered VCF files.
-
Filter Merged VCF: Further filter the merged VCF file.
-
Create Final PLINK: Generate the final PLINK files.
-
Split Into Chromosomes: Split the final PLINK files into separate chromosomes.
-
Create Report: Generate a report summarizing the pipeline's execution.
- Processed PLINK files for each chromosome.
- Various reports and statistics files.
Lukas Forer (@lukfor), Institute of Genetic Epidemiology, Medical University of Innsbruck
This pipeline is distributed under the MIT License.