Preconfigured pipeline for converting Illumina reads into VCF for Apis mellifera.
The input should be demultiplexed.
Reads for each sample should be non-interleaved (i.e. separate r1
and r2
files).
- Check pairing
- Strict barcode check
- Filter contaminants
- Trim adaptors
- Map against reference genome
- Call SNPs with
freebayes
(each contig run separately in parallel) - Genotyping stats
In another pipeline (coming soon):
- Make a set of haplotypes from the haploid individuals (drones)
- Use whatshap to set this as a haplotype and phase the pools
Use the singularity container from the Releases tab or the Docker container from GHCR. The container provides:
bbmap 38.90
bwa 0.7.17-r1188
freebayes 1.3.2
python 3.8.5
R 3.6.3 with data.table 1.12.8 and ggplot2 3.3.0
samtools 1.10 and bcftools 1.10.2 using htslib 1.10.2-3
vcflib 1.0.1
vcftools 0.1.16
If you have the above dependencies installed, you can install the pipeline with pip3
:
pip3 install \
git+git://github.com/tomharrop/honeybee-genotype-pipeline.git
threads
: Number of threads to use. Intermediate files are pipes, so at least 4 threads are required.restart_times
: Number of times to restart failing jobs.ref
: Reference genome (uncompressed fasta).samples_csv
: a csv file with the following columns:sample
: sample name (will be propagated to output files and VCF);barcode
: sample barcode, will be checked with 0 allowed mismatches;r1_path
: path to R1 file for thissample
;r2_path
: path to R2 file for thissample
;metadata
(optional): currently not used.
outdir
: Output directory.cnv_map
: Read in a whitespace-delimited file of sample names and ploidy, e.g. for genotyping drones and pools in a single run. Seefreebayes --help
for more info.ploidy
: Ploidy for freebayes, e.g. 1 for haploid, 2 for diploid.csd
: Do a separate freebayes run to pick up all alleles at the csd locus (i.e.--region NC_037640.1:11771679-11781139
).
honeybee-genotype-pipeline [-h] [-n] [--threads int]
[--restart_times RESTART_TIMES] --ref REF
--samples_csv SAMPLES_CSV --outdir OUTDIR
[--cnv_map CNV_MAP | --ploidy PLOIDY]
[--csd]
optional arguments:
-h, --help show this help message and exit
-n Dry run
--threads int Number of threads. Default: 4
--restart_times RESTART_TIMES
number of times to restart failing jobs (default 0)
--ref REF Reference genome in uncompressed fasta
--samples_csv SAMPLES_CSV
Sample csv (see README)
--outdir OUTDIR Output directory
--cnv_map CNV_MAP Read a copy number map from the BED file FILE
--ploidy PLOIDY Ploidy for freebayes (e.g. 1 for haploid, 2 for
diploid)
--csd Do a separate freebayes run to genotype the csd locus
n.b. freebayes
doesn't print in the snakemake
rulegraph, because it comes after a checkpoint rule. The input is from markdup
and generate_regions
.