-
Notifications
You must be signed in to change notification settings - Fork 25
Workflow Examples
- Quickstart
- Graph construction
- GFA output
- Merge graphs
- Kmer distance matrix
- Kmer error cleaning
- Read threading
- Link error cleaning
- Merge links
- Calling variants
- Assemble contigs
- Filter subgraph
- Filter reads
- Plot graph
This is the fastest way to use McCortex to go from sequence files to VCFs. It runs with the most common settings and you only need to run two commands. You need to create a text file listing your samples and their input sequences files (any mix of SAM/BAM/FASTA/FASTQ or gzipped FASTA/FASTQ is ok). Then give this file to scripts/make-pipeline.pl
, along with a reference if you have one. Read more about the McCortex Pipeline. Otherwise the sections below will walk you through the individual steps to use McCortex.
Construct graphs for three samples (using 70GB ram):
mccortex31 build -m 70G -k 31 --sample NA12878 --seq input.fq.gz NA12878.ctx
mccortex31 build -m 70G -k 31 --sample Mickey --seq2 reads.1.fq.gz:reads.2.fq.gz Mickey.ctx
mccortex31 build -m 70G -k 31 --sample Minnie --seq data.bam Minnie.ctx
Construct graph for the reference (hg19)
mccortex31 build -m 70G -k 31 --sample hg19 --seq hg19.fa.gz hg19.ctx
Generate a GFA representation of the unitig graph with:
mccortex31 unitigs --gfa graph.ctx > graph.gfa
Merge multiple graphs together. Graphs are kept as separate colours unless the --flatten
option is used. To combine three graphs into refAndSamples.ctx:
mccortex31 join --out refAndSamples.ctx NA12878.ctx Mickey.ctx Minnie.ctx
To generate a matrix of how many kmers are shared with between any pair of colours, use the dist
command:
mccortex31 dist -m 50G --out dist.tsv NA12878.ctx Mickey.ctx Minnie.ctx
Generated file dist.tsv
is a tab separated matrix like so:
. col0 col1
col0 2874186023 2107456247
col1 0 2796104371
There are two options here depending on sequence coverage.
A) 'Clean' graphs to remove sequencing error (per sample, for high coverage samples):
mccortex31 clean -m 60G --out NA12878.clean.ctx NA12878.ctx
mccortex31 clean -m 60G --out Mickey.clean.ctx Mickey.ctx
mccortex31 clean -m 60G --out Minnie.clean.ctx Minnie.ctx
...then merge graphs into file refAndSamples.ctx
. Uses 80GB ram:
mccortex31 join -m 80G --out refAndSamples.clean.ctx hg19.ctx NA12878.clean.ctx Mickey.clean.ctx Minnie.clean.ctx
B) Alternatively merge uncleaned samples then clean on the population (multiple low depth samples). Uses 100GB ram:
mccortex31 clean -m 100G --out samples.clean.ctx NA12878.ctx Mickey.ctx Minnie.ctx
mccortex31 join -m 80G --out refAndSamples.ctx hg19.ctx samples.clean.ctx
Now we have cortex graphs of the reference and our samples with sequencing error removed.
C) Another way to achieve pooled cleaning is to create a graph of kmers you want to keep, and intersect every sample with it. Note 0:
loads each sample into colour 0
:
mccortex31 clean -m 100G --out keep.kmers.ctx 0:NA12878.ctx 0:Mickey.ctx 0:Minnie.ctx
mccortex31 join -m 100G --out pop.clean.ctx --intersect keep.kmers.ctx NA12878.ctx Mickey.ctx Minnie.ctx
The command mccortex31 clean
has the option of writing out histograms of supernode lengths and coverages with the --covg-before
, --covg-after
, --len-before
and --len-after
arguments. You can plot the results of these using a pair of scripts bundled with McCortex in the scripts/
directory (requires R
and R packages ggplot2
and gridExtra
):
scripts/R/plot-covg-hist.R covg.before_cleaning.csv covg.before_cleaning.pdf
scripts/R/plot-covg-hist.R covg.after_cleaning.csv covg.after_cleaning.pdf
scripts/R/plot-length-hist.R length.before_cleaning.csv length.before_cleaning.pdf
scripts/R/plot-length-hist.R length.after_cleaning.csv length.after_cleaning.pdf
Read threading aligns reads against the graph and generates graph `links' which store long range connectivity information. This is per-sample information (each sample has its own set of links).
Before threading reads through the graph, you must run inferedges
:
mccortex31 inferedges -m 50G in.ctx
This edits in.ctx
to add missing edges between kmers. Later steps assume all edges between adjacent kmers already exist.
Before threading reads, paired end reads that overlap should be merged. There are many good read merging programs that can be used, read more on the read threading page.
Thread reads through the graph to assist in assembly, generating file refAndSamples.ctp.gz
:
mccortex31 thread -m 80G --seq2 NA12878.1.fq:NA12878.2.fq --out NA12878.ctp.gz refAndSamples.ctx:1
mccortex31 thread -m 80G --seqi Mickey.bam --out Mickey.ctp.gz refAndSamples.ctx:2
mccortex31 thread -m 80G --seqi Minnie.sam --out Minnie.ctp.gz refAndSamples.ctx:3
Now the link files NA12878.ctp.gz
, Mickey.ctp.gz
and Minnie.ctp.gz
are ready to be used alongside their respective graphs, e.g.:
mccortex31 contigs -m 100G -p 1:NA12878.ctp.gz -p 2:Mickey.ctp.gz -p 3:Minnie.ctp.gz refAndSamples.ctx
The reference is colour 0 in the above example. Read more on graph links.
Once a set of links have been generated, we need to remove links due to sequencing error. This is done in two steps: first we use the first 1,000 kmers (with links) to pick a cleaning threshold, then we use the cleaning threshold to create a cleaned set of links.
mccortex31 links -T link.stats.txt -L 1000 graph.raw.ctp.gz
LINK_THRESH=`grep 'suggested_cutoff=' link.stats.txt | grep -oE '[0-9,]+$'`
mccortex31 links --clean $LINK_THRESH --out graph.clean.ctp.gz graph.raw.ctp.gz
Cleaning links also removes redundant links. Combined with removing rare links, this reduces the memory required to store graph connectivity.
Similar to the join
command, pjoin
merges link files.
mccortex31 pjoin -m 30G -o refAndNA12878.ctp.gz ref.ctp.gz NA12878.ctp.gz
There are two algorithms for identifying putative variants in a McCortex graph -- the bubble and breakpoint callers. Both work with one or more graph colours (i.e. multi-sample). The bubble caller optionally takes a reference colour, the breakpoint caller requires a reference.
Calling variants is a four step process:
- make graph calls with
breakpoints
orbubbles
command - convert calls to VCF with
calls2vcf
command - sort VCFs, merge and remove duplicates, compress and index etc.
- add sample coverage with
vcfcov
command - genotype samples with
vcfgeno
command
mccortex31 bubbles -m 100G --out denovo.bubbles.gz --haploid 0 -p refAndSamples.ctp.gz refAndSamples.ctx
Uses the graph and links (refAndSamples.ctx
, refAndSamples.ctp.gz
) to call variants that are saved to the file denovo.bubbles.gz
. We also specify that the first colour is a haploid sample (with --haploid 0
). This is useful in removing bubbles caused by repeats rather than real variants.
Next we extract the 5' flanks from the bubble file:
./scripts/cortex_print_flanks.sh denovo.bubbles.gz > denovo.5pflank.fa
Creates the file denovo.5pflank.fa
. We then map the putative variant flanks to our reference (in this case hg19). We can do this with stampy:
stampy.py -G hg19 hg19.fa # create hg19.stidx
stampy.py -g hg19 -H hg19 # create hg19.sthash
stampy.py -g hg19 -h hg19 --inputformat=fasta -M denovo.5pflank.fa > denovo.5pflank.sam
or BWA:
bwa index hg19.fa
bwa mem hg19.fa denovo.5pflank.fa > denovo.5pflank.sam
Then we align the alleles to the reference and generate a VCF
mccortex31 calls2vcf -F denovo.5pflank.sam -o denovo.vcf bubbles.txt.gz hg19.fa
This generates a VCF of sites without genotypes and will contain duplicates. See the VCF post processing section to sort, remove duplicates, index and clean up your VCF.
If you have an assembled genome, you can compare your sequenced samples to it to identify large variants and re-arrangements:
mccortex31 breakpoints -m 100G --seq hg19.fa --out breakpoints.txt.gz -p PlutoAndGoofy.ctp.gz PlutoAndGoofy.ctx
The breakpoints command identifies where your samples diverge from the reference and rejoin it elsewhere. This could be small regions of clustered SNPs, indels, inversions or translocations.
We can generate a VCF of small events using calls2vcf
:
mccortex31 calls2vcf -o breakpoints.vcf breakpoints.txt.gz hg19.fa
This generates a VCF of sites without genotypes and will contain duplicates. See the VCF post processing section to sort, remove duplicates, index and clean up your VCF.
Results can be plotted on a Circos plot using a perl script in scripts/
:
gzip -dc breakpoints.txt.gz | scripts/make-circos.pl out-dir -
cd out-dir
circos
Plot appears in out-dir/circos.png
.
Once you have called variants with the Bubble Caller or the Breakpoint Caller, you are left with an unsorted VCF calls.raw.vcf
. Our proposed post-processing sorts, removes duplicates, renames variants, compresses and indexes the VCF:
./scripts/bash/vcf-sort calls.raw.vcf > calls.sorted.vcf
bcftools norm --remove-duplicates --fasta-ref ref.fa --multiallelics -any $< | \
./scripts/bash/vcf-rename > calls.norm.vcf
bgzip calls.norm.vcf
bcftools index calls.norm.vcf.gz
Next see VCF coverage and genotyping.
Given a sorted VCF, a reference and a cortex graph, we can annotate the VCF with the coverage for each sample in the graph:
mccortex31 vcfcov -m 60G --out out.vcf.gz --ref ref.fa in.vcf graph.ctx
Note: the VCF must be sorted and the reference must contain all the chromosomes in the VCF. The order of the chromosomes in the VCF and reference do not matter and do not need to be the same.
For each sample in the graph, we add a sample to the VCF with the same name. For each sample we add two fields:
-
K31R
mean coverage of kmers unique to the reference -
K31A
mean coverage of kmers unique to the alternative allele
vcfcov
works with biallelic and multiallelic VCFs. The VCF must be sorted and will be printed in the order it was read in. With no --out
argument vcfcov
prints to STDOUT.
Coverage is calculated by assembling all local haplotypes and finding which kmers uniquely support an alternative allele. We then report the number and mean kmer coverage on each reference and alternative allele.
See full page on VCF Genotyping.
After adding sample coverage for variants using the vcfcov
-- one sample at a time or all at once -- you can genotype a VCF with:
mccortex31 vcfgeno -m 60G --out out.vcf.gz --cov 40 in.vcf graph.ctx
Adding -r
(--rm-cov
) will remove the annotations added by vcfcov
after setting the genotype.
See full page on VCF Genotyping.
Pull out contigs from an individual in a population. Fills in low coverage gaps in sequence from the population if unconfounded.
mccortex31 contigs -p refAndSamples.ctp.gz --colour 1 --out refAndSamples.contigs.fa refAndSamples.ctx
The contigs
command uses random kmers to seed contigs, therefore it may print duplicate contigs and some contigs that are substrings of others. To remove duplicate contigs you can use the rmsubstr
command:
mccortex31 rmsubstr -m 10G --out refAndSamples.contigs.rmdup.fa refAndSamples.contigs.fa
More on the contig assembly page.
Pull out a region of interest using a set of reads or kmers. The subgraph command searches out from the seed kmers by dist
kmers (dist >= 0
).
mccortex31 subgraph --seq contig.fa contig.ctx 100 graph.ctx
Input is graph.ctx
, output is saved to contig.ctx
. You can specify particular colours on the input to keep in the output. Add the --invert
option to invert the kmers selected - only take those further than dist
kmers.
Filter reads by whether or not they share a kmer with a graph:
mccortex31 reads [--fasta|--fastq] --seq in1.fa out.se --seq2 in.1.fq:in.2.fq out.pair in.ctx
Will save reads from in1.fa
to out.se.fa.gz
and from in.1.fq
,in.2.fq
to out.pair.1.fa.gz
,out.pair.2.fa.gz
. Output options are --fasta
or --fastq
. Paired end reads (--seq2
) are considered touching the graph if either read shares one or more kmers with the graph. Add the --invert
option to select reads that DO NOT touch the graph.
Requires GraphViz which provides the dot
command. Flattens any colours in the graph:
./scripts/perl/mccortex-graph-to-graphviz.pl graph.ctx:0,3 | dot -Tpdf > kmers.pdf
Or to merge adjacent kmers add the --simplify
option:
./scripts/perl/mccortex-graph-to-graphviz.pl --simplify graph.ctx:0,3 | dot -Tpdf > contigs.pdf
Or you can just use ./scripts/perl/mccortex-graph-to-graphviz.pl
to generate dot files.
You can generate a plot of a de Bruijn graph straight from sequence files using scripts/seq2pdf.sh
:
./scripts/seq2pdf.sh --simplify 3 <(echo ACAACACGT) <(echo CCACACAA) > out.pdf
Usage is: ./scripts/seq2pdf.sh [--simplify] <kmer> <file1> [file2] > out.pdf
. Files can be of types FASTA, FASTQ, SAM, BAM, TXT or even gzipped. Requires graphviz.