-
Notifications
You must be signed in to change notification settings - Fork 196
Mapping long reads with Giraffe
This tutorial covers mapping long reads to a pangenome reference with vg giraffe
. This tutorial covers DNA reads from long-read sequencers like those from Oxford Nanopore or PacBio; for RNA-seq, see Long‐read RNA‐seq with pre‐existing pangenome. For short reads like Illumina or Element, see Mapping short reads with Giraffe.
Long-read support was released in vg 1.63.0. Make sure to install a new enough version of vg.
You will need a file of reads to align. For this example, we will use longread/hifi.fq
. You will also need a pangenome to map to. For this example, we will use the "AF-Filtered VG indexes" graph from the Human Pangenome Reference Consortium version 1.1 release, based on a CHM13 linear reference. You can get it here:
You can also pre-download the distance index for the graph to save on indexing time:
Don't download any other index files; the HPRC 1.1 graphs pre-date long-read support in Giraffe, and the other index files need to be re-generated.
If you don't want to use the full-size graph, you can use a small test graph included with vg to follow along.
vg gbwt --gbz-format -g hprc-v1.1-mc-chm13.d9.gbz -G longread/graph.gfa
While vg giraffe
can do its own indexing, you can pre-index the graph to prevent your first vg giraffe
run from needing to do it (and then distribute the indexes to your cluster nodes, if operating on a cluster). You can do this with:
vg autoindex --workflow lr-giraffe --prefix hprc-v1.1-mc-chm13.d9 --gbz hprc-v1.1-mc-chm13.d9.gbz
This will create some index files.
ls hprc-v1.1-mc-chm13.d9.*
hprc-v1.1-mc-chm13.d9.dist
hprc-v1.1-mc-chm13.d9.gbz
hprc-v1.1-mc-chm13.d9.longread.withzip.min
hprc-v1.1-mc-chm13.d9.longread.zipcodes
The new files are:
-
hprc-v1.1-mc-chm13.d9.dist
(the distance index, if you don't have it already) -
hprc-v1.1-mc-chm13.d9.longread.withzip.min
(the "minimizers" used to find seeds, with embedded "zipcodes") -
hprc-v1.1-mc-chm13.d9.longread.zipcodes
(the zipcodes too large to store in the minimizer file)
To invoke vg giraffe
in long-read mode, use the --parameter-preset
/-b
option to specify hifi
(for PacBio HiFi reads) or r10
(for Oxford Nanopore R10 chemistry reads), as appropriate for your data. Use the --gbz-name
/-Z
option to specify the .gbz
graph file, and the --fastq-in
/-f
option to specify the input reads in FASTQ format. You can also add --progress
/-p
for informative progress messages, and remember to redirect the aligned reads to a GAM file.
vg giraffe -b hifi -Z hprc-v1.1-mc-chm13.d9.gbz -f longread/hifi.fq -p >hifi.mapped.gam
Giraffe will automatically find the correct long-read indexes to go with the .gbz
graph, or generate them if they are missing. If you want to pass them explicitly, you can use the --minimizer-name
/-m
, --zipcode-name
/-z
, and --dist-name
/-d
options.
Giraffe will also guess the number of threads to use, but you can override this with the --threads
/-t
option.
For more on the many available vg giraffe
options, see vg manpage#giraffe.
If you prefer standard GAF-format output, you can ask for that with the --output-format
/-o
option:
vg giraffe -b hifi -Z hprc-v1.1-mc-chm13.d9.gbz -f longread/hifi.fq -p -o GAF >hifi.mapped.gaf
And if you want to go straight to linear-reference BAM, that's available too. But when working with long reads, remember to add --prune-low-cplx
/-P
to the command when surjecting to BAM, to re-compute slippery parts of the alignment against the linear target reference, since with long reads you won't be able to do indel realignment later.
vg giraffe -b hifi -Z hprc-v1.1-mc-chm13.d9.gbz -f longread/hifi.fq -p -o BAM -P >hifi.mapped.bam
Once you have aligned your reads, you will want to work with them.
If you have a GAM, you can get some quality control statistics with:
vg stats -a hifi.mapped.gam
If you're following along with the single-read example input, you might see something like this:
Total alignments: 1
Total primary: 1
Total secondary: 0
Total aligned: 1
Total perfect: 1
Total gapless (softclips allowed): 1
Total paired: 0
Total properly paired: 0
Alignment score: mean 15580, median 15580, stdev 0, max 15580 (1 reads)
Mapping quality: mean 60, median 60, stdev 0, max 60 (1 reads)
Insertions: 0 bp in 0 read events
Deletions: 0 bp in 0 read events
Substitutions: 0 bp in 0 read events
Softclips: 0 bp in 0 read events
Total time: 0.00598654 seconds
Speed: 167.041 reads/second
If you have unusually few reads with mapping quality 60, or an unusually large number of softclipped bases, that could be a sign of trouble.
If you mapped your reads in one format, you might need to convert them to a different one for analysis.
You can convert GAM to GAF:
vg convert --gam-to-gaf hifi.mapped.gam hprc-v1.1-mc-chm13.d9.gbz >hifi.mapped.gaf
You can also go the other way and convert GAF to GAM:
vg convert --gaf-to-gam hifi.mapped.gaf hprc-v1.1-mc-chm13.d9.gbz >hifi.mapped.gam
Note that these conversions require the graph file!
If you are trying to use a vg giraffe
GAM file with an older non-vg tool that reads GAM, or an old version of vg, and you get messages like:
what(): [vg::io::MessageIterator] obsolete, invalid, or corrupt input at message 12345 group 45678
Then you may be dealing with a tool that does not understand the run-level metadata embedded in the GAM file by newer versions of Giraffe. In that case, converting from GAM to GAF and back to GAM again can be used to remove that metadata.
If you have GAM or GAF alignments and you want linear-reference BAM files, for tools like DeepVariant, you can use vg surject
to "surject" your alignments and squash them down to a linear reference. With long reads, it is important to use the --prune-low-cplx
/-P
option of vg surject
, because it is common for the reads to span low-complexity regions such as tandem repeats, and most downstream tools (such as variant callers) will expect local alignments to such regions to be optimal against the linear reference.
vg surject -b -x hprc-v1.1-mc-chm13.d9.gbz hifi.mapped.gam --prune-low-cplx >hifi.mapped.bam
If you want to surject GAF alignments, you also need the --gaf-input
/-G
option:
vg surject -b -x hprc-v1.1-mc-chm13.d9.gbz -G hifi.mapped.gaf --prune-low-cplx >hifi.mapped.bam
After following along with the tutorial, remember to clean up:
rm hifi.mapped.{gaf,gam,bam} hprc-v1.1-mc-chm13.d9.{dist,gbz,longread.withzip.min,longread.zipcodes}