Skip to content

Mapping long reads with Giraffe

Adam Novak edited this page Jan 24, 2025 · 11 revisions

This tutorial covers mapping long reads to a pangenome reference with vg giraffe. This tutorial covers DNA reads from long-read sequencers like those from Oxford Nanopore or PacBio; for RNA-seq, see Long‐read RNA‐seq with pre‐existing pangenome. For short reads like Illumina or Element, see Mapping short reads with Giraffe.

Installation

Long-read support was released in vg 1.63.0. Make sure to install a new enough version of vg.

Obtaining Data

You will need a file of reads to align. For this example, we will use longread/hifi.fq. You will also need a pangenome to map to. For this example, we will use the "AF-Filtered VG indexes" graph from the Human Pangenome Reference Consortium version 1.1 release, based on a CHM13 linear reference. You can get it here:

You can also pre-download the distance index for the graph to save on indexing time:

Don't download any other index files; the HPRC 1.1 graphs pre-date long-read support in Giraffe, and the other index files need to be re-generated.

If you don't want to use the full-size graph, you can use a small test graph included with vg to follow along.
vg gbwt --gbz-format -g hprc-v1.1-mc-chm13.d9.gbz -G longread/graph.gfa

Indexing the Graph

While vg giraffe can do its own indexing, you can pre-index the graph to prevent your first vg giraffe run from needing to do it (and then distribute the indexes to your cluster nodes, if operating on a cluster). You can do this with:

vg autoindex --workflow lr-giraffe --prefix hprc-v1.1-mc-chm13.d9 --gbz hprc-v1.1-mc-chm13.d9.gbz

This will create some index files.

ls hprc-v1.1-mc-chm13.d9.*
hprc-v1.1-mc-chm13.d9.dist
hprc-v1.1-mc-chm13.d9.gbz
hprc-v1.1-mc-chm13.d9.longread.withzip.min
hprc-v1.1-mc-chm13.d9.longread.zipcodes

The new files are:

  • hprc-v1.1-mc-chm13.d9.dist (the distance index, if you don't have it already)
  • hprc-v1.1-mc-chm13.d9.longread.withzip.min (the "minimizers" used to find seeds, with embedded "zipcodes")
  • hprc-v1.1-mc-chm13.d9.longread.zipcodes (the zipcodes too large to store in the minimizer file)

Mapping Long Reads

To invoke vg giraffe in long-read mode, use the --parameter-preset/-b option to specify hifi (for PacBio HiFi reads) or r10 (for Oxford Nanopore R10 chemistry reads), as appropriate for your data. Use the --gbz-name/-Z option to specify the .gbz graph file, and the --fastq-in/-f option to specify the input reads in FASTQ format. You can also add --progress/-p for informative progress messages, and remember to redirect the aligned reads to a GAM file.

vg giraffe -b hifi -Z hprc-v1.1-mc-chm13.d9.gbz -f longread/hifi.fq -p >hifi.mapped.gam

Giraffe will automatically find the correct long-read indexes to go with the .gbz graph, or generate them if they are missing. If you want to pass them explicitly, you can use the --minimizer-name/-m, --zipcode-name/-z, and --dist-name/-d options.

Giraffe will also guess the number of threads to use, but you can override this with the --threads/-t option.

For more on the many available vg giraffe options, see vg manpage#giraffe.

Mapping in Other Formats

If you prefer standard GAF-format output, you can ask for that with the --output-format/-o option:

vg giraffe -b hifi -Z hprc-v1.1-mc-chm13.d9.gbz -f longread/hifi.fq -p -o GAF >hifi.mapped.gaf

And if you want to go straight to linear-reference BAM, that's available too. But when working with long reads, remember to add --prune-low-cplx/-P to the command when surjecting to BAM, to re-compute slippery parts of the alignment against the linear target reference, since with long reads you won't be able to do indel realignment later.

vg giraffe -b hifi -Z hprc-v1.1-mc-chm13.d9.gbz -f longread/hifi.fq -p -o BAM -P >hifi.mapped.bam

Working with Long Read Alignments

Once you have aligned your reads, you will want to work with them.

Collecting GAM Statistics

If you have a GAM, you can get some quality control statistics with:

vg stats -a hifi.mapped.gam

If you're following along with the single-read example input, you might see something like this:

Total alignments: 1
Total primary: 1
Total secondary: 0
Total aligned: 1
Total perfect: 1
Total gapless (softclips allowed): 1
Total paired: 0
Total properly paired: 0
Alignment score: mean 15580, median 15580, stdev 0, max 15580 (1 reads)
Mapping quality: mean 60, median 60, stdev 0, max 60 (1 reads)
Insertions: 0 bp in 0 read events
Deletions: 0 bp in 0 read events
Substitutions: 0 bp in 0 read events
Softclips: 0 bp in 0 read events
Total time: 0.00598654 seconds
Speed: 167.041 reads/second

If you have unusually few reads with mapping quality 60, or an unusually large number of softclipped bases, that could be a sign of trouble.

Converting Formats

If you mapped your reads in one format, you might need to convert them to a different one for analysis.

You can convert GAM to GAF:

vg convert --gam-to-gaf hifi.mapped.gam hprc-v1.1-mc-chm13.d9.gbz >hifi.mapped.gaf

You can also go the other way and convert GAF to GAM:

vg convert --gaf-to-gam hifi.mapped.gaf hprc-v1.1-mc-chm13.d9.gbz >hifi.mapped.gam

Note that these conversions require the graph file!

Stripping GAM Metadata

If you are trying to use a vg giraffe GAM file with an older non-vg tool that reads GAM, or an old version of vg, and you get messages like:

what(): [vg::io::MessageIterator] obsolete, invalid, or corrupt input at message 12345 group 45678

Then you may be dealing with a tool that does not understand the run-level metadata embedded in the GAM file by newer versions of Giraffe. In that case, converting from GAM to GAF and back to GAM again can be used to remove that metadata.

Surjecting Long Reads to BAM

If you have GAM or GAF alignments and you want linear-reference BAM files, for tools like DeepVariant, you can use vg surject to "surject" your alignments and squash them down to a linear reference. With long reads, it is important to use the --prune-low-cplx/-P option of vg surject, because it is common for the reads to span low-complexity regions such as tandem repeats, and most downstream tools (such as variant callers) will expect local alignments to such regions to be optimal against the linear reference.

vg surject -b -x hprc-v1.1-mc-chm13.d9.gbz hifi.mapped.gam --prune-low-cplx >hifi.mapped.bam

If you want to surject GAF alignments, you also need the --gaf-input/-G option:

vg surject -b -x hprc-v1.1-mc-chm13.d9.gbz -G hifi.mapped.gaf --prune-low-cplx >hifi.mapped.bam

Conclusion

After following along with the tutorial, remember to clean up:

rm hifi.mapped.{gaf,gam,bam} hprc-v1.1-mc-chm13.d9.{dist,gbz,longread.withzip.min,longread.zipcodes}
Clone this wiki locally