Skip to content

Long‐read RNA‐seq with pre‐existing pangenome

Adam Novak edited this page Jan 25, 2025 · 5 revisions

This tutorial supplements "Transcriptomic analyses" by providing complete steps for long-read RNA-seq (e.g. Kinnex) quantification with a pan-transcriptome made from a .gbz pangenome.

All example files here are assumed to be in the working directory. Input files provided by the user are called graph.gbz, annotations.ref.gff3, and reads.fastq.

Installation

Install vg following its README. Version 1.63.0 or later is required for long-read mapping. Also install rpvg following its README.

If you don't want to follow the instructions, you can get `rpvg` for Linux x86\_64 with a shell script.
if ! which rpvg ; then
    if [[ "$(uname -sm)" == "Linux x86_64" ]] ; then
        wget https://github.com/jonassibbesen/rpvg/releases/download/v1.0/rpvg-v1.0_linux.tar.gz
        tar -xvf rpvg-v1.0_linux.tar.gz rpvg-v1.0_linux/bin/rpvg
        mv rpvg-v1.0_linux/bin/rpvg "$(dirname "$(which vg)")"
        rmdir rpvg-v1.0_linux/bin
        rmdir rpvg-v1.0_linux
        rm rpvg-v1.0_linux.tar.gz
    fi
    # Otherwise follow the actual instructions
fi

Pan-transcriptome creation

A spliced pangenome simply adds splice junction edges between exons and thus is reference-based. By contrast, for a pan-transcriptome, splice junctions are projected onto the non-reference haplotypes. Thus, a pan-transcriptome has paths for all haplotype transcripts.

Some considerations for the annotation file:

  • Chromosome names in the annotations must match the pangenome’s reference haplotype’s contigs. Check the graph’s chromosome names with vg paths --list --reference-paths -x graph.gbz. Some paths in the graph might have modified PanSN (sample#haplotype#contig#offset) names; it’s OK if the graph has names with offsets and the annotations don’t..
  • Choose --transcript-tag to be the tag that connects exons to their parent transcript. Make sure it distinguishes between chrX/Y annotations for pseudoautosomal genes.
  • Annotations must correspond to the correct reference genome. For example, CHM13’s first version lacked a Y chromosome, and graphs built from it often use GRCh38’s Y, so the chrY genes in a CHM13 v2 annotation file may need to be swapped out for GRCh38’s chrY annotations.

The “Construction” section of the "Transcriptomic analyses" page has an alternative more efficient three-step process for pan-transcriptome creation. This process needs a VCF file (which is used for GBWT creation). It cannot be used when starting from a .gbz pangenome, which already includes a GBWT.

Graph creation requires large amounts of time and memory. With a HPRC v1.1 pangenome (90 haplotypes, 3.4G GBZ), 10k valid transcripts on CHM13, and 16 cores, it takes two hours and 140GB even without projection. With projection it was still running after a week. With pre-made annotations for other haplotypes (without projection), and a total of 13 million transcripts, it took two hours and 230GB.

If you don't want to use the full-size graph, you can use a small test graph included with vg to follow along.
vg gbwt --gbz-format -g graph.gbz -G longread/graph.gfa
cp longread/annotations.gff3 ./annotations.ref.gff3
cp longread/annotations.gff3 ./annotations.sample1.gff3
cp longread/annotations.gff3 ./annotations.sample2.gff3
cp longread/kinnex.fq ./reads.fastq

Using projection (only reference annotations)

If annotations are only available for the reference haplotype, they can be projected onto the pangenome’s other haplotypes via the --proj-embed-paths option:

vg rna --progress --proj-embed-paths --gbz-format graph.gbz \
    --transcripts annotations.ref.gff3 --transcript-tag Parent \
    --write-info info.tsv --gbwt-bidirectional --write-gbwt transcripts.gbwt > pantran.pg

Make sure the contig names in the first column of your GFF3 match the contig names in your graph. Check your available reference and generic paths with:

vg paths --list --reference-paths --generic-paths -x graph.gbz

In particular, if your target paths are in the graph in PanSN format, as something like CHM13#0#chr1, but your GFF3 just calls for chr1, it won't work.

Without projection (annotations for all haplotypes)

If annotations are available for all haplotypes, such as for HPRC v1.1, simply use multiple --transcripts/-n options. All annotation files must have the same --transcript-tag. Disable projection via --use-hap-ref. This option is significantly more efficient.

vg rna --progress --use-hap-ref --gbz-format graph.gbz \
    --transcripts annotations.ref.gff3 --transcript-tag Parent \
    -n annotations.sample1.gff3 -n annotations.sample2.gff3 \
    `# ...more sample annotations...` \
    --write-info info.tsv --gbwt-bidirectional --write-gbwt transcripts.gbwt > pantran.pg

Again, the paths in the GFF3 files need to match the names of the paths in the graph. If your annotations for, say, sample NA12878 are not working, check what the graph calls that sample's paths with:

vg paths --list --sample NA12878 -x graph.gbz

Then make sure the corresponding GFF3 file matches those path names.

Read alignment

vg giraffe requires a lot of indexes, which it can make for itself. Creating indexes in a step separate from read alignment has the major advantage of breaking up the job into self-contained pieces. To do this, use the vg autoindex command, which pre-creates indexes. This example uses a read file shipped with vg to create the indexes.

# Build .gbz from .gbwt and .pg
vg gbwt -x pantran.pg --gbz-format -g pantran.gbz transcripts.gbwt
# Build indexes
vg autoindex --workflow lr-giraffe --prefix pantran --gbz pantran.gbz
# Use preset "-b hifi" for Kinnex
vg giraffe --progress -b hifi --gbz-name pantran.gbz \
    --fastq-in reads.fastq --mapq-score-scale 0.01 > giraffe.gam

The resulting alignments may be QC’ed by vg stats -a giraffe.gam. Note the --mapq-score-scale 0.01 option. Long reads’ scores are penalized by vg giraffe’s mapping quality logic using this scaling factor, which must be set to bring the alignment scores down to an appropriate range for the mapping quality heuristics. The value used here empirically moves the median MAPQ of Kinnex reads from ~10 to 60.

Transcript quantification

The GAMs produced by vg giraffe have metadata incompatible with rpvg. Either use the linked PR, or remove metadata with vg convert --gam-to-gaf and --gaf-to-gam.

# Sanitize the GAM
vg convert --gam-to-gaf giraffe.gam pantran.pg > giraffe.gaf
vg convert --gaf-to-gam giraffe.gaf pantran.pg > giraffe.gaf.gam

The rpvg run requires a .xg version of the graph, so make one form the .pg file:

# rpvg requires .xg, not the .pg or .gbz from previous steps
vg convert --xg-out pantran.pg > pantran.xg

This step requires a GBWT with transcripts; use the one from vg rna. Additionally, as the vg giraffe -b hifi command did not use quality-adjusted scores, use --score-not-qual.

# Don't try this part unless you actually installed rpvg
if ! which rpvg ; then exit 0 ; fi

# Actually run rpvg to quantify transcripts
rpvg --long-reads --score-not-qual --single-path \
    -a giraffe.gaf.gam -g pantran.xg -p transcripts.gbwt \
    --path-info info.tsv -i haplotype-transcripts -o quantification
After following along, you can clean up the extraneous files.
rm info.tsv transcripts.gbwt pantran.{xg,pg,gbz,dist,longread.withzip.min,longread.zipcodes} giraffe.{gaf,gam,gaf.gam} graph.gbz annotations.{ref,sample1,sample2}.gff3 reads.fastq
rm -Rf quantification
Clone this wiki locally