-
Notifications
You must be signed in to change notification settings - Fork 196
vg manpage
Adam Novak edited this page Jan 24, 2025
·
4 revisions
vg - variation graph tool, vg version v1.62.0-1441-g22a8b60cf "Ranzano".
vg is a toolkit for variation graph data structures, interchange formats, alignment, genotyping, and variant calling methods.
For more in-depth explanations of tools and workflows, see the vg wiki page
This is an incomplete list of vg subcommands. For a complete list, run vg help
.
-
Graph construction and indexing
See the wiki page for an overview of vg indexes.
-
vg autoindex
: automatically construct a graph and indexes for a specific workflow (e.g. giraffe, rpvg). wiki page -
vg construct
: manually construct a graph from a reference and variants. wiki page -
vg index
: manually build individual indexes (xg, distance, GCSA, etc). wiki page -
vg gbwt
: manually build and manipulate GBWTs and indexes (GBWTgraph, GBZ, r-index). wiki page -
vg minimizer
: manually build a minimizer index for mapping. -
vg haplotypes
: haplotype sample a graph. Recommended for mapping with giraffe. wiki page
-
- Read mapping
- Downstream analyses
-
Working with read alignments
-
vg gamsort
: sort a GAM/GAF file or index a sorted GAM file. -
vg filter
: filter alignments by properties. -
vg surject
: project alignments on a graph onto a linear reference (gam/gaf->bam/sam/cram). -
vg inject
: project alignments on a linear reference onto a graph (bam/sam/cram->gam/gaf). -
vg sim
: simulate reads from a graph. wiki page
-
- Graph and read statistics
- Manipulating a graph
-
Conversion between formats
-
vg convert
: convert between handle graph formats and GFA, and between alignment formats. -
vg view
: convert between non-handle graph formats and alignment formats (dot, json, turtle...). -
vg surject
: project alignments on a graph onto a linear reference (gam/gaf->bam/sam/cram). -
vg inject
: project alignments on a linear reference onto a graph (bam/sam/cram->gam/gaf). -
vg paths
: extract a fasta from a graph. wiki page
-
- Subgraph extraction
Annotate alignments with graphs and graphs with alignments.
vg: option `--help' requires an argument
usage: vg annotate [options] >output.{gam,vg,tsv}
graph annotation options:
-x, --xg-name FILE xg index or graph to annotate (required)
-b, --bed-name FILE a BED file to convert to GAM. May repeat.
-f, --gff-name FILE a GFF3 file to convert to GAM. May repeat.
-g, --ggff output at GGFF subgraph annotation file instead of GAM (requires -s)
-F, --gaf-output output in GAF format rather than GAM
-s, --snarls FILE file containing snarls to expand GFF intervals into
alignment annotation options:
-a, --gam FILE file of Alignments to annotate (required)
-x, --xg-name FILE xg index of the graph against which the Alignments are aligned (required)
-p, --positions annotate alignments with reference positions
-m, --multi-position annotate alignments with multiple reference positions
-l, --search-limit N when annotating with positions, search this far for paths, or -1 to not search (default: 0 (auto from read length))
-b, --bed-name FILE annotate alignments with overlapping region names from this BED. May repeat.
-n, --novelty output TSV table with header describing how much of each Alignment is novel
-P, --progress show progress
-t, --threads use the specified number of threads
Mapping tool-oriented index construction from interchange formats.
usage: vg autoindex [options]
options:
output:
-p, --prefix PREFIX prefix to use for all output (default: index)
-w, --workflow NAME workflow to produce indexes for, can be provided multiple
times. options: map, mpmap, rpvg, giraffe, sr-giraffe, lr-giraffe (default: map)
input data:
-r, --ref-fasta FILE FASTA file containing the reference sequence (may repeat)
-v, --vcf FILE VCF file with sequence names matching -r (may repeat)
-i, --ins-fasta FILE FASTA file with sequences of INS variants from -v
-g, --gfa FILE GFA file to make a graph from
-G, --gbz FILE GBZ file to make indexes from
-x, --tx-gff FILE GTF/GFF file with transcript annotations (may repeat)
-H, --hap-tx-gff FILE GTF/GFF file with transcript annotations of a named haplotype (may repeat)
configuration:
-f, --gff-feature STR GTF/GFF feature type (col. 3) to add to graph (default: exon)
-a, --gff-tx-tag STR GTF/GFF tag (in col. 9) for transcript ID (default: transcript_id)
logging and computation:
-T, --tmp-dir DIR temporary directory to use for intermediate files
-M, --target-mem MEM target max memory usage (not exact, formatted INT[kMG])
(default: 1/2 of available)
-t, --threads NUM number of threads (default: all available)
-V, --verbosity NUM log to stderr (0 = none, 1 = basic, 2 = debug; default 1)
-h, --help print this help message to stderr and exit
usage: vg call [options] <graph> > output.vcf
Call variants or genotype known variants
support calling options:
-k, --pack FILE Supports created from vg pack for given input graph
-m, --min-support M,N Minimum allele support (M) and minimum site support (N) for call [default = 2,4]
-e, --baseline-error X,Y Baseline error rates for Poisson model for small (X) and large (Y) variants [default= 0.005,0.01]
-B, --bias-mode Use old ratio-based genotyping algorithm as opposed to porbablistic model
-b, --het-bias M,N Homozygous alt/ref allele must have >= M/N times more support than the next best allele [default = 6,6]
GAF options:
-G, --gaf Output GAF genotypes instead of VCF
-T, --traversals Output all candidate traversals in GAF without doing any genotyping
-M, --trav-padding N Extend each flank of traversals (from -T) with reference path by N bases if possible
general options:
-v, --vcf FILE VCF file to genotype (must have been used to construct input graph with -a)
-a, --genotype-snarls Genotype every snarl, including reference calls (use to compare multiple samples)
-A, --all-snarls Genotype all snarls, including nested child snarls (like deconstruct -a)
-c, --min-length N Genotype only snarls with at least one traversal of length >= N
-C, --max-length N Genotype only snarls where all traversals have length <= N
-f, --ref-fasta FILE Reference fasta (required if VCF contains symbolic deletions or inversions)
-i, --ins-fasta FILE Insertions fasta (required if VCF contains symbolic insertions)
-s, --sample NAME Sample name [default=SAMPLE]
-r, --snarls FILE Snarls (from vg snarls) to avoid recomputing.
-g, --gbwt FILE Only call genotypes that are present in given GBWT index.
-z, --gbz Only call genotypes that are present in GBZ index (applies only if input graph is GBZ).
-N, --translation FILE Node ID translation (as created by vg gbwt --translation) to apply to snarl names in output
-O, --gbz-translation Use the ID translation from the input gbz to apply snarl names to snarl names and AT fields in output
-p, --ref-path NAME Reference path to call on (multipile allowed. defaults to all paths)
-S, --ref-sample NAME Call on all paths with given sample name (cannot be used with -p)
-o, --ref-offset N Offset in reference path (multiple allowed, 1 per path)
-l, --ref-length N Override length of reference in the contig field of output VCF
-d, --ploidy N Ploidy of sample. Only 1 and 2 supported. (default: 2)
-R, --ploidy-regex RULES use the given comma-separated list of colon-delimited REGEX:PLOIDY rules to assign
ploidies to contigs not visited by the selected samples, or to all contigs simulated
from if no samples are used. Unmatched contigs get ploidy 2 (or that from -d).
-n, --nested Activate nested calling mode (experimental)
-I, --chains Call chains instead of snarls (experimental)
--progress Show progress
-t, --threads N number of threads to use
usage: vg chunk [options] > [chunk.vg]
Splits a graph and/or alignment into smaller chunks
Graph chunks are saved to .vg files, read chunks are saved to .gam files, and haplotype annotations are
saved to .annotate.txt files, of the form <BASENAME>-<i>-<region name or "ids">-<start>-<length>.<ext>.
The BASENAME is specified with -b and defaults to "./chunks".
For a single-range chunk (-p or -r), the graph data is sent to standard output instead of a file.
options:
-x, --xg-name FILE use this graph or xg index to chunk subgraphs
-G, --gbwt-name FILE use this GBWT haplotype index for haplotype extraction (for -T)
-a, --gam-name FILE chunk this gam file instead of the graph (multiple allowed)
-g, --gam-and-graph when used in combination with -a, both gam and graph will be chunked
-F, --in-gaf input alignment is a sorted bgzipped GAF
path chunking:
-p, --path TARGET write the chunk in the specified (0-based inclusive, multiple allowed)
path range TARGET=path[:pos1[-pos2]] to standard output
-P, --path-list FILE write chunks for all path regions in (line - separated file). format
for each as in -p (all paths chunked unless otherwise specified)
-e, --input-bed FILE write chunks for all (0-based end-exclusive) bed regions
-S, --snarls FILE write given path-range(s) and all snarls fully contained in them, as alternative to -c
id range chunking:
-r, --node-range N:M write the chunk for the specified node range to standard output
-R, --node-ranges FILE write the chunk for each node range in (newline or whitespace separated) file
-n, --n-chunks N generate this many id-range chunks, which are determined using the xg index
simple gam chunking:
-m, --gam-split-size N split gam (specified with -a, sort/index not required) up into chunks with at most N reads each
component chunking:
-C, --components create a chunk for each connected component. if a targets given with (-p, -P, -r, -R), limit to components containing them
-M, --path-components create a chunk for each path in the graph's connected component
general:
-s, --chunk-size N create chunks spanning N bases (or nodes with -r/-R) for all input regions.
-o, --overlap N overlap between chunks when using -s [0]
-E, --output-bed FILE write all created chunks to a bed file
-b, --prefix BASENAME write output chunk files with the given base name. Files for chunk i will
be named: <BASENAME>-<i>-<name>-<start>-<length>.<ext> [./chunk]
-c, --context-steps N expand the context of the chunk this many node steps [1]
-l, --context-length N expand the context of the chunk by this many bp [0]
-T, --trace trace haplotype threads in chunks (and only expand forward from input coordinates).
Produces a .annotate.txt file with haplotype frequencies for each chunk.
--no-embedded-haplotypes Don't load haplotypes from the graph. It is possible to -T without any haplotypes available.
-f, --fully-contained only return GAM alignments that are fully contained within chunk
-O, --output-fmt Specify output format (vg, pg, hg, gfa). [pg (vg with -T)]
-t, --threads N for tasks that can be done in parallel, use this many threads [1]
-h, --help
vg: unrecognized option `--help'
usage: vg construct [options] >new.vg
options:
construct from a reference and variant calls:
-r, --reference FILE input FASTA reference (may repeat)
-v, --vcf FILE input VCF (may repeat)
-n, --rename V=F match contig V in the VCFs to contig F in the FASTAs (may repeat)
-a, --alt-paths save paths for alts of variants by SHA1 hash
-A, --alt-paths-plain save paths for alts of variants by variant ID if possible, otherwise SHA1
(IDs must be unique across all input VCFs)
-R, --region REGION specify a VCF contig name or 1-based inclusive region (may repeat, if on different contigs)
-C, --region-is-chrom don't attempt to parse the regions (use when the reference
sequence name could be inadvertently parsed as a region)
-z, --region-size N variants per region to parallelize (default: 1024)
-t, --threads N use N threads to construct graph (defaults to numCPUs)
-S, --handle-sv include structural variants in construction of graph.
-I, --insertions FILE a FASTA file containing insertion sequences
(referred to in VCF) to add to graph.
-f, --flat-alts N don't chop up alternate alleles from input VCF
-l, --parse-max N don't chop up alternate alleles from input VCF longer than N (default: 100)
-i, --no-trim-indels don't remove the 1bp reference base from alt alleles of indels.
-N, --in-memory construct the entire graph in memory before outputting it.
construct from a multiple sequence alignment:
-M, --msa FILE input multiple sequence alignment
-F, --msa-format format of the MSA file (options: fasta, clustal; default fasta)
-d, --drop-msa-paths don't add paths for the MSA sequences into the graph
shared construction options:
-m, --node-max N limit the maximum allowable node sequence size (default: 32)
nodes greater than this threshold will be divided
Note: nodes larger than ~1024 bp can't be GCSA2-indexed
-p, --progress show progress
Convert graphs between handle-graph compliant formats as well as GFA.
usage: vg convert [options] <input-graph>
input options:
-g, --gfa-in input in GFA format
-r, --in-rgfa-rank N import rgfa tags with rank <= N as paths [default=0]
-b, --gbwt-in FILE input graph is a GBWTGraph using the GBWT in FILE
--ref-sample STR change haplotypes for this sample to reference paths (may repeat)
gfa input options (use with -g):
-T, --gfa-trans FILE write gfa id conversions to FILE
output options:
-v, --vg-out output in VG's original Protobuf format [DEPRECATED: use -p instead].
-a, --hash-out output in HashGraph format
-p, --packed-out output in PackedGraph format [default]
-x, --xg-out output in XG format
-f, --gfa-out output in GFA format
-H, --drop-haplotypes do not include haplotype paths in the output
(useful with GBWTGraph / GBZ inputs)
gfa output options (use with -f):
-P, --rgfa-path STR write given path as rGFA tags instead of lines
(multiple allowed, only rank-0 supported)
-Q, --rgfa-prefix STR write paths with given prefix as rGFA tags instead of lines
(multiple allowed, only rank-0 supported)
-B, --rgfa-pline paths written as rGFA tags also written as lines
-W, --no-wline Write all paths as GFA P-lines instead of W-lines.
Allows handling multiple phase blocks and subranges used together.
--gbwtgraph-algorithm Always use the GBWTGraph library GFA algorithm.
Not compatible with other GFA output options or non-GBWT graphs.
--vg-algorithm Always use the VG GFA algorithm. Works with all options and graph types,
but can't preserve original GFA coordinates.
--no-translation When using the GBWTGraph algorithm, convert the graph directly to GFA.
Do not use the translation to preserve original coordinates.
alignment options:
-G, --gam-to-gaf FILE convert GAM FILE to GAF
-F, --gaf-to-gam FILE convert GAF FILE to GAM
general options:
-t, --threads N use N threads (defaults to numCPUs)
usage: vg deconstruct [options] [-p|-P] <PATH> <GRAPH>
Outputs VCF records for Snarls present in a graph (relative to a chosen reference path).
options:
-p, --path NAME A reference path to deconstruct against (multiple allowed).
-P, --path-prefix NAME All paths [excluding GBWT threads / non-reference GBZ paths] beginning with NAME used as reference (multiple allowed).
Other non-ref paths not considered as samples.
-r, --snarls FILE Snarls file (from vg snarls) to avoid recomputing.
-g, --gbwt FILE consider alt traversals that correspond to GBWT haplotypes in FILE (not needed for GBZ graph input).
-T, --translation FILE Node ID translation (as created by vg gbwt --translation) to apply to snarl names and AT fields in output
-O, --gbz-translation Use the ID translation from the input gbz to apply snarl names to snarl names and AT fields in output
-a, --all-snarls Process all snarls, including nested snarls (by default only top-level snarls reported).
-c, --context-jaccard N Set context mapping size used to disambiguate alleles at sites with multiple reference traversals (default: 10000).
-u, --untangle-travs Use context mapping to determine the reference-relative positions of each step in allele traversals (AP INFO field).
-K, --keep-conflicted Retain conflicted genotypes in output.
-S, --strict-conflicts Drop genotypes when we have more than one haplotype for any given phase (set by default when using GBWT input).
-C, --contig-only-ref Only use the CONTIG name (and not SAMPLE#CONTIG#HAPLOTYPE etc) for the reference if possible (ie there is only one reference sample).
-L, --cluster F Cluster traversals whose (handle) Jaccard coefficient is >= F together (default: 1.0) [experimental]
-n, --nested Write a nested VCF, including special tags. [experimental]
-R, --star-allele Use *-alleles to denote alleles that span but do not cross the site. Only works with -n
-t, --threads N Use N threads
-v, --verbose Print some status messages
Filter alignments by properties.
vg: unrecognized option `--help'
usage: vg filter [options] <alignment.gam> > out.gam
Filter alignments by properties.
options:
-M, --input-mp-alns input is multipath alignments (GAMP) rather than GAM
-n, --name-prefix NAME keep only reads with this prefix in their names [default='']
-N, --name-prefixes FILE keep reads with names with one of many prefixes, one per nonempty line
-e, --exact-name match read names exactly instead of by prefix
-a, --subsequence NAME keep reads that contain this subsequence
-A, --subsequences FILE keep reads that contain one of these subsequences, one per nonempty line
-p, --proper-pairs keep reads that are annotated as being properly paired
-P, --only-mapped keep reads that are mapped
-X, --exclude-contig REGEX drop reads with refpos annotations on contigs matching the given regex (may repeat)
-F, --exclude-feature NAME drop reads with the given feature in the "features" annotation (may repeat)
-s, --min-secondary N minimum score to keep secondary alignment
-r, --min-primary N minimum score to keep primary alignment
-L, --max-length N drop reads with length > N
-O, --rescore re-score reads using default parameters and only alignment information
-f, --frac-score normalize score based on length
-u, --substitutions use substitution count instead of score
-o, --max-overhang N drop reads whose alignments begin or end with an insert > N [default=99999]
-m, --min-end-matches N drop reads that don't begin with at least N matches on each end
-S, --drop-split remove split reads taking nonexistent edges
-x, --xg-name FILE use this xg index or graph (required for -S and -D)
-v, --verbose print out statistics on numbers of reads dropped by what.
-V, --no-output print out statistics (as above) but do not write out filtered GAM.
-T, --tsv-out FIELD[;FIELD] do not write filtered gam but a tsv of the given fields
-q, --min-mapq N drop alignments with mapping quality < N
-E, --repeat-ends N drop reads with tandem repeat (motif size <= 2N, spanning >= N bases) at either end
-D, --defray-ends N clip back the ends of reads that are ambiguously aligned, up to N bases
-C, --defray-count N stop defraying after N nodes visited (used to keep runtime in check) [default=99999]
-d, --downsample S.P drop all but the given portion 0.P of the reads. S may be an integer seed as in SAMtools
-R, --max-reads N drop all but N reads. Nondeterministic on multiple threads.
-i, --interleaved assume interleaved input. both ends will be dropped if either fails filter
-I, --interleaved-all assume interleaved input. both ends will be dropped if *both* fail filters
-b, --min-base-quality Q:F drop reads with where fewer than fraction F bases have base quality >= PHRED score Q.
-G, --annotation K[:V] keep reads if the annotation is present and not false or empty. If a value is given, keep reads if the values are equal
similar to running jq 'select(.annotation.K==V)' on the json
-c, --correctly-mapped keep only reads that are marked as correctly-mapped
-l, --first-alignment keep only the first alignment for each read. Must be run with 1 thread
-U, --complement apply the complement of the filter implied by the other arguments.
-B, --batch-size work in batches of the given number of reads [default=512]
-t, --threads N number of threads [1]
Use an index to find nodes, edges, kmers, paths, or positions.
vg: unrecognized option `--help'
usage: vg find [options] >sub.vg
options:
graph features:
-x, --xg-name FILE use this xg index or graph (instead of rocksdb db)
-n, --node ID find node(s), return 1-hop context as graph
-N, --node-list FILE a white space or line delimited list of nodes to collect
--mapping FILE also include nodes that map to the selected node ids
-e, --edges-end ID return edges on end of node with ID
-s, --edges-start ID return edges on start of node with ID
-c, --context STEPS expand the context of the subgraph this many steps
-L, --use-length treat STEPS in -c or M in -r as a length in bases
-P, --position-in PATH find the position of the node (specified by -n) in the given path
-I, --list-paths write out the path names in the index
-r, --node-range N:M get nodes from N to M
-G, --gam GAM accumulate the graph touched by the alignments in the GAM
--connecting-start POS find the graph connecting from POS (node ID, + or -, node offset) to --connecting-end
--connecting-end POS find the graph connecting to POS (node ID, + or -, node offset) from --connecting-start
--connecting-range INT traverse up to INT bases when going from --connecting-start to --connecting-end (default: 100)
subgraphs by path range:
-p, --path TARGET find the node(s) in the specified path range(s) TARGET=path[:pos1[-pos2]]
-R, --path-bed FILE read our targets from the given BED FILE
-E, --path-dag with -p or -R, gets any node in the partial order from pos1 to pos2, assumes id sorted DAG
-W, --save-to PREFIX instead of writing target subgraphs to stdout,
write one per given target to a separate file named PREFIX[path]:[start]-[end].vg
-K, --subgraph-k K instead of graphs, write kmers from the subgraphs
-H, --gbwt FILE when enumerating kmers from subgraphs, determine their frequencies in this GBWT haplotype index
alignments:
-l, --sorted-gam FILE use this sorted, indexed GAM file
-F, --sorted-gaf FILE use this sorted, indexed GAF file
-o, --alns-on N:M write alignments which align to any of the nodes between N and M (inclusive)
-A, --to-graph VG get alignments to the provided subgraph
sequences:
-g, --gcsa FILE use this GCSA2 index of the sequence space of the graph (required for sequence queries)
-S, --sequence STR search for sequence STR using
-M, --mems STR describe the super-maximal exact matches of the STR (gcsa2) in JSON
-B, --reseed-length N find non-super-maximal MEMs inside SMEMs of length at least N
-f, --fast-reseed use fast SMEM reseeding algorithm
-Y, --max-mem N the maximum length of the MEM (default: GCSA2 order)
-Z, --min-mem N the minimum length of the MEM (default: 1)
-D, --distance return distance on path between pair of nodes (-n). if -P not used, best path chosen heurstically
-Q, --paths-named S return all paths whose names are prefixed with S (multiple allowed)
vg: unrecognized option `--help'
gamsort: sort a GAM/GAF file, or index a sorted GAM file
Usage: gamsort [Options] gamfile
Options:
-i / --index FILE produce an index of the sorted GAM file
-d / --dumb-sort use naive sorting algorithm (no tmp files, faster for small GAMs)
-s / --shuffle Shuffle reads by hash (GAM only)
-p / --progress Show progress.
-G / --gaf-input Input is a GAF file.
-c / --chunk-size Number of reads per chunk when sorting GAFs.
-t / --threads Use the specified number of threads.
usage: vg gbwt [options] [args]
Manipulate GBWTs. Input GBWTs are loaded from input args or built in earlier steps.
The input graph is provided with one of -x, -G, or -Z
General options:
-x, --xg-name FILE read the graph from FILE
-o, --output FILE write output GBWT to FILE
-d, --temp-dir DIR use directory DIR for temporary files
-p, --progress show progress and statistics
GBWT construction parameters (for steps 1 and 4):
--buffer-size N GBWT construction buffer size in millions of nodes (default 100)
--id-interval N store path ids at one out of N positions (default 1024)
Multithreading:
--num-jobs N use at most N parallel build jobs (for -v, -G, -A, -l, -P; default 4)
--num-threads N use N parallel search threads (for -b and -r; default 8)
Step 1: GBWT construction (requires -o and one of { -v, -G, -Z, -E, A }):
-v, --vcf-input index the haplotypes in the VCF files specified in input args in parallel
(inputs must be over different contigs; requires -x, implies -f)
(does not store graph contigs in the GBWT)
--preset X use preset X (available: 1000gp)
--inputs-as-jobs create one build job for each input instead of using first-fit heuristic
--parse-only store the VCF parses without building GBWTs
(use -o for the file name prefix; skips subsequent steps)
--ignore-missing do not warn when variants are missing from the graph
--actual-phasing do not interpret unphased homozygous genotypes as phased
--force-phasing replace unphased genotypes with randomly phased ones
--discard-overlaps skip overlapping alternate alleles if the overlap cannot be resolved
instead of creating a phase break
--batch-size N index the haplotypes in batches of N samples (default 200)
--sample-range X-Y index samples X to Y (inclusive, 0-based)
--rename V=P VCF contig V matches path P in the graph (may repeat)
--vcf-variants variants in the graph use VCF contig names instead of path names
--vcf-region C:X-Y restrict VCF contig C to coordinates X to Y (inclusive, 1-based; may repeat)
--exclude-sample X do not index the sample with name X (faster than -R; may repeat)
-G, --gfa-input index the walks or paths in the GFA file (one input arg)
--max-node N chop long segments into nodes of at most N bp (default 1024, use 0 to disable)
--path-regex X parse metadata as haplotypes from path names using regex X instead of vg-parser-compatible rules
--path-fields X parse metadata as haplotypes, mapping regex submatches to these fields instead of using vg-parser-compatible rules
--translation FILE write the segment to node translation table to FILE
-Z, --gbz-input extract GBWT and GBWTGraph from GBZ input (one input arg)
--translation FILE write the segment to node translation table to FILE
-I, --gg-in FILE load GBWTGraph from FILE and GBWT from input (one input arg)
-E, --index-paths index the embedded non-alt paths in the graph (requires -x, no input args)
-A, --alignment-input index the alignments in the GAF files specified in input args (requires -x)
--gam-format the input files are in GAM format instead of GAF format
Step 2: Merge multiple input GBWTs (requires -o):
-m, --merge use the insertion algorithm
-f, --fast fast merging algorithm (node ids must not overlap)
-b, --parallel use the parallel algorithm
--chunk-size N search in chunks of N sequences (default 1)
--pos-buffer N use N MiB position buffers for each search thread (default 64)
--thread-buffer N use N MiB thread buffers for each search thread (default 256)
--merge-buffers N merge 2^N thread buffers into one file per merge job (default 6)
--merge-jobs N run N parallel merge jobs (default 4)
Step 3: Alter GBWT (requires -o and one input GBWT):
-R, --remove-sample X remove the sample with name X from the index (may repeat)
--set-tag K=V set a GBWT tag (may repeat)
--set-reference X set sample X as the reference (may repeat)
Step 4: Path cover GBWT construction (requires an input graph, -o, and one of { -a, -l, -P }):
-a, --augment-gbwt add a path cover of missing components (one input GBWT)
-l, --local-haplotypes sample local haplotypes (one input GBWT)
-P, --path-cover build a greedy path cover (no input GBWTs)
-n, --num-paths N find N paths per component (default 64 for -l, 16 otherwise)
-k, --context-length N use N-node contexts (default 4)
--pass-paths include named graph paths in local haplotype or greedy path cover GBWT
Step 5: GBWTGraph construction (requires an input graph and one input GBWT):
-g, --graph-name FILE build GBWTGraph and store it in FILE
--gbz-format serialize both GBWT and GBWTGraph in GBZ format (makes -o unnecessary)
Step 6: R-index construction (one input GBWT):
-r, --r-index FILE build an r-index and store it in FILE
Step 7: Metadata (one input GBWT):
-M, --metadata print basic metadata
-C, --contigs print the number of contigs
-H, --haplotypes print the number of haplotypes
-S, --samples print the number of samples
-L, --list-names list contig/sample names (use with -C or -S)
-T, --path-names list path names
--tags list GBWT tags
Step 8: Paths (one input GBWT):
-c, --count-paths print the number of paths
-e, --extract FILE extract paths in SDSL format to FILE
usage:
vg giraffe -Z graph.gbz [-d graph.dist -m graph.min] <input options> [other options] > output.gam
vg giraffe -Z graph.gbz --haplotype-name graph.hapl --kff-name sample.kff <input options> [other options] > output.gam
Fast haplotype-aware short read mapper.
basic options:
-Z, --gbz-name FILE map to this GBZ graph
-m, --minimizer-name FILE use this minimizer index
-z, --zipcode-name FILE use these additional distance hints
-d, --dist-name FILE cluster using this distance index
-p, --progress show progress
-t, --threads INT number of mapping threads to use
-b, --parameter-preset NAME set computational parameters (chaining-sr / default / fast / hifi / r10 / srold) [default]
-h, --help print full help with all available options
input options:
-G, --gam-in FILE read and realign GAM-format reads from FILE
-f, --fastq-in FILE read and align FASTQ-format reads from FILE (two are allowed, one for each mate)
-i, --interleaved GAM/FASTQ input is interleaved pairs, for paired-end alignment
--comments-as-tags intepret comments in name lines as SAM-style tags and annotate alignments with them
haplotype sampling:
--haplotype-name FILE sample from haplotype information in FILE
--kff-name FILE sample according to kmer counts in FILE
--index-basename STR name prefix for generated graph/index files (default: from graph name)
alternate graphs:
-x, --xg-name FILE map to this graph (if no -Z / -g), or use this graph for HTSLib output
-g, --graph-name FILE map to this GBWTGraph (if no -Z)
-H, --gbwt-name FILE use this GBWT index (when mapping to -x / -g)
output options:
-N, --sample NAME add this sample name
-R, --read-group NAME add this read group
-o, --output-format NAME output the alignments in NAME format (gam / gaf / json / tsv / SAM / BAM / CRAM) [gam]
--ref-paths FILE ordered list of paths in the graph, one per line or HTSlib .dict, for HTSLib @SQ headers
--named-coordinates produce GAM/GAF outputs in named-segment (GFA) space
-P, --prune-low-cplx prune short and low complexity anchors during linear format realignment
-n, --discard discard all output alignments (for profiling)
--output-basename NAME write output to a GAM file beginning with the given prefix for each setting combination
--report-name NAME write a TSV of output file and mapping speed to the given file
--show-work log how the mapper comes to its conclusions about mapping locations
Giraffe parameters:
-A, --rescue-algorithm NAME use algorithm NAME for rescue (none / dozeu / gssw) [dozeu]
--fragment-mean FLOAT force the fragment length distribution to have this mean (requires --fragment-stdev)
--fragment-stdev FLOAT force the fragment length distribution to have this standard deviation (requires --fragment-mean)
--set-refpos set refpos field on reads to reference path positions they visit
--track-provenance track how internal intermediate alignment candidates were arrived at
--track-correctness track if internal intermediate alignment candidates are correct (implies --track-provenance)
--track-position coarsely track linear reference positions of good intermediate alignment candidates (implies --track-provenance)
-B, --batch-size INT number of reads or pairs per batch to distribute to threads [512]
program options:
--watchdog-timeout INT complain after INT seconds working on a read or read pair [10]
--log-reads log each read being mapped
-B, --batch-size INT complain after INT seconds working on a read or read pair [512]
scoring options:
--match INT use this match score [1]
--mismatch INT use this mismatch penalty [4]
--gap-open INT use this gap open penalty [6]
--gap-extend INT use this gap extension penalty [1]
--full-l-bonus INT the full-length alignment bonus [5]
result options:
-M, --max-multimaps INT produce up to INT alignments for each read [1]
computational parameters:
-c, --hit-cap INT use all minimizers with at most INT hits [10]
-C, --hard-hit-cap INT ignore all minimizers with more than INT hits [500]
-F, --score-fraction FLOAT select minimizers between hit caps until score is FLOAT of total [0.9]
-U, --max-min INT use at most INT minimizers, 0 for no limit [500]
--min-coverage-flank INT when trying to cover the read with minimizers, count INT towards the coverage of each minimizer on each side [250]
--num-bp-per-min INT use maximum of number minimizers calculated by READ_LENGTH / INT and --max-min [1000]
--downsample-window-length INT maximum window length for downsampling [18446744073709551615]
--downsample-window-count INT downsample minimizers with windows of length read_length/INT, 0 for no downsampling [0]
-D, --distance-limit INT cluster using this distance limit [200]
-e, --max-extensions INT extend up to INT clusters [800]
-a, --max-alignments INT align up to INT extensions [8]
-s, --cluster-score FLOAT only extend clusters if they are within INT of the best score [50]
-S, --pad-cluster-score FLOAT also extend clusters within INT of above threshold to get a second-best cluster [20]
-u, --cluster-coverage FLOAT only extend clusters if they are within FLOAT of the best read coverage [0.3]
--max-extension-mismatches INT maximum number of mismatches to pass through in a gapless extension [4]
-v, --extension-score INT only align extensions if their score is within INT of the best score [1]
-w, --extension-set FLOAT only align extension sets if their score is within INT of the best score [20]
-O, --no-dp disable all gapped alignment
-r, --rescue-attempts INT attempt up to INT rescues per read in a pair [15]
-L, --max-fragment-length INT assume that fragment lengths should be smaller than INT when estimating the fragment length distribution [2000]
--exclude-overlapping-min exclude overlapping minimizers
--paired-distance-limit FLOAT cluster pairs of read using a distance limit FLOAT standard deviations greater than the mean [2]
--rescue-subgraph-size FLOAT search for rescued alignments FLOAT standard deviations greater than the mean [4]
--rescue-seed-limit INT attempt rescue with at most INT seeds [100]
--explored-cap use explored minimizer layout cap on mapping quality
--mapq-score-window INT window to rescale score to for mapping quality, or 0 if not used [0]
--mapq-score-scale FLOAT scale scores for mapping quality [1]
long-read/chaining parameters:
--align-from-chains chain up extensions to create alignments, instead of doing each separately
--zipcode-tree-score-threshold FLOAT only fragment trees if they are within INT of the best score [50]
--pad-zipcode-tree-score-threshold FLOAT also fragment trees within INT of above threshold to get a second-best cluster [20]
--zipcode-tree-coverage-threshold FLOAT only fragment trees if they are within FLOAT of the best read coverage [0.3]
--zipcode-tree-scale FLOAT at what fraction of the read length should zipcode trees be split up [2]
--min-to-fragment INT minimum number of fragmenting problems to run [4]
--max-to-fragment INT maximum number of fragmenting problems to run [10]
--max-direct-chain INT take up to this many fragments per zipcode tree and turn them into chains instead of chaining. If this is 0, do chaining. [0]
--gapless-extension-limit INT do gapless extension to seeds in a tree before fragmenting if the read length is less than this [0]
--fragment-max-lookback-bases INT maximum distance to look back when making fragments [300]
--fragment-max-lookback-bases-per-base FLOAT maximum distance to look back when making fragments, per base [0.03]
--max-fragments INT how many fragments should we try to make when fragmenting something [18446744073709551615]
--fragment-max-indel-bases INT maximum indel length in a transition when making fragments [2000]
--fragment-max-indel-bases-per-base FLOAT maximum indel length in a transition when making fragments, per read base [0.2]
--fragment-gap-scale FLOAT scale for gap scores when fragmenting [1]
--fragment-points-per-possible-match FLOAT points to award non-indel connecting bases when fragmenting [0]
--fragment-score-fraction FLOAT minimum fraction of best fragment score to retain a fragment [0.1]
--fragment-max-min-score FLOAT maximum for fragment score threshold based on the score of the best fragment [1.79769e+308]
--fragment-min-score FLOAT minimum score to retain a fragment [60]
--fragment-set-score-threshold FLOAT only chain fragments in a tree if their overall score is within this many points of the best tree [0]
--min-chaining-problems INT ignore score threshold to get this many chaining problems [1]
--max-chaining-problems INT do no more than this many chaining problems [2147483647]
--max-lookback-bases INT maximum distance to look back when chaining [3000]
--max-lookback-bases-per-base FLOAT maximum distance to look back when chaining, per read base [0.3]
--max-indel-bases INT maximum indel length in a transition when chaining [2000]
--max-indel-bases-per-base FLOAT maximum indel length in a transition when chaining, per read base [0.2]
--item-bonus INT bonus for taking each item when fragmenting or chaining [0]
--item-scale FLOAT scale for items' scores when fragmenting or chaining [1]
--gap-scale FLOAT scale for gap scores when chaining [1]
--points-per-possible-match FLOAT points to award non-indel connecting bases when chaining [0]
--chain-score-threshold FLOAT only align chains if their score is within this many points of the best score [100]
--min-chains INT ignore score threshold to get this many chains aligned [4]
--min-chain-score-per-base FLOAT do not align chains with less than this score per read base [0.01]
--max-min-chain-score INT accept chains with this score or more regardless of read length [200]
--max-skipped-bases INT when skipping seeds in a chain for alignment, allow a gap of at most INT in the graph [0]
--max-chains-per-tree INT align up to this many chains from each tree [1]
--max-chain-connection INT maximum distance across which to connect seeds with WFAExtender when aligning a chain [100]
--max-tail-length INT maximum length of a tail to align with WFAExtender when aligning a chain [100]
--max-dp-cells INT maximum number of alignment cells to allow in a tail or BGA connection [18446744073709551615]
--max-tail-gap INT maximum number of gap bases to allow in a Dozeu tail [18446744073709551615]
--max-middle-gap INT maximum number of gap bases to allow in a middle connection [18446744073709551615]
--max-tail-dp-length INT maximum number of bases in a tail to do DP for, to avoid score overflow [30000]
--max-middle-dp-length INT maximum number of bases in a middle connection to do DP for, before making it a tail [2147483647]
--wfa-max-mismatches INT maximum mismatches (or equivalent-scoring gaps) to allow in the shortest WFA connection or tail [2]
--wfa-max-mismatches-per-base FLOAT maximum additional mismatches (or equivalent-scoring gaps) to allow per involved read base in WFA connections or tails [0.1]
--wfa-max-max-mismatches INT maximum mismatches (or equivalent-scoring gaps) to allow in the longest WFA connection or tail [20]
--wfa-distance INT band distance to allow in the shortest WFA connection or tail [10]
--wfa-distance-per-base FLOAT band distance to allow per involved read base in WFA connections or tails [0.1]
--wfa-max-distance INT band distance to allow in the longest WFA connection or tail [200]
--sort-by-chain-score order alignment candidates by chain score instead of base-level score
--min-unique-node-fraction FLOAT minimum fraction of an alignment that must be from distinct oriented nodes for the alignment to be distinct [0]
vg: unrecognized option `--help'
Usage:
vg haplotypes [options] -k kmers.kff -g output.gbz graph.gbz
vg haplotypes [options] -H output.hapl graph.gbz
vg haplotypes [options] -i graph.hapl -k kmers.kff -g output.gbz graph.gbz
vg haplotypes [options] -i graph.hapl --vcf-input variants.vcf graph.gbz > output.tsv
vg haplotypes [options] -i graph.hapl -k kmers.kff --extract M:N graph.gbz > output.fa
vg haplotypes [options] -i graph.hapl -k kmers.kff --classify output.kmers graph.gbz
vg haplotypes [options] -i graph.hapl --statistics ref_sample graph.gbz > output.tsv
Haplotype sampling based on kmer counts.
Output files:
-g, --gbz-output X write the output GBZ to X
-H, --haplotype-output X write haplotype information to X
Input files:
-d, --distance-index X use this distance index (default: <basename>.dist)
-r, --r-index X use this r-index (default: <basename>.ri)
-i, --haplotype-input X use this haplotype information (default: generate)
-k, --kmer-input X use kmer counts from this KFF file (required for --gbz-output)
Options for generating haplotype information:
--kmer-length N kmer length for building the minimizer index (default: 29)
--window-length N window length for building the minimizer index (default: 11)
--subchain-length N target length (in bp) for subchains (default: 10000)
--linear-structure extend subchains to avoid haplotypes visiting them multiple times
Options for sampling haplotypes:
--preset X use preset X (default, haploid, diploid)
--coverage N kmer coverage in the KFF file (default: estimate)
--num-haplotypes N generate N haplotypes (default: 4)
sample from N candidates (with --diploid-sampling; default: 32)
--present-discount F discount scores for present kmers by factor F (default: 0.9)
--het-adjustment F adjust scores for heterozygous kmers by F (default: 0.05)
--absent-score F score absent kmers -F/+F (default: 0.8)
--haploid-scoring use a scoring model without heterozygous kmers
--diploid-sampling choose the best pair from the sampled haplotypes
--extra-fragments in diploid sampling, select all candidates in bad subchains
--badness F threshold for the badness of a subchain (default: 4)
--include-reference include named and reference paths in the output
Other options:
-v, --verbosity N verbosity level (0 = silent, 1 = basic, 2 = detailed, 3 = debug; default: 0)
-t, --threads N approximate number of threads (default: 8 on this system)
Developer options:
--validate validate the generated information (may be slow)
--vcf-input X map the variants in VCF file X to subchains
--contig-prefix X a prefix for transforming VCF contig names into GBWT contig names
--extract M:N extract haplotypes in chain M, subchain N in FASTA format
--score-output X write haplotype scores to X (use with --extract)
--classify X classify kmers and write output to X
--statistics X output subchain statistics over reference sample X
Manipulate node ids.
usage: vg ids [options] <graph1.vg> [graph2.vg ...] >new.vg
options:
-c, --compact minimize the space of integers used by the ids
-i, --increment N increase ids by N
-d, --decrement N decrease ids by N
-j, --join make a joint id space for all the graphs that are supplied
by iterating through the supplied graphs and incrementing
their ids to be non-conflicting (modifies original files)
-m, --mapping FILE create an empty node mapping for vg prune
-s, --sort assign new node IDs in (generalized) topological sort order
vg: unrecognized option `--help'
usage: vg index [options] <graph1.vg> [graph2.vg ...]
Creates an index on the specified graph or graphs. All graphs indexed must
already be in a joint ID space.
general options:
-b, --temp-dir DIR use DIR for temporary files
-t, --threads N number of threads to use
-p, --progress show progress
xg options:
-x, --xg-name FILE use this file to store a succinct, queryable version of the graph(s), or read for GCSA or distance indexing
-L, --xg-alts include alt paths in xg
gcsa options:
-g, --gcsa-out FILE output a GCSA2 index to the given file
-f, --mapping FILE use this node mapping in GCSA2 construction
-k, --kmer-size N index kmers of size N in the graph (default 16)
-X, --doubling-steps N use this number of doubling steps for GCSA2 construction (default 4)
-Z, --size-limit N limit temporary disk space usage to N gigabytes (default 2048)
-V, --verify-index validate the GCSA2 index using the input kmers (important for testing)
gam indexing options:
-l, --index-sorted-gam input is sorted .gam format alignments, store a GAI index of the sorted GAM in INPUT.gam.gai
vg in-place indexing options:
--index-sorted-vg input is ID-sorted .vg format graph chunks, store a VGI index of the sorted vg in INPUT.vg.vgi
snarl distance index options
-j --dist-name FILE use this file to store a snarl-based distance index
--snarl-limit N don't store snarl distances for snarls with more than N nodes (default 10000)
if N is 0 then don't store distances, only the snarl tree
--no-nested-distance only store distances along the top-level chain
usage: vg inject -x graph.xg [options] input.[bam|sam|cram] >output.gam
options:
-x, --xg-name FILE use this graph or xg index (required, non-XG formats also accepted)
-i, --add-identity add the 'identity' statistic (% of matching base pairs) to output GAM
-r, --rescore re-score alignments
-o, --output-format NAME output the alignments in NAME format (gam / gaf / json) [gam]
-t, --threads N number of threads to use
vg: unrecognized option `--help'
usage: vg map [options] -d idxbase -f in1.fq [-f in2.fq] >aln.gam
Align reads to a graph.
graph/index:
-d, --base-name BASE use BASE.xg and BASE.gcsa as the input index pair
-x, --xg-name FILE use this xg index or graph (defaults to <graph>.vg.xg)
-g, --gcsa-name FILE use this GCSA2 index (defaults to <graph>.gcsa)
-1, --gbwt-name FILE use this GBWT haplotype index (defaults to <graph>.gbwt)
algorithm:
-t, --threads N number of compute threads to use
-k, --min-mem INT minimum MEM length (if 0 estimate via -e) [0]
-e, --mem-chance FLOAT set {-k} such that this fraction of {-k} length hits will by chance [5e-4]
-c, --hit-max N ignore MEMs who have >N hits in our index (0 for no limit) [2048]
-Y, --max-mem INT ignore mems longer than this length (unset if 0) [0]
-r, --reseed-x FLOAT look for internal seeds inside a seed longer than FLOAT*--min-seed [1.5]
-u, --try-up-to INT attempt to align up to the INT best candidate chains of seeds (1/2 for paired) [128]
-l, --try-at-least INT attempt to align at least the INT best candidate chains of seeds [1]
-E, --approx-mq-cap INT weight MQ by suffix tree based estimate when estimate less than FLOAT [0]
--id-mq-weight N scale mapping quality by the alignment score identity to this power [2]
-W, --min-chain INT discard a chain if seeded bases shorter than INT [0]
-C, --drop-chain FLOAT drop chains shorter than FLOAT fraction of the longest overlapping chain [0.45]
-n, --mq-overlap FLOAT scale MQ by count of alignments with this overlap in the query with the primary [0]
-P, --min-ident FLOAT accept alignment only if the alignment identity is >= FLOAT [0]
-H, --max-target-x N skip cluster subgraphs with length > N*read_length [100]
-w, --band-width INT band width for long read alignment [256]
-O, --band-overlap INT band overlap for long read alignment [{-w}/8]
-J, --band-jump INT the maximum number of bands of insertion we consider in the alignment chain model [128]
-B, --band-multi INT consider this many alignments of each band in banded alignment [16]
-Z, --band-min-mq INT treat bands with less than this MQ as unaligned [0]
-I, --fragment STR fragment length distribution specification STR=m:μ:σ:o:d [5000:0:0:0:1]
max, mean, stdev, orientation (1=same, 0=flip), direction (1=forward, 0=backward)
-U, --fixed-frag-model don't learn the pair fragment model online, use {-I} without update
-p, --print-frag-model suppress alignment output and print the fragment model on stdout as per {-I} format
--frag-calc INT update the fragment model every INT perfect pairs [10]
--fragment-x FLOAT calculate max fragment size as frag_mean+frag_sd*FLOAT [10]
--mate-rescues INT attempt up to INT mate rescues per pair [64]
-S, --unpaired-cost INT penalty for an unpaired read pair [17]
--no-patch-aln do not patch banded alignments by locally aligning unaligned regions
--xdrop-alignment use X-drop heuristic (much faster for long-read alignment)
--max-gap-length maximum gap length allowed in each contiguous alignment (for X-drop alignment) [40]
scoring:
-q, --match INT use this match score [1]
-z, --mismatch INT use this mismatch penalty [4]
--score-matrix FILE read a 4x4 integer substitution scoring matrix from a file
-o, --gap-open INT use this gap open penalty [6]
-y, --gap-extend INT use this gap extension penalty [1]
-L, --full-l-bonus INT the full-length alignment bonus [5]
--drop-full-l-bonus remove the full length bonus from the score before sorting and MQ calculation
-a, --hap-exp FLOAT the exponent for haplotype consistency likelihood in alignment score [1]
--recombination-penalty FLOAT use this log recombination penalty for GBWT haplotype scoring [20.7]
-A, --qual-adjust perform base quality adjusted alignments (requires base quality input)
preset:
-m, --alignment-model STR use a preset alignment scoring model, either "short" (default) or "long" (for ONT/PacBio)
"long" is equivalent to `-u 2 -L 63 -q 1 -z 2 -o 2 -y 1 -w 128 -O 32`
input:
-s, --sequence STR align a string to the graph in graph.vg using partial order alignment
-V, --seq-name STR name the sequence using this value (for graph modification with new named paths)
-T, --reads FILE take reads (one per line) from FILE, write alignments to stdout
-b, --hts-input FILE align reads from htslib-compatible FILE (BAM/CRAM/SAM) stdin (-), alignments to stdout
-G, --gam-input FILE realign GAM input
-f, --fastq FILE input fastq or (2-line format) fasta, possibly compressed, two are allowed, one for each mate
-F, --fasta FILE align the sequences in a FASTA file that may have multiple lines per reference sequence
--comments-as-tags intepret comments in name lines as SAM-style tags and annotate alignments with them
-i, --interleaved fastq or GAM is interleaved paired-ended
-N, --sample NAME for --reads input, add this sample
-R, --read-group NAME for --reads input, add this read group
output:
-j, --output-json output JSON rather than an alignment stream (helpful for debugging)
-%, --gaf output alignments in GAF format
--surject-to TYPE surject the output into the graph's paths, writing TYPE := bam |sam | cram
--ref-paths FILE ordered list of paths in the graph, one per line or HTSlib .dict, for HTSLib @SQ headers
--buffer-size INT buffer this many alignments together before outputting in GAM [512]
-X, --compare realign GAM input (-G), writing alignment with "correct" field set to overlap with input
-v, --refpos-table for efficient testing output a table of name, chr, pos, mq, score
-K, --keep-secondary produce alignments for secondary input alignments in addition to primary ones
-M, --max-multimaps INT produce up to INT alignments for each read [1]
-Q, --mq-max INT cap the mapping quality at INT [60]
--exclude-unaligned exclude reads with no alignment
-D, --debug print debugging information about alignment to stderr
--log-time print runtime to stderr
usage: vg minimizer [options] -d graph.dist -o graph.min graph
Builds a (w, k)-minimizer index or a (k, s)-syncmer index of the threads in the GBWT
index. The graph can be any HandleGraph, which will be transformed into a GBWTGraph.
The transformation can be avoided by providing a GBWTGraph or a GBZ graph.
Required options:
-d, --distance-index X annotate the hits with positions in this distance index
-o, --output-name X store the index to file X
Minimizer options:
-k, --kmer-length N length of the kmers in the index (default 29, max 31)
-w, --window-length N choose the minimizer from a window of N kmers (default 11)
-c, --closed-syncmers index closed syncmers instead of minimizers
-s, --smer-length N use smers of length N in closed syncmers (default 18)
Weighted minimizers:
-W, --weighted use weighted minimizers
--threshold N downweight kmers with more than N hits (default 500)
--iterations N downweight frequent kmers by N iterations (default 3)
--fast-counting use the fast kmer counting algorithm (default)
--save-memory use the space-efficient kmer counting algorithm
--hash-table N use 2^N-cell hash tables for kmer counting (default: guess)
Other options:
-z, --zipcode-name X store the distances that are too big to file X
if -z is not specified, some distances may be discarded
-l, --load-index X load the index from file X and insert the new kmers into it
(overrides minimizer / weighted minimizer options)
-g, --gbwt-name X use the GBWT index in file X (required with a non-GBZ graph)
-p, --progress show progress information
-t, --threads N use N threads for index construction (default 8)
(using more than 16 threads rarely helps)
--no-dist build the index without distance index annotations (not recommended)
usage: vg mod [options] <graph.vg> >[mod.vg]
Modifies graph, outputs modified on stdout.
options:
-P, --label-paths don't edit with -i alignments, just use them for labeling the graph
-c, --compact-ids should we sort and compact the id space? (default false)
-b, --break-cycles use an approximate topological sort to break cycles in the graph
-n, --normalize normalize the graph so that edges are always non-redundant
(nodes have unique starting and ending bases relative to neighbors,
and edges that do not introduce new paths are removed and neighboring
nodes are merged)
-U, --until-normal N iterate normalization until convergence, or at most N times
-z, --nomerge-pre STR do not let normalize (-n, -U) zip up any pair of nodes that both belong to path with prefix STR
-E, --unreverse-edges flip doubly-reversing edges so that they are represented on the
forward strand of the graph
-s, --simplify remove redundancy from the graph that will not change its path space
-d, --dagify-step N copy strongly connected components of the graph N times, forwarding
edges from old to new copies to convert the graph into a DAG
-w, --dagify-to N copy strongly connected components of the graph forwarding
edges from old to new copies to convert the graph into a DAG
until the shortest path through each SCC is N bases long
-L, --dagify-len-max N stop a dagification step if the unrolling component has this much sequence
-f, --unfold N represent inversions accessible up to N from the forward
component of the graph
-O, --orient-forward orient the nodes in the graph forward
-N, --remove-non-path keep only nodes and edges which are part of paths
-A, --remove-path keep only nodes and edges which are not part of any path
-k, --keep-path NAME keep only nodes and edges in the path
-R, --remove-null removes nodes that have no sequence, forwarding their edges
-g, --subgraph ID gets the subgraph rooted at node ID, multiple allowed
-x, --context N steps the subgraph out by N steps (default: 1)
-p, --prune-complex remove nodes that are reached by paths of --length which
cross more than --edge-max edges
-S, --prune-subgraphs remove subgraphs which are shorter than --length
-l, --length N for pruning complex regions and short subgraphs
-X, --chop N chop nodes in the graph so they are not more than N bp long
-u, --unchop where two nodes are only connected to each other and by one edge
replace the pair with a single node that is the concatenation of their labels
-e, --edge-max N only consider paths which make edge choices at <= this many points
-M, --max-degree N unlink nodes that have edge degree greater than N
-m, --markers join all head and tails nodes to marker nodes
('###' starts and '$$$' ends) of --length, for debugging
-y, --destroy-node ID remove node with given id
-a, --cactus convert to cactus graph representation
-v, --sample-vcf FILE for a graph with allele paths, compute the sample graph from the given VCF
-G, --sample-graph FILE subset an augmented graph to a sample graph using a Locus file
-t, --threads N for tasks that can be done in parallel, use this many threads
usage: vg mpmap [options] -x graph.xg -g index.gcsa [-f reads1.fq [-f reads2.fq] | -G reads.gam] > aln.gamp
Multipath align reads to a graph.
basic options:
graph/index:
-x, --graph-name FILE graph (required; XG format recommended but other formats are valid, see `vg convert`)
-g, --gcsa-name FILE use this GCSA2/LCP index pair for MEMs (required; both FILE and FILE.lcp, see `vg index`)
-d, --dist-name FILE use this snarl distance index for clustering (recommended, see `vg index`)
-s, --snarls FILE align to alternate paths in these snarls (unnecessary if providing -d, see `vg snarls`)
input:
-f, --fastq FILE input FASTQ (possibly gzipped), can be given twice for paired ends (for stdin use -)
-i, --interleaved input contains interleaved paired ends
-C, --comments-as-tags intepret comments in name lines as SAM-style tags and annotate alignments with them
algorithm presets:
-n, --nt-type TYPE sequence type preset: 'DNA' for genomic data, 'RNA' for transcriptomic data [RNA]
-l, --read-length TYPE read length preset: 'very-short', 'short', or 'long' (approx. <50bp, 50-500bp, and >500bp) [short]
-e, --error-rate TYPE error rate preset: 'low' or 'high' (approx. PHRED >20 and <20) [low]
output:
-F, --output-fmt TYPE format to output alignments in: 'GAMP for' multipath alignments, 'GAM' or 'GAF' for single-path
alignments, 'SAM', 'BAM', or 'CRAM' for linear reference alignments (may also require -S) [GAMP]
-S, --ref-paths FILE paths in the graph either 1) one per line in a text file, or 2) in an HTSlib .dict, to treat as
reference sequences for HTSlib formats (see -F) [all paths]
-N, --sample NAME add this sample name to output
-R, --read-group NAME add this read group to output
-p, --suppress-progress do not report progress to stderr
computational parameters:
-t, --threads INT number of compute threads to use [all available]
advanced options:
algorithm:
-X, --not-spliced do not form spliced alignments, even if aligning with --nt-type 'rna'
-M, --max-multimaps INT report (up to) this many mappings per read [10 rna / 1 dna]
-a, --agglomerate-alns combine separate multipath alignments into one (possibly disconnected) alignment
-r, --intron-distr FILE intron length distribution (from scripts/intron_length_distribution.py)
-Q, --mq-max INT cap mapping quality estimates at this much [60]
-b, --frag-sample INT look for this many unambiguous mappings to estimate the fragment length distribution [1000]
-I, --frag-mean FLOAT mean for a pre-determined fragment length distribution (also requires -D)
-D, --frag-stddev FLOAT standard deviation for a pre-determined fragment length distribution (also requires -I)
-G, --gam-input FILE input GAM (for stdin, use -)
-u, --map-attempts INT perform (up to) this many mappings per read (0 for no limit) [24 paired / 64 unpaired]
-c, --hit-max INT use at most this many hits for any match seeds (0 for no limit) [1024 DNA / 100 RNA]
scoring:
-A, --no-qual-adjust do not perform base quality adjusted alignments even when base qualities are available
-q, --match INT use this match score [1]
-z, --mismatch INT use this mismatch penalty [4 low error, 1 high error]
-o, --gap-open INT use this gap open penalty [6 low error, 1 high error]
-y, --gap-extend INT use this gap extension penalty [1]
-L, --full-l-bonus INT add this score to alignments that align each end of the read [mismatch+1 short, 0 long]
-w, --score-matrix FILE read a 4x4 integer substitution scoring matrix from a file (in the order ACGT)
-m, --remove-bonuses remove full length alignment bonuses in reported scores
Convert alignments to a compact coverage index.
usage: vg pack [options]
options:
-x, --xg FILE use this basis graph (any format accepted, does not have to be xg)
-o, --packs-out FILE write compressed coverage packs to this output file
-i, --packs-in FILE begin by summing coverage packs from each provided FILE
-g, --gam FILE read alignments from this GAM file (could be '-' for stdin)
-a, --gaf FILE read alignments from this GAF file (could be '-' for stdin)
-d, --as-table write table on stdout representing packs
-D, --as-edge-table write table on stdout representing edge coverage
-u, --as-qual-table write table on stdout representing average node mapqs
-e, --with-edits record and write edits rather than only recording graph-matching coverage
-b, --bin-size N number of sequence bases per CSA bin [default: inf]
-n, --node ID write table for only specified node(s)
-N, --node-list FILE a white space or line delimited list of nodes to collect
-Q, --min-mapq N ignore reads with MAPQ < N and positions with base quality < N [default: 0]
-c, --expected-cov N expected coverage. used only for memory tuning [default : 128]
-s, --trim-ends N ignore the first and last N bases of each read
-t, --threads N use N threads (defaults to numCPUs)
Traverse paths in the graph.
vg: unrecognized option `--help'
usage: vg paths [options]
options:
input:
-x, --xg FILE use the paths and haplotypes in this graph FILE. Supports GBZ haplotypes.
(Also accepts -v, --vg)
-g, --gbwt FILE use the threads in the GBWT index in FILE
(graph also required for most output options; -g takes priority over -x)
output graph (.vg format)
-V, --extract-vg output a path-only graph covering the selected paths
-d, --drop-paths output a graph with the selected paths removed
-r, --retain-paths output a graph with only the selected paths retained
-n, --normalize-paths output a graph where all equivalent paths in a site a merged (using selected paths to snap to if possible)
output path data:
-X, --extract-gam print (as GAM alignments) the stored paths in the graph
-A, --extract-gaf print (as GAF alignments) the stored paths in the graph
-L, --list print (as a list of names, one per line) the path (or thread) names
-E, --lengths print a list of path names (as with -L) but paired with their lengths
-M, --metadata print a table of path names and their metadata
-C, --cyclicity print a list of path names (as with -L) but paired with flag denoting the cyclicity
-F, --extract-fasta print the paths in FASTA format
-c, --coverage print the coverage stats for selected paths (not including cylces)
path selection:
-p, --paths-file FILE select the paths named in a file (one per line)
-Q, --paths-by STR select the paths with the given name prefix
-S, --sample STR select the haplotypes or reference paths for this sample
-a, --variant-paths select the variant paths added by 'vg construct -a'
-G, --generic-paths select the generic, non-reference, non-haplotype paths
-R, --reference-paths select the reference paths
-H, --haplotype-paths select the haplotype paths paths
configuration:
-o, --overlay apply a ReferencePathOverlayHelper to the graph
-t, --threads N number of threads to use [all available]. applies only to snarl finding within -n
usage: vg prune [options] <graph.vg> >[output.vg]
Prunes the complex regions of the graph for GCSA2 indexing. Pruning the graph
removes embedded paths.
Pruning parameters:
-k, --kmer-length N kmer length used for pruning
defaults: 24 with -P; 24 with -r; 24 with -u
-e, --edge-max N remove the edges on kmers making > N edge choices
defaults: 3 with -P; 3 with -r; 3 with -u
-s, --subgraph-min N remove subgraphs of < N bases
defaults: 33 with -P; 33 with -r; 33 with -u
-M, --max-degree N if N > 0, remove nodes with degree > N before pruning
defaults: 0 with -P; 0 with -r; 0 with -u
Pruning modes (-P, -r, and -u are mutually exclusive):
-P, --prune simply prune the graph (default)
-r, --restore-paths restore the edges on non-alt paths
-u, --unfold-paths unfold non-alt paths and GBWT threads
-v, --verify-paths verify that the paths exist after pruning
(potentially very slow)
Unfolding options:
-g, --gbwt-name FILE unfold the threads from this GBWT index
-m, --mapping FILE store the node mapping for duplicates in this file (required with -u)
-a, --append-mapping append to the existing node mapping
Other options:
-p, --progress show progress
-t, --threads N use N threads (default: 8)
-d, --dry-run determine the validity of the combination of options
usage: vg rna [options] graph.[vg|pg|hg|gbz] > splicing_graph.[vg|pg|hg]
General options:
-t, --threads INT number of compute threads to use [1]
-p, --progress show progress
-h, --help print help message
Input options:
-n, --transcripts FILE transcript file(s) in gtf/gff format; may repeat
-m, --introns FILE intron file(s) in bed format; may repeat
-y, --feature-type NAME parse only this feature type in the gtf/gff (parses all if empty) [exon]
-s, --transcript-tag NAME use this attribute tag in the gtf/gff file(s) as id [transcript_id]
-l, --haplotypes FILE project transcripts onto haplotypes in GBWT index file
-z, --gbz-format input graph is in GBZ format (contains both a graph and haplotypes (GBWT index))
Construction options:
-j, --use-hap-ref use haplotype paths in GBWT index as reference sequences (disables projection)
-e, --proj-embed-paths project transcripts onto embedded haplotype paths
-c, --path-collapse TYPE collapse identical transcript paths across no|haplotype|all paths [haplotype]
-k, --max-node-length INT chop nodes longer than maximum node length (0 disables chopping) [0]
-d, --remove-non-gene remove intergenic and intronic regions (deletes all paths in the graph)
-o, --do-not-sort do not topological sort and compact the graph
-r, --add-ref-paths add reference transcripts as embedded paths in the graph
-a, --add-hap-paths add projected transcripts as embedded paths in the graph
Output options:
-b, --write-gbwt FILE write pantranscriptome transcript paths as GBWT index file
-v, --write-hap-gbwt FILE write input haplotypes as a GBWT with node IDs matching the output graph
-f, --write-fasta FILE write pantranscriptome transcript sequences as fasta file
-i, --write-info FILE write pantranscriptome transcript info table as tsv file
-q, --out-exclude-ref exclude reference transcripts from pantranscriptome output
-g, --gbwt-bidirectional use bidirectional paths in GBWT index construction
usage: vg sim [options]
Samples sequences from the xg-indexed graph.
basic options:
-x, --xg-name FILE use the graph in FILE (required)
-n, --num-reads N simulate N reads or read pairs
-l, --read-length N simulate reads of length N
-r, --progress show progress information
output options:
-a, --align-out write alignments in GAM-format
-J, --json-out write alignments in json
--multi-position annotate alignments with multiple reference positions
simulation parameters:
-F, --fastq FILE match the error profile of NGS reads in FILE, repeat for paired reads (ignores -l,-f)
-I, --interleaved reads in FASTQ (-F) are interleaved read pairs
-s, --random-seed N use this specific seed for the PRNG
-e, --sub-rate FLOAT base substitution rate (default 0.0)
-i, --indel-rate FLOAT indel rate (default 0.0)
-d, --indel-err-prop FLOAT proportion of trained errors from -F that are indels (default 0.01)
-S, --scale-err FLOAT scale trained error probabilities from -F by this much (default 1.0)
-f, --forward-only don't simulate from the reverse strand
-p, --frag-len N make paired end reads with given fragment length N
-v, --frag-std-dev FLOAT use this standard deviation for fragment length estimation
-N, --allow-Ns allow reads to be sampled from the graph with Ns in them
--max-tries N attempt sampling operations up to N times before giving up [100]
-t, --threads number of compute threads (only when using FASTQ with -F) [1]
simulate from paths:
-P, --path PATH simulate from this path (may repeat; cannot also give -T)
-A, --any-path simulate from any path (overrides -P)
-m, --sample-name NAME simulate from this sample (may repeat; requires -g)
-R, --ploidy-regex RULES use the given comma-separated list of colon-delimited REGEX:PLOIDY rules to assign
ploidies to contigs not visited by the selected samples, or to all contigs simulated
from if no samples are used. Unmatched contigs get ploidy 2.
-g, --gbwt-name FILE use samples from this GBWT index
-T, --tx-expr-file FILE simulate from an expression profile formatted as RSEM output (cannot also give -P)
-H, --haplo-tx-file FILE transcript origin info table from vg rna -i (required for -T on haplotype transcripts)
-u, --unsheared sample from unsheared fragments
-E, --path-pos-file FILE output a TSV with sampled position on path of each read (requires -F)
usage: vg stats [options] [<graph file>]
options:
-z, --size size of graph
-N, --node-count number of nodes in graph
-E, --edge-count number of edges in graph
-l, --length length of sequences in graph
-L, --self-loops number of self-loops
-s, --subgraphs describe subgraphs of graph
-H, --heads list the head nodes of the graph
-T, --tails list the tail nodes of the graph
-e, --nondeterm list the nondeterministic edge sets
-c, --components print the strongly connected components of the graph
-A, --is-acyclic print if the graph is acyclic or not
-n, --node ID consider node with the given id
-d, --to-head show distance to head for each provided node
-t, --to-tail show distance to head for each provided node
-a, --alignments FILE compute stats for reads aligned to the graph
-r, --node-id-range X:Y where X and Y are the smallest and largest node id in the graph, respectively
-o, --overlap PATH for each overlapping path mapping in the graph write a table:
PATH, other_path, rank1, rank2
multiple allowed; limit comparison to those provided
-O, --overlap-all print overlap table for the cartesian product of paths
-R, --snarls print statistics for each snarl
--snarl-contents print out a table of <snarl, depth, parent, contained node ids>
--snarl-sample print out reference coordinates on given sample
-C, --chains print statistics for each chain
-F, --format graph format from {VG-Protobuf, PackedGraph, HashGraph, XG}. Can't detect Protobuf if graph read from stdin
-D, --degree-dist print degree distribution of the graph.
-b, --dist-snarls FILE print the sizes and depths of the snarls in a given distance index.
-p, --threads N number of threads to use [all available]
-v, --verbose output longer reports
-P, --progress show progress
usage: vg surject [options] <aln.gam> >[proj.cram]
Transforms alignments to be relative to particular paths.
options:
-x, --xg-name FILE use this graph or xg index (required)
-t, --threads N number of threads to use
-p, --into-path NAME surject into this path or its subpaths (many allowed, default: reference, then non-alt generic)
-F, --into-paths FILE surject into path names listed in HTSlib sequence dictionary or path list FILE
-i, --interleaved GAM is interleaved paired-ended, so when outputting HTS formats, pair reads
-M, --multimap include secondary alignments to all overlapping paths instead of just primary
-G, --gaf-input input file is GAF instead of GAM
-m, --gamp-input input file is GAMP instead of GAM
-c, --cram-output write CRAM to stdout
-b, --bam-output write BAM to stdout
-s, --sam-output write SAM to stdout
-l, --subpath-local let the multipath mapping surjection produce local (rather than global) alignments
-T, --max-tail-len N only align up to N bases of read tails (default: 10000)
-g, --max-graph-scale X make reads unmapped if alignment target subgraph size exceeds read length by a factor of X (default: 819.2 or 134218 with -S)
-P, --prune-low-cplx prune short and low complexity anchors during realignment
-I, --max-slide N look for offset duplicates of anchors up to N bp away when pruning (default: 6)
-a, --max-anchors N use no more than N anchors per target path (default: unlimited)
-S, --spliced interpret long deletions against paths as spliced alignments
-A, --qual-adj adjust scoring for base qualities, if they are available
-N, --sample NAME set this sample name for all reads
-R, --read-group NAME set this read group for all reads
-f, --max-frag-len N reads with fragment lengths greater than N will not be marked properly paired in SAM/BAM/CRAM
-L, --list-all-paths annotate SAM records with a list of all attempted re-alignments to paths in SS tag
-C, --compression N level for compression [0-9]
-V, --no-validate skip checking whether alignments plausibly are against the provided graph
-w, --watchdog-timeout N warn when reads take more than the given number of seconds to surject
-r, --progress show progress
format conversions for graphs and alignments
vg: unrecognized option `--help'
usage: vg view [options] [ <graph.vg> | <graph.json> | <aln.gam> | <read1.fq> [<read2.fq>] ]
options:
-g, --gfa output GFA format (default)
-F, --gfa-in input GFA format, reducing overlaps if they occur
-v, --vg output VG format [DEPRECATED, use vg convert instead]
-V, --vg-in input VG format only
-j, --json output JSON format
-J, --json-in input JSON format (use with e.g. -a as necessary)
-c, --json-stream streaming conversion of a VG format graph in line delimited JSON format
(this cannot be loaded directly via -J)
-G, --gam output GAM format (vg alignment format: Graph Alignment/Map)
-Z, --translation-in input is a graph translation description
-t, --turtle output RDF/turtle format (can not be loaded by VG)
-T, --turtle-in input turtle format.
-r, --rdf_base_uri set base uri for the RDF output
-a, --align-in input GAM format, or JSON version of GAM format
-A, --aln-graph GAM add alignments from GAM to the graph
-q, --locus-in input is Locus format, or JSON version of Locus format
-z, --locus-out output is Locus format
-Q, --loci FILE input is Locus format for use by dot output
-d, --dot output dot format
-S, --simple-dot simplify the dot output; remove node labels, simplify alignments
-u, --noseq-dot shows size information instead of sequence in the dot output
-e, --ascii-labels use labels for paths or superbubbles with char/colors rather than emoji
-Y, --ultra-label label nodes with emoji/colors that correspond to ultrabubbles
-m, --skip-missing skip mappings to nodes not in the graph when drawing alignments
-C, --color color nodes that are not in the reference path (DOT OUTPUT ONLY)
-p, --show-paths show paths in dot output
-w, --walk-paths add labeled edges to represent paths in dot output
-n, --annotate-paths add labels to normal edges to represent paths in dot output
-M, --show-mappings with -p print the mappings in each path in JSON
-I, --invert-ports invert the edge ports in dot so that ne->nw is reversed
-s, --random-seed N use this seed when assigning path symbols in dot output
-b, --bam input BAM or other htslib-parseable alignments
-f, --fastq-in input fastq (output defaults to GAM). Takes two
positional file arguments if paired
-X, --fastq-out output fastq (input defaults to GAM)
-i, --interleaved fastq is interleaved paired-ended
-L, --pileup output VG Pileup format
-l, --pileup-in input VG Pileup format, or JSON version of VG Pileup format
-B, --distance-in input distance index
-R, --snarl-in input VG Snarl format
-E, --snarl-traversal-in input VG SnarlTraversal format
-K, --multipath-in input VG MultipathAlignment format (GAMP), or JSON version of GAMP format
-k, --multipath output VG MultipathAlignment format (GAMP)
-D, --expect-duplicates don't warn if encountering the same node or edge multiple times
-x, --extract-tag TAG extract and concatenate messages with the given tag
--first only extract the first message with the requested tag
--verbose explain the file being read with --extract-tag
--threads N for parallel operations use this many threads [1]
Bugs can be reported at: https://github.com/vgteam/vg/issues
For technical support, please visit: https://www.biostars.org/tag/vg/