XTree (CrossTree) is a fast, k-mer-based sequence lookup and tallying tool designed for efficient analysis of metagenomic data. It offers a range of functionalities including genome coverage analysis, maximum-likelihood read redistribution, cross-tabulation of annotation hierarchies, and per-query annotation and statistics.
This documentation covers XTree version 2.00c.
- Total and unique genome coverage analysis
- Maximum-likelihood read redistribution
- Cross-tabulation of two annotation hierarchies (e.g., taxonomy and function)
- Per-query annotation and statistics
- Multiple granularity reporting: per-reference (genome-level), per-rank (redistributed), or per-rank (LCA [lowest common ancestor])
- High-speed processing (often 10-100x faster than alignment)
- Support for standard input, compressed files, and both FASTA and FASTQ formats
- Adamantium mode for robust k-mer analysis
xtree {BUILD,ALIGN} [options]
--seqs <arg>
: Input sequences file--log-out <arg>
: Log output file--threads <arg>
: Number of threads to use--db <arg>
: Database file (input for ALIGN, output for BUILD)
--map <arg>
: Taxonomy mapping file--comp <arg>
: Compression level (0-2)--k <arg>
: K-mer size
--confidence <arg>
: Confidence threshold for LCA--perq-out <arg>
: Per-query output file--ref-out <arg>
: Reference output file--tax-out <arg>
: Taxonomy output file--cov-out <arg>
: Coverage output file--orthog-out <arg>
: Orthogonal output file--redistribute
: Enable redistribution algorithm--fast-redistribute
: Enable single-pass redistribution--shallow-lca
: Use shallow LCA algorithm--copymem
: Copy database to local memory--doforage
: Include all references in coverage table--half-forage
: Include references with at least 2/3 of max k-mer matches--no-adamantium
: Disable adamantium mode
- FASTA format
- Linearized: one line per header, one line per sequence
- Use
lingenome
tool to create compliant files
- Tab-delimited text file
- Columns: sequence name, taxonomy string, optional second taxonomy/function string
- Taxonomy strings are semicolon-delimited
- FASTA or FASTQ format (gzipped or uncompressed)
- Maximum sequence length: 100MB
Build mode creates an XTree database from reference sequences and taxonomic information.
- Compression levels (0-2): Higher levels recommended for large databases
- K-mer size: Affects sensitivity and specificity (default: 29)
- Log output: Provides detailed statistics on reference sequences
Align mode processes query sequences against the XTree database.
- Confidence threshold for LCA assignment
- Redistribution algorithm for ambiguous mappings
- Adamantium mode for robust k-mer analysis
-
Per-query Output (
--perq-out
)- Detailed information for each query sequence
-
Reference Output (
--ref-out
)- Tabulated summary of matched reference sequences
-
Taxonomy Output (
--tax-out
)- Tabulated summary of matched taxonomies
- LCA applied when
--redistribute
is not used
-
Coverage Output (
--cov-out
)- Detailed coverage statistics for reference sequences
- Additional columns when adamantium mode is enabled
-
Orthogonal Output (
--orthog-out
)- Cross-tabulation of two hierarchies (e.g., taxonomy and function)
- Multithreading support for improved performance
- Compression level 2 recommended for large databases
--copymem
option for improved database access speed--shallow-lca
for faster LCA calculation in certain cases
- Maximum read length: 100MB
- Large databases may require significant memory
- Confidence threshold interpretation may change in future versions
XTree provides a powerful and flexible tool for analyzing metagenomic data, offering a balance between speed and accuracy. Its various output options and optimization features make it suitable for a wide range of metagenomic studies and large-scale sequence analysis projects. Users should refer to the detailed addendum for in-depth information on database structure, input formats, output details, and advanced features.
XTree uses a custom database format (.xtr
) with the following structure:
- Version byte [1] | compression level
- Size of prefix [1]
- Size of suffix [1]
- Size of kmer_t [1]
- Number of references [4]
- Number of k-mers [8]
- All prefix indices [1 << (2 * #2) x 8]
- All k-mer data [(#2 + #4) * #6]
- Number of adamantine k-mer targets [8]
- All adamantine k-mer targets [(#2 + #4) * #9]
- All reference lengths [#5 x 4]
- Size of string data [8]
- All strings [#9]
- Number of Ix1 in map [4] [0 means skip rest of file]
- String size for Ix1 [8]
- Ix1 strings dump [#12]
- Number of Ix2 in map [4] [can be 0/skipped if no h2 map]
- String size of Ix2 [8]
- Ix2 strings dump [#15]
- HPairs[0] dump [num ref by 4]
- HPairs[1] dump [num ref by 4]
- Format: FASTA
- Requirements:
- Linearized format: exactly one line per header and one subsequent line for the entire sequence
- No word wrap or blocks of sequence allowed
- Use the
lingenome
tool to create compliant reference files from a directory of FASTA files
Example:
>Reference1
ATCGATCGATCGATCGATCG...
>Reference2
GCTAGCTAGCTAGCTAGCTA...
- Format: Tab-delimited text file
- Columns:
- Exact name of the reference sequence (without ">" character or file extensions)
- Semicolon-delimited list of strings denoting taxonomic (or other hierarchical) ranks, from most general to most specific
- (Optional) Another taxonomic (or other hierarchical) ranking in the same format as the second column
Example:
Ref1 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli Function1;Function2;Function3
Ref2 Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Bacillus subtilis Function1;Function4;Function5
- Formats supported: FASTA or FASTQ
- Can be gzipped or uncompressed
- For FASTA format, sequences must be linearized (no line breaks within sequences)
- Maximum sequence length: 100MB
- 0: No compression (default for small marker-gene databases)
- 1: Light compression (keep k-mers if the characters before the prefix are "A" then not "A")
- 2: Recommended for large whole-genome databases (keep k-mers if the characters before the prefix are "A" then "C")
The k-mer size (-k
option) affects sensitivity and specificity:
- Longer k-mers: More specific but less sensitive
- Shorter k-mers: More sensitive but less specific
- Default: 29 (optimizes specificity; consider using 21 for sensitivity)
When using the --log-out
option in BUILD mode, XTree generates a tab-delimited log file with the following columns for each reference sequence:
- Reference: Name of the reference sequence
- UsableLen: Length of the sequence usable for k-mer generation (total length minus k-1 and ambiguous bases)
- TotalKmers: Total number of k-mers in the sequence
- UniqKmers: Number of k-mers unique to this sequence within the entire database
- UniqKmers_SC: Number of k-mers unique to this sequence, including those appearing multiple times within it
- AdamantineKmers (WeakKmers): Number of initially unique k-mers that become non-unique when considering single-nucleotide mutations
Note: The truly "adamantine" k-mers, those remaining unique even after mutation, would be the difference between UniqKmers and WeakKmers.
The --confidence
option sets the confidence level for LCA assignment:
- Scale: 0-1 (lowest to highest confidence)
- Interpretation (at compression level 2):
- 0.25: ~80% statistical confidence
- 0.5: 95%+ statistical confidence
- 0.125: ~50% statistical confidence
- Recommended: 0.25 for whole-genome databases
The --redistribute
option enables a maximum-likelihood algorithm to assign ambiguously mapping query sequences to the most likely reference or taxonomy.
- Uses a multi-pass algorithm to refine assignments
- Can be limited to a single pass with
--fast-redistribute
The LCA algorithm is applied in the following scenarios:
-
When generating the taxonomy output (
--tax-out
):- If
--redistribute
is not used, LCA is applied when a query matches multiple taxa at the chosen confidence level. - Broader-ranked taxa appear for queries that can't be confidently assigned to a specific taxon.
- If
-
During per-query processing:
- Used when a query doesn't have a clear, confident match to a single reference or taxon.
- Applied when there are multiple top-scoring references or when the best match's k-mer count ratio is below the confidence threshold.
-
LCA calculation can be optimized using the
--shallow-lca
option.
- Enabled by default, can be disabled with
--no-adamantium
- Identifies "adamantine" k-mers that remain unique even after single-nucleotide mutations
- Provides additional columns in the coverage output for more robust analysis
Tab-delimited file with the following columns:
- Query sequence name
- Reference sequence name
- K-mers matched at assigned taxonomic level [most matched; other matched]
- LCA taxonomy at the specified confidence level
- Total k-mers matched
Example:
Query1 Ref1 [100;50] Bacteria;Proteobacteria;Gammaproteobacteria 150
Query2 Ref2 [80;40] Bacteria;Firmicutes;Bacilli 120
Tab-delimited file with two columns:
- Reference sequence name
- Total number of query sequences that matched that reference
Example:
Ref1 500
Ref2 300
Tab-delimited file with two columns:
- Taxonomy
- Count
- If
--redistribute
is used, only full-rank taxa will appear - If
--redistribute
is not used, LCA algorithm is applied, and broader-ranked taxa may appear
Example:
Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 300
Bacteria;Firmicutes;Bacilli 200
Tab-delimited file with the following columns:
- Reference: Reference sequence name
- Ref_len: Length of the reference sequence
- Total_kmers: Total number of k-mers in the reference sequence
- Total_unique: Number of unique k-mers in the reference sequence
- Kmers_found: Total k-mers from query sequences that mapped to this reference
- Unique_kmers_found: Unique k-mers from query sequences that mapped to this reference
- Kmers_covered: Number of k-mers in the reference sequence that experienced at least one matching k-mer from a query sequence
- Unique_kmers_covered: Number of unique k-mers in the reference sequence that experienced at least one matching k-mer from a query sequence
- Proportion_covered: Decimal fraction of the reference sequence that experienced at least one matching k-mer from a matching read
- Unique_proportion_covered: Decimal fraction of unique k-mers in the reference sequence that experienced at least one matching k-mer from a matching read
- Reads_covered: Number of query sequences (not k-mers) that piled up onto the reference sequence
- Exp: Expected coverage based on a Poisson distribution
- Suspicion: A measure of potential contamination or strain-level variation
- Coverage_est: Estimated coverage of the reference sequence
- Num_reads_est: Estimated number of reads that mapped to the reference sequence
When "adamantium" mode is enabled (default), additional columns are included:
- Adamantine_tot: Total number of adamantine k-mers in the reference sequence
- Adamantium_found: Number of adamantine k-mers found in the query sequences
- Adamantium_covered: Number of adamantine k-mers covered by query sequences
- Adamantium_prop: Proportion of adamantine k-mers covered
- XCov_adamantium: Cross-coverage estimate using adamantine k-mers
- Reads_covered_adamantium: Estimated number of reads covered using adamantine k-mers
Tab-delimited file with three columns:
- First hierarchy (e.g., taxonomy)
- Second hierarchy (e.g., function)
- Count
Example:
Bacteria;Proteobacteria;Gammaproteobacteria Function1;Function2 150
Bacteria;Firmicutes;Bacilli Function1;Function4 100
- Use
--threads
to set the number of threads for parallel processing - For large databases, use compression level 2 (
--comp 2
) - The
--copymem
option can improve performance by copying the database to local memory - Use
--shallow-lca
for faster LCA calculation in certain cases
- Maximum read length is set to 100MB
- Large whole-genome databases may require significant memory
- The confidence threshold interpretation may change in future versions to match common expectations (e.g., 0.80 meaning 80% confidence)
- The program includes basic error checking for input file formats and command-line arguments
- It will print error messages and exit if it encounters issues like malformed input files or invalid options
XTree (CrossTree) is a powerful and flexible tool for metagenomic analysis, offering a balance between speed and accuracy. Its various output options and optimization features make it suitable for a wide range of metagenomic studies and large-scale sequence analysis projects.
Key features include:
- Fast k-mer-based sequence lookup and tallying
- Genome coverage analysis and maximum-likelihood read redistribution
- Cross-tabulation of annotation hierarchies (e.g., taxonomy and function)
- Per-query annotation and statistics
- Support for multiple reporting granularities
Users should pay close attention to input file requirements, particularly for reference sequences and taxonomy mapping in BUILD mode. The various output formats provide flexibility in analyzing results, from per-query details to broad taxonomic and functional summaries.
By understanding the details provided in this documentation, researchers can effectively leverage XTree's capabilities to process and analyze large-scale metagenomic datasets, gaining valuable insights into community composition and functional potential.