XTree (CrossTree) Documentation

Overview

XTree (CrossTree) is a fast, k-mer-based sequence lookup and tallying tool designed for efficient analysis of metagenomic data. It offers a range of functionalities including genome coverage analysis, maximum-likelihood read redistribution, cross-tabulation of annotation hierarchies, and per-query annotation and statistics.

Version

This documentation covers XTree version 2.00c.

Features

Total and unique genome coverage analysis
Maximum-likelihood read redistribution
Cross-tabulation of two annotation hierarchies (e.g., taxonomy and function)
Per-query annotation and statistics
Multiple granularity reporting: per-reference (genome-level), per-rank (redistributed), or per-rank (LCA [lowest common ancestor])
High-speed processing (often 10-100x faster than alignment)
Support for standard input, compressed files, and both FASTA and FASTQ formats
Adamantium mode for robust k-mer analysis

Usage

xtree {BUILD,ALIGN} [options]

Options for both BUILD and ALIGN modes

--seqs <arg>: Input sequences file
--log-out <arg>: Log output file
--threads <arg>: Number of threads to use
--db <arg>: Database file (input for ALIGN, output for BUILD)

BUILD Mode Options

--map <arg>: Taxonomy mapping file
--comp <arg>: Compression level (0-2)
--k <arg>: K-mer size

ALIGN Mode Options

--confidence <arg>: Confidence threshold for LCA
--perq-out <arg>: Per-query output file
--ref-out <arg>: Reference output file
--tax-out <arg>: Taxonomy output file
--cov-out <arg>: Coverage output file
--orthog-out <arg>: Orthogonal output file
--redistribute: Enable redistribution algorithm
--fast-redistribute: Enable single-pass redistribution
--shallow-lca: Use shallow LCA algorithm
--copymem: Copy database to local memory
--doforage: Include all references in coverage table
--half-forage: Include references with at least 2/3 of max k-mer matches
--no-adamantium: Disable adamantium mode

Input File Requirements

Reference Sequences (BUILD mode)

FASTA format
Linearized: one line per header, one line per sequence
Use lingenome tool to create compliant files

Taxonomy Mapping (BUILD mode)

Tab-delimited text file
Columns: sequence name, taxonomy string, optional second taxonomy/function string
Taxonomy strings are semicolon-delimited

Query Sequences (ALIGN mode)

FASTA or FASTQ format (gzipped or uncompressed)
Maximum sequence length: 100MB

Build Mode

Build mode creates an XTree database from reference sequences and taxonomic information.

Key Options

Compression levels (0-2): Higher levels recommended for large databases
K-mer size: Affects sensitivity and specificity (default: 29)
Log output: Provides detailed statistics on reference sequences

Align Mode

Align mode processes query sequences against the XTree database.

Key Features

Confidence threshold for LCA assignment
Redistribution algorithm for ambiguous mappings
Adamantium mode for robust k-mer analysis

Output Options

Per-query Output (--perq-out)
- Detailed information for each query sequence
Reference Output (--ref-out)
- Tabulated summary of matched reference sequences
Taxonomy Output (--tax-out)
- Tabulated summary of matched taxonomies
- LCA applied when --redistribute is not used
Coverage Output (--cov-out)
- Detailed coverage statistics for reference sequences
- Additional columns when adamantium mode is enabled
Orthogonal Output (--orthog-out)
- Cross-tabulation of two hierarchies (e.g., taxonomy and function)

Performance Considerations

Multithreading support for improved performance
Compression level 2 recommended for large databases
--copymem option for improved database access speed
--shallow-lca for faster LCA calculation in certain cases

Limitations

Maximum read length: 100MB
Large databases may require significant memory
Confidence threshold interpretation may change in future versions

Conclusion

XTree provides a powerful and flexible tool for analyzing metagenomic data, offering a balance between speed and accuracy. Its various output options and optimization features make it suitable for a wide range of metagenomic studies and large-scale sequence analysis projects. Users should refer to the detailed addendum for in-depth information on database structure, input formats, output details, and advanced features.

Addendum: Comprehensive XTree Documentation

Database Structure

XTree uses a custom database format (.xtr) with the following structure:

Version byte [1] | compression level
Size of prefix [1]
Size of suffix [1]
Size of kmer_t [1]
Number of references [4]
Number of k-mers [8]
All prefix indices [1 << (2 * #2) x 8]
All k-mer data [(#2 + #4) * #6]
Number of adamantine k-mer targets [8]
All adamantine k-mer targets [(#2 + #4) * #9]
All reference lengths [#5 x 4]
Size of string data [8]
All strings [#9]
Number of Ix1 in map [4] [0 means skip rest of file]
String size for Ix1 [8]
Ix1 strings dump [#12]
Number of Ix2 in map [4] [can be 0/skipped if no h2 map]
String size of Ix2 [8]
Ix2 strings dump [#15]
HPairs[0] dump [num ref by 4]
HPairs[1] dump [num ref by 4]

Input File Formats

Reference Sequences (for BUILD mode)

Format: FASTA
Requirements:
- Linearized format: exactly one line per header and one subsequent line for the entire sequence
- No word wrap or blocks of sequence allowed
- Use the lingenome tool to create compliant reference files from a directory of FASTA files

Example:

>Reference1
ATCGATCGATCGATCGATCG...
>Reference2
GCTAGCTAGCTAGCTAGCTA...

Taxonomy Mapping File (for BUILD mode)

Format: Tab-delimited text file
Columns:
1. Exact name of the reference sequence (without ">" character or file extensions)
2. Semicolon-delimited list of strings denoting taxonomic (or other hierarchical) ranks, from most general to most specific
3. (Optional) Another taxonomic (or other hierarchical) ranking in the same format as the second column

Example:

Ref1    Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli    Function1;Function2;Function3
Ref2    Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Bacillus subtilis    Function1;Function4;Function5

Query Sequences (for ALIGN mode)

Formats supported: FASTA or FASTQ
Can be gzipped or uncompressed
For FASTA format, sequences must be linearized (no line breaks within sequences)
Maximum sequence length: 100MB

Build Mode Details

Compression Levels

0: No compression (default for small marker-gene databases)
1: Light compression (keep k-mers if the characters before the prefix are "A" then not "A")
2: Recommended for large whole-genome databases (keep k-mers if the characters before the prefix are "A" then "C")

K-mer Size

The k-mer size (-k option) affects sensitivity and specificity:

Longer k-mers: More specific but less sensitive
Shorter k-mers: More sensitive but less specific
Default: 29 (optimizes specificity; consider using 21 for sensitivity)

Log Output Option (`--log-out`)

When using the --log-out option in BUILD mode, XTree generates a tab-delimited log file with the following columns for each reference sequence:

Reference: Name of the reference sequence
UsableLen: Length of the sequence usable for k-mer generation (total length minus k-1 and ambiguous bases)
TotalKmers: Total number of k-mers in the sequence
UniqKmers: Number of k-mers unique to this sequence within the entire database
UniqKmers_SC: Number of k-mers unique to this sequence, including those appearing multiple times within it
AdamantineKmers (WeakKmers): Number of initially unique k-mers that become non-unique when considering single-nucleotide mutations

Note: The truly "adamantine" k-mers, those remaining unique even after mutation, would be the difference between UniqKmers and WeakKmers.

Align Mode Details

Confidence Threshold

The --confidence option sets the confidence level for LCA assignment:

Scale: 0-1 (lowest to highest confidence)
Interpretation (at compression level 2):
- 0.25: ~80% statistical confidence
- 0.5: 95%+ statistical confidence
- 0.125: ~50% statistical confidence
Recommended: 0.25 for whole-genome databases

Redistribution Algorithm

The --redistribute option enables a maximum-likelihood algorithm to assign ambiguously mapping query sequences to the most likely reference or taxonomy.

Uses a multi-pass algorithm to refine assignments
Can be limited to a single pass with --fast-redistribute

Lowest Common Ancestor (LCA) Usage

The LCA algorithm is applied in the following scenarios:

When generating the taxonomy output (--tax-out):
- If --redistribute is not used, LCA is applied when a query matches multiple taxa at the chosen confidence level.
- Broader-ranked taxa appear for queries that can't be confidently assigned to a specific taxon.
During per-query processing:
- Used when a query doesn't have a clear, confident match to a single reference or taxon.
- Applied when there are multiple top-scoring references or when the best match's k-mer count ratio is below the confidence threshold.
LCA calculation can be optimized using the --shallow-lca option.

Adamantium Mode

Enabled by default, can be disabled with --no-adamantium
Identifies "adamantine" k-mers that remain unique even after single-nucleotide mutations
Provides additional columns in the coverage output for more robust analysis

Output Formats

Per-query Output (`--perq-out`)

Tab-delimited file with the following columns:

Query sequence name
Reference sequence name
K-mers matched at assigned taxonomic level [most matched; other matched]
LCA taxonomy at the specified confidence level
Total k-mers matched

Example:

Query1  Ref1    [100;50]    Bacteria;Proteobacteria;Gammaproteobacteria    150
Query2  Ref2    [80;40]     Bacteria;Firmicutes;Bacilli    120

Reference Output (`--ref-out`)

Tab-delimited file with two columns:

Reference sequence name
Total number of query sequences that matched that reference

Example:

Ref1    500
Ref2    300

Taxonomy Output (`--tax-out`)

Tab-delimited file with two columns:

Taxonomy
Count

If --redistribute is used, only full-rank taxa will appear
If --redistribute is not used, LCA algorithm is applied, and broader-ranked taxa may appear

Example:

Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli    300
Bacteria;Firmicutes;Bacilli    200

Coverage Output (`--cov-out`)

Tab-delimited file with the following columns:

Reference: Reference sequence name
Ref_len: Length of the reference sequence
Total_kmers: Total number of k-mers in the reference sequence
Total_unique: Number of unique k-mers in the reference sequence
Kmers_found: Total k-mers from query sequences that mapped to this reference
Unique_kmers_found: Unique k-mers from query sequences that mapped to this reference
Kmers_covered: Number of k-mers in the reference sequence that experienced at least one matching k-mer from a query sequence
Unique_kmers_covered: Number of unique k-mers in the reference sequence that experienced at least one matching k-mer from a query sequence
Proportion_covered: Decimal fraction of the reference sequence that experienced at least one matching k-mer from a matching read
Unique_proportion_covered: Decimal fraction of unique k-mers in the reference sequence that experienced at least one matching k-mer from a matching read
Reads_covered: Number of query sequences (not k-mers) that piled up onto the reference sequence
Exp: Expected coverage based on a Poisson distribution
Suspicion: A measure of potential contamination or strain-level variation
Coverage_est: Estimated coverage of the reference sequence
Num_reads_est: Estimated number of reads that mapped to the reference sequence

When "adamantium" mode is enabled (default), additional columns are included:

Adamantine_tot: Total number of adamantine k-mers in the reference sequence
Adamantium_found: Number of adamantine k-mers found in the query sequences
Adamantium_covered: Number of adamantine k-mers covered by query sequences
Adamantium_prop: Proportion of adamantine k-mers covered
XCov_adamantium: Cross-coverage estimate using adamantine k-mers
Reads_covered_adamantium: Estimated number of reads covered using adamantine k-mers

Orthogonal Output (`--orthog-out`)

Tab-delimited file with three columns:

First hierarchy (e.g., taxonomy)
Second hierarchy (e.g., function)
Count

Example:

Bacteria;Proteobacteria;Gammaproteobacteria    Function1;Function2    150
Bacteria;Firmicutes;Bacilli    Function1;Function4    100

Performance Optimization

Use --threads to set the number of threads for parallel processing
For large databases, use compression level 2 (--comp 2)
The --copymem option can improve performance by copying the database to local memory
Use --shallow-lca for faster LCA calculation in certain cases

Limitations and Considerations

Maximum read length is set to 100MB
Large whole-genome databases may require significant memory
The confidence threshold interpretation may change in future versions to match common expectations (e.g., 0.80 meaning 80% confidence)

Error Handling

The program includes basic error checking for input file formats and command-line arguments
It will print error messages and exit if it encounters issues like malformed input files or invalid options

Conclusion

XTree (CrossTree) is a powerful and flexible tool for metagenomic analysis, offering a balance between speed and accuracy. Its various output options and optimization features make it suitable for a wide range of metagenomic studies and large-scale sequence analysis projects.

Key features include:

Fast k-mer-based sequence lookup and tallying
Genome coverage analysis and maximum-likelihood read redistribution
Cross-tabulation of annotation hierarchies (e.g., taxonomy and function)
Per-query annotation and statistics
Support for multiple reporting granularities

Users should pay close attention to input file requirements, particularly for reference sequences and taxonomy mapping in BUILD mode. The various output formats provide flexibility in analyzing results, from per-query details to broad taxonomic and functional summaries.

By understanding the details provided in this documentation, researchers can effectively leverage XTree's capabilities to process and analyze large-scale metagenomic datasets, gaining valuable insights into community composition and functional potential.

Files

xtree-documentation.md

Latest commit

History

xtree-documentation.md

File metadata and controls

XTree (CrossTree) Documentation

Overview

Version

Features

Usage

Options for both BUILD and ALIGN modes

BUILD Mode Options

ALIGN Mode Options

Input File Requirements

Reference Sequences (BUILD mode)

Taxonomy Mapping (BUILD mode)

Query Sequences (ALIGN mode)

Build Mode

Key Options

Align Mode

Key Features

Output Options

Performance Considerations

Limitations

Conclusion

Addendum: Comprehensive XTree Documentation

Database Structure

Input File Formats

Reference Sequences (for BUILD mode)

Taxonomy Mapping File (for BUILD mode)

Query Sequences (for ALIGN mode)

Build Mode Details

Compression Levels

K-mer Size

Log Output Option (--log-out)

Align Mode Details

Confidence Threshold

Redistribution Algorithm

Lowest Common Ancestor (LCA) Usage

Adamantium Mode

Output Formats

Per-query Output (--perq-out)

Reference Output (--ref-out)

Taxonomy Output (--tax-out)

Coverage Output (--cov-out)

Orthogonal Output (--orthog-out)

Performance Optimization

Limitations and Considerations

Error Handling

Conclusion

Log Output Option (`--log-out`)

Per-query Output (`--perq-out`)

Reference Output (`--ref-out`)

Taxonomy Output (`--tax-out`)

Coverage Output (`--cov-out`)

Orthogonal Output (`--orthog-out`)