F1ALA: ultrafast and memory-efficient Ancestral Lineage Annotation for huge SARS-CoV-2 phylogeny using F1-score

The unprecedented scale of the global SARS-CoV-2 phylogeny overwhelmed most common ancestral lineage annotation methods (such as matUtils and PastML) when annotating the PANGO lineage information to unlabeled nodes in a rooted tree. Furthermore, the accuracy of these annotation methods has not been clearly elucidated. To resolve these challenges, we developed an efficient and accurate ancestral lineage annotation method (F1ALA). It utilizes F1-score to evaluate the confidence for assigning a lineage annotation at a specific ancestor node given lineage labels of taxa in the tree. F1ALA achieved ultrafast speed (less than 13 minutes) with significant less memory usage (3.6 GB) for annotating 2,277 PANGO lineages in a phylogeny with 5.26 million taxa, allowing real-time lineage tracking to be performed on a laptop computer. Benchmarking on three phylogenies with 100K, 660K and 5.26M taxa, F1ALA significantly outperformed matUtils and was comparable to PastML on the annotation accuracy in both empirical and simulated tests. The high efficiency of F1ALA enables the refinement of a huge SARS-CoV-2 phylogeny by pruning all taxa with inconsistent label compared to their closest annotation nodes and re-inserting them back. We demonstrated that this refinement was able to optimize the SARS-CoV-2 phylogenetic tree topology achieving a larger tree log-likelihood and a smaller parsimony score.

If there is any question about using F1ALA, please send email to [email protected].

How It Works

Illustration of the algorithm for ancestral lineage annotation. Given a tree with 5 taxa (Node5-9) and 4 internal nodes (Node1-4) where Node5-7 are labeled lineage A and Node8-9 are labeled lineage B, ancestral lineage annotation is computed in three steps. Step1: Extract potential annotated ancestor nodes for lineage A (Node1-3 and 5-7) and B (Node1, 3-4 and 8-9) (shown in the headers (black background) of top two tables). Step2: Determine the order of lineages to assign the annotation based on the annotation confidence score (the largest F1-score in each lineage, i.e., A = 4/5 and B = 1, marked by underlines in top two tables). It is first to assign lineage B and then A. The order is shown as ① and ② in bottom two tables. Step3: Assign the annotation for B at Node4 firstly (middle table), then for A at Node1. When recalculating F1-scores for potential annotated ancestor nodes of lineage A, taxa Node8-9 are excluded due to taxa Node8-9 have been assigned to the confirmed annotation of Node4 in the previous processing (bottom table). Note that F1-score tables for lineage B are the same in Step1 and Step3 shown in the middle.

Installation

A precompiled executable program is available as F1ALA.jar (required Java 11 or above).

git clone https://github.com/id-bioinfo/F1ALA.git
cd F1ALA
chmod a+x f1ala
# If users want to compile F1ALA from source code, 
make

For conda installation,

conda create -n f1ala
conda activate f1ala
conda install f1ala::f1ala

Quick Usage

Ancestral lineage annotation

Infer the lineage information at the ancestor nodes in a given rooted tree with labeled taxa.

cd /home/ytye/f1ala_github/Benchmark_datasets/100k
/home/ytye/f1ala_github/f1ala --annotation -t 100k_tree_InnodeNameAdded.nwk --label 100k_pangolin.tsv  --output 1248_in_100k_annotation.tsv -T 8

N.B. When encountering StackOverflowError problems, please reset Java settings --xmx,--xms,--xss to be larger.

Annotation statistics and visualization

Write the annotation details to the output file, including annotation_node, annotation_node_precedor, distance_to_root, pangolineage, F1score and samples.
Write the annotation visualization to file graph-data-generated.js that should be moved to the provided 'visual' folder and open the 'graph.html' in a browser.
Collapse the tree by lineages and write this collapsed tree to file [#lineages]_collapsedTree.nwk.
Remove inconsistent taxa and write this pruned tree to file [#consistent_taxa]_removedTree.nwk (used for tree refinement using other phylogenetic insertion methods, e.g., UShER).
Write the inconsistent taxa names and their lineages to file [#inconsistent_taxa]unKeepSamples[#consistent_taxa]_tree.tsv.

cd /home/ytye/f1ala_github/Benchmark_datasets/100k
/home/ytye/f1ala_github/f1ala --annotation_details -t 100k_tree_InnodeNameAdded.nwk --label 100k_pangolin.tsv --assignment 1248_in_100k_annotation.tsv --output 1248_in_100k_annotation_details.tsv -T 8

Tree refinement

Refine of a phylogeny by pruning all taxa with inconsistent label compared to their closest annotated ancestors and re-inserting them back using phylogenetic insertion methods such as TIPars and UShER.

Phylogenetic insertion using TIPars

include processing ancestral lineage annotation

cd /home/ytye/f1ala_github/Benchmark_datasets/100k
/home/ytye/f1ala_github/f1ala --refinement -t 100k_tree_InnodeNameAdded.nwk -s 100k_taxa.fas -a 100k_anc.fas --label 100k_pangolin.tsv --output refined_tree.nwk -T 8 -x 8G

exclude processing ancestral lineage annotation that could be computed by other methods, e.g. PastML and matUtils

cd /home/ytye/f1ala_github/Benchmark_datasets/100k
/home/ytye/f1ala_github/f1ala --refinement_from_annotation -t 100k_tree_InnodeNameAdded.nwk -s 100k_taxa.fas -a 100k_anc.fas --label 100k_pangolin.tsv --assignment 1248_in_100k_annotation.tsv --output refined_tree.nwk -T 8 -x 8G

Phylogenetic insertion using UShER

use the [#consistent_taxa]_removedTree.nwk after ancestral lineage annotation by TIPars2

cd /home/ytye/f1ala_github/Benchmark_datasets/100k
/home/ytye/f1ala_github/f1ala --annotation -t 100k_tree_InnodeNameAdded.nwk --label 100k_pangolin.tsv  --output 1248_in_100k_annotation.tsv -T 8
/home/ytye/f1ala_github/f1ala --annotation_details -t 100k_tree_InnodeNameAdded.nwk --label 100k_pangolin.tsv --assignment 1248_in_100k_annotation.tsv --output 1248_in_100k_annotation_details.tsv -T 8
usher -v taxa.vcf -t 81784_removedTree.tree -d ./usher -o ./usher/81784_AddTo_100k.pb

The refined tree by UShER is ./usher/final-tree.nh.

Tree bubbling

Collapse the tree into multiple clusters based on the ancestal lineage annotation. Large clusters (>exploreTreeNodeLimit) will further to stratify into multple bubbles by BFS search. Small clusters (<smallClusterLimit) and bubbles (<smallBubbleLimit) will be merged. Clusters will link to bubbles.

cd /home/ytye/f1ala_github/Benchmark_datasets/100k
/home/ytye/f1ala_github/f1ala --tree_BFS -t 100k_tree_InnodeNameAdded.nwk --label 100k_pangolin.tsv  --output 100k_tree_bfs.tsv --exploreTreeNodeLimit 2000 --smallBubbleLimit 5 --smallClusterLimit 5 -T 8

Output tsv file includes 8 items.

bubble_type : 1 is cluster and 2 is bubble
annotation_node : root of the subtree for cluster or bubble
annotation_node_precedor : precedor of this annotation_node where precedor is also a cluster or bubble
dist_to_precedor : total branch length from annotation_node_precedor to annotated_node
parent_node : parent node of this annotation_node in the input tree
pangolineage : lineage label annotated by 'Ancestral lineage annotation'
num_nodes : number of nodes in the cluster or bubble
nodes : a list of nodes in the cluster or bubble (separated by comma)

Graft subtrees

Graft a set of subtrees to a bigtree. Subtrees and the bigtree should have two common samples as anchors. One anchor is used for outgroup rooting in all provided subtrees and bigtree. The graft method will scale the branch length of subtrees to match the big tree based on the distance of two anchors. After the graft, the final tree is rooting by midpoint outgroup (link).

cd /home/ytye/f1ala_github/Benchmark_datasets/subtrees
/home/ytye/f1ala_github/f1ala --graft_subtrees -t bigtree.nwk --subtrees subtrees.tsv --outgroup A --output merge.nwk --print2screen true

Prune tips

Prune a set of tips from a tree by given those you want to retain in an input file.

cd /home/ytye/f1ala_github/Benchmark_datasets/subtrees
/home/ytye/f1ala_github/f1ala --prune_tips -t bigtree.nwk --retain_tips <retrin_tips.tsv> --output pruned.nwk --print2screen true

How to Cite

Yongtao Ye, Marcus H Shum, Isaac Wu, Carlos Chau, Ningqi Zhao, David K Smith, Joseph T Wu, Tommy T Lam, F1ALA: ultrafast and memory-efficient ancestral lineage annotation applied to the huge SARS-CoV-2 phylogeny, Virus Evolution, Volume 10, Issue 1, 2024, veae056, https://doi.org/10.1093/ve/veae056

Acknowledgements

This project is supported by the Theme Based Research Scheme (T11-705/21-N), the Health and Medical Research Fund (COVID1903011-549 WP1) and the Innovation and Technology Commission’s InnoHK funding (D²4H).

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
Benchmark_datasets		Benchmark_datasets
img		img
lib		lib
visual		visual
F1ALA.jar		F1ALA.jar
MANIFEST.MF		MANIFEST.MF
Makefile		Makefile
README.md		README.md
TIPars.java		TIPars.java
f1ala		f1ala
f1ala.py		f1ala.py
meta.yaml		meta.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

F1ALA: ultrafast and memory-efficient Ancestral Lineage Annotation for huge SARS-CoV-2 phylogeny using F1-score

How It Works

Installation

Quick Usage

Ancestral lineage annotation

Annotation statistics and visualization

Tree refinement

Phylogenetic insertion using TIPars

Phylogenetic insertion using UShER

Tree bubbling

Graft subtrees

Prune tips

How to Cite

Acknowledgements

About

Releases 2

Packages

Contributors 2

Languages

id-bioinfo/F1ALA

Folders and files

Latest commit

History

Repository files navigation

F1ALA: ultrafast and memory-efficient Ancestral Lineage Annotation for huge SARS-CoV-2 phylogeny using F1-score

How It Works

Installation

Quick Usage

Ancestral lineage annotation

Annotation statistics and visualization

Tree refinement

Phylogenetic insertion using TIPars

Phylogenetic insertion using UShER

Tree bubbling

Graft subtrees

Prune tips

How to Cite

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages