read2tree is a software tool that allows to obtain alignment matrices for tree inference. For this purpose it makes use of the OMA database and a set of reads. Its strength lies in the fact that it bipasses the several standard steps when obtaining such a matrix in regular analysis. These steps are read filtereing, assembly, gene prediction, gene annotation, all vs all comparison, orthology prediction, alignment and concatination.
If you want to re-run a read2tree analysis (after facing an error, or changing the inputs), please make sure that you removed the mplog.log
file and the output folder. Alternatively you could start from an empty folder. Otherwise, read2tree might use the faulty output of previous unfinished run.
We are working on a new read2tree version using the minimap2 aligner, which is much faster. For this version, the --read_type
argument could be either short
, long-hifi
or long-ont
. You could also use --threads 40
to be used with minimap2.
You can watch David Dylus's presentation on Read2Tree as part of the SIB in silico talks.
You can cite Read2Tree published in Nature Biotechnology:
David Dylus, Adrian Altenhoff, Sina Majidian, Fritz J. Sedlazeck & Christophe Dessimoz.
Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree. Nat Biotechnol (2023). https://doi.org/10.1038/s41587-023-01753-4/.
There are three ways to install read2tree. You can choose either of them.
To set up read2tree on your local machine from source please follow the instructions below.
First, we need to create a fresh conda environment:
conda create -n r2t python=3.10.8
The following python packages are needed: numpy, scipy, cython, lxml, tqdm, pysam, pyparsing, requests, filelock, natsort, pyyaml, biopython, ete3, dendropy.
You can install all of them using.
conda install -c conda-forge biopython numpy Cython ete3 lxml tqdm scipy pyparsing requests natsort pyyaml filelock
conda install -c bioconda dendropy pysam
Besides, you need softwares including mafft (multiple sequence aligner), iqtree (phylogenomic inference), ngmlr, ngm/nextgenmap (long and short read mappers), and samtools which could be installed using conda.
conda install -c bioconda mafft iqtree ngmlr nextgenmap samtools
Then, you can install the read2tree package after downlaoding the package from this GitHub repo using
git clone https://github.com/DessimozLab/read2tree.git
cd read2tree
python setup.py install
conda create -n r2t python=3.10.8
conda install -c bioconda read2tree
Alternatively, you could also try using mamba. Caution: please read about compatiblity of conda and mamba in one envirnoment.
The Dockerfile is also available in this repository. There is an example how to run in the test example section.
A prebuild container can be loaded from dockerhub:
docker pull dessimozlab/read2tree:latest
To run read2tree two things are required as input:
- The DNA sequencing reads as FASTQ file(s).
- A set of reference orthologous groups, i.e. marker genes.
In our wiki page, you may find information on how to obtain the marker genes using OMA browser. You can set the value of
Maximum nr of markers
as 200 or 400. Once you downloaded the tgz file, run this
tar xvzf marker_genes_*.tgz
ls marker_genes/*.fna | wc -l
cat marker_genes/*.fna > dna_ref.fa
The output of Read2Tree is the concatenated alignments as a fasta file where each record corresponds to one species. We also provide the option --tree
for inferring the species tree using IQTREE as defualt.
read2tree --tree --standalone_path marker_genes/ --reads read_1.fastq read_2.fastq --output_path output --dna_reference dna_ref.fa
read2tree --standalone_path marker_genes/ --output_path output --reference --dna_reference dna_ref.fa # this creates just the reference folder 01 - 03
read2tree --standalone_path marker_genes/ --output_path output --reads species1_R1.fastq species2_R2.fastq
read2tree --standalone_path marker_genes/ --output_path output --reads species2_R1.fastq species2_R2.fastq
read2tree --standalone_path marker_genes/ --output_path output --reads species3_R1.fastq species3_R2.fastq
read2tree --standalone_path marker_genes/ --output_path output --merge_all_mappings --tree
To have bootstrap values a metric for quality of internal nodes, you can run the following
thread=20
iqtree -T ${thread} -s output/concat_*_aa.phy -bb 1000
The .phy
file is either concat_sample_aa.phy
or concat_merge_aa.phy
corresponding to single- or multi-species mode.
It is also possible to use trimal for trimming msa trimal -in <inputfile> -out <outputfile> -automated1
For closely related species, the user can infer tree using MSA of nucleotide sequences.
thread=20
iqtree -T ${thread} -s output/concat_*_dna.phy
The goal of this test example is to infer species tree for Mus musculus using its sequencing reads. You can download the full read data from from SRR5171076 using sra-tools. Alternatively, a small read dataset is provided in the tests
folder. For this example, we consider five species including Mnemiopsis leidyi, Xenopus laevis, Homo sapiens, Gorilla gorilla, and Rattus norvegicus as the reference. Using OMA browser, we downloaded 20 marker genes of these five species as the reference orthologous groups, located in the folder tests/mareker_genes
.
cd tests
read2tree --debug --tree --standalone_path marker_genes/ --reads sample_1.fastq sample_2.fastq --output_path output/ --dna_reference dna_ref.fa
docker run --rm -i -v $PWD/tests:/input -v $PWD/tests/:/reads -v $PWD/output:/out -v $PWD/run:/run dessimozlab/read2tree:latest --tree --standalone_path /input/marker_genes --dna_reference /input/cds-marker_genes.fasta.gz --reads /reads/sample_1.fastq --output_path /out
You can check the inferred species tree for the sample and five reference species in Newick format:
$cat output/tree_sample_1.nwk
(sample_1:0.0106979811,((HUMAN:0.0041202790,GORGO:0.0272785216):0.0433094119,(XENLA:0.1715052824,MNELE:0.9177670816):0.1141311779):0.0613339433,RATNO:0.0123413734);
For the full description of output files please check our wiki page.
Note that we consider species names as 5-letter codes e.g. XENLA = Xenopus laevis. If you want to rerun your analysis, make sure that you moved/deleted the files. Otherwise, read2tree continues the progress of previous analysis.
For running on clusters, you can run the first step of read2tree such that folders 01, 02 and 03 are computed (this allows for mapping). This can be done using the '--reference' option. Since read2tree re-orders the OGs into the included species, it is possible to split the mapping step per species using multiple threads for the mapper. For this the '--single_mapping' option is available.
Hint: As read2tree exploits the progress
package, the user can benefit from continuing unfinished runs. However, if you want to conduct a new analysis with different inputs, you need to remove output of previous runs or change the output_path
.
To see the details of arguments, please take look at our wiki page
Installing on MAC sometimes drops this error:
raise ValueError, 'unknown locale: %s' % localename
ValueError: unknown locale: UTF-8
This can be mitigated using:
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
-
version 0.1.5:
- fix issue with UnknownSeq being removed in Biopython>1.80
- removing unused modeltester wrappers
-
version 0.1.4:
- allow reference folders not named marker_genes (#12)
- update environment.yml file to contain all dependencies (#16)
- documentation improvements
- CI/CD pipeline
-
version 0.1.3:
- improvements of documentation
- adding support for docker
- small bugfixes
-
version 0.1.2: packaging
-
version 0.1.0: Adding covid analysis
-
version 0.0: Initial work
- David Dylus, (main author)
- Adrian Altenhoff.
The authors would like to thank Alex Warwick for help how to initiate such a package.
This project is licensed under the MIT License.