Skip to content

IV. Tutorials and results interpretation

Benjamin Linard edited this page May 1, 2020 · 7 revisions

You will find here several pre-configured analyses, which can be run immediately after installing PEWO. Each section below go through the results and plots generated by PEWO and comment their interpretation.

  1. Fast test of Pruning-based ACcuracy (PAC) evaluation
  2. Placement accuracy for a bacterial taxonomy (16S rRNA)
  3. Placement accuracy for HIV genomes
  4. Search for the most accurate taxonomic markers
  5. CPU/RAM requirements evaluation
  6. Faster evaluation with Likelihood-based ACcuracy evaluation (LAC)

Example 1 : Fast test of Pruning-based ACcuracy (PAC) evaluation

This examples is a short introduction to how PEWO can be configured and which result tables and plots will be produced with of the Accuracy procedure. This procedure was published and described several times (Berger et al, 2010, Matsen et al, 2010, Linard et al. 2019). PEWO provides an automated implementation of this procedure. Note that the results produced by the current example are NOT representative of the actual accuracy of the tested methods. Only a few random prunings are computed to limit computational time but a decent accuracy evaluation would require at least 50 prunings.

a) Run the example

After PEWO installation, from the installation directory execute every commands of example/1_fast_test_of_accuracy_procedure/README.md . It requires around 15 min of computation using 2 CPU cores (on a i7 2.5GHz CPU) and 4Gb of RAM.

b) Raw results

Node Distances measures for every query and for every tested conditions are in results.csv. Below are 3 lines extracted from this file:

software pruning nd e_nd r h g ms sb mp k o red ar
epang 0 2 1.030 150 1 0.99999
pplacer 0 2 1.029 150 6 3 40
rappas 0 1 1.000 150 8 2 0.99 RAXMLNG

Columns 1 to 4 : software, Node Distance (nd) and expected Node Distance (end) measured in each pruning. As several sequences may have been produced in a pruning process (several leaves pruned and/or short reads requested in configuration), one measurement is done for each of the placed sequences.

Other columns : parameters and parameter values associated to all tested software. Parameter association is described in the config.yaml file. For instance, RAPPAS placements are influenced by 4 parameters 1) the k-mer size (k), 2) the probability threshold omega (o), 3) the ratio of alignment reduction (red) and 4) the software used for ancestral reconstruction (ar).

c) Summaries and plots

In each pruning setup, an average node distance is computed from the node distances associated to all reads built from the pruned leaves. These means are in a batch of csv files, with filenames including software name and nature of the measure (ND= node distance; eND= expected node distance).

mean_per_pruning_eND_rappas.csv
mean_per_pruning_ND_pplacer.csv
[...]

Then, an an average is computed between all prunings. Corresponding statistics are in "summary" csv files.

summary_table_eND_rappas.csv
summary_table_ND_pplacer.csv
[...]

Summary means are ploted for visual exploration of the results:

summary_plot_eND_rappas.svg
summary_plot_ND_pplacer.svg
[...]

These plots are dynamically generated. The number of plotted values depends on the number of parameter that were tested and the number of resulting parameter combinations. In the current example, only 1 value per software is ploted (value measured for default parameters of the tested methods). Plots exploring more parameter combinations are shown in following examples.

Example 2 : Placement accuracy for a bacterial taxonomy

When one possesses a reference taxonomy and aims for species identifications, phylogenetic placement is one of the possible solutions. Today, several methods are available and are based on different algorithms. The current example illustrate how PEWO can help to evaluate phylogenetic placement accuracy using the same reference tree but different methods and different parameter combinations. To do so, the "Accuracy" procedure, described here, is used.

The reference taxonomy is a phylogeny based on 16S RNA sequences for a broad range of bacterial orders. It contains 150 taxa. EPA-ng, PPlacer and RAPPAS are tested, with 2 different values tested for each of their respective parameters.

a) Experimental setup

We use the pre-filled PEWO configuration file example/2_placement_accuracy_for_a_bacterial_taxonomy/config.yaml. Below are listed the fields of this yaml file which are important in the current analysis.

Accuracy procedure configuration :

  • dataset_align : path to the reference alignment alignment_150.fasta
  • dataset_tree : path to the reference tree RAxML_bipartitions.150.BEST.WITH`, which was build from the reference alignment
  • test_soft : phylogenetic placement software to test, here set to [rappas,epang,pplacer,apples] to test these 4 software.
  • read_length : set to [150,300] to measure accuracy for two fragment lengths (matches current Illumina technologies). The reference alignment is 1300 bp long, using a large number such as 10000 would would use full sequence length for the queries, not shorter fragments.
  • pruning_count : the number of pruning experiment that will be run. To shorten the computations required by this example, only a few prunings are set. Prunings are done by random selection of a tree node and pruning of its subtree. A correct evaluation would require at least 50 prunings, if not 100 prunings to reflect a lot of scenarios, from easy (just one leaf pruned) to hard (a large proportion of the tree is pruned) scenarios.

Per software configuration :

The following lines of the configuration file are used to determine the parameter combinations that are tested for each software.

  • PPlacer : three parameters are impacting placements (max-strikes, strike-box and max-pitches) and two values are tested for each parameter ( [6,24], [3,12] and [40,160]) respectively). This will produce accuracy measurements for a total of 16 conditions: read_length (2) x [ values (2) ^ parameters (3) ]

  • Epan-ng : two heuristics are tested (["h1","h2"]) with one parameter and two values in each heuristic (h1: g: [0.999,0.99999] and h2: G: [0.01,0.1]). This will produce accuracy measurements for a total of 8 conditions: read_length (2) x heuristic (2) x [ values (2) ^ parameter (1) ]

  • Apples : two parameters (methods and criteria) are tested for three different values ([OLS,FM,BE] and [MLSE,ME,HYBRID] respectively). This will produce accuracy measurements for a total of 18 conditions: read_length (2) x [ values (3) ^ parameter (2) ]

  • RAPPAS : it is an alignment-free approach which first requires a database construction step, then the placements themselves. Database construction is based on a method of ancestral states reconstruction, but default parameters are kept (reduction: [0.99] , arsoft: [RAXMLNG]). The experiment focuses on the effect of parameters k and omega, which defines how many phylo-kmers will be computed during database construction. Two values are tested for each ([7,8] and [1.5,2.0] respectively). This will produce accuracy measurements for a total of 8 conditions: read_length (2) x [ values (2) ^ parameter (2) ]

b) Run the example

After PEWO installation, from the installation directory execute every command of example/2_placement_accuracy_for_a_bacterial_taxonomy/README.md . It requires around 4 hours of computation using 2 CPU cores (on a i7 2.5GHz CPU) and 8Gb of RAM.

c) Result files

Result files and their content are identical to those described in example 1.

d) Plots and interpretation

Node Distance (ND) or (expected Node Distance) eND are metrics describing a topological distance between the branch where the placement of a sequence was expected and the branch where the placement was observed. Results of our experiments can be visually explored by opening the following plots:

Plots comments

These plots are dynamically generated, with the following rules:

  • by default, tested read lengths are differentiated by lines, with one read length per line of gray boxes. Here r:150 and r:300 correspond to the read length requested in the configuration file.
  • to be comparable, plots of all tested software are using the same color scale.
  • when more than 3 parameters are tested, gray boxes are multiplied on each line (line is reserved for read length). Here, for pplacer we have 2 grey boxes per line. Top line of gray boxes is for 150bp read length, columns are for parameter 'mp' (max-pitches) and values mp=40 (column 1) and mp=160 (column 2). In each gray box you will find combinations of other parameters ('sb' for strike-box and 'ms' for max-strikes).

Plots Interpretation

PLOT 1, 2 and 3: You will observe that EPA-ng, PPlacer and RAPPAS produced similar placement accuracy on this reference phylogeny. For EPA-ng and PPLacer, tested parameter values have few impact and longer reads produce more accurate placement (expected, as they are more informative).

In the case of RAPPAS, a k-mer size of 8 or superior is necessary to produce a similar accuracy (higher k-mer length = higher placement accuracy but heavier databases).

With Apples, one method the ME criterion (Minimum Evolution) produces particularly inaccurate placements (y axis, 2nd line), whatever the tested method (x-axis, the different methods to measure distance in APPLES). This is expected as inputs phylogenies are optimised under the Maximum Likelihood criterion.

Example 3: Placement accuracy for HIV genomes

This examples describes results of the Accuracy procedure on a full viral genome. Before reading more we advice that you read example 2, which describes in details how PEWO is configured.

a) Experimental setup

Here the reference phylogeny contains 104 full HIV genomes (~9kb long) and includes most known subtypes. EPA-ng is tested for two different heuristics (h1 and h2). RAPPAS is tested for two values of k and omega. Only a few prunings are launched in this example and dozens of prunings would have been a better evaluation (but longer computations).

b) Run the example

After PEWO installation, from the installation directory execute every command of example/3_placement_accuracy_for_HIV_genomes/README.md . It requires around 2 hours of computation using 2 CPU cores (on a i7 2.5GHz CPU) and 8Gb of RAM.

c) Results and interpretations

The following plots are generated:

  • In this example, EPA-ng heuristic 2 (slower heuristic) shows the best accuracy when at least 10% of the top scoring tree branches are explored by the heuristic prior to local branch length optimisations.

  • RAPPAS comes second when considering k=8. With k=9 RAPPAS performs better. Only computations related to k=9 would be launched by this longer example but adding k=10 in the configuration file and relaunching PEWO would launch supplementary computations and the plots would be automatically updated to test is equal accuracy can be attained in both software.

  • Finally, EPA-ng heuristic 1 (the fastest) shows the worst performance. Lower values of the parameter 'g' may lead to better performance, and as for RAPPAS and k=10, PEWO could be relaunched after modifying this parameter values in the configuration file. Only related computations would be launched and the results and plots would be automatically updated.

  • Please note how different the color scale is when compared to example 2. PEWO plots are dynamically generated and the lowest/highest observed measures are used to determine this scale. Here the red color indicates the lowest scores, and should not be interpreted as low quality placements ! Actually, here every methods shows expected Node Distances below 3, meaning that over the 104 HIV genomes, on average, placement is <3 nodes away from the correct placement, which is pretty good !

Example 4: Search for the most accurate taxonomic markers

Which phylogenetic placement accuracy can be expected for different reference phylogenies based on different taxonomic markers (16S, cox1) ? PEWO can highlight which locus will likely produce the best identifications under different approaches and answering such question is the topic of the current example.

We possess a set of ~900 Coleopteran full mitochondria. With such a reference taxonomy, several markers genes are an option to future phylogenetic placements of metagenomic or metabarcoding data. In particular, cox1, cytb, 16S and 12S are classical marker genes in these studies. But which marker will produce more accurate identifications with phylogenetic placement approaches ?

a) Experimental setup

In this example, we built 4 reference phylogenies using ~900 complete mitochondrial genomes from Linard et al, 2018. For each phylogeny, we run the Accuracy procedure of PEWO and tested 3 software: EPA-ng, RAPPAS and PPlacer with default parameters and the supplementary condition of k=10 for RAPPAS.

Results will be compared to determine which marker is potentially the best option for accurate placements.

b) Run the example

After PEWO installation, from the installation directory execute commands of example/4_search_for_most_accurate_taxonomic_marker/README.md . Each marker requires around 6 hours of computation using 2 CPU cores (on a i7 2.5GHz CPU) and 8Gb of RAM.

c) Results and interpretation

We compiled the results in the table above. In each line is reported one of the four different mitochondrial loci. Node Distance (ND and expected Node Distance (eND) are reported for each software. For RAPPAS, results are also reported for k=10, the phylo-kmer size for which similar (16S, cox1) or better (cytb) placement accuracies as observed. For each locus, the lowest observed eND (e.g., best accuracy) is highlighted in bold.

We can conclude the following :

!!!WARNING!!! These conclusions are based on a toy example where only 10 prunings were run. They should be confirmed by a deeper Accuracy analysis based on 50 to 100 prunings !

  • If 12S is selected for species identification : PPlacer or EPA-ng are the appropriate software. In this example, RAPPAS shows lower accuracy (eND=6.84 vs eND=4.52 and eND=4.87 for PPlacer and EPA-ng respectively). This can be explained by a locus reference alignment which is very gappy. Indeed, high gap proportions are a problem in the 1st version of the k-mer based approach of RAPPAS (Linard et al, 2019).

  • If 16X is selected : trends are similar to 12S. The 16S locus used in this example is shorter than the 12S, which may explain why placements are less accurate than with 12S. Interestingly, RAPPAS accuracy is closer to alignment-based methods (alignment is less gappy).

  • If cox1 is selected : For every software the observed eND is almost twice the value observed with 12S or 16S. This locus shows lower resolution for a process of species identifications based on phylogenetic placements, despite the fact that this locus is almost twice the size of the 12S or 16S loci. A reason for this result could be invoked: the selected cox1 sequences show few sequence diversity, and in the 900 selected species represent only a few families -(some clades, homogeneous in terms of cox1 sequence, are over-represented).

  • If cytb is selected : conclusions are similar to the cox1 example. Interestingly, RAPPAS appears to perform better than alignment-based software on this particular locus.

Example 5: CPU/RAM requirements evaluation

When using phylogenetic placement for large datasets, on large trees or on a dily basis, one can be interested by the computational resources necessary to compute these placements. The 'resource' procedure of PEWO is designed to measure placement cost in terms of memory and CPU.

a) Experimental setup

In config.yaml, the setup for a 'resource' test is done in 2 step:

  • Prepare a query fasta file containing many query sequences to be placed on the selected reference tree. Of course, these reads have to align with the corresponding reference alignment (overlapping locus and gene). Set the path to this fasta file in the field query_user.

  • Choose the number of repeats. Each command-line necessary to placement, including compulsory pre-placement steps (ex: ancestral reconstruction and database build in rappas) will be launched with n repeats. Memory / CPU / time measures will be summarized as means of these repeats.

  • Configure which placement software , which parameters and which parameter values are tested. Every combination will be repeated n times.

b) Run the example

After PEWO installation, from the installation directory execute commands of example/5_CPU_RAM_requirements_evaluation/README.md . It requires around 4 hours of computation using 2 CPU cores (on a i7 2.5GHz CPU) and 8Gb of RAM.

c) Results and interpretation

Raw measurement for each repeat and each command launched will be written in working_directory/benchmark/ :

#RAPPAS ancestral reconstruction
benchmarks/0_red0.99_arRAXMLNG_ansrec_benchmark.tsv
#RAPPAS DB build
benchmarks/0_k8_o2.0_red0.99_arRAXMLNG_rappas-dbbuild_benchmark.tsv
#RAPPAS placement
benchmarks/0_length0_k8_o2.0_red0.99_arRAXMLNG_rappas-placement_benchmark.tsv

#Reads alignments prior to placement
benchmarks/0_length0_hmmer-align_benchmark.tsv
#EPA-ng placement
benchmarks/0_length0_g0.99999_epang-h1-placement_benchmark.tsv
[...]

Means of repeats are then computed and reported in the file resources.tsv. It contains the following columns:

column description
s CPU seconds
h.m.s wall clock hour.minute.seconds
max_rss Peek of total memory actually held in RAM for a process. RSS can be misleading, because it reports the total all of the shared libraries that the process uses, even though a shared library is only loaded into memory once regardless of how many processes use it.
max_vms Peek of total accessible address space of a process. This size also includes memory that may not be resident in RAM like mallocs that have been allocated but not written to. VSS is of very little use to determine real memory usage of a process.
max_uss Peek of total private memory for a process, i.e. that memory that is completely unique to that process. USS is an extremely useful number because it indicates the true incremental cost of running a particular process. When a process is killed, the USS is the total memory that is actually returned.
max_pss PSS differs from RSS in that it reports the proportional size of its shared libraries. PSS is a very useful number because when the PSS for all processes in the system are summed together, that is a good representation for the total memory usage in the system.
io_in input volume
io_out output volume
mean_load infrastructure load
repeat repeat number
operation name of the operation (rappas-db-build, rappas-placement, epa-ng ...)
all other columns Columns parameters requested in the configuration file, one per column. The structure is identical to the one described in example 1.

Resources themselves are measured via the snakemake benchmarks command. For information, we noticed that the io_in and io_out fields can be buggy (issue in Snakemake benchmarking methods?). They should be interpreted with caution.

Example 6: Faster evaluation with Likelihood-based ACcuracy evaluation (LAC)

This example introduces a different procedure of testing and comparing placement algorithms. For each query sequence, it produces a modified tree by extending the original tree with the placed sequence. To create a new node in the modified tree, the best placement according to a given software, is used. Then, the overall LogLikelihood of the modified tree is calculated. Afterward, LogLikelihood values of modified trees can be used to compare different placement algorithms.

a) Run the example

After PEWO installation, from the installation directory execute commands of example/6_placement_likelihood/README.md.

b) Raw results

Node Distances measures for every query and for every tested conditions are in likelihood.csv. Below are 3 lines extracted from this file:

software ar g k length likelihood mp ms o pruning query red sb
epang 0.999 0 -39709.165 0 8f43808640b0e835dfd588c7c30a955c78cf2252-17084050
pplacer 0 -39707.929 40.0 6.0 0 8f43808640b0e835dfd588c7c30a955c78cf2252-17084050 3.0
rappas RAXMLNG 6.0 0 -39709.503 1.5 0 8f43808640b0e835dfd588c7c30a955c78cf2252-17084050 0.99

c) Summaries and plots

For every software configuration, the mean LogLikelihood of modified trees is computed. In this example, the means of 1000 LogLikelihood values are taken, which is the number of sequences in the input dataset. Mean LogLikelihood can be used to compare different software configurations and parameter values. The values are stored in the following csv files:

summary_table_LL_rappas.csv
summary_table_LL_pplacer.csv
[...]

Also, these values are plotted for visual exploration of the results:

summary_plot_LL_rappas.svg
summary_plot_LL_pplacer.svg
[...]

These plots are dynamically generated. The number of plotted values depends on the combinations and the number of parameters that were setup by the user in the PEWO config file. More informative plots exploring more parameter combinations are explored in following examples.