Skip to content

Latest commit

 

History

History
86 lines (61 loc) · 5.89 KB

README.md

File metadata and controls

86 lines (61 loc) · 5.89 KB

README

This repo provides a pipeline for analyzing the stability of taxa in all given alignments. We add the option of first subsampling the given set of alignments to generate a dataset of alignment that contain different numbers of sequences and also vary in sequence length.

Data

The path to the directory containing the data on which the pipeline is to be run needs to be provided in config.yaml as data_folder. This directory should contain subdirectories which themselves contain alignments as nexus or fasta files, or symlinks to such files.

Prerequisites

Python packages that are required for running this pipeline can be found in environment.yml. A conda environment called phylostability for our analysis can be installed using conda env create -f environment.yml.

The non-python utility CONSEL can be downloaded and installed separately using the instructions included on the homepage.

The line defining path_to_consel in the file config.yaml should be updated to reflect the local build of CONSEL

It also requires a local build of raxml-ng, which is extensively documented.

The line defining path_to_raxml in the file config.yaml should be updated to reflect the local build of RAxML

Running the pipeline

To run the workflow, execute:

snakemake -c36

Note that this will use 36 cores, you can adjust accordingly. The number of cores to be used also needs to be updated in config.yaml.

Subsampling datasets

By default, our pipeline subsamples the dataset you input in config.yaml so that num_samples (in config.yaml) alignments are chosen for the analysis. The num_samples alignments are chosen randomly so that different alignment lengths and number of taxa are included (aligmnents are binned according to these parameters and drawn uniformly from the bins). If you want to run the analysis on all provided alignments, set num_samples in config.yaml to be the total number of datasets. Alternatively, you could delete selected_data/ from the data_folder path in Analysis_snakefile and run snakemake --snakefile Analysis_snakefile -cX.

If the pipeline has run successfully, rerunning snakemake -c36 will add num_samples to the already existing run and will perform random forest regression and classification on all datasets together. The default stability_measure used in the analysis is the normalized Taxon Influence Index (normalised_tii), but this can be set to be rf_radius in config.yaml.

Output

The names of datasets that are subsampled and used in the stability analysis are saved in the file {data_folder}/selected_datasets.csv, where data_folder is the name of the directory containing all data as set in config.yaml. All data produced in the analysis is saved in the directory {data_folder}/selected_data/. This includes a number of csv files as well as alignments and trees, which are saved in directories named by the names of the alignment files as provided in the input data. The most important csv file is probably {data_folder}/selected_data/rf_regression_combined_statistics.csv, which contains all summary statistics computed for all taxa, which are then used for training random forests (or a balanced subset of these taxa is used). The following plots summarize results of our stability analysis, including results of random forest regression and classification, and can be found in {data_folder}/selected_data/plots:

Filename Plot
au_pie_chart.pdf Pie chart summarizing number of taxa for which AU-test is significant/non-significant and stability_measure is zero/non-zero
au_test_classifier_features Feature importances of random forest classifier trained to predict AU-test output (significant vs non-significant)
au_test_classifier_results.pdf ROC curve for classifier predicting AU-test significance vs non-significance
combined_random_forest_features.pdf Feature importances of random forest classifier and regressor trained to predict stability_measure (classifier predicts TII zero vs non-zero)
discrete_random_forest_model_features.pdf Feature importances of random forest classifier predicting stability_measure
dist_instability_to_low_bootstrap.pdf Average normalized topological distance of unstable edges to closest low bootstrap support node ($<70$)
normalised_tii.pdf Histogram showing number of times each normalized TII value is observed
random_forest_classifier_results.pdf ROC curve of random forest classifier predicting stability_measure zero vs non-zero
random_forest_model_features.pdf Feature importances of random forest regression predicting stability_measure
random_forest_results.pdf Scatterplot of predicted vs true values of stability_measure
rf_radius_num_taxa.pdf Scatterplot of RF radius vs number of taxa of alignment of the taxon for which we observe that RF radius
rf_radius.pdf Histogram showing number of times each normalized RF radius is observed
tii.pdf Histogram showing number of times each TII is observed (not normalized)

The random forests classifiers require a minimum amount of data in each class. If there are not enough datasets for all classes in the classifier, there will be an error message. The remaining analysis will be performed, but no random forest will be built and therefore not all output plots will be computed.