Skip to content

This repository is in support of the PathCORE-T paper (https://doi.org/10.1101/147645). It contains all the code and necessary data/metadata to repeat all analyses in the paper.

License

Notifications You must be signed in to change notification settings

greenelab/PathCORE-T-analysis

Repository files navigation

Overview

This repository contains the scripts to run the analyses described in the PathCORE-T paper. Running ./ANALYSIS.sh is sufficient to reproduce the results in the paper. To use PathCORE-T in your own analyses, please review the sections from The PathCORE-T analysis workflow onwards in this README.

We released two Python packages for PathCORE-T:

The two packages are used in this analysis repository.

The data directory

A README is provided in the ./data directory with details about the scripts to download and/or process datasets, data source citations, etc.

The figures directory

All figures in the PathCORE-T paper are also available here.

The jupyter-notebooks directory

Scripts used to generate Figure 3 and Supplemental Figure 2 are provided in notebook format. We have found that we can offer greater detail about each of the figures in this format.

Tutorials

This directory also contains 2 notebooks that users can read through or run when they are getting started with PathCORE-T analysis:

The PathCORE-T analysis workflow

Please review one of the analysis_<dataset>_<model>.sh scripts for an example of the workflow.

In the figure below, (a) is used to generate the weight matrix and (b) specifies the inputs to the PathCORE-T analysis in (c):

PathCORE-T analysis workflow diagram

Scripts (in order of execution):

  1. run_network_creation.py

    Iterates through a directory of weight matrices generated by a feature construction algorithm that has been applied to a transcriptomic dataset. Multiple weight matrices can be constructed from the same algorithm initialized with different random seeds. The eADAGE example uses multiple weight matrices, whereas the two NMF examples only use one weight matrix.

  2. run_permutation_test.py

    Iterates through a directory of network files and applies a permutation test to the networks to determine edge significance. If there is more than 1 network file in the directory, the networks are combined to make a single aggregate network. Edges that are significant under their corresponding nulls (generated by the permutation test) are kept in the final network.

Additional:

  • constants directory

    This module allows for import of two dictionaries: GENE_SIGNATURE_DEFINITIONS and SHORTEN_PATHWAY_NAMES. These are intended to be modified when you need to run PathCORE-T using a feature construction algorithm and/or pathway definitions different from those in our case studies.

    In most cases, the files in constants should be the only ones you may need to modify to run an analysis of your own.

  • utils.py

    Utility functions for file reading & processing.

Web application database setup

Here we describe the steps taken to prepare the database that backs the PathCORE-T demo application. The demo application is built on the Flask microframework and deployed on Heroku. The database is a MongoDB instance hosted on mLab.

Both Heroku and mLab provide free tier options for their services.

Note that the --metadata flag is used in analysis_Paeruginosa_eADAGE.sh for run_network_creation.py ahead of the web application setup carried out by running web_db_Paeruginosa_eADAGE.sh.

Scripts (in order of execution):

  • web_initialize_db.py

    Creates the following collections:

    • genes: Stores the gene identifiers. Assumes these can be retrieved from the first column (the row names/index) of the transcriptomic dataset. For the PAO1 example, we provided an additional file (for more information, see data/README.md) that has the common names corresponding to the gene locus tags specified in the compendium.

    • pathways: Stores the pathway & definition information from the pathway definitions file.

    • sample_labels: Stores the sample labels and the corresponding normalized expression values. Assumes the labels can be retrieved from the first row (the header) of the transcriptomic dataset and each column is the vector of expression values corresponding to that sample.

    • network_edges: Stores the network files in the networks directory created by running run_network_creation.py.

    • network_feature_signatures: Stores the feature gene signature information in the metadata directory created by running run_network_creation.py ... --metadata

    • network_feature_pathways: Stores the feature pathway definitions in the metadata directory created by running run_network_creation.py ... --metadata

    • sample_annotations: Specific to the PAO1 example, we store additional information about the samples in the compendium that can be displayed on the web application (for more information about the sample annotations file, see `data/README.md).

  • web_edge_page_data.py

    Creates the collection pathcore_edge_data. All information needed in an edge page is stored here (e.g. computes gene odds ratios, sample "summary" expression scores, creates heatmaps based on these values).

Additional:

  • utils_setup_PAO1_example.py

    Utility files in support of the PAO1 example. Gets the gene common names and sample annotations information.

PathCORE-T web application setup

Step 1: mLab setup

  • Register for an mLab account at mLab.com.
  • Create new: Create a free sandbox database (0.5 GB).
  • Database Users tab: Add a user to the new database that has write-access.
  • Create a credentials file (see example-mLab-credentials.yml)

Step 2: Run web_initialize_db.py

Step 3: Run web_edge_page_data.py

Fork the PathCORE-T-demo repository. Follow the setup instructions in the repository's README. Update or remove any text or code specific to the eADAGE-based, KEGG PAO1 case study so that the web application accurately describes and supports your analysis.