Skip to content

Toolbox for performing gene set enrichment analysis in Matlab (including ensemble enrichment)

License

Notifications You must be signed in to change notification settings

benfulcher/GeneCategoryEnrichmentAnalysis

Repository files navigation

Gene Category Enrichment Analysis including Custom Null Ensembles

DOI

This is a Matlab toolbox for performing gene-category enrichment analysis relative to two different types of null models:

  1. Random-gene nulls, in which categories assessed relative to categories of the same size but annotated by the same number of random genes. This follows the permutation-based method of Gene Score Resampling (as implemented in ermineJ).
  2. Ensemble-based nulls, in which categories are assessed relative to an ensemble of randomized phenotypes.

The toolbox was introduced in our paper:

Instructions for performing the basic functions of these analyses are in the wiki 📓.

The package is currently set up to perform enrichment on Gene Ontology (GO) Biological Process annotations, but could be modified straightforwardly to use other types of GO annotations, or even to use other annotation systems like KEGG.

Pull requests to improve the functionality and clarity of documentation are very welcome!

Alternative Packages

Note that this repository is no longer in active development, but the same null-testing procedure has been re-implemented in other packages. I would recommend investigating these alternatives:

Repository Organization

The package is organized into directories as follows:

Data:

  1. RawData: all data downloaded from external sources (like GO, MouseMine, etc.)
  2. ProcessedData: raw data processed into Matlab-readable files.

Code:

  1. DataProcessing: code required to process raw data.
  2. GeneScoreResampling, EnsembleEnrichment: code to run both random-gene and randomized-phenotype enrichment analysis.
  3. ResultsComparison: code to compare GSEA results to ermineJ.
  4. Peripheral: additional code files.

To initialize this toolbox, all of these subdirectories should be added to the Matlab path by running the startup script.

Running an Analysis

A summary of how to run an enrichment analysis with this package is describd here, but please read the wiki 📓 for more detailed instructions.

NOTE: This package relied on MySQL downloads of the GO data, but GO no longer provides their ontology in this format. As a workaround, the directory oboConversion has been added which includes instructions and code for converting recent GO releases (available as go-basic.obo files) into an sqlite database, and DataProcessing scripts have been updated with sqlite commands to bypass the need to use a MySQL connection.

Preparation: Defining gene-to-category annotations

The first step in running an enrichment analysis is defining the set of gene categories, and the genes annotated to each category. Results of this, using hierarchy-propagated gene-to-category annotations corresponding to GO biological processes (processed on 2019-04-17), can be downloaded from this partner Zenodo data repository.

Code in this repository also allows you to reprocess these annotations from raw data from GO, as described on this wiki page. You can test this pipeline using the term and term2term tables from a mySQL download of the GO term data on 2019-04-17, which are also available in the associated Zenodo data repository.

Performing Enrichment

All parameters are set using GiveMeDefaultEnrichmentParams, as described in the wiki.

Gene-score resampling (random-gene null)

The Gene Score Resampling method assesses significance relative to a 'random-gene null', and is implemented in the SingleEnrichment function. Instructions to implement this are in the wiki.

Ensemble enrichment

Ensemble enrichment computes the enrichment of a given phenotype relative to an ensemble of randomized phenotypes, as described in our paper.

This proceeds across ComputeAllCategoryNulls (precompute category nulls) and EnsembleEnrichment (evaluate significance relative to these nulls), as described in the wiki.