Skip to content

A fully concurrent pipeline for querying transcript-level GTEx data in specific tissues

License

Notifications You must be signed in to change notification settings

IMS-Bio2Core-Facility/GTExSnake

Repository files navigation

Snakemake Workflow: GTExSnake

MIT License Status: Active CI/CD Codestyle: Black Codestyle: snakefmt

A fully concurrent pipeline for querying transcript-level GTEx data in specific tissues

Find us in the "Standardized Usage" Section of the Snakemake Workflow Catalog

If you find the project useful, leaves us a star on github!

A list of changes can be found in our CHANGELOG.

Motivation

There are a number of circumstances where transcript level expressed data for a specific tissue is highly valuable. For tissue-dependent expression data, there are few resources better than GTEx. In this case, the medianTranscriptExpression query provides the necessary data. It returns the median expression of each transcript for a gene in a given tissue. Here, we query a list of genes against a region-specific subset of GTEx.

Notes on Installation

A full walkthrough on how to install and use this pipeline can be found here.

To take advantage of Singularity, you'll need to install that separately. If you are running on a Linux system, then singularity can be installed from conda like so:

conda install -n snakemake -c conda-forge singularity

It's a bit more challenging for other operating systems. Your best bet is to follow their instructions here. But don't worry! Singularity is not regquired! Snakemake will still run each step in its own Conda environment, it just won't put each Conda environment in a container.

Get the Source Code

Alternatively, you may grab the source code. You likely will not need these steps if you aren't planning to contribute.

Navigate to our release page on github and download the most recent version. The following will do the trick:

curl -s https://api.github.com/repos/IMS-Bio2Core-Facility/GTExSnake/releases/latest |
grep tarball_url |
cut -d " " -f 4 |
tr -d '",' |
xargs -n1 curl -sL |
tar xzf -

After querying the github api to get the most recent release information, we grep for the desired URL, split the line and extract the field, trim superfluous characters, use xargs to pipe this to curl while allowing for re-directs, and un-tar the files. Easy!

Alternatively, for the bleeding edge, please clone the repo like so:

git clone https://github.com/IMS-Bio2Core-Facility/GTExSnake

⚠️ Heads Up! The bleeding edge may not be stable, as it contains all active development.

However you choose to install it, cd into the directory.

Reproducibility

Reproducibility results are a cornerstone of the scientific process. By running the pipeline with snakemake in a docker image using conda environments, we ensure that no aspect of the pipeline is left to chance. You will get our analysis, as we ran it, with the software versions, as we used them.

We also strive to make this pipeline as FAIR/O compliant as possible. To that end, it will be available on the Snakemake workflow catalog in addition to the usual availablility on Github.

Unfortunately, any query to an unstable API is inherently not reproducible. Thus, changes in BioMart or GTEx could impact the results. We recognise this as an inherent limitation, and will do our best to keep abreast of API changes that impact the pipeline.

Data

The pipeline requires no input data other than a list of gene names specified in config/config.yaml.

References

It is surprisingly challenging to align RefSeq IDs and Ensembl IDs. This is further complicated because GTEx uses Gencode26 under the hood. As this is not the most up-to-date version, it actually proved quite frustrating to find the desired version numbers for each gene. To combat this, this pipeline takes 3 different approaches in parallel:

  1. Gencode v26 GTF annotations are downloaded from EBI, so the user only needs to supply gene names.
  2. A query is made to BioMart to retrieve RefSeq IDs for each ENST returned by GTEx.
  3. Data from MANE is added to help identify consensus transcripts.

Contributing

If you are interested in helping us improve the pipeline, pleare see our guides on contributing and be sure to abide by our code of conduct!