eDAVE - extension for GDC Data Analysis, Visualization, and Exploration (DAVE) Tools
The goal of this project is to provide a highly efficient GUI to analyse, visualize and explore datasets from GDC (Genomic Data Commons). This project contains two main components:
- data-processing-pipeline/ -> implements data processing pipeline to build local data repository.
- app/ -> implements Dash-based app which is an interface to local data repository.
jan.binkowski[at]pum.edu.pl
Because of technical limitations web-based eDAVE is providing access to only part of the data
deposited in the GDC. To overcome this obstacle a user may run eDAVE locally. To do so, You
should follow one out of two alternative paths (described below) as well as open
data-processing-pipeline/config.json
and update the fields listed below.
- FILES_LIMIT # max files/samples in local data repository
- MIN_COMMON_SAMPLES # min common samples (exp and met data) per single category
- MIN_SAMPLES_PER_SAMPLE_GROUP # min number of samples per single category
- MAX_SAMPLES_PER_SAMPLE_GROUP # max number of samples per single category
1.1 make sure that You have python >= 3.10 installed
1.2 install poetry dependency manager
pip install poetry
1.3 clone repository
git clone https://github.com/ClinicalEpigeneticsLaboratory/eDAVE.git
1.4 open project directory and install required dependencies
poetry install
1.5 install pre-commit [optional]
poetry run pre-commit install
1.6 Alternative for steps 1.2-1.5 using Makefile
make set_up
This script builds the data repository required to run Dash app, and it is
based on GDC API
and GDC data transfer tool.
Please note that FIELDS
, FILTERS
as well as GDC TRANSFER TOOL EXECUTABLE
are declared in data-processing-pipeline/config.json
file.
Additionally, GDC API requires a maximum FILES_LIMIT
parameter, to test purposes this parameter should
be a relatively small number e.g. 100 (default). However, in the production
mode, it should be 100000.
cd data-processing-pipeline/
poetry run python run.py # Please be patient, usually it takes around 12-24h to download all datasets
Please note that to run app in production mode set debug: false
in app/config.json
file. Please remember,
that the app requires an existing local data repository from step 2.
cd app/
poetry run python app.py # development mode
poetry run gunicorn app:server # production mode
Alternatively, a user may want to run the app in Docker container. This solution comprises all steps described in the path 1.
git clone https://github.com/ClinicalEpigeneticsLaboratory/eDAVE.git && cd eDAVE/
docker build . -t edave # build an image. Please be patient, usually it takes around 12-24h to download all datasets
# Optional: to view summary of image vulnerabilities and recommendations
# docker scout quickview
# once the image is created you may start the container using the following command
docker run -p 8000:8000 edave # run container
To ensure the code quality level we use: black, isort, lint and bandit. To run those tools:
make
or specifically:
make black
make isort
make pylint
make bandit
To run unit tests open main eDAVE directory and type:
make tests_data_processing_pipeline # to run data processing pipeline tests
make tests_app # to run app tests