In this folder you can access a set of examples and script showcasing the usage of the library in various deployment settings
- Simple example of local execution: introductory example
- Local and Remote computation
- GWAS
- Local execution: Simplified version of the query shown in the paper. The computation is performed locally without the support of a cluster.
- On Google Cloud Storage: extended version of the example shown in the paper with more visualizations. Data reside on Google Cloud Storage. NB: this query cannot be executed in the docker image
- On HDFS: same example, but data reside on HDFS. NB: this query cannot be executed in the docker image
In the HTML folder are available the previous notebooks in HTML format for easy consultation through a web browser.
- Transcriptional Interaction and Co-regulation Analyser (TICA): the last and most complex application example of the library. This query has been tested and deployed on AWS EMR. We have a script for every cell line. NB: this query cannot be executed in the docker image
In the data folder are available the following example datasets:
genes
: used in the first applicative example of the manuscript about Local/Remote computationHG19_ENCODE_BROAD
: used in the local version of the GWAS analysis
For the TICA query, the user needs to download the complete set of GDM datasets from the following public S3 bucket: https://s3.us-east-2.amazonaws.com/geco-repository/geco-repository-minimal.tar.gz
In order to run the programs which make use of a Spark cluster with an Hadoop file system it is necessary to have:
- A correctly installed Hadoop file system: you can download Hadoop from this link and then follow this guide to setup yours
- A correctly installed Spark distribution: you can download it from this link and then follow the instructions at this link
- The GMQL repository data used in the workflows
- you can download the whole set of GDM datasets used in the queries from this link
- unpack the
tar.gz
file - use
hdfs dfs put ./geco-repository hdfs:///
to put the contents of the uncompressed folder in HDFS
We evaluated the performance of the system using Amazon Web Services Elastic Map Reduce, which offers the possibility to
specify the Hadoop cluster configuration, the number of nodes and the various instances specifications. For completeness
we provide as supplementary material also the AWS command line scripts to setup an EMR cluster for every configuration
defined in the paper. They are available in the cluster_configuration
folder:
AWS_EMR_1m_1s.sh
: 1 master and 1 slaveAWS_EMR_1m_3s.sh
: 1 master and 3 slavesAWS_EMR_1m_5s.sh
: 1 master and 5 slavesAWS_EMR_1m_10s.sh
: 1 master and 10 slaves