There is a complete tutorial on using R for the Sevenbridges Cancer Genomics Cloud, by Tengfei Yin.
I have also wrote two additional READMEs, to support the pratical and theoretical course I give in IARC in feb. 2017:
- using Cancer Genomics Cloud interface to run our first analysis
- using R and CWL to run reproducible analyses
Sevenbridges maintained a GitHub repository for API client, CWL schema, meta schema and SDK helper in R, here.
Tutorials from Tengfei Yin for multiple tasks can be found on the GitHub, including intersting ones for TCGA data analysis:
- Use R on the CancerGenomicsCloud
- Describe and execute Common Workflow Language (CWL) Tools and Workflows in R
- Browse data on the Cancer Genomics Cloud via the Data Explorer, a SPARQL query, or the Datasets API
A good schema for using R api to analyse TCGA data is the following:
- Create your
docker
image or use an existing one. - Choose the
machine
you want (default is m4.2xlarge (8 CPUs, 32Gb, 40cts/h) You pay at least 1 hour. - Create a
tool
or aworkflow
(directly in R or import CWL file, which can be written also on JSON or YAML) - Add specific data to your project (use the
queries
to keep reproducibility) Run
your analysis with a loop on your files
- Filter and count TCGA entities with dataset API
- GUI Data Browser tutorial. CAUTION: filters after "file" entity are not considered if you want to add the querying files to your project.
Steps are the following:
- use GUI to add lung BAM files to your project
- use this R script to:
-
- load platypus and bgzip JSON tools
-
- connect them into a workflow
-
- add the workflow to your project (these 2 previous steps can be skipped if your app is already present in the project)
-
- loop over the BAM file to run the variant calling on each sample
-
- download locally each VCF file
-
- transfer each VCF from local computer to the IARC HPC
-
- delete VCF files on the CGC (don't forget the checking of VCF files downloading before this)
-
R api could be use to analyse several task features, such as:
- task execution time (queue + run)
- task price (computing + storage)
This script is an example of task analysis, which produce this sort of picture.
- query is limited to the 100 first files
- If upload a JSON file in the GUI, can not run a task using this app in R
The CGC uses two types of Amazon EC2 pricing for instances: On-Demand and Spot. On-Demand instances are purchased at a fixed rate, while the price of Spot Instances varies according to supply and demand.
- CGC strategy is to bid the On-Demand instance price for spot instances
- AWS EC2 will terminate your spot instance if bid price < market price
- in this case, task will continue on an On-Demand instance
- if spot instance is terminated before 1h of running, not charged
- spot instance are not recommended for critical-time jobs