GitHub - cBioPortal/gdc-et-pipeline: GSoC : Spring-batch based system to extract and transform GDC hosted data to suitable cBioPortal file formats.

GDC ET Pipeline is a spring-batch based tool that transforms cancer genomic data, available from NCI's GDC repository, into appropriate file formats that can be loaded into cBioPortal tool.

Slides discussing this pipline can be found here (2017) and here (2019).

The pipeline currently requires that the data being transformed is available on the filesystem. To download data, use the GDC Portal to generate a manifest file containing all of the files desired, then use the GDC Data Transfer Tool to download the files.

GDC Portal: https://portal.gdc.cancer.gov/

GDC Data Transfer Tool: https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Getting_Started/

Downloading data with the data transfer tool can be done on the command line as follows:

gdc-client download -m <MANIFEST_FILE.txt>

GDC ET Pipeline

The pipeline expects a manifest file in order to know the data files it must process. More details on manifest file can be found on [GDC Data portal](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Preparing_for_Data_Download_and_Upload/). The batch expects the actual data files to be downloaded by the user from the GDC repository.

There are several datatypes that are hosted at GDC. The pipeline currently supports processing of Clinical, Mutation, CNA, and Expression data files for conversion into cBioPortal ready to import files. More details about these types of files can be found in the cBioPortal documentation.

Brief overview of Steps :

The batch runs in several steps each accomplishing a different task. Current implementation stage of each step is mentioned which can be further extended upon.

Running the pipeline

The batch has some required options as well as optional parameters that user can provide.

$JAVA_HOME org.cbio.gdcpipeline.GDCPipelineApplication -<option>
List of options : 
-c,--cancer_study_id <arg>             [REQUIRED]  Cancer Study Id 
-o,--output <arg>                      [REQUIRED]  output directory for files
-s,--source <arg>                      [REQUIRED]  source directory for files
-d,--datatypes <arg>                   [OPTIONAL]  Datatypes to run. Default is All
-f,--filter_normal_sample <arg>        [OPTIONAL]  True or False. Flag to filter
                                                   normal samples. Default is True

-i,--isoformOverrideSource <arg>       [OPTIONAL]  Isoform Override Source. Default
                                                   is 'uniprot'
-m,--manifest_file <arg>               [OPTIONAL]  Manifest file path

-separate_mafs,--separate_mafs <arg>   [OPTIONAL]  True or False. Process MAF files
                                                   individually or merge together.
                                                   Default is False
-h,--help                                          shows this help document and
                                                   quits.

After the data has been downloaded using the GDC Data Download tool described above, you can call the pipeline like so:

$JAVA_HOME/bin/java -jar target/gdcpipeline-0.0.1-SNAPSHOT.jar -c <CANCER_STUDY_NAME> -m <MANIFEST_FILE> -o <OUTPUT_DESTINATION> -s <DOWNLOADED_RAW_FILES_LOCATION>

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GDC ET Pipeline

Brief overview of Steps :

Running the pipeline

About

Releases

Packages

Contributors 4

Languages

License

cBioPortal/gdc-et-pipeline

Folders and files

Latest commit

History

Repository files navigation

GDC ET Pipeline

Brief overview of Steps :

Running the pipeline

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages