code related to Shu et al., 2016
tnbc_pipeline_scripts.py contains all python scripts for data analysis in Shu et al., 2016
Many of the functions and scripts here refer to functions and code from the bradner lab pipeline https://github.com/BradnerLab/pipeline
PREREQUISITES:
-
Clone this repo and cd into the TNBC folder -all paths are set relative to this folder -recommended that you cd into the TNBC folder to run all analysis
-
please install bamliquidator https://github.com/BradnerLab/pipeline/wiki/bamliquidator -set bamliquidator path globally as bamliquidator -set bamliquidator_batch path globally as bamliquidator_batch -install macs1.4.2 and add to global path as macs14 (http://liulab.dfci.harvard.edu/MACS/Download.html) -install bowtie2, samtools, cufflinks, cuffquant, R
-
Clone the bradnerlab pipeline https://github.com/BradnerLab/pipeline -edit section of paths in pipeline_dfci.py to correctly point to the pipeline directory, samtools, bamliquidator, and bamliquidator_batch
-
Data tables -A number of data tables are included to help organize files for analysis -Each row represents a specific dataset. The UNIQUE_ID and NAME corresponds to the sample name in GEO for Series GSE63584 -The BACKGROUND column represents the control for a given dataset -ENRICHED_REGION and ENRICHED_MACS refer to locations of peak files generated by various peak calling algorithms -For more information consult the loadDataTable function in the pipeline_dfci.py module
-
Obtaining raw data -raw fastqs may be obtained here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63584 -recommended that fastqs be deposited into the ./fastq folder
-
Generate sorted and indexed bam files for each sample. -align raw fastqs to the hg19 genome. bowtie2 index can be found here: ftp://igenome:[email protected]/Homo_sapiens/UCSC/hg19/Ho\ mo_sapiens_UCSC_hg19.tar.gz -for alignment we used bowtie2 v2.2.1 with default parameters except for -k 1 -place bams in the ./bam folder or edit the "FILE_PATH" column in CHIP_DATA_TABLE.txt to point to the folder containing bams -bams should be named to include the UNIQUE_ID as in the CHIP_DATA_TABLE.txt. The UNIQUE_ID is derived from the GEO sample name. -bams shouled be sorted and indexed and include a corresponding .bai file with the same name -NOTE: code will not work if multiple bams in the directory share the same UNIQUE_ID
-
Running python analysis -cd into the repo directory -Run each block of code by uncommenting and then running python ./tnbc_pipeline_scripts.py -All paths are set relativeto the TNBC folder
-
Running R scripts -all R scripts are found in ./figure_code -additional R scripts are used for expression analysis and to generate plots of ChIP-Seq data. -output figures can be found in ./figures