This repository contains scripts and files related to the analysis of RNAseq, ATACseq, and CUT&RUN data to investigate the contrasting roles of MSH2 and MLH1 in tumor metastasis.
The code provided here is fully reproducible. Except for the original fastq.gz
files, all other files and results can be recreated using the scripts included in this repository. Due to size constraints and data ownership, large files and non public files are omitted. For access to these files, please see the .gitignore
and email [email protected].
To run these pipelines navigate to the hipergator directory of your choice and use git clone https://github.com/HeatherKates/TanziaMLH1MSH2.git
Each project is divided into subanalyses as it should be run. RNAseq and ATACseq are divided into MLH1 and MSH2 files. CUT&RUN is further divided into KO and R4 files.
Each analysis has a data/ folder that contains a symlink to the actual location of the data on hipergator.
Clone this repository to your directory of choice, e.g.
`ssh [email protected]`
`cd /blue/zhangw/hkates`
`git clone https://github.com/HeatherKates/TanziaMLH1MSH2.git`
ALL SCRIPTS SHOULD BE RUN FROM THE MAIN ANALYSIS DIRECTORY.
`cd TanziaMLH1MSH2/RNAseq`
`nano scripts/1a_fastqc.sbatch` to change resource requests (or edit another way)
`sbatch scripts/1a_fastqc.sbatch`
MAKE SURE THAT YOU CHANGE THE HIPERGATOR RESOURSE REQUESTS IN ANY .SBATCH FILES
Nothing else needs to be changed
Run .sbatch scripts by typing sbatch FILENAME
Run .R scripts by typing module load R
and then Rscript FILENAME
Executing scripts may create directories and files in the respective results/ folders. These will overwrite any existing files. If you wish to re-run an analysis without over-writing files, rename the directory and reclone the repo before starting again.
This directory contains the RNA-seq pipeline files and scripts to analyze the roles of MSH2 and MLH1 in tumor metastasis. Below is an overview of the main folders and the contents within each directory.
-
MLH1/:
-
results/: This directory contains subdirectories corresponding to the main stages of the RNA-seq analysis pipeline.
- 1_fastqc/: Quality control outputs from FASTQ files.
- 2_fastp/: Contains results from FASTQ trimming and preprocessing with Fastp.
- 3_salmon/: Outputs from transcript quantification using Salmon.
- 4_deseq2/: DESeq2 results for differential expression analysis, including MSH2 and MLH1 comparisons.
- 5_GSEA/: Results from Gene Set Enrichment Analysis (GSEA).
- 6_GO/: GO enrichment analysis outputs.
-
scripts/: Contains all scripts used to perform each step of the pipeline, named sequentially to indicate the recommended order of execution.
- 1a_fastqc.sbatch: Script to perform quality control using FastQC.
- 1b_summarize_fastqc.bash: Summarizes FastQC results for review.
- 2_fastp.sbatch: Trims and preprocesses FASTQ files with Fastp.
- 3a_salmon_human_index.sbatch: Creates a human transcriptome index for Salmon.
- 3b_salmon_human.sbatch: Runs Salmon for transcript quantification.
- 4_deseq2.R: DESeq2 analysis for MLH1-related samples.
- 5_GSEA.R: Gene Set Enrichment Analysis for MLH1.
- 6_GO.R: GO enrichment analysis for MLH1-related genes.
- logs/: Directory for log files generated by each step of the pipeline.
-
-
01_summarize_deseq2.R: Summarizes DESeq2 results for both MSH2 and MLH1 comparisons.
-
02_deseq2_viz.R: Generates visualizations of DESeq2 differential expression results.
Each script is designed to facilitate reproducibility and can be run sequentially to perform the complete RNA-seq analysis for the study.
This pipeline processes ATAC-seq data starting from raw FASTQ files through trimming, alignment, peak calling, and differential binding analysis. Below is a summary of each step and the associated input/output files.
Runs Trim Galore to remove adapter sequences from the raw FASTQ files.
Input Files:
- Raw FASTQ files (
R1
andR2
) for each lane:- Example:
231-MLH1KO-1_S1_L001_R1_001.fastq.gz
,231-MLH1KO-1_S1_L001_R2_001.fastq.gz
- Example:
Output Files:
- Trimmed FASTQ files for each lane:
- Example:
MLH1KO-1_L001_val_1.fq.gz
,MLH1KO-1_L001_val_2.fq.gz
- Example:
- Trimming report files:
- Example:
231-MLH1KO-1_S1_L001_R1_001.fastq.gz_trimming_report.txt
- Example:
Runs FastQC on both raw and trimmed FASTQ files to assess read quality.
Input Files:
- Raw FASTQ files (
R1
,R2
) for each lane - Trimmed FASTQ files (
val_1.fq.gz
,val_2.fq.gz
)
Output Files:
- FastQC HTML reports:
- Example:
231-MLH1KO-1_S1_L001_R1_001_fastqc.html
- Example:
- FastQC data files (
fastqc_data.txt
inside the.zip
):- Example:
231-MLH1KO-1_S1_L001_R1_001_fastqc.zip
- Example:
Aligns trimmed reads to the reference genome using Bowtie2.
Input Files:
- Trimmed FASTQ files from Trim Galore:
- Example:
MLH1KO-1_L001_val_1.fq.gz
,MLH1KO-1_L001_val_2.fq.gz
- Example:
Output Files:
- SAM alignment files:
- Example:
MLH1KO-1.sam
- Example:
Converts SAM to BAM, applies filtering, and sorts BAM files for downstream analysis.
Input Files:
- SAM alignment files:
- Example:
MLH1KO-1.sam
- Example:
Output Files:
- Filtered and sorted BAM files:
- Example:
MLH1KO-1.sorted.bam
- Example:
- Flagstat statistics:
- Example:
MLH1KO-1.bam.flagstat.log
- Example:
Calls ATAC-seq peaks using Genrich.
Input Files:
- Filtered and sorted BAM files from Samtools:
- Example:
MLH1KO-1.sorted.bam
- Example:
Output Files:
- NarrowPeak files (peak regions):
- Example:
MLH1KO-1.narrowPeak
- Example:
Creates BigWig files from BAM files for visualization in genome browsers.
Input Files:
- Filtered and sorted BAM files from Samtools:
- Example:
MLH1KO-1.sorted.bam
- Example:
Output Files:
- BigWig files for visualization:
- Example:
MLH1KO-1.bw
- Example:
Performs differential binding analysis between groups (e.g., KO vs WT) using DiffBind.
Input Files:
- NarrowPeak files for each sample:
- Example:
MLH1KO-1.narrowPeak
- Example:
- BAM files for each sample:
- Example:
MLH1KO-1.sorted.bam
- Example:
Output Files:
- CSV file with differential binding results:
- Example:
MLH1_differential_binding_results.csv
- Example:
- Filtered differential binding results for KO and R4 samples:
- Example:
MLH1KO_filtered_differential_binding_results.csv
- Example:
MLH1R4_filtered_differential_binding_results.csv
- Example:
- BED file with peaks data (optional):
- Example:
MLH1_filtered_differential_binding_results.bed
- Example:
Warnings and issues encountered during the pipeline run are logged in logs/:
This directory contains the CUT&RUN pipeline files and scripts to analyze the roles of MSH2 and MLH1 in tumor metastasis. Below is an overview of the main folders and the contents within each directory.
CUT&RUN analysis is performed using the nf-core/cutandrun pipeline https://github.com/nf-core/cutandrun/tree/master, so scripts are minimal and directory output structure follows https://nf-co.re/cutandrun/3.2.2/results/cutandrun/results-6e1125d4fee4ea7c8b70ed836bb0e92a89e3305f/.
Dependencies are provided using a custom conda environment loaded in the .sbatch script.
- MSH2: This directory contains the analysis related to peak-calling in MSH2KO and MSH2R4 cell lines
- KO/: Peak calling in MSH2KO cell lines
- samplesheet.csv: This is the main input for nf-core/cutandrun that lists the groups, read files, and controls for each sample
- cutandrun.sbatch: .sbatch script to run nf-core/cutandrun pipeline. Change resource requests to your own.
- ** fastq.gz*: symlinks to the original read files on /orange
- R4/: Peak calling in MSH2R4 cell lines
- samplesheet.csv: This is the main input for nf-core/cutandrun that lists the groups, read files, and controls for each sample
- cutandrun.sbatch: .sbatch script to run nf-core/cutandrun pipeline. Change resource requests to your own.
- ** fastq.gz*: symlinks to the original read files on /orange
- KO/: Peak calling in MSH2KO cell lines
- MLH1: This directory contains the analysis related to peak-calling in MLH1KO and MLH1R4 cell lines
- KO/: Peak calling in MLH1KO cell lines
- samplesheet.csv: This is the main input for nf-core/cutandrun that lists the groups, read files, and controls for each sample
- cutandrun.sbatch: .sbatch script to run nf-core/cutandrun pipeline. Change resource requests to your own.
- ** fastq.gz*: symlinks to the original read files on /orange
- R4/: Peak calling in MLH1R4 cell lines
- samplesheet.csv: This is the main input for nf-core/cutandrun that lists the groups, read files, and controls for each sample
- cutandrun.sbatch: .sbatch script to run nf-core/cutandrun pipeline. Change resource requests to your own.
- ** fastq.gz*: symlinks to the original read files on /orange
- KO/: Peak calling in MLH1KO cell lines
To ensure full reproducibility, follow the steps below:
- Download Original Data: Obtain the original
fastq.gz
files or run scripts on hipergator where they will access data in /orange (persmission is for users in zhangw only) - Run Scripts: Use the scripts provided in the
scripts
directory to process the data and generate results.
For access to large files or any other inquiries, please contact:
Heather Kates
Email: [email protected]
None
This research was supported by [unknown]