Snakemake pipeline for Cut&Tag analysis
If you do not have a Snakemake environment, please follow the instructions in the maxsonBraunLab Snakemake setup repository to create one.
NOTE: Some of the pipeline setup instructions are tailored for working on Oregon Health and Science University's Linux-based high-performance computing cluster (ARC, formerly Exacloud). If you have access to another computing cluster, then some steps may need to be modified accordingly.
NOTE: Exacloud users transitioning to ARC at the end of 2024 may need to set a new container cache directory (APPTAINER_CACHEDIR
) in their ~/.bashrc
file as shown in the Set Singularity/Apptainer cache directory section below.
On the Github repo for this pipeline, click the green "Code" button near the top right corner and copy the Clone / HTTPS URL (ends with .git). Then run the following on the command line:
# cd into a project directory
# paste the copied URL after `git clone` command to get a copy of pipeline
git clone <URL for repo>
# navigate to the main pipeline directory (where the snakefile is)
cd <main pipeline directory>
#create a directory for your fastq files
mkdir -p data/raw
# link fastqs to data/raw
ln -s /path/to/fastq/files/sample1_R1.fastq.gz data/raw
ln -s /path/to/fastq/files/sample1_R2.fastq.gz data/raw
ln -s /path/to/fastq/files/sample2_R1.fastq.gz data/raw
ln -s /path/to/fastq/files/sample2_R2.fastq.gz data/raw
...
# make scripts executable
chmod +x src/*.py src/*.sh *.sh
IMPORTANT
After creating the symlinks, rename all the symlinks in data/raw to the following format: {condition}_{replicate}_{mark}_{R1|R2}.fastq.gz
For example, a file with this original name LIB200706TB_M6Q3_RBP1_S93_L001_R1_001.fastq.gz will be renamed to M6Q_3_RBP1_R1.fastq.gz
Activate an environment containing snakemake, and then run the script make_sample_sheet.py
script from the root of the directory.
python src/make_sample_sheet.py data/raw
This will make a samplesheet for the experiment called samplesheet.tsv
in the root of the directory. The files src/deseq2_metadata.csv
and src/diffbind_config.csv
will also be created. The contents of the samplesheet will be structured like the following example:
sample R1 R2 mark condition igg
HoxE_1_IgG data/raw/HoxE_1_IgG_R1.fastq.gz data/raw/HoxE_1_IgG_R2.fastq.gz IgG HoxE HoxE_1_IgG
HoxE_1_Rbp1 data/raw/HoxE_1_Rbp1_R1.fastq.gz data/raw/HoxE_1_Rbp1_R2.fastq.gz Rbp1 HoxE HoxE_1_Rbp1
HoxE_2_Rbp1 data/raw/HoxE_2_Rbp1_R1.fastq.gz data/raw/HoxE_2_Rbp1_R2.fastq.gz Rbp1 HoxE HoxE_2_Rbp1
HoxE_3_Rbp1 data/raw/HoxE_3_Rbp1_R1.fastq.gz data/raw/HoxE_3_Rbp1_R2.fastq.gz Rbp1 HoxE HoxE_3_Rbp1
HoxM_1_IgG data/raw/HoxM_1_IgG_R1.fastq.gz data/raw/HoxM_1_IgG_R2.fastq.gz IgG HoxM HoxM_1_IgG
HoxM_1_Rbp1 data/raw/HoxM_1_Rbp1_R1.fastq.gz data/raw/HoxM_1_Rbp1_R2.fastq.gz Rbp1 HoxM HoxM_1_Rbp1
HoxM_2_Rbp1 data/raw/HoxM_2_Rbp1_R1.fastq.gz data/raw/HoxM_2_Rbp1_R2.fastq.gz Rbp1 HoxM HoxM_2_Rbp1
HoxM_3_Rbp1 data/raw/HoxM_3_Rbp1_R1.fastq.gz data/raw/HoxM_3_Rbp1_R2.fastq.gz Rbp1 HoxM HoxM_3_Rbp1
HoxW_1_IgG data/raw/HoxW_1_IgG_R1.fastq.gz data/raw/HoxW_1_IgG_R2.fastq.gz IgG HoxW HoxW_1_IgG
HoxW_1_Rbp1 data/raw/HoxW_1_Rbp1_R1.fastq.gz data/raw/HoxW_1_Rbp1_R2.fastq.gz Rbp1 HoxW HoxW_1_Rbp1
HoxW_2_Rbp1 data/raw/HoxW_2_Rbp1_R1.fastq.gz data/raw/HoxW_2_Rbp1_R2.fastq.gz Rbp1 HoxW HoxW_2_Rbp1
HoxW_3_Rbp1 data/raw/HoxW_3_Rbp1_R1.fastq.gz data/raw/HoxW_3_Rbp1_R2.fastq.gz Rbp1 HoxW HoxW_3_Rbp1
The script splits the file name on the '_' and uses the first split for the condition, and the second split for the mark. The 'igg' column is the same as the 'sample' column and should be manually replaced with the sample name of the IGG or control you would like to use for that sample. If the sample is an IgG it can be the same as its name, and won't affect peak calling.
So a fixed version of the table above would look like this:
sample R1 R2 mark condition igg
HoxE_1_IgG data/raw/HoxE_1_IgG_R1.fastq.gz data/raw/HoxE_1_IgG_R2.fastq.gz IgG HoxE HoxE_1_IgG
HoxE_1_Rbp1 data/raw/HoxE_1_Rbp1_R1.fastq.gz data/raw/HoxE_1_Rbp1_R2.fastq.gz Rbp1 HoxE HoxE_1_IgG
HoxE_2_Rbp1 data/raw/HoxE_2_Rbp1_R1.fastq.gz data/raw/HoxE_2_Rbp1_R2.fastq.gz Rbp1 HoxE HoxE_1_IgG
HoxE_3_Rbp1 data/raw/HoxE_3_Rbp1_R1.fastq.gz data/raw/HoxE_3_Rbp1_R2.fastq.gz Rbp1 HoxE HoxE_1_IgG
HoxM_1_IgG data/raw/HoxM_1_IgG_R1.fastq.gz data/raw/HoxM_1_IgG_R2.fastq.gz IgG HoxM HoxM_1_IgG
HoxM_1_Rbp1 data/raw/HoxM_1_Rbp1_R1.fastq.gz data/raw/HoxM_1_Rbp1_R2.fastq.gz Rbp1 HoxM HoxM_1_IgG
HoxM_2_Rbp1 data/raw/HoxM_2_Rbp1_R1.fastq.gz data/raw/HoxM_2_Rbp1_R2.fastq.gz Rbp1 HoxM HoxM_1_IgG
HoxM_3_Rbp1 data/raw/HoxM_3_Rbp1_R1.fastq.gz data/raw/HoxM_3_Rbp1_R2.fastq.gz Rbp1 HoxM HoxM_1_IgG
HoxW_1_IgG data/raw/HoxW_1_IgG_R1.fastq.gz data/raw/HoxW_1_IgG_R2.fastq.gz IgG HoxW HoxW_1_IgG
HoxW_1_Rbp1 data/raw/HoxW_1_Rbp1_R1.fastq.gz data/raw/HoxW_1_Rbp1_R2.fastq.gz Rbp1 HoxW HoxW_1_IgG
HoxW_2_Rbp1 data/raw/HoxW_2_Rbp1_R1.fastq.gz data/raw/HoxW_2_Rbp1_R2.fastq.gz Rbp1 HoxW HoxW_1_IgG
HoxW_3_Rbp1 data/raw/HoxW_3_Rbp1_R1.fastq.gz data/raw/HoxW_3_Rbp1_R2.fastq.gz Rbp1 HoxW HoxW_1_IgG
For this example there was only one IgG per condition, so the sample name corresponding to that IGG was used for each sample in the condition. In the case that each sample had its own control file, each entry would correspond to the IgG for that sample. If only one IgG was used in the whole experiment, then its sample name could be used for each row. If you are not using IgG set config['USEIGG'] to false, and don't modify the samplesheet.
-
Edit runtime configuration in the file
config.yml
:-
If using Singularity/Apptainer containers instead of Conda, specify the path to the folder containing the Singularity image files (.sif).
-
Specify whether or not to trim adapters and filter out low-quality reads from raw reads to improve downstream alignment rate. You can decide based on adapter content in the sequencing core's FastQC reports.
-
Specify the path to a FASTA file containing sequences to trim from reads. The default is to use
src/adapter_seqs.fa
. Additional sequences can be added to the file if needed. This file path is ignored if adapter trimming is set to False. -
Specify the path to the fastq screen configuration file.
-
Specify the path to the bowtie2 index for the genome you are aligning to.
-
Specify the path to the genome FASTA file.
-
Specify whether or not to use IGG for peak calling.
-
Specify the path to the chromosome sizes.
-
Optionally, specify the path to a tab-delimited file of blacklisted genomic regions to filter out. If blacklist removal is not desired, then set this to an empty string with quotations ("").
-
-
The pipeline detects samples in the subdirectory data/raw with the following assumptions:
-
Paired end reads
-
Read 1 and 2 are designated using "_R1", and "_R2"
-
Samples are designated in the following convention:
{condition}_{replicate}_{mark}_{R1|R2}.fastq.gz
. This format affects the output files and ensures the bigwig files from the same marks are merged.
-
-
Make sure the
src/deseq2_metadata.csv
looks right. The file is created when you runmake_sample_sheet.py
and should have the following properties:- Should have two columns labeled "sample", and "condition"
- The sample column corresponds to the individual biological replicates (includes all fields around the "_" delimiter).
- The condition should be the condition for each sample, which uses the first field with the "_" delimiter.
- If you have multiple conditions and marks to analyze, you can introduce more columns into this file and adjust the deseq2.R file to account for extra covariates.
Below is an example of what the src/deseq2_metadata.csv
file might look like for an experiment with various conditions, replicates, and marks/antibody targets. Note that the IgG samples are not included in differential expression analysis:
sample,condition
HoxE_1_Rbp1,HoxE
HoxE_2_Rbp1,HoxE
HoxE_3_Rbp1,HoxE
HoxM_1_Rbp1,HoxM
HoxM_2_Rbp1,HoxM
HoxM_3_Rbp1,HoxM
HoxW_1_Rbp1,HoxW
HoxW_2_Rbp1,HoxW
HoxW_3_Rbp1,HoxW
Please follow the instructions in the "Snakemake + SLURM integration" section below if you are running the pipeline as a batch job and don't yet have a SLURM profile set up. The SLURM profile will configure default settings for SnakeMake to interact with SLURM. More information can be found here.
NOTE: If you already have a SLURM profile set up to run Snakemake with Conda (i.e., includes settings like use-conda, conda-prefix) but would like to run Snakemake with SLURM and Singularity/Apptainer integration, please follow the instructions in the "Snakemake + SLURM + Singularity/Apptainer integration" section below.
Download the slurm
folder from the maxsonBraunLab repository and copy the entire thing to ~/.config/snakemake
.
Your file configuration for SLURM should be as follows:
~/.config/snakemake/slurm/<files>
Change the file permissions for the scripts in the slurm
folder so that they are executable. To do this, run:
chmod +x ~/.config/snakemake/slurm/slurm*
NOTE: The ~/.config/snakemake/slurm/config.yaml
file contains settings for SnakeMake to interact with SLURM and, optionally, Conda or Singularity/Apptainer. If you already have an exisiting SLURM profile configured to run Snakemake with Conda (i.e., includes settings like use-conda, conda-prefix), then you will need to create a separate profile for running Snakemake with Singularity/Apptainer. To do this:
- Copy contents of base slurm profile into another folder for slurm_singularity profile:
cp -r ~/.config/snakemake/slurm ~/.config/snakemake/slurm_singularity
- Make profile scripts executable:
chmod +x ~/.config/snakemake/slurm_singularity/slurm*
-
Remove any conda-specific settings (e.g. use-conda, conda-prefix, etc.) from
~/.config/snakemake/slurm_singularity/config.yaml
-
(Optional) Add the following lines to end of
~/.config/snakemake/slurm_singularity/config.yaml
file:
use-singularity: True
keep-going: True
rerun-incomplete: True
printshellcmds: True
If you would like to run the pipeline using Singularity/Apptainer containers instead of Conda, please follow additional setup and execution instructions in the "Reproducible results with SnakeMake + Singularity/Apptainer" section below.
A "dry-run" can be accomplished to see what and how files will be generated by using the command:
snakemake -nrp
To invoke the pipeline, please use either of the two options below:
# run in interactive mode. Recommended for running light jobs.
snakemake -j <n cores> --use-conda --conda-prefix $CONDA_PREFIX_1/envs
# run in batch mode. Recommended for running intensive jobs.
sbatch run_pipeline_conda.sh
For users running the pipeline in batch mode, run_pipeline_conda.sh
is a wrapper script that contains the following command:
snakemake -j <n jobs> --use-conda --conda-prefix $CONDA_PREFIX_1/envs --profile slurm --cluster-config cluster.yaml
Additional setup instructions are provided in the wrapper script.
You can standardize further arguments for running the pipeline in batch mode using the following instructions. The maxsonBraunLab repository slurm contains further instructions to set up a SnakeMake slurm profile.
To ensure the reproducibility of your results and reduce environment setup issues, we recommend running a SnakeMake workflow using Singularity/Apptainer containers. These containers standardize the installation of bioinformatics software (e.g. bowtie2, samtools, deseq2).
Make sure to complete the general pipeline/data setup instructions above before running the pipeline.
If you have access to the MaxsonLab storage space on ARC, then you can skip the first setup step below and use the default SINGULARITY_IMAGE_FOLDER
path specified in the config.yml
file to access the containers for this pipeline.
If you have previously run this pipeline with Singularity/Apptainer and already built the needed containers, then you can set the SINGULARITY_IMAGE_FOLDER
path in the config.yml
to the folder where your container images are stored.
Do this step if you do not have access to the MaxsonLab storage space on ARC and have not run this pipeline with Singularity/Apptainer before. You will need to build the necessary containers from the definition files provided in the singularity_definition_files
folder.
To build containers without requiring root access on ARC, you will need to create a Sylabs account (you can use your Github account to log in). After logging in, navigate to Dashboard > Access Tokens
and create a new access token. Make sure to copy and save the token into a secure place.
This token will allow you to access the Sylabs remote builder tool from the command line. Note that every user is limited to 500 minutes of build time per month.
After generating a Sylabs access token, you will need to log into Sylabs from the ARC command line. To do this:
# get onto an interactive/compute node
srun -p interactive --time=3:00:00 --pty bash
# input your access token when prompted
singularity remote login
Now you're ready to start building containers. To do this, navigate to the main folder of the pipeline (where the Snakefile is) and run the singularity_build_remote.sh
script as follows:
# create folder to store build logs
mkdir -p jobs/singularity_build_remote
# make sure to follow any additional instructions in the script file before executing
# provide the path to a folder where you want to store your container images
# (don't include a slash "/" at the end of path)
sbatch singularity_build_remote.sh <path_to_output_folder>
By default, Singularity/Apptainer will create and use a cache directory in your personal user root folder (i.e. in /home/users/<username>
). This may create problems as there is limited space in a user's root folder on ARC. To avoid issues with space in your root folder, you can set the Singularity/Apptainer cache directory path to a folder in your lab group directory like this:
# make a cache folder inside your lab user folder
mkdir /home/groups/MaxsonLab/<your_user_folder>/cache
# make the path to the cache folder accessible to other processes
export SINGULARITY_CACHEDIR="/home/groups/MaxsonLab/<your_user_folder>/cache"
export APPTAINER_CACHEDIR="/home/groups/MaxsonLab/<your_user_folder>/cache"
If you are an experienced user, you can add the export
lines above to your ~/.bashrc
file. Otherwise, run the export
command before executing the pipeline.
Singularity/Apptainer documentation on Exacloud can be found here.
Singularity/Apptainer documentation on ARC can be found here.
A "dry-run" can be accomplished to see what and how files will be generated by using the command:
snakemake -nrp
To invoke the pipeline, please use either of the two options below.
NOTE: Make sure to use double quotes for the --bind
argument, and insert an integer for the -j
flag. The --bind
argument binds the host (ARC) paths to the container to access the genome indices and the path to the raw sequencing files. When Snakemake is executed directly on an interactive terminal, the -j
flag represents the max number of cores to use. When executed via a batch script, the -j
flag represents the max number of jobs to run at a time.
Option 1: Singularity/Apptainer + interactive run
# get onto an interactive/compute node
srun --time=12:00:00 --pty bash
# set folder paths
# fastq_folder should be absolute/full path to folder containing raw FASTQ files (not the symlinks)
indices_folder="/home/groups/MaxsonLab/indices"
fastq_folder="/set/full/path/here"
# run pipeline
snakemake -j <n_cores> \
--use-singularity \
--singularity-args "--bind $indices_folder,$fastq_folder"
Option 2: Singularity/Apptainer + slurm (batch) run
For users running the Singularity/Apptainer version of the pipeline in batch mode, run_pipeline_singularity.sh
is a wrapper script for the pipeline. You will need to add the appropriate FASTQ folder path to the script prior to running. Additional instructions are provided in the wrapper script.
# run pipeline
sbatch run_pipeline_singularity.sh
With adapter trimming enabled and blacklist removal disabled (default):
With blacklist removal enabled:
Below is an explanation of each output directory:
aligned - sorted and aligned sample bam files
callpeaks - the output of callpeaks.py with peak files and normalized bigwigs for each sample.
counts - the raw counts for each mark over a consensus peak list
custom_report - Contains an HTML file with quick summary tables of QC metrics for each sample, grouped by mark and condition for convenient comparison. This report complements the QC report generated by MultiQC.
deseq2 - the results of DESeq2 for each counts table (by mark). Note that because each input counts table is generated for consensus peak regions, DESeq2 normalizes these tables by counts in consensus peaks rather than by entire library size.
diffbind - the results of Diffbind for each mark. Note that here, counts are normalized by entire library size rather than just counts in peaks.
dtools - fingerprint plot data for multiqc to use
fastp - adapter-trimmed FASTQ files (if adapter-trimming option is enabled in `config.yml`)
fastqc - fastqc results
fastq_screen - fastq_screen results
logs - runtime logs for each snakemake rule
markd - duplicate marked bam files
multiqc - contains the file multiqc_report.html with a lot of QC info about the experiment.
plotEnrichment - FRIP statistics for each mark
preseq - library complexity data for multiqc report
Each mark should have the following output files:
"data/deseq2/{mark}/{mark}-rld-pca.png" - PCA of counts after reguarlized log transformation.
"data/deseq2/{mark}/{mark}-vsd-pca.png" - PCA of counts after variance stabilizing transformation.
"data/deseq2/{mark}/{mark}-normcounts.csv" - normalized count for each sample in each consensus peak.
"data/deseq2/{mark}/{mark}-lognormcounts.csv" - log2 normalized counts for each sample in each consensus peak.
"data/deseq2/{mark}/{mark}-rld.png", - the sdVsMean plot using regularized log transformation.
"data/deseq2/{mark}/{mark}-vsd.png" - the sdVsMean plot using variance stabilizing transformation.
"data/deseq2/{mark}/{mark}-vsd-dist.png" - the sample distance matrix after variance stabilizing transformation.
"data/deseq2/{mark}/{mark}-rld-dist.png" - the sample distance matrix using regularized log transformation.
"data/deseq2/{mark}/{mark}-dds.rds" - the R object with the results of running the DEseq() function.
For each contrast, the differentially expressed genes are written to a file ending in -diffexp.tsv
as well as those with an adjusted p-value less than 0.05 with the extension -sig05-diffexp.tsv
. A summary of the results using an alpha of 0.05 is also written to a file with the extension -sig05-diffexp-summary.txt
. Additionally two MA plots are written to the file ending in plotMA.png
that have highlighted differential peaks with an adjusted p-value less than 0.1.
See the following paper for further explanations of the above plots and data transforms: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#exporting-results-to-csv-files