--| Overview
--| Usage
--| Options
--| Output
--| Resources
--| Schedulers
--| Software
atac_chip_preprocess is a containerized Nextflow pipeline for preprocessing of ATAC-seq and ChIP-seq data.
The pipeline consists of these steps, starting from a samplesheet to read the fastq files:
- Validation of the samplesheet to ensure that fastq files are not duplicated and all file paths exist. If any validation fails then the process will return an informative error for debuggung.
- Initial fastq QC with
fastqc
. - Merging of lane/technical replicates per sample, see the samplesheet section below on how to indicate technical replicates (=multiple fastq files) per sample.
- Adapter and quality trimming with
fastp
. - Mapping of reads with
bowtie2
, by default with the--very-sensitive
(and-X 2000
for paired-end data) flags. The pipeline does not include a dedicated indexing step for the reference genome since this is a one-liner viabowtie2-build
. The user needs to build the index upfront and then provide the path to the folder with the*.bt2
index files via--index
. The files will then automatically be found in that folder. It is expected that only a single set of index file is in that folder. - Duplicate marking with
samblaster
. - Removal of alignments with MAPQ below 20, removal of non-primary or supplementary alignments, removal of reads not mapped to primary chromosomes (the regex to keep alignments is
chr[1-9,X,Y]
), removal of mitochondrial alignments and duplicate reads withsamtools
and combinations of GNU tools. - For paired-end data fetching of insert size metrics with
picard CollectInsertSizeMetrics
. - For ATAC-seq data extraction of transposome insertion events (cutsites, that is the 5' end of alignments) using GNU tool combinations, output in gzipped BED format.
- Peak calling with
macs2
. For ATAC-seq the options--keep-dup=all --nomodel --extsize 100 --shift -50 --min-length 250
are used by default, else it is only--keep-dup=all
since the deduplicated BAM files are used as input. Then, peaks are filtering against NGS blacklists (ENCODE + mitochondrial homologs in the nuclear genome, the latter for ATAC-seq only) usingbedtools
. This is species-specific. Currently, human and mouse is supported via the flag--species
with eitherhs
ormm
. - Calculation of Fractions Of Reads in Peaks (FRiPs) as a signal-to-noise QC metric with
featureCounts
. - Creation of bigwig tracks (raw counts) for quick visual inspection of data quality with
bedtools genomecov
. - Summary report collecting fastqc, trimming, alignment metrics and FRiP/featureCounts metrics with
MultiQC
. - Output of all used software versions and the exact command lines per process and sample using custom scripts.
Execute this command to run the pipeline on a tiny test dataset with minimal resources to explore outputs:
NXF_VER=23.04.0 nextflow run atpoint/atac_chip_preprocess -r main -profile docker,test --keep_merge --keep_trim
An overview of current software versions and exact command lines when using default settings of the pipeline can be found here.
The minimal parameters the user has to provide are the following ones:
--samplesheet
: path to a samplesheet csv file with three columns, beingsample
(the sample name),r1
(path to R1) andr2
(path to R2), where r2 can be empty. If empty, then the sample is considered single-end.--index
: path to a folder containing abowtie2
index with the typical*.bt2
files. Note, it is the path to the folder, not the path to the index basename, as the pipeline will find the bt2 files automatically.--species
: either ofmm
orhs
to let the peak caller know whether mouse or human data are provided, so it gets the effective genome length right.
Note that the bowtie2
index must be produced upfront, we did not include that into the pipeline as it is trivially just bowtie2-build genome.fa idx
.
On our HPC we typically use this command below, with Apptainer (currently the profiles docker
, singularity
and apptainer
are supported) as container engine and SLURM as executor. If any other executor shall be used then the user needs to add it to the scheduler config file file.
# Example for mouse ATAC-seq data
NXF_VER=23.04.0 nextflow run atpoint/atac_chip_preprocess -r main -profile apptainer,slurm --samplesheet path/to/samplesheet.csv --index path/to/index_folder --species mm
# Example for mouse ChIP-seq data
NXF_VER=23.04.0 nextflow run atpoint/atac_chip_preprocess -r main -profile apptainer,slurm --samplesheet path/to/samplesheet.csv --index path/to/index_folder --species mm --atacseq false
Use either of -profile docker/singularity/apptainer
to use any of these container engines.
The pipeline uses (in our opinion) reasonable defaults for all processing steps. Still, the following options exist for customization:
General options
--outdir
, path to desired output folder collecting all results. By default, it is./atac_chip_preprocess_results/
in the directory from which the pipeline is launched.--atacseq
, a logical, set tofalse
if processing something like ChIP-seq data, by defaulttrue
for ATAC-seq data.
Filtering options
--blacklist
: path to a BED file to filter peaks against. By default when--species
ismm
then the provided mm10 blacklist is used, forhs
the hg38 one is used.--filter_blacklist
: logical, set tofalse
to turn off any blacklist filtering, defaulttrue
. If any species other than human and mouse is used this can be set to false so no filtering takes place and both--species
and--blacklist
have no effect, hence can be left at defaults.--flag_remove
: a numeric flag to be used withsamtools view -F
, so indicating which alignments to remove. Default is 3332, so discard unmapped, not primary, supplementary alignments and duplicates . See here for details.--chr_regex
: a groovy-compatible regex to indicate which chromosomes to keep in the BAM alignments. Default ischr[1-9,X,Y]
which means keep everything starting withchr
and then a number or X/Y. That in turn removes typical decoys (chrEBV
) and unplaced/random contigs such aschrU...
. As a result it keeps only the primary autosomes and sex chromosomes.--min_mapq
: an integer, keep only alignments with MAPQ greater than that, default is 20.--fragment_length
: for single-end data an average expected fragment length to extend reads to fragments for bigwig creation and FRiP calculation, default is 250. That is only used if--atacseq false
as for ATAC-seq data everything is based on the transposome cutsites (that is the 5' ends of the alignments).keep_merge
: logical, whether to keep the merged fastq files, else they're not published to the output directory.keep_trim
: logical, whether to keep the trimmed fastq files, else they're not published to the output directory.
Process options
--do_not_trim
: logical, whether to skip adapter and quality trimming.--trim_additional
: additional arguments for thefastp
trimming process beyond what is coded in the module definition, default--dont_eval_duplication -z 6
to skip duplicate level assessment and to compress outputs--align_additional
: additional arguments for thebowtie2
alignment process beyond what is coded in the module definition, default is-X 2000 --very-sensitive
, seebowtie2
manual.--sort_additional
: additional arguments for thesamtools sort
process beyond what is coded in the module definition, default is-l 6
to compress the resulting BAM file to that level. Do not add-m
or-@
here, as resources are hardcoded in thenextflow.config
file.--filter_additional
: additional arguments for thesamtools view
filtering process beyond what is described above and given with the-q
and-F
flags--macs_additional
: additional arguments for themacs2 callpeak
, default is in any case--keep-dup=all
since we provide already deduplicated data to that process and if ATAC-seq data are processed (default) then--nomodel --extsize 100 --shift -50 --min-length 250
to provide some smoothing to the pileup when using the cutsites for peak calling.
By default, all outputs will be collected in ./atac_chip_preprocess_results/
relative to the directory from which the pipeline is launched. Use --outdir
to change this. Output folders are:
- alignments_filtered: Sorted bam, bam index and flagstats for the filtered alignments.
- alignments_unfiltered: Sorted bam, bam index and flagstats for the unfiltered alignments.
- bed_files: In ATAC-seq mode gzipped BED files with the cutsites (5' end of filtered alignments) per sample. These are used for peak calling in the pipeline.
- bigwig: Bigwig files of filtered alignments, without any normalization, for a quick visual inspection of data quality.
- fastq_merged: The merged fastq files in case there were technical replicates per sample and
--keep_merge
was used. - fastq_trimmed: The trimmed fastq files and trimming stats from
fastp
in case trimming was activated (default yes) and--keep-trim
was used. - fastqc: The fastqc per-sample outputs.
- frips: A file
frips_all.txt
with the FRiP score per sample. If a sample had a FRiP of zero then currently it is not listed but ignored. - misc: Contains the chromsizes extracted from the BAM files and the insert sizes per sample in case of paired-end data.
- multiqc: The multiQC summary report summarizing all fastqc, trim, alignment and FRiP stats.
- peaks: The narrowPeak and summit BED files per sample from
macs2
. - pipeline_info: A file
command_lines.txt
summarizing all used command lines per process and sample, and a filesortware_versions.txt
summarizing all software versions that were used in the pipeline.
The nextflow.config files contains hardcoded defaults towards resources for the individual processes, suitable for use on HPC or workstation environments. The most demanding process is the alignment steps, requiring 16 threads and 16GB of RAM per sample to finish in a reasonable amount of time.
The schedulers.config file currently contains a single scheduler profile for SLURM as used on or HPC,
submitting jobs (if using -profile slurm
) to a quere called normal
with a maximum 8h of walltime. Custom profiles should be added to this config. Users can add custom configurations here.
For reproducibility we recommend to use the container options (Docker, Singularity, Apptainer) for -profile docker,singularity,apptainer
to take care of all software, using a provided Docker-based image. By default, the pipeline does not support conda/mamba. However, the user can create such a software environment with the environment.yml
file, and then simply run the pipeline without specifying any of the above container engines, so the software available in the current environment will be used.