Skip to content
wbaopaul edited this page Mar 9, 2020 · 1 revision

scATAC-pro Manual

Contents

How to set up user configuration file

The user configuration file configure_user.txt is a required input file (specified by flag -c) for running a module, by which the parameteres/options (if exist) are assigned. Below shows how to specify needed parameters/options in the configure file module by module. Note that some modules are not mentioned because there is no parameter/option to specify. Using default setting for vast majority modules is fine, but need to change the genome name, mapping index path and genome annotation files, which are varied for differnt data sets

global setting

Parameter Value Note
OUTPUT_PREFIX pbmc10k The name used as prefix for outputs, usually sample/dataset name
IsSingleEnd FALSE Set it to TURE if the reads are single-ended
BLACKLIST annotation/hg38_blacklist.bed Genomic regions as black list used to remove artificial peaks/bins
PROMOTERS annotation/hg38_promoter.bed File for promoters to calculate the QC
ENHANCERS annotation/hg38_enhancer.bed File for enhancers to calculate the QC
TSS annotation/hg38_tss.bed File for transcript start sites to calculate the QC and annotate peaks/bins
GENOME_NAME hg38 Used for TF motif enrichemnt and footprinting analysis
plotEPS TRUE Plot figures in .eps format or not when generating summary report

trimming

Parameter Value Note
TRIM_METHOD trim_galore Adapter trimming method, three options: trim_galore/Trimmomatic/none
ADAPTER_SEQ NA Set it to the path of the adapter .fa file if TRIM_METHOD is set to Trimmomatic, otherwise ignore it

mapping

Parameter Value Note
MAPPING_METHOD bwa Read alignment method, three options: bwa/bowtie/bowtie2
BWA_OPTS -t 16 Additional options for bwa, ignore it if MAPPING_METHOD is not set to bwa
BWA_INDEX PATH_TO_INDEX Index file for bwa of the used genome (the path of the .fa file of the genome), ignore it if MAPPING_METHOD is not set to bwa
BOWTIE_OPTS --quiet -p 16 Additional options for bowtie, ignore it if MAPPING_METHOD is not set to bowtie
BOWTIE_INDEX PATH_TO_INDEX/GENOME_PREFIX Index file for bowtie of the used genome (the directory of the .ebwt file of the genome), ignore it if MAPPING_METHOD is not set to bowtie
BOWTIE2_OPTS --quiet -p 16 Additional options for bowtie2, ignore it if MAPPING_METHOD is not set to bowtie2
BOWTIE2_INDEX PATH_TO_INDEX Index file for bowtie2 of the used genome (the directory of the .bt2 file of the genome), ignore it if MAPPING_METHOD is not set to bowtie2
MAPQ 30 Filter out reads with MAPQ less than 30 for downstream modules
CELL_MAPQ_QC TRUE Report mapping qc for cell barcodes (need to run module get_bam4Cells)

call_peak

Parameter Value Note
PEAK_CALLER MACS2 Peak calling method, four options: MACS2/BIN/COMBINED/GEM
MACS_OPTS -q 0.01 -g hs --nomodel --extsize 200 --shift -100 Additional options to call macs2; no need to specify -t -n -f
BIN_RESL 5000 Bin resolution in base pair if PEAK_CALLER is set to BIN or COMBINED
CHROM_SIZE_FILE annotations/chrom_hg38.sizes The file of the chromosome size

call_cell

Parameter Value Note
CELL_CALLER FILTER Cell calling method, three options: FILTER/EmptyDrop/cellranger/
EmptyDrop_FDR 0.001 Fdr cutoff for EmptyDrop algorithm, ignore it if CELL_CALLER is not specified as EmptyDrop
FILTER_BC_CUTOFF --min_uniq_frags 5000 --max_uniq_frags 50000 --min_frac_peak 0.5 --min_frac_tss 0.0 --min_frac_promoter --min_frac_enhancer --max_frac_mito 0.1 The QC (per barcode) cutoffs used for define cells if CELL_CALLER is set to FILTER: the minimum # of unique fragments, the maximum # of unique fragments, the minimum fractions of fragments in peaks, in TSSs, in promoters, in enhancers, and the maximum fraction of fragments in mitochodrial genome , ignore it otherwise

clustering

Parameter Value Note
norm_by tf-idf Normalization method, three options: tf-idf/log/NA
Top_Variable_Features 10000 Number/fraction of variable features used for seurat. If set to 0-1, meaning the fraction of total # of features
REDUCTION pca Dimension reduction method: pca/lda; UMAP and TSNE will be automatically calculated correspondly
nREDUCTION 30 The reduced dimension, an integer
CLUSTERING_METHOD seurat Clustering method, one of these options: seurat/cisTopic/kmeans/LSI/SCRAT/scABC/chromVAR
K_CLUSTERS An integer or NULL The number of expected cell clusters, will set resolution parameter for Louvain algorithm as 0.2 if K_CLUSTERS is specified as NULL
prepCello TRUE Generate object for VisCello (for visualization)

split_bam

Parameter Value Note
SPLIT_BAM2CLUSTER TRUE Extract bam files for each cell clusters or not; this module is neccessary if you want to do footprinting analysis

runDA

Parameter Value Note
RUN_DA TRUE Run differential accessibility analysis or not
group1 0:1 Either the name(s) of one or multiple cell clusters, separated by colon, or 'one'. If specified as 'one', will perform all one-vs-rest comparisons
group2 rest Either the name(s) of one or multiple cell clusters, separated by colon, or 'rest'
test_use wilcox Statistical testing method used to do differential accessibility analysis, negbinom/LR/wilcox/t/DESeq2

runGO

Parameter Value Note
RUN_GO TRUE Run GO analysis or not after running DA
GO_TYPE BP Type of GO terms, one of three options: BP/CC/kegg

footprint

Parameter Value Note
DO_FOOTPRINT FALSE Perform TF footprinting analysis or not
group1_fp 0 Either the name of a cell cluster or 'one'. If specified as 'one', will conduct all one-vs-rest comparisons
group2_fp rest Either the name of a cell cluster or 'rest'

runCicero

Parameter Value Note
RUN_Cicero TRUE Predicting cis chromatin interactions or not
Cicero_Plot_Region chr5:140610000-140640000 Plot cis chromatin interactions within Cicero_Plot_Region on the summary report

integrate

Parameter Value Note
Integrate_By seurat Integration method, one of seurat/pool/harmony
prepCello4Integration TRUE Prepare VisCello object for integrated object or not

More details about inputs and outputs for all analysis modules

Note this is a long table. You can slide right to read it

Module Input Output
demplx_fastq Fastq files for both reads and index, separated by comma like: PE1_fastq,PE2_fastq,index1_fastq,inde2_fastq,index3_fastq.... Multiple index files are supportted and fastq file can be in compressed format (e.g. .gz file) Demultiplexed fastq1 and fastq2 files with index information embedded in the read name as: @index3_index2_index1:original_read_name, saved in output/demplxed_fastq/
trimming Demultiplexed fastq1 and fastq2 files. Trimmed demultiplexed fastq1 and fastq2 files, saved in output/trimmed_fastq/. This module can be skipped if TRIM_METHOD is set to 'none' when running module process
mapping The demultiplexed and trimmed paired-end fastq files, separated by comma: pe1.fastq,pe2.fastq Position sorted bam file, and position sorted MAPQ30 bam file, saved in output/mapping_result/ and plain text files of mapping QC metrics and fragments.txt file saved in output/summary/
call_peak The position sorted MAPQ30 bam file outputted from the mapping module. Note that the annotation of blacklist regions and CHROM_SIZE_FILE are used to filter out potential artificial peaks. It's not neccessary to use bam file in scATAC-pro format to call peaks, because the peaks are called based on aggregated bam file. The peaks/features file in plain text format, saved as output/peaks/PEAK_CALLER/OUTPUT_PREFIX_features_BlacklistRemoved.bed.
get_mtx The peaks/features file outputed from call_peak module. It searchs the fragments.txt file in directory output/summary/ to construct the matrix The raw peak-by-cell sparse matrix along with corresponding barcodes and features files, saved in output/raw_matrix/PEAK_CALLER/ as matrix.mtx, barcodes.txt and features.txt.
qc_per_barcode Fragment.txt file and peaks/features file, separated by comma. This module can only be performed after running module mapping and module call_peak QC metrics for each barcode, saved as output/summary/qc_per_barcode.summary in plain text format
aggr_signal Position sorted MAPQ30 bam file outputted from module mapping Aggregated data in .bw and .bedgraph files, which can be uploaded and visualized to genome browser, saved in output/signal/. A Tss-by-window count matrix (in .mtx.gz format, +/- 1000 bp of each TSS) is also created in output/signal, which can be used to plot the TSS enrichment profile when generating the summary report
call_cell The raw peak-by-cell sparse matrix file outputted from the get_mtx module. This module can be only performed after running module get_mtx and module qc_per_barcode The filtered peak-by-cell sparse matrix, the corresponding barcodes and features files, saved in output/filtered_matrix/PEAK_CALLER/CELL_CALLER/ as matrix.mtx, barcodes.txt and features.txt, respectively
get_bam4Cells Bam file for aggregated data outputted from module mapping, and a barcodes.txt file outputted from module call_cell, separated by comma Bam file and mapping QC (optional) for cell barcodes saved in output/mapping_result/cell_barcodes.MAPQ30 and output/summary/cell_barcodes.MappingStats, respectively
clustering The filtered peak-by-cell sparse matrix outputted from the call_peak module A seurat object with metadata 'active_clusters' for the cell clustering labels, saved as output/downstream_analysis/PEAK_CALLER/CELL_CALLER/seurat_obj.rds. The cell barcodes by cluster table, saved in output/downstream_analysis/cell_cluster_table.txt file, the UMAP plot colored by clustering label, and a VisCello input object as output/downstream_analysis/PEAK_CALLER/CELL_CALLER/VisCello_Obj if parameter prepCello is specified as TRUE. If CLUSTER_METHOD is set to 'chromVAR', a chromVAR object is saved as output/downstream_analysis/PEAK_CALLER/CELL_CALLER/chromVAR_obj.rds as well
split_bam The cell barcodes by cluster table file (output/downstream_analysis/PEAK_CALLER/CELL_CALLER/cell_cluster_table.txt), outputted from clustering module Bam (saved in output/downstream/PEAK_CALLER/CELL_CALLER/data_by_cluster), .bw, and .bedgraph files (saved in output/signal/) for aggregated signal for cells in each cluster
runDA Either two groups named as '0:1,2' in which group1 consists of cluster 0 and 1, and group2 consists of cluster2 or specified as '0, rest' or 'one,rest' . The differential accessibility features with statistical significancy information, saved in output/downstream_analysis/PEAK_CALLER/CELL_CALLER/differential_accessible_features_group1_vs_group2.txt.
motif_analysis The filtered peak-by-cell sparse matrix file outputted from the call_cell module. A chromVar object, a table and a heatmap for differentially enriched TFs in each clusters, saved in output/filtered_matrix/PEAK_CALLER/CELL_CALLER/.
footprint Either two groups named as '1,2' or '1,rest', 'one,rest' A table and a heatmap of differential bound TFs for each group, saved in output/downstream_analysis/PEAK_CALLER/CELL_CALLER/
runCicero seurat_obj.rds file outputted from clustering module, saved in output/downstream_analysis/PEAK_CALLER/CELL_CALLER/seurat_obj.rds Gene activity object in .rds format, saved as output/downstream_analysis/PEAK_CALLER/CELL_CALLER/cicero_gene_activity.rds, and predicted cis-chromatin interactions in plain text format, saved in output/downstream_analysis/PEAK_CALLER/CELL_CALLER/cicero_inteactions.txt
integrate Peak files called from different data sets, separated by comma A seurat object for integrated data and a UMAP plot colored by clustering labels. If Integrate_By = seurat, an 'integrated' assay is created in the seurat object and the cell clustering by louvain algorithm is performed on the PCs of the 'integrated' assay. If Integrate_By is set to 'harmony', the clustering is performed on the reduced dimension 'harmony'. If Integreate_By is set to 'pool', the data is simply pooled and regressed out the confound factors of sequence depth per cell and the dataset ID
report Path to the directory of summary QC files: output/summary as default A summary report in html format, saved in output/summary/, along with .eps figures for each panels saved in output/summary/Figures/