forked from wbaopaul/scATAC-pro
-
Notifications
You must be signed in to change notification settings - Fork 8
Manual
wbaopaul edited this page Mar 9, 2020
·
1 revision
Contents
- [scATAC-pro Manual](#scatac-pro Manual)
The user configuration file configure_user.txt is a required input file (specified by flag -c) for running a module, by which the parameteres/options (if exist) are assigned. Below shows how to specify needed parameters/options in the configure file module by module. Note that some modules are not mentioned because there is no parameter/option to specify. Using default setting for vast majority modules is fine, but need to change the genome name, mapping index path and genome annotation files, which are varied for differnt data sets
Parameter | Value | Note |
---|---|---|
OUTPUT_PREFIX | pbmc10k | The name used as prefix for outputs, usually sample/dataset name |
IsSingleEnd | FALSE | Set it to TURE if the reads are single-ended |
BLACKLIST | annotation/hg38_blacklist.bed | Genomic regions as black list used to remove artificial peaks/bins |
PROMOTERS | annotation/hg38_promoter.bed | File for promoters to calculate the QC |
ENHANCERS | annotation/hg38_enhancer.bed | File for enhancers to calculate the QC |
TSS | annotation/hg38_tss.bed | File for transcript start sites to calculate the QC and annotate peaks/bins |
GENOME_NAME | hg38 | Used for TF motif enrichemnt and footprinting analysis |
plotEPS | TRUE | Plot figures in .eps format or not when generating summary report |
Parameter | Value | Note |
---|---|---|
TRIM_METHOD | trim_galore | Adapter trimming method, three options: trim_galore/Trimmomatic/none |
ADAPTER_SEQ | NA | Set it to the path of the adapter .fa file if TRIM_METHOD is set to Trimmomatic, otherwise ignore it |
Parameter | Value | Note |
---|---|---|
MAPPING_METHOD | bwa | Read alignment method, three options: bwa/bowtie/bowtie2 |
BWA_OPTS | -t 16 | Additional options for bwa, ignore it if MAPPING_METHOD is not set to bwa |
BWA_INDEX | PATH_TO_INDEX | Index file for bwa of the used genome (the path of the .fa file of the genome), ignore it if MAPPING_METHOD is not set to bwa |
BOWTIE_OPTS | --quiet -p 16 | Additional options for bowtie, ignore it if MAPPING_METHOD is not set to bowtie |
BOWTIE_INDEX | PATH_TO_INDEX/GENOME_PREFIX | Index file for bowtie of the used genome (the directory of the .ebwt file of the genome), ignore it if MAPPING_METHOD is not set to bowtie |
BOWTIE2_OPTS | --quiet -p 16 | Additional options for bowtie2, ignore it if MAPPING_METHOD is not set to bowtie2 |
BOWTIE2_INDEX | PATH_TO_INDEX | Index file for bowtie2 of the used genome (the directory of the .bt2 file of the genome), ignore it if MAPPING_METHOD is not set to bowtie2 |
MAPQ | 30 | Filter out reads with MAPQ less than 30 for downstream modules |
CELL_MAPQ_QC | TRUE | Report mapping qc for cell barcodes (need to run module get_bam4Cells) |
Parameter | Value | Note |
---|---|---|
PEAK_CALLER | MACS2 | Peak calling method, four options: MACS2/BIN/COMBINED/GEM |
MACS_OPTS | -q 0.01 -g hs --nomodel --extsize 200 --shift -100 | Additional options to call macs2; no need to specify -t -n -f |
BIN_RESL | 5000 | Bin resolution in base pair if PEAK_CALLER is set to BIN or COMBINED |
CHROM_SIZE_FILE | annotations/chrom_hg38.sizes | The file of the chromosome size |
Parameter | Value | Note |
---|---|---|
CELL_CALLER | FILTER | Cell calling method, three options: FILTER/EmptyDrop/cellranger/ |
EmptyDrop_FDR | 0.001 | Fdr cutoff for EmptyDrop algorithm, ignore it if CELL_CALLER is not specified as EmptyDrop |
FILTER_BC_CUTOFF | --min_uniq_frags 5000 --max_uniq_frags 50000 --min_frac_peak 0.5 --min_frac_tss 0.0 --min_frac_promoter --min_frac_enhancer --max_frac_mito 0.1 | The QC (per barcode) cutoffs used for define cells if CELL_CALLER is set to FILTER: the minimum # of unique fragments, the maximum # of unique fragments, the minimum fractions of fragments in peaks, in TSSs, in promoters, in enhancers, and the maximum fraction of fragments in mitochodrial genome , ignore it otherwise |
Parameter | Value | Note |
---|---|---|
norm_by | tf-idf | Normalization method, three options: tf-idf/log/NA |
Top_Variable_Features | 10000 | Number/fraction of variable features used for seurat. If set to 0-1, meaning the fraction of total # of features |
REDUCTION | pca | Dimension reduction method: pca/lda; UMAP and TSNE will be automatically calculated correspondly |
nREDUCTION | 30 | The reduced dimension, an integer |
CLUSTERING_METHOD | seurat | Clustering method, one of these options: seurat/cisTopic/kmeans/LSI/SCRAT/scABC/chromVAR |
K_CLUSTERS | An integer or NULL | The number of expected cell clusters, will set resolution parameter for Louvain algorithm as 0.2 if K_CLUSTERS is specified as NULL |
prepCello | TRUE | Generate object for VisCello (for visualization) |
Parameter | Value | Note |
---|---|---|
SPLIT_BAM2CLUSTER | TRUE | Extract bam files for each cell clusters or not; this module is neccessary if you want to do footprinting analysis |
Parameter | Value | Note |
---|---|---|
RUN_DA | TRUE | Run differential accessibility analysis or not |
group1 | 0:1 | Either the name(s) of one or multiple cell clusters, separated by colon, or 'one'. If specified as 'one', will perform all one-vs-rest comparisons |
group2 | rest | Either the name(s) of one or multiple cell clusters, separated by colon, or 'rest' |
test_use | wilcox | Statistical testing method used to do differential accessibility analysis, negbinom/LR/wilcox/t/DESeq2 |
Parameter | Value | Note |
---|---|---|
RUN_GO | TRUE | Run GO analysis or not after running DA |
GO_TYPE | BP | Type of GO terms, one of three options: BP/CC/kegg |
Parameter | Value | Note |
---|---|---|
DO_FOOTPRINT | FALSE | Perform TF footprinting analysis or not |
group1_fp | 0 | Either the name of a cell cluster or 'one'. If specified as 'one', will conduct all one-vs-rest comparisons |
group2_fp | rest | Either the name of a cell cluster or 'rest' |
Parameter | Value | Note |
---|---|---|
RUN_Cicero | TRUE | Predicting cis chromatin interactions or not |
Cicero_Plot_Region | chr5:140610000-140640000 | Plot cis chromatin interactions within Cicero_Plot_Region on the summary report |
Parameter | Value | Note |
---|---|---|
Integrate_By | seurat | Integration method, one of seurat/pool/harmony |
prepCello4Integration | TRUE | Prepare VisCello object for integrated object or not |
Note this is a long table. You can slide right to read it
Module | Input | Output |
---|---|---|
demplx_fastq | Fastq files for both reads and index, separated by comma like: PE1_fastq,PE2_fastq,index1_fastq,inde2_fastq,index3_fastq.... Multiple index files are supportted and fastq file can be in compressed format (e.g. .gz file) | Demultiplexed fastq1 and fastq2 files with index information embedded in the read name as: @index3_index2_index1:original_read_name, saved in output/demplxed_fastq/ |
trimming | Demultiplexed fastq1 and fastq2 files. | Trimmed demultiplexed fastq1 and fastq2 files, saved in output/trimmed_fastq/. This module can be skipped if TRIM_METHOD is set to 'none' when running module process |
mapping | The demultiplexed and trimmed paired-end fastq files, separated by comma: pe1.fastq,pe2.fastq | Position sorted bam file, and position sorted MAPQ30 bam file, saved in output/mapping_result/ and plain text files of mapping QC metrics and fragments.txt file saved in output/summary/ |
call_peak | The position sorted MAPQ30 bam file outputted from the mapping module. Note that the annotation of blacklist regions and CHROM_SIZE_FILE are used to filter out potential artificial peaks. It's not neccessary to use bam file in scATAC-pro format to call peaks, because the peaks are called based on aggregated bam file. | The peaks/features file in plain text format, saved as output/peaks/PEAK_CALLER/OUTPUT_PREFIX_features_BlacklistRemoved.bed. |
get_mtx | The peaks/features file outputed from call_peak module. It searchs the fragments.txt file in directory output/summary/ to construct the matrix | The raw peak-by-cell sparse matrix along with corresponding barcodes and features files, saved in output/raw_matrix/PEAK_CALLER/ as matrix.mtx, barcodes.txt and features.txt. |
qc_per_barcode | Fragment.txt file and peaks/features file, separated by comma. This module can only be performed after running module mapping and module call_peak | QC metrics for each barcode, saved as output/summary/qc_per_barcode.summary in plain text format |
aggr_signal | Position sorted MAPQ30 bam file outputted from module mapping | Aggregated data in .bw and .bedgraph files, which can be uploaded and visualized to genome browser, saved in output/signal/. A Tss-by-window count matrix (in .mtx.gz format, +/- 1000 bp of each TSS) is also created in output/signal, which can be used to plot the TSS enrichment profile when generating the summary report |
call_cell | The raw peak-by-cell sparse matrix file outputted from the get_mtx module. This module can be only performed after running module get_mtx and module qc_per_barcode | The filtered peak-by-cell sparse matrix, the corresponding barcodes and features files, saved in output/filtered_matrix/PEAK_CALLER/CELL_CALLER/ as matrix.mtx, barcodes.txt and features.txt, respectively |
get_bam4Cells | Bam file for aggregated data outputted from module mapping, and a barcodes.txt file outputted from module call_cell, separated by comma | Bam file and mapping QC (optional) for cell barcodes saved in output/mapping_result/cell_barcodes.MAPQ30 and output/summary/cell_barcodes.MappingStats, respectively |
clustering | The filtered peak-by-cell sparse matrix outputted from the call_peak module | A seurat object with metadata 'active_clusters' for the cell clustering labels, saved as output/downstream_analysis/PEAK_CALLER/CELL_CALLER/seurat_obj.rds. The cell barcodes by cluster table, saved in output/downstream_analysis/cell_cluster_table.txt file, the UMAP plot colored by clustering label, and a VisCello input object as output/downstream_analysis/PEAK_CALLER/CELL_CALLER/VisCello_Obj if parameter prepCello is specified as TRUE. If CLUSTER_METHOD is set to 'chromVAR', a chromVAR object is saved as output/downstream_analysis/PEAK_CALLER/CELL_CALLER/chromVAR_obj.rds as well |
split_bam | The cell barcodes by cluster table file (output/downstream_analysis/PEAK_CALLER/CELL_CALLER/cell_cluster_table.txt), outputted from clustering module | Bam (saved in output/downstream/PEAK_CALLER/CELL_CALLER/data_by_cluster), .bw, and .bedgraph files (saved in output/signal/) for aggregated signal for cells in each cluster |
runDA | Either two groups named as '0:1,2' in which group1 consists of cluster 0 and 1, and group2 consists of cluster2 or specified as '0, rest' or 'one,rest' . | The differential accessibility features with statistical significancy information, saved in output/downstream_analysis/PEAK_CALLER/CELL_CALLER/differential_accessible_features_group1_vs_group2.txt. |
motif_analysis | The filtered peak-by-cell sparse matrix file outputted from the call_cell module. | A chromVar object, a table and a heatmap for differentially enriched TFs in each clusters, saved in output/filtered_matrix/PEAK_CALLER/CELL_CALLER/. |
footprint | Either two groups named as '1,2' or '1,rest', 'one,rest' | A table and a heatmap of differential bound TFs for each group, saved in output/downstream_analysis/PEAK_CALLER/CELL_CALLER/ |
runCicero | seurat_obj.rds file outputted from clustering module, saved in output/downstream_analysis/PEAK_CALLER/CELL_CALLER/seurat_obj.rds | Gene activity object in .rds format, saved as output/downstream_analysis/PEAK_CALLER/CELL_CALLER/cicero_gene_activity.rds, and predicted cis-chromatin interactions in plain text format, saved in output/downstream_analysis/PEAK_CALLER/CELL_CALLER/cicero_inteactions.txt |
integrate | Peak files called from different data sets, separated by comma | A seurat object for integrated data and a UMAP plot colored by clustering labels. If Integrate_By = seurat, an 'integrated' assay is created in the seurat object and the cell clustering by louvain algorithm is performed on the PCs of the 'integrated' assay. If Integrate_By is set to 'harmony', the clustering is performed on the reduced dimension 'harmony'. If Integreate_By is set to 'pool', the data is simply pooled and regressed out the confound factors of sequence depth per cell and the dataset ID |
report | Path to the directory of summary QC files: output/summary as default | A summary report in html format, saved in output/summary/, along with .eps figures for each panels saved in output/summary/Figures/ |