sc_preprocess

Introduction
Details
Usage
- Indexing
- Quantification
Output

Introduction

sc_preprocess is a containerized preprocessing pipeline for 10x scRNA-seq data written in Nextflow. The pipeline basically implements the code suggestions of the salmon/alevin developers in their tutorial. This covers generation of a genome-decoyed and expanded (spliced+unspliced) transcriptome index directly from reference annotations, quantification of reads against this index, generation of spliced- and unspliced count tables and a basic QC summary report. It optionally supports feature barcoding experiments such as CITE-Seq or cell hashtag oligos (HTO). The indexing procedure relies on extraction of spliced and unspliced counts from a genome/GTF using eisaR followed by index building with salmon. The quantification happens via alevin. Generation of count tables from the quantification results is achieved via tximeta. A QC summary report is provided by alevinQC.

Run the test profile after cloning this repo to see produced output, see also the output description.

NXF_VER=22.10.4 nextflow run main.nf -profile docker,test --features_file $(realpath test/hto.tsv) --samplesheet $(realpath test/samplesheet.csv)

For a summary of all current software versions and command lines (based on the test data) see the command line and software version summaries in the misc folder.

Workflow

The pipeline covers these steps:

Generate a genome-decoyed and expanded reference transcriptome containing all spliced (=exonic) and unspliced (intronic) transcript sequences. This expanded transcriptome is required for quantification to allow output of both spliced- and unspliced counts, e.g. for velocity analysis. Also, this expanded reference is decoyed by the entire reference genome to perform selective alignments in order to improve mapping accuracy by capturing reads that better align to the genome than the transcriptome, removing potential gDNA contaminations and other spurious mappings. The indexing of this expanded reference is done by salmon.
Processing of the reads (fastq files read from a CSV samplesheet) against the expanded reference using alevin performing cellular barcode (CB) detection, read mapping, Unique Molecular Identifier (UMI) deduplication and gene count estimation, resulting in per-cell gene-level abundance estimations.
Split the obtained quantifications into spliced and unspliced count tables and save these in mtx format for easy distribution and loading into downstream analysis environments such as R.
Create a per-sample summary report using alevinQC which includes relevant QC metrics such as the average number of reads per cell, number of detected genes per cell and others. Also, a table and plot summarizing number of cells per sample is generated using custom scripts.
Optionally, quantify matched reads from a feature barcoding experiment such as CITE-seq or HTO cell hashing against a provided set of reference barcode sequences. The obtained counts are filtered for the cellular barcodes detected in the RNA quantification. In feature barcode mode the pipeline will eventually return a count matrix that only contains those cellular barcodes detected in both the RNA and feature barcode experiment. Some feature barcode sets such as totalSeq-B/C require barcode translation. We provide the default translation table from the CellRanger GitHub in the assets/ folder to perform this translation if --translate_barcodes is set and --translate_table points to that table. See also this thread at biostars.org for details.

Usage

Typically one would build an index first and then run the quantification separately as shown below. Still, the entire workflow including indexing and quantification can also be run as a single command.

Indexing

Typical command line for the indexing on a cluster with SLURM and Singularity, using GENCODE annotations. If not using GENCODE then the --idx_args argument can be omitted.

NXF_VER=22.10.4 nextflow run main.nf -profile singularity,slurm --idx_only --genome path/to/genome.fa.gz --gtf path/to/gtf.gz --idx_args '\--gencode'

This will parse both spliced and unspliced transcript sequences directly from the genome using the GTF as guide and then run the indexing process on these files, using the provided genome as decoy. Six CPUs and 30GB of RAM are hardcoded in the default profile for this step, and this is sufficient for mouse data.

Indexing parameters

These parameters can be used to customize the indexing:

--only_idx: logical, whether to only run the indexing process without any quantification and then exit
--genome: path to the genome fasta file (.fa.gz)
--gtf: path to the GTF reference file (.gtf.gz)
--gene_id: name of the column storing the gene ID in the GTF, default gene_id
--gene_name: name of the column storing the gene name in the GTF, default gene_name
--gene_type: name of the column storing the gene biotype in the GTF, default gene_type
--chrM: name of the mitochondrial chromosome, default chrM
--rrna: gene biotype for ribosomal RNAs, default rRNA
--read_length: an integer indicating the read length of R2 (the non-CB/non-UMI read), default 150 --idx_outdir: name of output folder inside the sc_preprocess_results directly to store the output
--idx_name: name of the index in which the actual index is stored inside --idx_outdir, default idx
--idx_args: additional arguments passed to salmon index step, default --sparse

Quantification

Typical command line for the quantification on a HPC with SLURM and Singularity, assuming presence of feature barcodes that require barcode translation:

NXF_VER=22.10.4 nextflow run main.nf -profile singularity,slurm \
    --samplesheet path/to/samplesheet.csv \
    --idx path_to_idx_folder \
    --tgmap path_to_tgmap --rrnagenes path_to_rrnagenes_file --mtgenes path_to_mtrnagenes_file \
    --features path_to_expanded_features_file --gene2type path_to_gene2type_file \
    --features_file path/to/hto.tsv --translate_table path/to/assets/translate_table.txt.gz --translate_barcodes

This will quantify the reads against the index, create spliced- and unspliced count tables and create the alevinQC report. As resources we hardcoded 6 CPUs and 30GB of RAM which works well for mouse samples of typical size. See details below.

Samplesheet

The pipeline reads the fastq file pairs from a samplesheet which is a five-column CSV with a header.

sample,r1,r2,libtype,is_fb
sample1,$baseDir/test/sample1_1.fastq.gz,$baseDir/test/sample1_2.fastq.gz,ISR,false
sample1,$baseDir/test/sample1a_1.fastq.gz,$baseDir/test/sample1a_2.fastq.gz,ISR,false
sample1,$baseDir/test/sample1SF_1.fastq.gz,$baseDir/test/sample1SF_2.fastq.gz,ISR,true
sample2,$baseDir/test/sample2_1.fastq.gz,$baseDir/test/sample2_2.fastq.gz,ISR,false

The first line is the mandatory header followed by the sample to quantify:

column1 (sample) is the per-sample name to be used in the output. This can be any Unix-compatible name, and there is no need to match this name with the fastq file names as other software may require you to do.
column2/3 (r1/2) are the paths to the fastq files for that sample. It either must be the full absolute path (don't use ~) or alternative a path relative to the directory that this pipeline is started from using either of the three implicit Nextflow variables $baseDir, $projectDir and $launchDir (see example below).
column4 is the library type so basically the strandedness and layout of the library. That would be ISR for 10x Chromium data.
column5 (is_fb) is a logical column with either true or false indicating whether this fastq file pair is a feature barcode experiment. If so then these files will be quantified against the barcode library provided by --features-file (see below). This column can be empty and then defaults to false. See below for details on feature barcoding experiments.

An example using relative paths (same samples as in the example above) could be:

sample,r1,r2,libtype,is_fb
sample1,$baseDir/test/sample1_1.fastq.gz,$baseDir/test/sample1_2.fastq.gz,ISR,false
sample1,$baseDir/test/sample1a_1.fastq.gz,$baseDir/test/sample1a_2.fastq.gz,ISR,false
sample1,$baseDir/test/sample1SF_1.fastq.gz,$baseDir/test/sample1SF_2.fastq.gz,ISR,true
sample2,$baseDir/test/sample2_1.fastq.gz,$baseDir/test/sample2_2.fastq.gz,ISR,false

Technical replicates

...from sequencing the same library on multiple lanes or in individual sequencing runs can be specified in the samplesheet by using the same sample_id for these files. In the example above sample1 has two technical replicates (sample_1/2.fastq.gz and sample1a_1/2.fastq.gz) which will be merged by the pipeline prior to quantification.

Feature barcoding experiments

If feature barcode experiments are to be quantified then the user must provide a --features_file which is a tab-separated list that in column1 stores the name of the feature barcode and in column2 the sequence, for example:

hto_1	ACCCACCAGTAAGAC
hto_2	GGTCGAGAGCATTCA
hto_3	CTTGCCGCATGTCAT

If this file is provided then an index will be made for these files and the feature barcode fastq files will be quantified against it. The counts of the feature barcodes will be included to the bottom of the gene expression count tables. If barcode translation must be performed (totalSeq-B/C), then use --translate_barcodes --translate_table path/to/translate_table.txt.gz.

Quantification parameters

These parameters can be used to customize the quantification:

--r1_type: this flag defines structure of read1 in the experiment so the position and length of the CB and UMI. This must consist of the two options --bc-geometry --umi-geometry from alevin. The defaults are:
--bc-geometry 1[1-16] --umi-geometry 1[17-28] and indicate that the BC is in read1 from position 1-16 and the UMI from position 17-28, so 16bp CB and 12bp UMI as in Chromium V3. For Chromium V2 it would be 10bp UMIs so one would use --r1_type '\--bc-geometry 1[1-16] --umi-geometry 1[17-26]'.
--r2_type: same as above but defining the structure of read2. For Chromium V3 (the default) it would be:
--r2_type '\--read-geometry 2[1-91]' meaning that the first 91bp of R2 should be used for quantification, as recommended by 10x. One could also use 1-end here so alevin would only the entire read, e.g. in case of 150bp reads. We stick with the default of 91bp here fir consistency between runs as not every sequencing run may produce 150bp reads, depending on the machine and run mode.
--R2_type_fb: same as --r2_type but the read2 structure for feature barcode experiments. The defaults here assume totalSeqB feature barcoding:
--read-geometry 2[11-25] meaning that the feature barcodes are 15bp long and starting from at position 11. The original CITE-seq protocol would be [1-15].
--quants_args: this flag allows to pass further arguments to the alevin processing. This can be any of the allowed alevin arguments (see its manual) such as generating inferential replicates. Is must not include any of -o -i -p as these options are already defined internally. See the Alevin manual or the help via salmon alevin -h for details on further options.

Output

The pipeline will produce an output folder sc_preprocess_results in the location from which the pipeline was launched that contains:

alevin_idx: folder with the expanded transcriptome index (idx_gentrome) and the feature barcode index (idx_features)
alevin_quant: folder with the alevin outputs (one folder per sample)
mtx: folder with the expression matrices as mtx.gz and the column and row annotations as tsv.gz. If feature barcode (FBs) libraries were present for a particular sample then the returned files will already be filtered for CBs found in both the RNA and FB experiment. The FB counts will be appended to the mtx.gz so will be the last entries in these files. The same goes for the sample_feature.tsv.gz file, where the feature barcode names will be the last entries.
qc_dir: folder with the alevinQC html reports for each quantified library -- RNA and FB (if present). Note that in case of FBs this QC report will be based on the full RNA/FB experiment that has not been filtered for CBs present in both experiments. It is useful to judge the quality of both experiments independently.
In this folder there will also be a file summary_cellnumbers.txt which summarizes the number of detected cells per sample in the RNA experiment and the number of intersecting cells with the FB experiment. If no FBs were present for that sample NAs are returned. The numbers are also presented as a barplot in summary_cellnumbers.pdf.
pipeline_info: folder contains software_versions.txt listing all software versions used in the pipeline and command_lines.txt listing the exact command lines used in the processes

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
.github/workflows		.github/workflows
assets		assets
bin		bin
configs		configs
functions		functions
misc		misc
modules		modules
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CONTAINERLOG.md		CONTAINERLOG.md
Dockerfile		Dockerfile
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config
schema.nf		schema.nf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sc_preprocess

Introduction

Workflow

Usage

Indexing

Quantification

Output

About

Releases 1

Packages

Languages

ATpoint/sc_preprocess

Folders and files

Latest commit

History

Repository files navigation

sc_preprocess

Introduction

Workflow

Usage

Indexing

Quantification

Output

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages