scPipe is an R package that allows barcode demultiplexing, transcript mapping and quality control of raw sequencing data generated by multiple 3 prime end sequencing protocols including CEL-seq, MARS-seq, Chromium 10x and Drop-seq. scPipe produces a count matrix that is essential for downstream analysis along with a user-friendly HTML report that summarises data quality. These results can be used as input for downstream analyses including normalization, visualization and statistical testing.
The package is under active development. Feel free to ask any questions or submit a pull request.
- [21/09/2017] scPipe now uses the SingleCellExperiment class.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("scPipe")
install.packages("devtools")
devtools::install_github("LuyiTian/scPipe")
The general workflow of scPipe is illustrated in the following figure:
-
The
sc_trim_barcode
function will reformat each read and put the cell barcode and UMI sequence into the fastq read names:@ACGATCGA_TAGAGC#SIMULATE_SEQ::002::000::0000::0 AAGACGTCTAAGGGCGGTGTACACCCTTTTGAGCAATGATTGCACAACCTGCGATCACCTTATACAGAATTAT+AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
-
After alignment, the
sc_exon_mapping
function will put the cell barcode and UMI into the bam file with different tags, together with gene information:AAAGTCAA_AACTCA#SIMULATE_SEQ::007::000::0013::10 0 ERCC-00171 142 40 73M * 0 0 GCCTCGGGAATAAGCTGACGGTGACAAGGTTTCCCCCTAATCGAGACGCTGCAATAACACAGGGGCATACAGT AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA HI:i:1 NH:i:1 NM:i:0 GE:Z:ERCC-00171 YC:Z:AAAGTCAA YM:Z:AACTCA YE:i:-364
. In this example the cell barcode is AAAGTCAA with tagYC
, the UMI is AACTCA with tagYM
and the gene that this read maps to isERCC-00171
with tagGE
. This read is located 364 bp upstream of the transcription end site (TES), which is stored in theYE
tag. -
The
sc_demultiplex
function will look for the cell barcode in BAM file (by default in theYC
tag) and compare it against the known cell barcode annotation file, which is a csv file consisting of two columns. The first column is the cell name and second column is the cell barcode. For Chromium 10x and Drop-seq data we can runsc_detect_bc
to find the barcodes and generate the cell barcode annotation file before runningsc_demultiplex
. An example barcode annotation file is availab in the package fromsystem.file("extdata", "barcode_anno.csv", package = "scPipe")
. The output ofsc_demultiplex
will be multiple csv files corresponding to each cell. Each file has three columns, the first of which contains the gene id, the second column contains the UMI sequence and third column gives the relative location of the read to the TES. These files are used forsc_gene_counting
.
For further examples see the vignette.
This package is inspired by the scater
and scran
packages. The idea to put cell barcode and UMI sequences into the BAM file is from Drop-seq tools. We thank Dr Aaron Lun for suggestions on package development.