This is a wrapped package of the above workflow with additional checks on the Celltag Library sequences. This package have a dependency on R version (R >= 3.5.0). This can be used as an alternative approach for this pipeline.
(You might need to install devtools to be able to install from Github first)
install.packages("devtools")
Install the package from GitHub.
library("devtools")
devtools::install_github("morris-lab/CloneHunter")
Load the package
library("CloneHunter")
In the first section, we would like to evaluate the CellTag library complexity using sequencing. Following is an example using the sequencing data we generated in lab for pooled CellTag library V2.
# Read in data file that come with the package
fpath <- system.file("extdata", "V2-1_R1.zip", package = "CloneHunter")
extract.dir <- "."
# Extract the dataset
unzip(fpath, overwrite = FALSE, exdir = ".")
full.fpath <- paste0(extract.dir, "/", "V2-1_S2_L001_R1_001.fastq")
# Set up output file directory
output.path <- "./celltag_extracted_v2-1_r1.txt"
# Extract the CellTags
extracted.cell.tags <- CellTagExtraction(fastq.bam.input = full.fpath, celltag.version = "v2", extraction.output.filename = output.path, save.fullTag = FALSE, save.onlyTag = FALSE)
The extracted CellTag - extracted.cell.tags
variable - contains a list of two vectors as following.
First Vector-extracted.cell.tags[[1]] |
Second Vector-extracted.cell.tags[[1]] |
---|---|
Full CellTag with conserved regions | 8nt CellTag region |
# Count the occurrence of each CellTag
cell.tag.count <- as.data.frame(table(extracted.cell.tags[[2]]), stringsAsFactors = F)
# Sort the CellTags in descending order of occurrence
cell.tag.count.sort <- cell.tag.count[order(-cell.tag.count$Freq), ]
colnames(cell.tag.count.sort) <- c("CellTag", "Count")
Here are are generating the whitelist for this CellTag library - CellTag V2. This will remove the CellTags with an occurrence number below the threshold. The threshold (using 90th percentile as an example) is determined: floor[(90th quantile)/10]. The percentile can be changed while calling the function. Occurrence scatter plots are saved under the output.dir
, which could be further used to determine the percentile for each different CellTag library.
whitelisted.cell.tag <- CellTagWhitelistFiltering(count.sorted.table = cell.tag.count.sort, percentile = 0.9, output.dir="./", output.count.file = "my_favourite_v1.csv", save.output = TRUE)
The generated whitelist for each library can be used to filter and clean the single-cell CellTag UMI matrices.
In this section, we are presenting an alternative approach that utilizes this package to carry out CellTag extraction, quantification, and generation of UMI count matrices. This can be also accomplished via the workflow supplied - https://github.com/morris-lab/CellTagWorkflow.
Note: Using the package could be slow for the extraction part. For reference, it took approximately an hour to extract from a 40Gb BAM file using a maximum of 8Gb of memory.
Here we would follow the same step as in https://github.com/morris-lab/CellTagWorkflow to download the a BAM file from the Sequence Read Archive (SRA) server. Again, this file is quite large. Hence, it might take a while to download.
# bash
wget https://sra-download.ncbi.nlm.nih.gov/traces/sra65/SRZ/007347/SRR7347033/hf1.d15.possorted_genome_bam.bam
In this step, we will extract the CellTag information from the BAM file, which contains information including cell barcodes, CellTag and Unique Molecular Identifiers (UMI). The result generated from this extraction will be a data table containing the following information. The result will then be saved into a tab-delimited table on the directory given.
Cell Barcode | Unique Molecular Identifier | CellTag Motif |
---|---|---|
Cell.BC | UMI | Cell.Tag |
extracted.cell.tags.bam <- CellTagExtraction(fastq.bam.input="./hf1.d15.possorted_genome_bam.bam", celltag.version="v1", extraction.output.filename="./hf1.d15.possorted.celltag.tsv", FALSE, FALSE)
In this step, we will quantify the CellTag UMI counts and generate the UMI count matrices. This function will take in two inputs, including the barcode tsv file generated by 10X and the CellTag data table from Step 2. The result will contain the final matrix with the following information. The result will also be saved under the same directory as where the tab-delimited table was generated in Step 2.
Cell Barcode | CellTag Motif 1 | CellTag Motif 2 | <all tags detected> | CellTag Motif N |
---|---|---|---|---|
Cell.BC | Motif 1 | Motif 2 | <all tags detected> | Motif N |
ct.mtx <- CellTagMatrixCount("barcodes.tsv", "./hf1.d15.possorted.celltag.tsv")
The generated CellTag UMI count matrices can then be used in the following steps for clone identification.
In this section, we are presenting an alternative approach that utilizes this package that we established to carry out clone calling with single-cell CellTag UMI count matrices. In this pipeline below, we are using a subset dataset generated from the full data (Full data can be found here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99915). Briefly, in our lab, we reprogram mouse embryonic fibroblasts (MEFs) to induced endoderm progenitors (iEPs). This dataset is a single-cell dataset that contains cells collected from different time points during the process. This subset is a part of the first replicate of the data. It contains cells collected at Day 28 with three different CellTag libraries - V1, V2 & V3.
These matrices can be obtained from the first part of this tutorial. In this pipeline below, the matrices were saved as .Rds files. It will need to be changed when saving the matrices into different data types.
# Read the RDS file
dt.mtx.path <- system.file("extdata", "hf1.d28.prefiltered.Rds", package = "CloneHunter")
sc.celltag <- readRDS(dt.mtx.path)
# Change the rownames
rnm <- sc.celltag$Cell.BC
sc.celltag <- sc.celltag[,-1]
rownames(sc.celltag) <- rnm
If CellTag error correction was not planned to be performed, move to the next step to carry out binarization and metric-based filtering. Otherwise, before binarization and additional filtering, please move on to the CellTag Error Correction section and come back to the binarization with the collapsed matrix, collapsed.mtx
.
Here we would like to binarize the count matrix to contain 0 or 1, where 0 indicates no such CellTag found in a single cell and 1 suggests the existence of such CellTag. The suggested cutoff that marks existence or absence is at least 2 counts per CellTag per Cell. For details, please refer to the paper - https://www.nature.com/articles/s41586-018-0744-4
# Calling binarization
binary.sc.celltag <- SingleCellDataBinarization(sc.celltag, 2)
We then generate scatter plots for the number of total celltag counts in each cell and the number each tag across all cells. These plots could help us further in filtering and cleaning the data.
metric.p <- MetricPlots(celltag.data = binary.sc.celltag)
print(paste0("Mean CellTags Per Cell: ", metric.p[[1]]))
print(paste0("Mean CellTag Frequency across Cells: ", metric.p[[2]]))
Note: This filters the single-cell data based on the whitelist of CellTags one by one. By mean of that, if three CellTag libraries were used, the following commands need to be executed for 3 times and result matrices can be further joined (Example provided).
Based on the whitelist generated earlier, we filter the UMI count matrix to contain only whitelisted CelTags.
whitelist.sc.data.v2 <- SingleCellDataWhitelist(binary.sc.celltag, whitels.cell.tag.file = "./my_favourite_v2_1.csv")
For all three CellTags,
##########
# Only run if this sample has been tagged with more than 1 CellTags libraries
##########
## NOT RUN
whitelist.sc.data.v1 <- SingleCellDataWhitelist(binary.sc.celltag, whitels.cell.tag.file = "./my_favourite_v1.csv")
whitelist.sc.data.v2 <- SingleCellDataWhitelist(binary.sc.celltag, whitels.cell.tag.file = "./my_favourite_v2_1.csv")
whitelist.sc.data.v3 <- SingleCellDataWhitelist(binary.sc.celltag, whitels.cell.tag.file = "./my_favourite_v3.csv")
For each version of CellTag library, it should be processed through the following steps one by one to call clones for different pooled CellTag library.
Recheck the metric as similar as Step 3
metric.p2 <- MetricPlots(celltag.data = whitelist.sc.data.v2)
print(paste0("Mean CellTags Per Cell: ", metric.p2[[1]]))
print(paste0("Mean CellTag Frequency across Cells: ", metric.p2[[2]]))
metric.filter.sc.data <- MetricBasedFiltering(whitelisted.celltag.data = whitelist.sc.data.v2, cutoff = 20, comparison = "less")
metric.filter.sc.data.2 <- MetricBasedFiltering(whitelisted.celltag.data = metric.filter.sc.data, cutoff = 2, comparison = "greater")
metric.p3 <- MetricPlots(celltag.data = metric.filter.sc.data.2)
print(paste0("Mean CellTags Per Cell: ", metric.p3[[1]]))
print(paste0("Mean CellTag Frequency across Cells: ", metric.p3[[2]]))
If it looks good, proceed to the following steps to call the clones.
This calculates pairwise Jaccard similarities among cells using the filtered CellTag UMI count matrix. This will generate a Jaccard similarity matrix and plot a correlation heatmap with cells ordered by hierarchical clustering. The matrix and plot will be saved in the current working directory.
jac.mtx <- JaccardAnalysis(whitelisted.celltag.data = metric.filter.sc.data.2)
Based on the Jaccard similarity matrix, we can call clones of cells. A clone will be selected if the correlations inside of the clones passes the cutoff given (here, 0.7 is used. It can be changed based on the heatmap/correlation matrix generated above). Using this part, a list containing the clonal identities of all cells and the count information for each clone. The tables will be saved in the given directory and filename.
clone.id | cell.barcode |
---|---|
Clonal ID | Cell BC |
Clone.ID | Frequency |
---|---|
Clonal ID | The cell number in the clone |
Clone.result <- CloneCalling(Jaccard.Matrix = jac.mtx, output.dir = "./", output.filename = "clone_calling_result.csv", correlation.cutoff = 0.7)
In this step, we will identify CellTags with similar sequences and collapse similar CellTags to the centroid CellTag. For more information, please refer to starcode software - https://github.com/gui11aume/starcode. Briefly, starcode clusters DNA sequences based on the Levenshtein distances between each pair of sequences, from which we collapse similar CellTag sequences to correct for potential errors occurred during single-cell RNA-sequencing process. Default maximum distance from starcode was used to cluster the CellTags.
First, we will prepare the data to the format that could be accepted by starcode. This function accepts three inputs including the unfiltered single-cell data, the single-cell full UMI count matrix and the output csv file to save to. The output will be a data frame containing the CellTag information with their corresponding cell barcode and UMI counts. In this function, we concatenate the CellTag with cell barcode and use the combined sequences as input to execute Starcode. The file to be used for Starcode will be stored under the same directory as the output file and with the name provided and the suffix of "collapse.txt".
# Expecting matrix with each column = a cell, each row = a celltag
sc.celltag.t <- t(sc.celltag)
colnames(sc.celltag.t) <- rownames(sc.celltag)
# Generating the collapsing files
collapse.df <- CellTagDataForCollapsing(sc.cell.tag, sc.celltag.t, "./my_favoriate.csv")
Following the instruction for Starcode, we will run the following command to generate the result from starcode.
./starcode -s --print-clusters ./my_favoriate_collapse.txt > ./collapsing_result.txt
With the collapsed results, we will regenerate the CellTag x Cell Barcode matrix. The output will be a matrix that contain the combined counts and collapsed CellTags. Also, the output will be saved under the output file directory given.
collapsed.mtx <- CellTagDataPostCollapsing(sc.cell.tag, "./collapsing_result.txt", "./my_favoriate.csv", "./collapsed_matrix.RDS")