Plans for Single Cell Support #342

atrull314 · 2022-12-07T16:39:32Z

atrull314
Dec 7, 2022

Hi,

I just had a question on whether there will be plans for this tool to support single cell data? I know that there's a way to produce the transcript-to-read mappings and build the count matrix ourselves, but because that takes the quantification piece out, I was just curious if there were any plans to integrate this within bambu so there would be an option to produce single cell barcode matrices?

Thanks for your help!

andredsim · 2022-12-09T01:23:08Z

andredsim
Dec 9, 2022
Maintainer

Hi,

Thank you for your interest in Bambu. We do want to look at if we can adapt Bambu for singe cell data in the future, however there are no imminent plans to release this yet.
Whilst this is not the most convenient method and is untested by us, have you tried splitting the bam file by cell and submitting them essentially as separate samples in a multi-sample analysis in Bambu. This should provide you a matrix at the end with CPM and other quantification measures for each cell.
Because this method might impact transcript discovery, you could first do a run with the full sample and save the annotations to use in the multi-sample (cell seperated run) and set discovery = FALSE.
It would be great to hear from a user if this produces decent results for you and let me know if you need me to explain what I mean above in more detail.

0 replies

lianov · 2022-12-13T16:16:04Z

lianov
Dec 13, 2022

@andredsim : thanks for the reply. I am working with @atrull314 in the development of the nf-core scnanoseq pipeline (see original proposal at https://nfcore.slack.com/archives/CE6SDEDAA/p1660666891400089

Indeed, when we discussed adding bambu, the most benefit we saw was to attempt to add the quantifications. Given there is no current support for this, our current plan is to implement bambu as the default caller for transcripts at the "Exon Feature Assignment" step we have in the workflow figure from the link above (which we will update to reflect updated tools in the near future). This will (at the moment) allows us to at least implement annotations with downstream quantification still happening with umitools.

That being said, I think the above plan will be our first pass - what you suggested has intrigued me and I can see that being a feasible step (with what we are currently working on being used as a way compare results at a high-level. and also have the annotations you mentioned).

Thus, I will be happy to look into this when we get to it and if we find anything useful, I will be sure to share here (or if someone else jumps ahead and tests the suggestion, we will be sure to add this on to scnanoseq at the time).

0 replies

jonathangoeke · 2022-12-14T01:59:58Z

jonathangoeke
Dec 14, 2022
Maintainer

thanks, yes please share any update! It looks like Bambu for transcript discovery might be a great addition to the pipeline. I also joined the slack channel on nf-core

0 replies

lingminhao · 2022-12-16T03:54:23Z

lingminhao
Dec 16, 2022
Collaborator

Hi @lianov , I am part of the bambu team and am planning to expand bambu to adopt single cell data as well. We are interested in the scananoseq that you plan to build. Do you currently have any discussion groups for this plan? I would be interested to join for followup discussion and potentially provide some information about bambu for implementing bambu in your pipeline.

0 replies

lianov · 2022-12-16T13:50:32Z

lianov
Dec 16, 2022

@lingminhao : That would be great and yes we have a scnanoseq channel in the nf-core slack group, the following should link you to it: https://nfcore.slack.com/archives/C03TUE2K6NS.

Looking forward to our discussions.

0 replies

lianov · 2023-02-08T22:13:21Z

lianov
Feb 8, 2023

Hello, I wanted to follow-up on our discussion with an update related to the approach suggested by @andredsim.

What we have found so far, is that bambu fails on specific subsetted bams - we are unsure if there is a workaround this at the moment, and hope the example provided below (along with the data) can aid this discussion. If there is any miss-interpreatation on our end from the suggested approach, we are happy to change.

Background

For context, the data we use here is derived from the Blaze preprint: https://www.biorxiv.org/content/10.1101/2022.08.16.504056v1

In particular, the GridION dataset contains the higher quality chemistry we are targetting for the pipeline (Q20) - thus this is our main dataset. However, we also make use of the PrometheION dataset as a stress test linked to high depth (we do expect that samples being processsed with this pipeline will have high depth as a related to single-cell/nuclei).

Overview of steps in test

After initial processing (code and data will be below):

Subset BAMs per cell barcode (corrected barcode indicated by the CB tag)
Run bambu with all samples (in this case the GridION and PrometheION as a test) to acquaire a common annotation file (discovery = TRUE, quant = FALSE). The BAM files at this point are the original full files/samples. Save annotations
Prepare annotation created above (prepareAnnotations) to be used across samples
In a single bambu run, provide all cell-subsetted BAMs derived from a single sample as a multi-sample quantification run (with discovery = FALSE). This assumes that GridION and PromethION have separate bambu runs, but with all their respective BAM subsetted files

Step number 4 is where we find issues. Initially, it seems to work, until we hit an issue with one specific BAM file (BAM name: ERR9958133_subset_bam/ERR9958133_GACTTCCTCTGTCGCT.bam). The error we encounter is:

Error: BiocParallel errors
  1 remote errors, element index: 499
  435 unevaluated and other errors
  first remote error:
Error in xgb.DMatrix(newdata, missing = missing): REAL() can only be applied to a 'numeric', not a 'logical'

While processing all subsetted BAMs is the preffered approach, I have gone ahead and also tested running bambu per subsetted bam to see if we could spot other issues (in the code below, this is under Try2 the mapply version). In this case, we encounter an error much earlier, with another BAM (input_data/ERR9958133_subset_bam/ERR9958133_AAGCGTTGTTTGATCG.bam)

Error: BiocParallel errors
  1 remote errors, element index: 1
  0 unevaluated and other errors
  first remote error:
Error in eval(bysub, x, parent.frame()): object 'eqClassById' not found

This is where we are - it's unclear to us if we should be filtering out specific BAMs which do not meet 1 or more assumptions for bambu (unclear what we might be missing. We would need to know more to make it generic enough for an nf-core pipeline analyzing different data types/sources). At a high-level, we did look at the size of the BAMs in case they were too small for bambu (note these tests have only been done with the GridION sample so far). While these 2 specific BAMs are smaller than others, we do see other BAMs of similar size where we did not encounter a fatal error - just a data point.

Data and code:

I am providing a public Globus endpoint which should have all the data needed to reproduce the issue. If you have any issues, please let us know. Depending on the scope of this issue, we may move forward without including bambu on an initial release BUT we would love to do so as soon as we can work out a solution for single-cell/nuclei data and proceed with further tests (whether before first release or future releases). Thus, we are happy to work with you all to push this discussion forward.

Globus endpoint : https://app.globus.org/file-manager?origin_id=d1a6e641-7072-4477-8aa7-40fa4f0a5622&origin_path=%2F

bambu test

Script is also present in the Globus endpoint (per_cell_bambu.R)

# Brief tests linked discussion with bambu authors (https://github.com/GoekeLab/bambu/discussions/342):

######################
### LOAD LIBRARIES ###
######################
library(bambu)

################
### BAM PATH ###
################

# setting 2 diff. variables for quick tests
#bam_files_small == GridION
#bam_files_large == PromethION

bam_files_small <- "./input_data/ERR9958133.corrected.dedup.bam" 

bam_files_large <- "./input_data/ERR9958135.corrected.dedup.bam"

gtf <- "./input_data/gencode.v31.annotation.gtf"

genome_file <- "./input_data/GRCh38.primary_assembly.genome.fa"

##########################
### PREPARE ANNOTATION ###
##########################

bambuAnnotations <- prepareAnnotations(gtf)

########################################
### PERFORM DISCOVERY ON ALL SAMPLES ###
########################################

# Here, we set quant = FALSE while discovery = TRUE to save a common set of annotations

se_all_no_quant <- bambu(reads = c(bam_files_small, bam_files_large),
                         annotations = bambuAnnotations,
                         genome = genome_file,
                         lowMemory = FALSE,
                         discovery = TRUE,
                         quant = FALSE,
                         verbose = TRUE) 

se_all_no_quant

# save extended annotation for use on per-sample runs:

writeToGTF(se_all_no_quant, file = "extended_annotation_all_samples.gtf")

###########################
### Re-PREPARE ANNOTATION #
###########################
# from gtf generate above re-prepare annotation obj

new_gtf <- "./extended_annotation_all_samples.gtf"

bambuAnnotations <- prepareAnnotations(new_gtf)

######################################
### QUANTIFY per cell data GridION ###
######################################

# testing for GridION

# quantify per cell barcode:

bam_files_ERR9958133 <- list.files(path = "input_data/ERR9958133_subset_bam",
                                   pattern = "\\.bam$",
                                   full.names = TRUE)

bam_files_ERR9958133


#### Try1: all bams (per-cell within a sample) in a single bambu run ####
# would be the cleaner path

dir.create("rcOutDir_ERR9958133", recursive = TRUE)

se_ERR9958133_per_cell <- bambu(reads = bam_files_ERR9958133,
                                annotations = bambuAnnotations,
                                genome = genome_file,
                                ncore = 1,
                                lowMemory = FALSE,
                                discovery = FALSE, # we use previously discovered annotations
                                quant = TRUE,
                                rcOutDir = "rcOutDir_ERR9958133",
                                verbose = TRUE)

se_ERR9958133_per_cell

#NOTE: bam_files_ERR9958133[499] causes issues ("input_data/ERR9958133_subset_bam/ERR9958133_GACTTCCTCTGTCGCT.bam")


#### Try2: processing cell barcodes separately... ####

dir.create("per_cell_outs_ERR9958133/", recursive = TRUE)

mapply(FUN = function(x) {
  
  se_per_cell <- bambu(reads = bam_files_ERR9958133[x],
                       annotations = bambuAnnotations,
                       genome = genome_file,
                       ncore = 1,
                       lowMemory = FALSE,
                       discovery = FALSE, # we use previously discovered annotations
                       quant = TRUE,
                       verbose = TRUE)
  
  # save standard bambu outputs with basename of bam as prefix
  writeBambuOutput(se_per_cell,
                   path = "per_cell_outs_ERR9958133/",
                   prefix = paste0(sub("\\.bam$", "", basename(bam_files_ERR9958133[x])),"_"))
  
  
}, x=1:length(bam_files_ERR9958133))

#NOTE: bam_files_ERR9958133[39] causes issues ("input_data/ERR9958133_subset_bam/ERR9958133_AAGCGTTGTTTGATCG.bam")

sessionInfo():

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bambu_3.0.5                 BSgenome_1.66.1             rtracklayer_1.58.0          Biostrings_2.66.0           XVector_0.38.0              SummarizedExperiment_1.28.0
 [7] Biobase_2.58.0              GenomicRanges_1.50.2        GenomeInfoDb_1.34.4         IRanges_2.32.0              S4Vectors_0.36.1            BiocGenerics_0.44.0        
[13] MatrixGenerics_1.10.0       matrixStats_0.63.0         

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9               lattice_0.20-45          tidyr_1.2.1              prettyunits_1.1.1        png_0.1-8                Rsamtools_2.14.0         assertthat_0.2.1        
 [8] digest_0.6.30            utf8_1.2.2               BiocFileCache_2.6.0      R6_2.5.1                 RSQLite_2.2.19           httr_1.4.4               pillar_1.8.1            
[15] zlibbioc_1.44.0          rlang_1.0.6              GenomicFeatures_1.50.3   progress_1.2.2           curl_4.3.3               data.table_1.14.6        rstudioapi_0.14         
[22] blob_1.2.3               Matrix_1.5-1             BiocParallel_1.32.4      stringr_1.4.1            RCurl_1.98-1.9           bit_4.0.5                biomaRt_2.54.0          
[29] DelayedArray_0.24.0      compiler_4.2.2           pkgconfig_2.0.3          tidyselect_1.2.0         KEGGREST_1.38.0          tibble_3.1.8             GenomeInfoDbData_1.2.9  
[36] codetools_0.2-18         XML_3.99-0.13            fansi_1.0.3              withr_2.5.0              crayon_1.5.2             dplyr_1.0.10             dbplyr_2.2.1            
[43] rappdirs_0.3.3           GenomicAlignments_1.34.0 bitops_1.0-7             grid_4.2.2               jsonlite_1.8.3           lifecycle_1.0.3          DBI_1.1.3               
[50] magrittr_2.0.3           cli_3.4.1                stringi_1.7.8            cachem_1.0.6             xml2_1.3.3               filelock_1.0.2           ellipsis_0.3.2          
[57] vctrs_0.5.0              generics_0.1.3           xgboost_1.6.0.1          rjson_0.2.21             restfulr_0.0.15          tools_4.2.2              bit64_4.0.5             
[64] glue_1.6.2               purrr_0.3.5              hms_1.1.2                parallel_4.2.2           fastmap_1.1.0            yaml_2.3.6               AnnotationDbi_1.60.0    
[71] BiocManager_1.30.19      memoise_2.0.1            BiocIO_1.8.0

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plans for Single Cell Support #342

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Plans for Single Cell Support #342

atrull314 Dec 7, 2022

Replies: 6 comments

andredsim Dec 9, 2022 Maintainer

lianov Dec 13, 2022

jonathangoeke Dec 14, 2022 Maintainer

lingminhao Dec 16, 2022 Collaborator

lianov Dec 16, 2022

lianov Feb 8, 2023

Background

Overview of steps in test

Data and code:

atrull314
Dec 7, 2022

andredsim
Dec 9, 2022
Maintainer

lianov
Dec 13, 2022

jonathangoeke
Dec 14, 2022
Maintainer

lingminhao
Dec 16, 2022
Collaborator

lianov
Dec 16, 2022

lianov
Feb 8, 2023