Can I use output from cellranger-arc in cellbender ? #121

beetlejuice007 · 2022-01-29T16:50:59Z

Can I use output from cellranger-arc in cellbender ?

mhulke · 2022-02-02T14:48:46Z

I'm not part of the cellbender team, but yes you can. I've done it before and the program works just fine. Just out of curiosity, do you plan to remove the ATAC peaks from the raw matrix before running cellbender or leave them in? I've been leaving them in, but I would be curious how removing them affects filtering and downstream processing.

beetlejuice007 · 2022-02-04T21:17:28Z

ok thanks. I was able to run cellbender. I did not remove the ATAC peaks. I did not get any errors.
But i noticed that my output.h5 (197MB) file is much bigger than input.h5 file (70MB) (Also the filtered.h5 is 36MB. ).
Is this expected ? and why did the file size increase so much ?

mhulke · 2022-02-07T20:16:21Z

Sorry, that's a question for the dev team. My files usually end up larger, so I wouldn't be too worried about it unless you see something really strange in the data.

sjfleming · 2022-03-30T19:34:34Z

This is a great catch @hemantgujar and @mhulke ... it took me a very long time to notice this for some reason.

But I finally figured it out. The cellranger people apparently spent a lot of time thinking about and optimizing the compression of their HDF5 files, it turns out! And I did not... but my next update will fix this issue, by compressing the HDF5 files in the same optimized way that cellranger does.

Long story short, the fact that the outputs are so much larger than the inputs is the result of sub-optimal compression by me, but it doesn't affect the data in any way. The output files do contain a few more things, like the per-cell embeddings and cell probabilities and some more latent variables. But these are pretty small in terms of file size, and they do not account for the big discrepancy. Once this fix is put in place, the outputs will be smaller files than the inputs.

sjfleming · 2022-03-30T19:41:42Z

As for the cellranger-arc ATAC data, I will have to look into that. I am glad to hear the code runs without error, but I wonder if it is altering the ATAC data?

https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/output/matrices

It looks like Peaks is a new valid value for "feature type", so I need to update the code to reflect that. I will make it so that the user can choose which "feature type"s to operate on, and the excluded features will be unused and untouched in the output.

smilesw · 2022-09-04T08:47:49Z

Hello, there
Cellbender is originally developed for RNA-seq dataset. So I plan to remove ATAC peaks from raw_feature_bc_matrix.h5, which is output by cellranger-arc. But does anyone know how I can get only gex data as h5 file as input file for Cellbender by removing ATAC peak data?
Thanks in advance for any help.

vincycheng · 2022-10-27T16:44:52Z

Hello!
I am also trying to use Cellbender on the cellranger-arc output with expected-cells 20000 and total-droplets-included 40000.
The issue I have is that, I got 40000 cell barcodes from Cellbender. I don't know if it has anything to do with the existence of the ATAC peak. It seems surreal to have 40000 true cells.
Any advice would be appreciated!

mhulke · 2022-10-27T16:56:40Z

To avoid having ATAC peaks mess with the Cellbender output, I usually feed my scRNA data through Cellranger separately from Cellranger-arc, and then use that h5 file in Cellbender. Afterward, pairing up the RNA and ATAC data using the barcodes downstream is fast and easy. You could also remove the peak data from the arc h5 before feeding it into Cellbender.

sjfleming · 2022-10-28T16:21:49Z

@smilesw , sorry for the delay, somehow I missed your comment.
Your best bet might actually be to use an anndata h5ad file as input for cellbender. It would be kind of tricky to remove the ATAC features from the h5 file. You can use scanpy to load the raw h5 file, manipulate the data so that the ATAC features are removed, and then save the anndata object as an h5ad file. CellBender can directly read the h5ad file.

With my next update, I will make this much easier by allowing users to choose which features they want to include in the analysis.

sjfleming · 2022-10-28T16:25:11Z

@vincycheng in your case it sounds like cellbender's cell calling might not have worked very well on your dataset. Sometimes a sub-optimal run results in cellbender saying "all droplets contain cells" when really they don't. If you can include a screenshot of the output PDF, maybe I can make a few suggestions about how to modify the parameters.

It could be the case that the ATAC peak information threw off cellbender... I have not done testing myself to know if this is a problem. If you wanted to try removing the ATAC data, you could do either as @mhulke suggested, or you could try to use an h5ad file and manually remove the ATAC features, as I suggested in the previous post. It's probably worth a try!

vincycheng · 2022-11-04T17:28:19Z

Thank you @sjfleming for your reply. I did removed the ATAC peak information and ran cellbender. The result came out a bit more "reasonable". ~30,000 cells were being called. There seems to be some other issues with this dataset, but I will open another discussion once I gather more information. Thanks again!

jingxinfu · 2023-01-04T22:02:44Z

Hi @sjfleming, I had the same issue with cellbender (default parameters and cellranger-arc estimated cells) outputting much more cells than the original cell ranger-arc output. And I did remove peak features and saved data into an additional h5ad file for the cellbender input. Would you mind suggesting how to deal with this issue?

sjfleming · 2023-01-11T16:05:48Z

Hi @jingxinfu ! I have maybe a few comments on these three runs for which you attached output PDFs:

This looks like a "very hard" dataset... perhaps something I would even consider a QC failure (before cellbender). The reason I say that is because, based on the UMI curve in panel 2, I cannot even tell by eye where the cells might be and where the empty droplets might be. It's just a smooth drop-off, no elbow or knee or inflection point in the plot at all. One question I would ask: is it possible there are more than 25000 cells in that dataset? That the empty droplets are different in UMI counts from cells, but that we just are not seeing any empty droplets before droplet 25000? (If that were the case, I would say you should re-run with a larger value for --total-dropelts-included, so that you include a few thousand empty droplets, as well as all the cells.)
I see somewhat of an inflection point around droplet 20000. Are you expecting about 20000 cells in this dataset? I think cellbender is sort of trying to find those cells, but it is struggling because it doesn't have a lot of examples of empty droplets. I would also try re-running with a larger value for --total-dropelts-included, so that you include a few thousand empty droplets, as well as all the cells. Maybe 35000 total droplets?
Now this example seems to show what I would think are pretty accurate cell calls, based on just looking at the UMI curve. The red cell probabilities show that cellbender has more-or-less identified that inflection point in the UMI curve near droplet 11000. Were you expecting a different output? To me that seems probably correct.

jingxinfu · 2023-01-11T18:38:13Z

Thanks for your suggestions! @sjfleming I will test it out for the first two samples. And for the third sample, I found that CellBender tends to keep more mitochondrial-associated UMIs when comparing the UMIs changes between CellRanger ARC and CellBender.

sjfleming · 2023-01-11T20:10:49Z

Hi @jingxinfu , for that third sample, if you mean that CellBender is identifying more droplets as "cells" and that those extra "cells" compared to CellRanger ARC tend to have more mitochondrial gene expression, then I think you're seeing something very common. Let me explain:

CellBender will tend to keep everything that is a "non-empty" droplet. We usually refer to these as "cells" and say it's the "cell probability"... but what it really is, is the probability that a droplet is "not empty". It doesn't actually mean that it contains a high quality cell.

We do urge people to perform cell quality control QC after running CellBender, in order to eliminate precisely those kinds of things: debris and dead/dying cells that might have super high mitochondrial gene counts.

Here's an example from the public 10x Genomics human-mouse mixture dataset:

In the plot above, CellBender has identified all droplets that are not gray color as "non-empty". The gray droplets have been identified by CellBender as "empty". I believe this is actually the correct outcome. But, you can see that the "arms" in the plot with high mitochondrial read fraction (red) are clearly something very different from the "good quality cells" that have low mitochondrial read fraction.

So, even though CellBender (correctly) identifies those (red) droplets as "non-empty", they should probably be filtered out by a cell QC step before using the data in downstream analyses, since those high mitochondrial count droplets are possibly(??) not what you'd be interested in for further analysis.

jingxinfu · 2023-01-11T21:47:49Z

That's good to know. Thank you so much for the explanation in detail!

ccruizm · 2023-02-23T09:30:56Z

Hi @sjfleming! Following the use of CellBender with the h5 generated by cellranger-arc, you mentioned you would implement a feature that allows users to choose which data type can be included in the analysis. Is that already implemented in the tool? If I use --exclude-antibody-capture, would this exclude the ATAC and only use the GEX data? or does only work with data that contains protein information? (although this is not implemented in CellBender v3 😅)

Thanks in advance!

sjfleming · 2023-02-24T15:27:40Z

Hi @ccruizm , that feature is not yet implemented in an official release. But I will make sure to include it in v0.3.0, which I anticipate releasing by March 10. My development version of that feature is the argument --exclude-feature-types, which is an optional input argument on the sf_dev_0.3.0_postreg branch in this repository (I pushed a change just this minute to allow "Peaks" as a feature to be excluded). You can install cellbender from that branch of the GitHub repository in order to test it.

The command would be something like

cellbender remove-background \
    ...
    --exclude-feature-types Peaks

sjfleming · 2023-02-24T15:28:32Z

The idea would be that, in the output file, the Peaks features are left unchanged, identical to the input. And they would not be used in the analysis of the other feature types (the gene expression).

ccruizm · 2023-02-24T15:55:17Z

Thanks for the speedy reply! I ran both h5 containing the peaks and without them and saw differences as others reported. But having an h5 output also preserving the ATAC information is better! Going to test it now! 🤓

sjfleming · 2023-02-24T16:00:59Z

I haven't tested it yet myself, so please do let me know if it works / doesn't! Hopefully I can make any adjustments and get it working.

Out of curiosity, how would you describe the differences you saw when including / excluding the Peaks?

ccruizm · 2023-02-24T16:46:23Z

I just gave it a try, but it did not work 😞. I saw discrepancies in several metrics (prior counts of cells and empty droplets, including additional barcodes and empty droplets, Largest surely-empty droplets, etc.) comparing the runs I made using the h5 that includes both features, h5ad RNA only and h5 excluding the features using --exclude-feature-types Peaks (the last one gives an error Inference procedure terminated early due to a NaN value in: mu, lam (see the attached txt file).

cellbender.txt

About the differences (see below), I mainly see a reduction in the number of cells called. Each column is a different sample, and the third one shows a 'cleaner' estimation of the number of cells (if I am interpreting it correctly).

By the way, in case someone else needs it, this is how I extracted the RNA assay from the h5 files to input into the cellbender pipeline.

import os
import scanpy as sc
import muon as mu

data_dir = "/data/"

for i in os.listdir(data_dir):
    if os.path.isdir(os.path.join(data_dir, i)):
        mdata = mu.read_10x_h5(os.path.join(data_dir, i, "outs/raw_feature_bc_matrix.h5"))
        rna = mdata.mod['rna']
        rna.write(os.path.join(data_dir, i, "outs/raw_feature_bc_matrix_rna_only.h5ad"), compression = 'gzip')

sjfleming · 2023-02-24T18:36:28Z

Thank you so much for running that comprehensive test! Shoot, okay. I am going to have to do some debugging here. The fact that the code runs fine when you manually subset to RNA means that there is a bug somewhere in my exclusion of features from the analysis. I will track it down!

sjfleming · 2023-02-24T18:37:33Z

Any chance you could run it with the flag --debug and show me the log again? :)

cellbender remove-background \
    ...
    --debug

ccruizm · 2023-02-27T12:27:59Z

Sure! here it is :)

output.log

sjfleming · 2023-02-27T20:28:42Z

Thanks @ccruizm ! Unfortunately I will need to do more digging to track this down, but I will keep you posted. I will make sure it's working for the v0.3.0 release

kunalkathuria · 2023-06-12T19:20:17Z

Hi,

I ran into the same issue of trying to ignore "Peaks" from Cellranger-arc multiomic data. Perhaps a quick way to do this currently is to list all the ATAC-feature indices (they are usually listed after all the genes in the Cellranger output features.tsv.gz) and supply that list to the --blacklist-genes option? Would this work?

Best,

Kunal

kunalkathuria · 2023-06-13T21:13:39Z

Or is there a fix available now?

sjfleming · 2023-07-10T18:05:37Z

@kunalkathuria I think that your suggestion would work (potentially!). But let me know if you get a chance to try running the code on the sf_dev_0.3.0_postreg branch. I think it may be fixed there.

sjfleming · 2023-08-08T19:03:46Z

Closed by #238

sjfleming self-assigned this Mar 30, 2022

sjfleming added the enhancement New feature or improvement label Mar 30, 2022

sjfleming added this to the v0.2.1 milestone Mar 30, 2022

sjfleming mentioned this issue Nov 16, 2022

Use of CellBender for scATACseq data? #167

Open

sjfleming added the bug Something isn't working label Feb 27, 2023

sjfleming mentioned this issue Feb 27, 2023

High MT genes ratio at some samples #179

Closed

sjfleming mentioned this issue Mar 28, 2023

v0.3.0 #189

Closed

ollieeknight mentioned this issue Apr 28, 2023

Trouble installing dev branch #209

Closed

sjfleming mentioned this issue Jun 7, 2023

Cell probabilities remain close to one #220

Closed

sjfleming mentioned this issue Aug 6, 2023

v0.3.0 #238

Merged

sjfleming closed this as completed Aug 8, 2023

sjfleming mentioned this issue Sep 20, 2023

Compatibility with 10X Multiome data? #284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I use output from cellranger-arc in cellbender ? #121

Can I use output from cellranger-arc in cellbender ? #121

beetlejuice007 commented Jan 29, 2022

mhulke commented Feb 2, 2022

beetlejuice007 commented Feb 4, 2022 •

edited

Loading

mhulke commented Feb 7, 2022

sjfleming commented Mar 30, 2022

sjfleming commented Mar 30, 2022

smilesw commented Sep 4, 2022

vincycheng commented Oct 27, 2022 •

edited

Loading

mhulke commented Oct 27, 2022

sjfleming commented Oct 28, 2022

sjfleming commented Oct 28, 2022

vincycheng commented Nov 4, 2022

jingxinfu commented Jan 4, 2023 •

edited

Loading

sjfleming commented Jan 11, 2023

jingxinfu commented Jan 11, 2023

sjfleming commented Jan 11, 2023 •

edited

Loading

jingxinfu commented Jan 11, 2023

ccruizm commented Feb 23, 2023 •

edited

Loading

sjfleming commented Feb 24, 2023

sjfleming commented Feb 24, 2023

ccruizm commented Feb 24, 2023

sjfleming commented Feb 24, 2023

ccruizm commented Feb 24, 2023

sjfleming commented Feb 24, 2023

sjfleming commented Feb 24, 2023

ccruizm commented Feb 27, 2023 •

edited

Loading

sjfleming commented Feb 27, 2023

kunalkathuria commented Jun 12, 2023

kunalkathuria commented Jun 13, 2023

sjfleming commented Jul 10, 2023

sjfleming commented Aug 8, 2023

Can I use output from cellranger-arc in cellbender ? #121

Can I use output from cellranger-arc in cellbender ? #121

Comments

beetlejuice007 commented Jan 29, 2022

mhulke commented Feb 2, 2022

beetlejuice007 commented Feb 4, 2022 • edited Loading

mhulke commented Feb 7, 2022

sjfleming commented Mar 30, 2022

sjfleming commented Mar 30, 2022

smilesw commented Sep 4, 2022

vincycheng commented Oct 27, 2022 • edited Loading

mhulke commented Oct 27, 2022

sjfleming commented Oct 28, 2022

sjfleming commented Oct 28, 2022

vincycheng commented Nov 4, 2022

jingxinfu commented Jan 4, 2023 • edited Loading

sjfleming commented Jan 11, 2023

jingxinfu commented Jan 11, 2023

sjfleming commented Jan 11, 2023 • edited Loading

jingxinfu commented Jan 11, 2023

ccruizm commented Feb 23, 2023 • edited Loading

sjfleming commented Feb 24, 2023

sjfleming commented Feb 24, 2023

ccruizm commented Feb 24, 2023

sjfleming commented Feb 24, 2023

ccruizm commented Feb 24, 2023

sjfleming commented Feb 24, 2023

sjfleming commented Feb 24, 2023

ccruizm commented Feb 27, 2023 • edited Loading

sjfleming commented Feb 27, 2023

kunalkathuria commented Jun 12, 2023

kunalkathuria commented Jun 13, 2023

sjfleming commented Jul 10, 2023

sjfleming commented Aug 8, 2023

beetlejuice007 commented Feb 4, 2022 •

edited

Loading

vincycheng commented Oct 27, 2022 •

edited

Loading

jingxinfu commented Jan 4, 2023 •

edited

Loading

sjfleming commented Jan 11, 2023 •

edited

Loading

ccruizm commented Feb 23, 2023 •

edited

Loading

ccruizm commented Feb 27, 2023 •

edited

Loading