Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I use output from cellranger-arc in cellbender ? #121

Closed
beetlejuice007 opened this issue Jan 29, 2022 · 30 comments
Closed

Can I use output from cellranger-arc in cellbender ? #121

beetlejuice007 opened this issue Jan 29, 2022 · 30 comments
Assignees
Labels
bug Something isn't working enhancement New feature or improvement
Milestone

Comments

@beetlejuice007
Copy link

Can I use output from cellranger-arc in cellbender ?

@mhulke
Copy link

mhulke commented Feb 2, 2022

I'm not part of the cellbender team, but yes you can. I've done it before and the program works just fine. Just out of curiosity, do you plan to remove the ATAC peaks from the raw matrix before running cellbender or leave them in? I've been leaving them in, but I would be curious how removing them affects filtering and downstream processing.

@beetlejuice007
Copy link
Author

beetlejuice007 commented Feb 4, 2022

ok thanks. I was able to run cellbender. I did not remove the ATAC peaks. I did not get any errors.
But i noticed that my output.h5 (197MB) file is much bigger than input.h5 file (70MB) (Also the filtered.h5 is 36MB. ).
Is this expected ? and why did the file size increase so much ?

@mhulke
Copy link

mhulke commented Feb 7, 2022

Sorry, that's a question for the dev team. My files usually end up larger, so I wouldn't be too worried about it unless you see something really strange in the data.

@sjfleming
Copy link
Member

This is a great catch @hemantgujar and @mhulke ... it took me a very long time to notice this for some reason.

But I finally figured it out. The cellranger people apparently spent a lot of time thinking about and optimizing the compression of their HDF5 files, it turns out! And I did not... but my next update will fix this issue, by compressing the HDF5 files in the same optimized way that cellranger does.

Long story short, the fact that the outputs are so much larger than the inputs is the result of sub-optimal compression by me, but it doesn't affect the data in any way. The output files do contain a few more things, like the per-cell embeddings and cell probabilities and some more latent variables. But these are pretty small in terms of file size, and they do not account for the big discrepancy. Once this fix is put in place, the outputs will be smaller files than the inputs.

@sjfleming
Copy link
Member

As for the cellranger-arc ATAC data, I will have to look into that. I am glad to hear the code runs without error, but I wonder if it is altering the ATAC data?

https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/output/matrices

It looks like Peaks is a new valid value for "feature type", so I need to update the code to reflect that. I will make it so that the user can choose which "feature type"s to operate on, and the excluded features will be unused and untouched in the output.

@sjfleming sjfleming self-assigned this Mar 30, 2022
@sjfleming sjfleming added the enhancement New feature or improvement label Mar 30, 2022
@sjfleming sjfleming added this to the v0.2.1 milestone Mar 30, 2022
@smilesw
Copy link

smilesw commented Sep 4, 2022

Hello, there
Cellbender is originally developed for RNA-seq dataset. So I plan to remove ATAC peaks from raw_feature_bc_matrix.h5, which is output by cellranger-arc. But does anyone know how I can get only gex data as h5 file as input file for Cellbender by removing ATAC peak data?
Thanks in advance for any help.

@vincycheng
Copy link

vincycheng commented Oct 27, 2022

Hello!
I am also trying to use Cellbender on the cellranger-arc output with expected-cells 20000 and total-droplets-included 40000.
The issue I have is that, I got 40000 cell barcodes from Cellbender. I don't know if it has anything to do with the existence of the ATAC peak. It seems surreal to have 40000 true cells.
Any advice would be appreciated!

@mhulke
Copy link

mhulke commented Oct 27, 2022

To avoid having ATAC peaks mess with the Cellbender output, I usually feed my scRNA data through Cellranger separately from Cellranger-arc, and then use that h5 file in Cellbender. Afterward, pairing up the RNA and ATAC data using the barcodes downstream is fast and easy. You could also remove the peak data from the arc h5 before feeding it into Cellbender.

@sjfleming
Copy link
Member

@smilesw , sorry for the delay, somehow I missed your comment.
Your best bet might actually be to use an anndata h5ad file as input for cellbender. It would be kind of tricky to remove the ATAC features from the h5 file. You can use scanpy to load the raw h5 file, manipulate the data so that the ATAC features are removed, and then save the anndata object as an h5ad file. CellBender can directly read the h5ad file.

With my next update, I will make this much easier by allowing users to choose which features they want to include in the analysis.

@sjfleming
Copy link
Member

@vincycheng in your case it sounds like cellbender's cell calling might not have worked very well on your dataset. Sometimes a sub-optimal run results in cellbender saying "all droplets contain cells" when really they don't. If you can include a screenshot of the output PDF, maybe I can make a few suggestions about how to modify the parameters.

It could be the case that the ATAC peak information threw off cellbender... I have not done testing myself to know if this is a problem. If you wanted to try removing the ATAC data, you could do either as @mhulke suggested, or you could try to use an h5ad file and manually remove the ATAC features, as I suggested in the previous post. It's probably worth a try!

@vincycheng
Copy link

Thank you @sjfleming for your reply. I did removed the ATAC peak information and ran cellbender. The result came out a bit more "reasonable". ~30,000 cells were being called. There seems to be some other issues with this dataset, but I will open another discussion once I gather more information. Thanks again!

@jingxinfu
Copy link

jingxinfu commented Jan 4, 2023

Hi @sjfleming, I had the same issue with cellbender (default parameters and cellranger-arc estimated cells) outputting much more cells than the original cell ranger-arc output. And I did remove peak features and saved data into an additional h5ad file for the cellbender input. Would you mind suggesting how to deal with this issue?
image
image
image

@sjfleming
Copy link
Member

Hi @jingxinfu ! I have maybe a few comments on these three runs for which you attached output PDFs:

  1. This looks like a "very hard" dataset... perhaps something I would even consider a QC failure (before cellbender). The reason I say that is because, based on the UMI curve in panel 2, I cannot even tell by eye where the cells might be and where the empty droplets might be. It's just a smooth drop-off, no elbow or knee or inflection point in the plot at all. One question I would ask: is it possible there are more than 25000 cells in that dataset? That the empty droplets are different in UMI counts from cells, but that we just are not seeing any empty droplets before droplet 25000? (If that were the case, I would say you should re-run with a larger value for --total-dropelts-included, so that you include a few thousand empty droplets, as well as all the cells.)
  2. I see somewhat of an inflection point around droplet 20000. Are you expecting about 20000 cells in this dataset? I think cellbender is sort of trying to find those cells, but it is struggling because it doesn't have a lot of examples of empty droplets. I would also try re-running with a larger value for --total-dropelts-included, so that you include a few thousand empty droplets, as well as all the cells. Maybe 35000 total droplets?
  3. Now this example seems to show what I would think are pretty accurate cell calls, based on just looking at the UMI curve. The red cell probabilities show that cellbender has more-or-less identified that inflection point in the UMI curve near droplet 11000. Were you expecting a different output? To me that seems probably correct.

@jingxinfu
Copy link

Thanks for your suggestions! @sjfleming I will test it out for the first two samples. And for the third sample, I found that CellBender tends to keep more mitochondrial-associated UMIs when comparing the UMIs changes between CellRanger ARC and CellBender.

@sjfleming
Copy link
Member

sjfleming commented Jan 11, 2023

Hi @jingxinfu , for that third sample, if you mean that CellBender is identifying more droplets as "cells" and that those extra "cells" compared to CellRanger ARC tend to have more mitochondrial gene expression, then I think you're seeing something very common. Let me explain:

CellBender will tend to keep everything that is a "non-empty" droplet. We usually refer to these as "cells" and say it's the "cell probability"... but what it really is, is the probability that a droplet is "not empty". It doesn't actually mean that it contains a high quality cell.

We do urge people to perform cell quality control QC after running CellBender, in order to eliminate precisely those kinds of things: debris and dead/dying cells that might have super high mitochondrial gene counts.

Here's an example from the public 10x Genomics human-mouse mixture dataset:
image

In the plot above, CellBender has identified all droplets that are not gray color as "non-empty". The gray droplets have been identified by CellBender as "empty". I believe this is actually the correct outcome. But, you can see that the "arms" in the plot with high mitochondrial read fraction (red) are clearly something very different from the "good quality cells" that have low mitochondrial read fraction.

So, even though CellBender (correctly) identifies those (red) droplets as "non-empty", they should probably be filtered out by a cell QC step before using the data in downstream analyses, since those high mitochondrial count droplets are possibly(??) not what you'd be interested in for further analysis.

@jingxinfu
Copy link

That's good to know. Thank you so much for the explanation in detail!

@ccruizm
Copy link

ccruizm commented Feb 23, 2023

Hi @sjfleming! Following the use of CellBender with the h5 generated by cellranger-arc, you mentioned you would implement a feature that allows users to choose which data type can be included in the analysis. Is that already implemented in the tool? If I use --exclude-antibody-capture, would this exclude the ATAC and only use the GEX data? or does only work with data that contains protein information? (although this is not implemented in CellBender v3 😅)

Thanks in advance!

@sjfleming
Copy link
Member

Hi @ccruizm , that feature is not yet implemented in an official release. But I will make sure to include it in v0.3.0, which I anticipate releasing by March 10. My development version of that feature is the argument --exclude-feature-types, which is an optional input argument on the sf_dev_0.3.0_postreg branch in this repository (I pushed a change just this minute to allow "Peaks" as a feature to be excluded). You can install cellbender from that branch of the GitHub repository in order to test it.

The command would be something like

cellbender remove-background \
    ...
    --exclude-feature-types Peaks

@sjfleming
Copy link
Member

The idea would be that, in the output file, the Peaks features are left unchanged, identical to the input. And they would not be used in the analysis of the other feature types (the gene expression).

@ccruizm
Copy link

ccruizm commented Feb 24, 2023

Thanks for the speedy reply! I ran both h5 containing the peaks and without them and saw differences as others reported. But having an h5 output also preserving the ATAC information is better! Going to test it now! 🤓

@sjfleming
Copy link
Member

I haven't tested it yet myself, so please do let me know if it works / doesn't! Hopefully I can make any adjustments and get it working.

Out of curiosity, how would you describe the differences you saw when including / excluding the Peaks?

@ccruizm
Copy link

ccruizm commented Feb 24, 2023

I just gave it a try, but it did not work 😞. I saw discrepancies in several metrics (prior counts of cells and empty droplets, including additional barcodes and empty droplets, Largest surely-empty droplets, etc.) comparing the runs I made using the h5 that includes both features, h5ad RNA only and h5 excluding the features using --exclude-feature-types Peaks (the last one gives an error Inference procedure terminated early due to a NaN value in: mu, lam (see the attached txt file).

cellbender.txt

About the differences (see below), I mainly see a reduction in the number of cells called. Each column is a different sample, and the third one shows a 'cleaner' estimation of the number of cells (if I am interpreting it correctly).
Screenshot 2023-02-24 at 17 11 17
By the way, in case someone else needs it, this is how I extracted the RNA assay from the h5 files to input into the cellbender pipeline.

import os
import scanpy as sc
import muon as mu

data_dir = "/data/"

for i in os.listdir(data_dir):
    if os.path.isdir(os.path.join(data_dir, i)):
        mdata = mu.read_10x_h5(os.path.join(data_dir, i, "outs/raw_feature_bc_matrix.h5"))
        rna = mdata.mod['rna']
        rna.write(os.path.join(data_dir, i, "outs/raw_feature_bc_matrix_rna_only.h5ad"), compression = 'gzip')

@sjfleming
Copy link
Member

Thank you so much for running that comprehensive test! Shoot, okay. I am going to have to do some debugging here. The fact that the code runs fine when you manually subset to RNA means that there is a bug somewhere in my exclusion of features from the analysis. I will track it down!

@sjfleming
Copy link
Member

Any chance you could run it with the flag --debug and show me the log again? :)

cellbender remove-background \
    ...
    --debug

@ccruizm
Copy link

ccruizm commented Feb 27, 2023

Sure! here it is :)

output.log

@sjfleming
Copy link
Member

Thanks @ccruizm ! Unfortunately I will need to do more digging to track this down, but I will keep you posted. I will make sure it's working for the v0.3.0 release

@kunalkathuria
Copy link

Hi,

I ran into the same issue of trying to ignore "Peaks" from Cellranger-arc multiomic data. Perhaps a quick way to do this currently is to list all the ATAC-feature indices (they are usually listed after all the genes in the Cellranger output features.tsv.gz) and supply that list to the --blacklist-genes option? Would this work?

Best,

Kunal

@kunalkathuria
Copy link

Or is there a fix available now?

@sjfleming
Copy link
Member

@kunalkathuria I think that your suggestion would work (potentially!). But let me know if you get a chance to try running the code on the sf_dev_0.3.0_postreg branch. I think it may be fixed there.

@sjfleming sjfleming mentioned this issue Aug 6, 2023
@sjfleming
Copy link
Member

Closed by #238

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or improvement
Projects
None yet
Development

No branches or pull requests

8 participants