How are priors calculated? #147

apredeus · 2022-08-17T10:47:26Z

Hi,

Could you please comment on how exactly are priors on empty droplets and cells calculated?

I have an unusual dataset with a very "steep" cell curve - first ~5k cells have high UMIs, and then there's a pretty sharp drop. For some reason, prior on empty droplets is huge, although I'd actually expect it to be rather low? Changing --low-count-threshold does nothing.

Thank you in advance!

The text was updated successfully, but these errors were encountered:

apredeus · 2022-08-17T15:19:25Z

Basically, prior on empty droplets somehow becomes more than that on cells. Here's the log:

cellbender:remove-background: Command:
cellbender remove-background --input GSM4288835/output/Gene/raw/ --output GSM4288835.cellbender.out/cellbender_out.h5 --cuda --expected-cells 5001 --total-droplets-included 8001 --low-count-threshold 20
cellbender:remove-background: 2022-08-17 11:44:25
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from directory GSM4288835/output/Gene/raw/
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Trimming dataset for inference.
cellbender:remove-background: Including 18269 genes that have nonzero counts.
cellbender:remove-background: Prior on counts in empty droplets is 4446
cellbender:remove-background: Prior on counts for cells is 3854
cellbender:remove-background: Excluding barcodes with counts below 2223
Traceback (most recent call last):
File "/usr/local/bin/cellbender", line 11, in
load_entry_point('cellbender', 'console_scripts', 'cellbender')()
File "/opt/cellbender/cellbender/base_cli.py", line 101, in main
cli_dict[args.tool].run(args)
File "/opt/cellbender/cellbender/remove_background/cli.py", line 109, in run
main(args)
File "/opt/cellbender/cellbender/remove_background/cli.py", line 204, in main
run_remove_background(args)
File "/opt/cellbender/cellbender/remove_background/cli.py", line 151, in run_remove_background
SingleCellRNACountsDataset(input_file=args.input_file,
File "/opt/cellbender/cellbender/remove_background/data/dataset.py", line 99, in init
self._trim_dataset_for_analysis(total_droplet_barcodes=total_droplet_barcodes,
File "/opt/cellbender/cellbender/remove_background/data/dataset.py", line 280, in _trim_dataset_for_analysis
assert num_barcodes_above_umi_cutoff > n_cells,
AssertionError: There are no empty droplets with UMI counts over the lower cutoff of 2223. Some empty droplets are necessary for the analysis. Reduce the --low-count-threshold parameter.

sjfleming · 2022-08-19T16:22:31Z

Hi @apredeus , yes these priors are currently computed using some heuristics that honestly I think are not very robust. Your dataset seems to be an example.

The current heuristic for finding empty droplet size (after removing droplets with UMI count below --low-count-threshold) is to find the mode (after taking a log and rounding it... pretty strange actually). This modal UMI count value is then assumed to be the "empty droplet plateau". This works a lot of the time, but not all the time.

CellBender/cellbender/remove_background/data/dataset.py

Lines 1436 to 1438 in cb2d209

    
           empty_log_counts = mode(np.round(np.log1p(counts[counts > cut]), 
        
                                            decimals=1))[0] 
        
           empty_counts = int(np.expm1(empty_log_counts).item())

I am definitely working to improve the prior calculation heuristics. But in the mean time I might be able to give a bit more advice if you can post a picture of the UMI curve.

apredeus · 2022-08-19T16:39:44Z

Great, thank you very much - this clarifies a lot.

I'm putting together a database of some 10,000 scRNA-seq datasets at Sanger, and I'd like to use Cellbender on all of them. From what we find so far, about 10-20% of runs are problematic with current priors (so far I use --expected-cells from STARsolo/CellRanger, and modify total droplets if the expected number is over 20k). We've tried few things for the problematic samples, but nothing very systematic. If you'd like, we can coordinate our efforts - e.g. I can test new presets on problematic/single-nuc samples, etc.

sjfleming · 2022-09-21T21:01:10Z

Hi @apredeus sorry for the slow reply, and that sounds fantastic! Very glad to hear you'll be running CellBender on your samples. That's a massive database.

I would very much like to get that fraction of problematic runs reduced way below the 10-20% you're currently seeing. I know that makes things painful when you've got a lot of samples.

I'll keep you updated about new presets for priors. If you notice any trends, like what you mentioned above happening repeatedly

first ~5k cells have high UMIs, and then there's a pretty sharp drop
--> prior on empty droplets is huge

I'd love to know

sjfleming · 2023-08-08T19:33:44Z

This changed a bit in v0.3.0

The prior-finding logic is now somewhat compartmentalized here
https://github.com/broadinstitute/CellBender/blob/master/cellbender/remove_background/data/priors.py

But additionally, there are now input arguments --force-cell-umi-prior and --force-empty-umi-prior which will override any heuristics.

Closed by #238

sjfleming closed this as completed Aug 8, 2023

sjfleming self-assigned this Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are priors calculated? #147

How are priors calculated? #147

apredeus commented Aug 17, 2022

apredeus commented Aug 17, 2022

sjfleming commented Aug 19, 2022

apredeus commented Aug 19, 2022

sjfleming commented Sep 21, 2022

sjfleming commented Aug 8, 2023

How are priors calculated? #147

How are priors calculated? #147

Comments

apredeus commented Aug 17, 2022

apredeus commented Aug 17, 2022

sjfleming commented Aug 19, 2022

apredeus commented Aug 19, 2022

sjfleming commented Sep 21, 2022

sjfleming commented Aug 8, 2023