Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How are priors calculated? #147

Closed
apredeus opened this issue Aug 17, 2022 · 5 comments
Closed

How are priors calculated? #147

apredeus opened this issue Aug 17, 2022 · 5 comments
Assignees

Comments

@apredeus
Copy link

Hi,

Could you please comment on how exactly are priors on empty droplets and cells calculated?

I have an unusual dataset with a very "steep" cell curve - first ~5k cells have high UMIs, and then there's a pretty sharp drop. For some reason, prior on empty droplets is huge, although I'd actually expect it to be rather low? Changing --low-count-threshold does nothing.

Thank you in advance!

@apredeus
Copy link
Author

Basically, prior on empty droplets somehow becomes more than that on cells. Here's the log:

cellbender:remove-background: Command:
cellbender remove-background --input GSM4288835/output/Gene/raw/ --output GSM4288835.cellbender.out/cellbender_out.h5 --cuda --expected-cells 5001 --total-droplets-included 8001 --low-count-threshold 20
cellbender:remove-background: 2022-08-17 11:44:25
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from directory GSM4288835/output/Gene/raw/
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Trimming dataset for inference.
cellbender:remove-background: Including 18269 genes that have nonzero counts.
cellbender:remove-background: Prior on counts in empty droplets is 4446
cellbender:remove-background: Prior on counts for cells is 3854
cellbender:remove-background: Excluding barcodes with counts below 2223
Traceback (most recent call last):
File "/usr/local/bin/cellbender", line 11, in
load_entry_point('cellbender', 'console_scripts', 'cellbender')()
File "/opt/cellbender/cellbender/base_cli.py", line 101, in main
cli_dict[args.tool].run(args)
File "/opt/cellbender/cellbender/remove_background/cli.py", line 109, in run
main(args)
File "/opt/cellbender/cellbender/remove_background/cli.py", line 204, in main
run_remove_background(args)
File "/opt/cellbender/cellbender/remove_background/cli.py", line 151, in run_remove_background
SingleCellRNACountsDataset(input_file=args.input_file,
File "/opt/cellbender/cellbender/remove_background/data/dataset.py", line 99, in init
self._trim_dataset_for_analysis(total_droplet_barcodes=total_droplet_barcodes,
File "/opt/cellbender/cellbender/remove_background/data/dataset.py", line 280, in _trim_dataset_for_analysis
assert num_barcodes_above_umi_cutoff > n_cells,
AssertionError: There are no empty droplets with UMI counts over the lower cutoff of 2223. Some empty droplets are necessary for the analysis. Reduce the --low-count-threshold parameter.

@sjfleming
Copy link
Member

Hi @apredeus , yes these priors are currently computed using some heuristics that honestly I think are not very robust. Your dataset seems to be an example.

The current heuristic for finding empty droplet size (after removing droplets with UMI count below --low-count-threshold) is to find the mode (after taking a log and rounding it... pretty strange actually). This modal UMI count value is then assumed to be the "empty droplet plateau". This works a lot of the time, but not all the time.

empty_log_counts = mode(np.round(np.log1p(counts[counts > cut]),
decimals=1))[0]
empty_counts = int(np.expm1(empty_log_counts).item())

I am definitely working to improve the prior calculation heuristics. But in the mean time I might be able to give a bit more advice if you can post a picture of the UMI curve.

@apredeus
Copy link
Author

Great, thank you very much - this clarifies a lot.

I'm putting together a database of some 10,000 scRNA-seq datasets at Sanger, and I'd like to use Cellbender on all of them. From what we find so far, about 10-20% of runs are problematic with current priors (so far I use --expected-cells from STARsolo/CellRanger, and modify total droplets if the expected number is over 20k). We've tried few things for the problematic samples, but nothing very systematic. If you'd like, we can coordinate our efforts - e.g. I can test new presets on problematic/single-nuc samples, etc.

@sjfleming
Copy link
Member

Hi @apredeus sorry for the slow reply, and that sounds fantastic! Very glad to hear you'll be running CellBender on your samples. That's a massive database.

I would very much like to get that fraction of problematic runs reduced way below the 10-20% you're currently seeing. I know that makes things painful when you've got a lot of samples.

I'll keep you updated about new presets for priors. If you notice any trends, like what you mentioned above happening repeatedly

first ~5k cells have high UMIs, and then there's a pretty sharp drop
--> prior on empty droplets is huge

I'd love to know

@sjfleming
Copy link
Member

This changed a bit in v0.3.0

The prior-finding logic is now somewhat compartmentalized here
https://github.com/broadinstitute/CellBender/blob/master/cellbender/remove_background/data/priors.py

But additionally, there are now input arguments --force-cell-umi-prior and --force-empty-umi-prior which will override any heuristics.

Closed by #238

@sjfleming sjfleming self-assigned this Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants