Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non-H5 inputs #38

Closed
jpcartailler opened this issue Oct 22, 2019 · 9 comments
Closed

Support for non-H5 inputs #38

jpcartailler opened this issue Oct 22, 2019 · 9 comments
Assignees
Labels
enhancement New feature or improvement
Milestone

Comments

@jpcartailler
Copy link

Greetings,

Am very excited to try this approach, but I can't seem to be able to get our data into it. Our data comes from the InDrops method. I did go through the trouble of passing the data through Seurat/LOOM to generate .h5 files, which unfortunately does not seem compatible with CellBender (see ValueError: blocks must be 2-D error, below).

Is there any chance that you could introduce a more generic/accessible format that could be used as CellBender input? Ultimately, we all start with barcodes and genes. A sparse matrix would be convenient, for example.

Alternatively, if you know of a good way to load inDrops data into CellBender, then that would really make my day!

JP

cellbender:remove-background: Command:
cellbender remove-background --input data.KO_Gene_new.cells.h5ad --output output.h5 --cuda --expected-cells 500 --total-droplets-included 1000 --epochs 100
cellbender:remove-background: 2019-10-22 11:59:42
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from file data.KO_Gene_new.cells.h5ad
cellbender:remove-background: CellRanger v2 format
Traceback (most recent call last):
  File "C:\Users\c\AppData\Local\Continuum\miniconda3\envs\CellBender\Scripts\cellbender-script.py", line 11, in <module>
    load_entry_point('cellbender', 'console_scripts', 'cellbender')()
  File "c:\users\c\cellbender\cellbender\base_cli.py", line 101, in main
    cli_dict[args.tool].run(args)
  File "c:\users\c\cellbender\cellbender\remove_background\cli.py", line 92, in run
    main(args)
  File "c:\users\c\cellbender\cellbender\remove_background\cli.py", line 185, in main
    run_remove_background(args)
  File "c:\users\c\cellbender\cellbender\remove_background\cli.py", line 143, in run_remove_background
    args.low_count_threshold)
  File "c:\users\c\cellbender\cellbender\remove_background\data\dataset.py", line 82, in __init__
    self._load_data()
  File "c:\users\c\cellbender\cellbender\remove_background\data\dataset.py", line 125, in _load_data
    self.data = get_matrix_from_cellranger_h5(self.input_file)
  File "c:\users\c\cellbender\cellbender\remove_background\data\dataset.py", line 874, in get_matrix_from_cellranger_h5
    count_matrix = sp.vstack(csc_list, format='csc')
  File "C:\Users\c\AppData\Local\Continuum\miniconda3\envs\CellBender\lib\site-packages\scipy\sparse\construct.py", line 499, in vstack
    return bmat([[b] for b in blocks], format=format, dtype=dtype)
  File "C:\Users\c\AppData\Local\Continuum\miniconda3\envs\CellBender\lib\site-packages\scipy\sparse\construct.py", line 548, in bmat
    raise ValueError('blocks must be 2-D')
ValueError: blocks must be 2-D
@johnchamberlin
Copy link

I was about to post the same question. But I learned that you can use .mtx format as input, which might be easier to synthesize than .h5. See the example here:
https://cellbender.readthedocs.io/en/latest/getting_started/remove_background/index.html

The only hiccup was that the genes/features file has to be named "genes.tsv", not "features.tsv". I am using STARsolo instead of CellRanger which uses "features.tsv".

@sjfleming
Copy link
Member

Yes, at the moment, the easiest approach is to try to get your data into the format of either CellRanger v2 or CellRanger v3, in their mtx format.

The .h5 file input expects the format to be exactly as CellRanger has it, so that's a bit more of a pain to pull off. The sparse mtx and tsv format should work for you though.

The two CellRanger versions are a bit different (v2 has genes.tsv, v3 has features.tsv.gz), but if you can get your data into that fairly generic format, you can use the tool directly. Check out
https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/2.1/output/matrices
for details on formatting for CellRanger v2.

@c5creative We are interested in adding more compatibility for other file formats in the future. Could you point me to some documentation of the file specification for your InDrops format (and maybe a public example data file)?

@achamess
Copy link

Just chiming in. I also used STARsolo. I had to manually changes features.tsv to genes.tsv and then it works. Alternatively, one an use DropletUtils to make h5 files. Either way, there is an intermediate step from the outputs of STARsolo to Cellbender. Good tool, btw. Really cleans up my data.

@jpcartailler
Copy link
Author

Thanks to everyone's feedback!

@sjfleming - With regards to the inDrop data - in our case, we originally used the inDrops pipeline for the data processing/filtering. I can share some output with you (it's large), but in short, it's a tab-delimited file with barcodes as rows and genes/features as columns:
image
Since one can provide mtx, I think that is sufficient enough for a generic means to load data.

@achamess - great tip, I went back and reprocessed with STARsolo and now am making progress with CellBender

@sjfleming
Copy link
Member

We are currently adding functionality to read inputs from the DGE matrix format from dropseq, and if there's interest, we could also add a file parser for inDrop data as well. But glad to hear you've made progress.

@sjfleming sjfleming self-assigned this Jul 29, 2020
@sjfleming sjfleming added this to the v0.2 milestone Jul 29, 2020
@sjfleming sjfleming added the enhancement New feature or improvement label Aug 19, 2020
@Hrovatin
Copy link

If there will be further input and output formats added h5ad might be good choice for both as well.

@sjfleming
Copy link
Member

Interesting point... it would require the user to have anndata and h5py installed, but it could be doable...

@sjfleming
Copy link
Member

The h5ad addition is now live thanks to @jacobkimmel

The next commit will also add support for the DropSeq file format, which is a zipped dense count matrix in tabular form, much like the transpose of the inDrop format above.

Let me know if there is still desire for the inDrop format you've shown @c5creative

@sjfleming sjfleming modified the milestones: v0.2, v0.1, v0.2.1 May 3, 2021
@sjfleming sjfleming mentioned this issue Mar 28, 2023
@sjfleming sjfleming mentioned this issue Aug 6, 2023
@sjfleming
Copy link
Member

Closed by #238

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or improvement
Projects
None yet
Development

No branches or pull requests

5 participants