read_10x_h5 error in scanpy 1.9.1 #128

francois-a · 2022-05-03T00:42:42Z

Hi, I'm not sure if this is related to other h5 issues (and may be a scanpy bug), but scanpy 1.9.1 throws ValueError: Illegal slicing argument for scalar dataspace when opening *.cellbender_output_filtered.h5 files (this works fine with scanpy 1.8.2).

The text was updated successfully, but these errors were encountered:

sjfleming · 2022-05-03T15:14:43Z

François! Hope all is well!

So, I think I know what this is... any chance the input data was in CellRanger v2 format? CellBender outputs a file in "v2 format" if the input was v2 format. I guess I wanted to keep output like input, although this may have been a bad choice in hindsight, since v3 format from CellRanger makes a lot more sense in terms of its structure, and is better-maintained by writers of other tools.

If that's what's going on, it is a scanpy bug, which I am attempting to fix here
scverse/scanpy#2248

sjfleming · 2022-05-03T15:15:25Z

If you are seeing that with a CellRanger v3 formatted output file from CellBender, let me know!

beetlejuice007 · 2022-05-17T20:03:48Z

I am getting same error.
i am usng cellranger-arc they have v2 (latest).

dnjst · 2022-06-28T15:25:31Z

I'm getting this error with CellRanger v3 formatted output

scanpy v1.9.1
h5py 3.7.0

`~/project/conda/notebook_env3_clone1/lib/python3.8/site-packages/scanpy/readwrite.py in read_10x_h5(filename, genome, gex_only, backup_url)
    181         v3 = '/matrix' in f
    182     if v3:
--> 183         adata = _read_v3_10x_h5(filename, start=start)
    184         if genome:
    185             if genome not in adata.var['genome'].values:

~/project/conda/notebook_env3_clone1/lib/python3.8/site-packages/scanpy/readwrite.py in _read_v3_10x_h5(filename, start)
    266         try:
    267             dsets = {}
--> 268             _collect_datasets(dsets, f["matrix"])
    269 
    270             from scipy.sparse import csr_matrix

~/project/conda/notebook_env3_clone1/lib/python3.8/site-packages/scanpy/readwrite.py in _collect_datasets(dsets, group)
    254     for k, v in group.items():
    255         if isinstance(v, h5py.Dataset):
--> 256             dsets[k] = v[:]
    257         else:
    258             _collect_datasets(dsets, v)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

~/project/conda/notebook_env3_clone1/lib/python3.8/site-packages/h5py/_hl/dataset.py in __getitem__(self, args, new_dtype)
    798         if self.shape == ():
    799             fspace = self.id.get_space()
--> 800             selection = sel2.select_read(fspace, args)
    801             if selection.mshape is None:
    802                 arr = numpy.zeros((), dtype=new_dtype)

~/project/conda/notebook_env3_clone1/lib/python3.8/site-packages/h5py/_hl/selections2.py in select_read(fspace, args)
     99     """
    100     if fspace.shape == ():
--> 101         return ScalarReadSelection(fspace, args)
    102 
    103     raise NotImplementedError()

~/project/conda/notebook_env3_clone1/lib/python3.8/site-packages/h5py/_hl/selections2.py in __init__(self, fspace, args)
     84             self.mshape = ()
     85         else:
---> 86             raise ValueError("Illegal slicing argument for scalar dataspace")
     87 
     88         self.mspace = h5s.create(h5s.SCALAR)

ValueError: Illegal slicing argument for scalar dataspace`

sjfleming · 2022-07-05T17:47:11Z

I think it's scalars and strings in h5 fields, since scanpy uses h5py to read h5 files using [:] notation (see here h5py/h5py#1779)

I guess I could make sure everything that gets written is wrapped up into a list... sort of strange to have to do that, but it seems faster than hoping scanpy will change their dataloader.

For now, for existing files, here is a workaround:

Install pytables using conda install pytables
Use the utility ptrepack as follows to create a new h5 file that can be loaded by scanpy

ptrepack CELLBENDER_OUTPUT.h5:/matrix MODIFIED_CELLBENDER_OUTPUT.h5:/matrix

This takes just the "matrix" part of the cellbender output file and puts it in its own new h5 file, without any of the cellbender metadata. That new file should be readable directly by scanpy.

sjfleming · 2022-07-05T17:52:30Z

Another option is to use this function anndata_from_h5() as in

adata = anndata_from_h5(file='CELLBENDER_OUTPUT.h5')

This loads an AnnData object that can be used in scanpy as normal. It's just a drop-in replacement for scanpy's data loader.

Function below:

import tables
import numpy as np
import scipy.sparse as sp
import anndata
from typing import Dict, Optional


def anndata_from_h5(file: str,
                    analyzed_barcodes_only: bool = True) -> 'anndata.AnnData':
    """Load an output h5 file into an AnnData object for downstream work.

    Args:
        file: The h5 file
        analyzed_barcodes_only: False to load all barcodes, so that the size of
            the AnnData object will match the size of the input raw count matrix.
            True to load a limited set of barcodes: only those analyzed by the
            algorithm. This allows relevant latent variables to be loaded
            properly into adata.obs and adata.obsm, rather than adata.uns.

    Returns:
        adata: The anndata object, populated with inferred latent variables
            and metadata.

    """

    d = dict_from_h5(file)
    X = sp.csc_matrix((d.pop('data'), d.pop('indices'), d.pop('indptr')),
                      shape=d.pop('shape')).transpose().tocsr()

    # check and see if we have barcode index annotations, and if the file is filtered
    barcode_key = [k for k in d.keys() if (('barcode' in k) and ('ind' in k))]
    if len(barcode_key) > 0:
        max_barcode_ind = d[barcode_key[0]].max()
        filtered_file = (max_barcode_ind >= X.shape[0])
    else:
        filtered_file = True

    if analyzed_barcodes_only:
        if filtered_file:
            # filtered file being read, so we don't need to subset
            print('Assuming we are loading a "filtered" file that contains only cells.')
            pass
        elif 'barcode_indices_for_latents' in d.keys():
            X = X[d['barcode_indices_for_latents'], :]
            d['barcodes'] = d['barcodes'][d['barcode_indices_for_latents']]
        elif 'barcodes_analyzed_inds' in d.keys():
            X = X[d['barcodes_analyzed_inds'], :]
            d['barcodes'] = d['barcodes'][d['barcodes_analyzed_inds']]
        else:
            print('Warning: analyzed_barcodes_only=True, but the key '
                  '"barcodes_analyzed_inds" or "barcode_indices_for_latents" '
                  'is missing from the h5 file. '
                  'Will output all barcodes, and proceed as if '
                  'analyzed_barcodes_only=False')

    # Construct the anndata object.
    adata = anndata.AnnData(X=X,
                            obs={'barcode': d.pop('barcodes').astype(str)},
                            var={'gene_name': (d.pop('gene_names') if 'gene_names' in d.keys()
                                               else d.pop('name')).astype(str)},
                            dtype=X.dtype)
    adata.obs.set_index('barcode', inplace=True)
    adata.var.set_index('gene_name', inplace=True)

    # For CellRanger v2 legacy format, "gene_ids" was called "genes"... rename this
    if 'genes' in d.keys():
        d['id'] = d.pop('genes')

    # For purely aesthetic purposes, rename "id" to "gene_id"
    if 'id' in d.keys():
        d['gene_id'] = d.pop('id')

    # If genomes are empty, try to guess them based on gene_id
    if 'genome' in d.keys():
        if np.array([s.decode() == '' for s in d['genome']]).all():
            if '_' in d['gene_id'][0].decode():
                print('Genome field blank, so attempting to guess genomes based on gene_id prefixes')
                d['genome'] = np.array([s.decode().split('_')[0] for s in d['gene_id']], dtype=str)

    # Add other information to the anndata object in the appropriate slot.
    _fill_adata_slots_automatically(adata, d)

    # Add a special additional field to .var if it exists.
    if 'features_analyzed_inds' in adata.uns.keys():
        adata.var['cellbender_analyzed'] = [True if (i in adata.uns['features_analyzed_inds'])
                                            else False for i in range(adata.shape[1])]

    if analyzed_barcodes_only:
        for col in adata.obs.columns[adata.obs.columns.str.startswith('barcodes_analyzed')
                                     | adata.obs.columns.str.startswith('barcode_indices')]:
            try:
                del adata.obs[col]
            except Exception:
                pass
    else:
        # Add a special additional field to .obs if all barcodes are included.
        if 'barcodes_analyzed_inds' in adata.uns.keys():
            adata.obs['cellbender_analyzed'] = [True if (i in adata.uns['barcodes_analyzed_inds'])
                                                else False for i in range(adata.shape[0])]

    return adata


def dict_from_h5(file: str) -> Dict[str, np.ndarray]:
    """Read in everything from an h5 file and put into a dictionary."""
    d = {}
    with tables.open_file(file) as f:
        # read in everything
        for array in f.walk_nodes("/", "Array"):
            d[array.name] = array.read()
    return d


def _fill_adata_slots_automatically(adata, d):
    """Add other information to the adata object in the appropriate slot."""

    for key, value in d.items():
        try:
            if value is None:
                continue
            value = np.asarray(value)
            if len(value.shape) == 0:
                adata.uns[key] = value
            elif value.shape[0] == adata.shape[0]:
                if (len(value.shape) < 2) or (value.shape[1] < 2):
                    adata.obs[key] = value
                else:
                    adata.obsm[key] = value
            elif value.shape[0] == adata.shape[1]:
                if value.dtype.name.startswith('bytes'):
                    adata.var[key] = value.astype(str)
                else:
                    adata.var[key] = value
            else:
                adata.uns[key] = value
        except Exception:
            print('Unable to load data into AnnData: ', key, value, type(value))

esrice · 2022-07-14T15:58:23Z

FYI, I tried the ptrepack solution but am sadly still getting the same error (Illegal slicing argument for scalar dataspace) when trying to read the ptrepack output file with scanpy's read_10x_h5().

sjfleming · 2022-07-15T21:27:20Z

@esrice Thanks for letting me know. That is confusing to me... there must be something going on that I'm not understanding. Is it possible for you to share that .h5 file here so I can take a look and see what's going on with scanpy's data loader?

esrice · 2022-07-18T13:42:20Z

Sure, here is an h5 output by cellbender that is parsed fine by scanpy v8 but not v9. Github doesn't allow h5 uploads in comments so here's a download link:

https://oc1.rnet.missouri.edu/index.php/s/fqOxxkjaxkcizUE/download

sjfleming · 2022-07-19T21:21:20Z

Okay thank you @esrice , now I see. Yes there are still some extra values packed into the h5 file in the cellbender output for v0.2. I will soon update to v0.3, where the ptrepack solution might work. But for now, that suggestion did not make sense.

The best thing to do is to use the above data loader code for the time being: the anndata_from_h5() function.

In v0.3.0, the data loading functions will be distributed as part of cellbender, and also the format of the h5 output files will be tweaked to be compatible with the scanpy v1.9+ data loader.

HugoCornu · 2022-08-08T17:48:26Z

thanks for that anndata_from_h5() script !

@ivirshup

After @ivirshup's pytables PR (#2064) we started having issues with loading h5 files with scalar datasets, such as those created by CellBender (broadinstitute/CellBender#128). It is currently not an issue for the 10X h5 files for now since they don't have any scalars, however it'd be good to just handle scalars as well as arrays 1- to fix the cellbender file loading problem 2- to fix potential problems we might end up having if 10X h5 format includes scalar datasets.

gokceneraslan · 2022-10-03T22:01:45Z

@sjfleming you might like scverse/scanpy#2344 :)

sjfleming · 2022-10-04T14:57:17Z

Woohoo, thanks @gokceneraslan !

@ivirshup

* Handle scalar datasets too After @ivirshup's pytables PR (#2064) we started having issues with loading h5 files with scalar datasets, such as those created by CellBender (broadinstitute/CellBender#128). It is currently not an issue for the 10X h5 files for now since they don't have any scalars, however it'd be good to just handle scalars as well as arrays 1- to fix the cellbender file loading problem 2- to fix potential problems we might end up having if 10X h5 format includes scalar datasets. * Add a scalar to the multiple_genomes.h5 test file * Fixes #2203

carmensandoval · 2023-02-03T02:23:17Z

Thanks @gokceneraslan ! Is this change available in a public version of Scanpy? I'm on 1.9.1 and still seeing this error.

Name: scanpy
Version: 1.9.1

org_1 = sc.read_10x_h5('../cellbender/SAM24425933_cellbender_out_filtered.h5')

Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?a8abfd26-9931-48ee-951f-ff7127d3d029)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [6], line 1
----> 1 org_1 = sc.read_10x_h5('../cellbender/SAM24425933_cellbender_out_filtered.h5')

File ~/mambaforge/envs/pegasus/lib/python3.9/site-packages/scanpy/readwrite.py:183, in read_10x_h5(filename, genome, gex_only, backup_url)
    181     v3 = '/matrix' in f
    182 if v3:
--> 183     adata = _read_v3_10x_h5(filename, start=start)
    184     if genome:
    185         if genome not in adata.var['genome'].values:

File ~/mambaforge/envs/pegasus/lib/python3.9/site-packages/scanpy/readwrite.py:268, in _read_v3_10x_h5(filename, start)
    266 try:
    267     dsets = {}
--> 268     _collect_datasets(dsets, f["matrix"])
    270     from scipy.sparse import csr_matrix
    272     M, N = dsets['shape']

File ~/mambaforge/envs/pegasus/lib/python3.9/site-packages/scanpy/readwrite.py:256, in _collect_datasets(dsets, group)
    254 for k, v in group.items():
    255     if isinstance(v, h5py.Dataset):
--> 256         dsets[k] = v[:]
    257     else:
    258         _collect_datasets(dsets, v)
...
---> 86     raise ValueError("Illegal slicing argument for scalar dataspace")
     88 self.mspace = h5s.create(h5s.SCALAR)
     89 self.fspace = fspace

ValueError: Illegal slicing argument for scalar dataspace

sjfleming · 2023-08-08T19:04:27Z

Closed by #238

gokceneraslan mentioned this issue Oct 3, 2022

Handle scalar datasets too scverse/scanpy#2344

Merged

sjfleming mentioned this issue Jan 11, 2023

10x H5 loading Error #174

Closed

sjfleming mentioned this issue Mar 28, 2023

v0.3.0 #189

Closed

kunalkathuria mentioned this issue Jul 5, 2023

error running scanpy highly_variable_genes #227

Open

sjfleming mentioned this issue Aug 6, 2023

v0.3.0 #238

Merged

sjfleming closed this as completed Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_10x_h5 error in scanpy 1.9.1 #128

read_10x_h5 error in scanpy 1.9.1 #128

francois-a commented May 3, 2022

sjfleming commented May 3, 2022

sjfleming commented May 3, 2022

beetlejuice007 commented May 17, 2022

dnjst commented Jun 28, 2022 •

edited

Loading

sjfleming commented Jul 5, 2022 •

edited

Loading

sjfleming commented Jul 5, 2022

esrice commented Jul 14, 2022

sjfleming commented Jul 15, 2022

esrice commented Jul 18, 2022

sjfleming commented Jul 19, 2022

HugoCornu commented Aug 8, 2022

gokceneraslan commented Oct 3, 2022

sjfleming commented Oct 4, 2022

carmensandoval commented Feb 3, 2023

sjfleming commented Aug 8, 2023

read_10x_h5 error in scanpy 1.9.1 #128

read_10x_h5 error in scanpy 1.9.1 #128

Comments

francois-a commented May 3, 2022

sjfleming commented May 3, 2022

sjfleming commented May 3, 2022

beetlejuice007 commented May 17, 2022

dnjst commented Jun 28, 2022 • edited Loading

sjfleming commented Jul 5, 2022 • edited Loading

sjfleming commented Jul 5, 2022

esrice commented Jul 14, 2022

sjfleming commented Jul 15, 2022

esrice commented Jul 18, 2022

sjfleming commented Jul 19, 2022

HugoCornu commented Aug 8, 2022

gokceneraslan commented Oct 3, 2022

sjfleming commented Oct 4, 2022

carmensandoval commented Feb 3, 2023

sjfleming commented Aug 8, 2023

dnjst commented Jun 28, 2022 •

edited

Loading

sjfleming commented Jul 5, 2022 •

edited

Loading