Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_10x_h5 error in scanpy 1.9.1 #128

Closed
francois-a opened this issue May 3, 2022 · 15 comments
Closed

read_10x_h5 error in scanpy 1.9.1 #128

francois-a opened this issue May 3, 2022 · 15 comments

Comments

@francois-a
Copy link

Hi, I'm not sure if this is related to other h5 issues (and may be a scanpy bug), but scanpy 1.9.1 throws ValueError: Illegal slicing argument for scalar dataspace when opening *.cellbender_output_filtered.h5 files (this works fine with scanpy 1.8.2).

@sjfleming
Copy link
Member

François! Hope all is well!

So, I think I know what this is... any chance the input data was in CellRanger v2 format? CellBender outputs a file in "v2 format" if the input was v2 format. I guess I wanted to keep output like input, although this may have been a bad choice in hindsight, since v3 format from CellRanger makes a lot more sense in terms of its structure, and is better-maintained by writers of other tools.

If that's what's going on, it is a scanpy bug, which I am attempting to fix here
scverse/scanpy#2248

@sjfleming
Copy link
Member

If you are seeing that with a CellRanger v3 formatted output file from CellBender, let me know!

@beetlejuice007
Copy link

I am getting same error.
i am usng cellranger-arc they have v2 (latest).

@dnjst
Copy link

dnjst commented Jun 28, 2022

I'm getting this error with CellRanger v3 formatted output

scanpy v1.9.1
h5py 3.7.0

`~/project/conda/notebook_env3_clone1/lib/python3.8/site-packages/scanpy/readwrite.py in read_10x_h5(filename, genome, gex_only, backup_url)
    181         v3 = '/matrix' in f
    182     if v3:
--> 183         adata = _read_v3_10x_h5(filename, start=start)
    184         if genome:
    185             if genome not in adata.var['genome'].values:

~/project/conda/notebook_env3_clone1/lib/python3.8/site-packages/scanpy/readwrite.py in _read_v3_10x_h5(filename, start)
    266         try:
    267             dsets = {}
--> 268             _collect_datasets(dsets, f["matrix"])
    269 
    270             from scipy.sparse import csr_matrix

~/project/conda/notebook_env3_clone1/lib/python3.8/site-packages/scanpy/readwrite.py in _collect_datasets(dsets, group)
    254     for k, v in group.items():
    255         if isinstance(v, h5py.Dataset):
--> 256             dsets[k] = v[:]
    257         else:
    258             _collect_datasets(dsets, v)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

~/project/conda/notebook_env3_clone1/lib/python3.8/site-packages/h5py/_hl/dataset.py in __getitem__(self, args, new_dtype)
    798         if self.shape == ():
    799             fspace = self.id.get_space()
--> 800             selection = sel2.select_read(fspace, args)
    801             if selection.mshape is None:
    802                 arr = numpy.zeros((), dtype=new_dtype)

~/project/conda/notebook_env3_clone1/lib/python3.8/site-packages/h5py/_hl/selections2.py in select_read(fspace, args)
     99     """
    100     if fspace.shape == ():
--> 101         return ScalarReadSelection(fspace, args)
    102 
    103     raise NotImplementedError()

~/project/conda/notebook_env3_clone1/lib/python3.8/site-packages/h5py/_hl/selections2.py in __init__(self, fspace, args)
     84             self.mshape = ()
     85         else:
---> 86             raise ValueError("Illegal slicing argument for scalar dataspace")
     87 
     88         self.mspace = h5s.create(h5s.SCALAR)

ValueError: Illegal slicing argument for scalar dataspace`

@sjfleming
Copy link
Member

sjfleming commented Jul 5, 2022

I think it's scalars and strings in h5 fields, since scanpy uses h5py to read h5 files using [:] notation (see here h5py/h5py#1779)

I guess I could make sure everything that gets written is wrapped up into a list... sort of strange to have to do that, but it seems faster than hoping scanpy will change their dataloader.

For now, for existing files, here is a workaround:

  1. Install pytables using conda install pytables
  2. Use the utility ptrepack as follows to create a new h5 file that can be loaded by scanpy
ptrepack CELLBENDER_OUTPUT.h5:/matrix MODIFIED_CELLBENDER_OUTPUT.h5:/matrix

This takes just the "matrix" part of the cellbender output file and puts it in its own new h5 file, without any of the cellbender metadata. That new file should be readable directly by scanpy.

@sjfleming
Copy link
Member

Another option is to use this function anndata_from_h5() as in

adata = anndata_from_h5(file='CELLBENDER_OUTPUT.h5')

This loads an AnnData object that can be used in scanpy as normal. It's just a drop-in replacement for scanpy's data loader.

Function below:

import tables
import numpy as np
import scipy.sparse as sp
import anndata
from typing import Dict, Optional


def anndata_from_h5(file: str,
                    analyzed_barcodes_only: bool = True) -> 'anndata.AnnData':
    """Load an output h5 file into an AnnData object for downstream work.

    Args:
        file: The h5 file
        analyzed_barcodes_only: False to load all barcodes, so that the size of
            the AnnData object will match the size of the input raw count matrix.
            True to load a limited set of barcodes: only those analyzed by the
            algorithm. This allows relevant latent variables to be loaded
            properly into adata.obs and adata.obsm, rather than adata.uns.

    Returns:
        adata: The anndata object, populated with inferred latent variables
            and metadata.

    """

    d = dict_from_h5(file)
    X = sp.csc_matrix((d.pop('data'), d.pop('indices'), d.pop('indptr')),
                      shape=d.pop('shape')).transpose().tocsr()

    # check and see if we have barcode index annotations, and if the file is filtered
    barcode_key = [k for k in d.keys() if (('barcode' in k) and ('ind' in k))]
    if len(barcode_key) > 0:
        max_barcode_ind = d[barcode_key[0]].max()
        filtered_file = (max_barcode_ind >= X.shape[0])
    else:
        filtered_file = True

    if analyzed_barcodes_only:
        if filtered_file:
            # filtered file being read, so we don't need to subset
            print('Assuming we are loading a "filtered" file that contains only cells.')
            pass
        elif 'barcode_indices_for_latents' in d.keys():
            X = X[d['barcode_indices_for_latents'], :]
            d['barcodes'] = d['barcodes'][d['barcode_indices_for_latents']]
        elif 'barcodes_analyzed_inds' in d.keys():
            X = X[d['barcodes_analyzed_inds'], :]
            d['barcodes'] = d['barcodes'][d['barcodes_analyzed_inds']]
        else:
            print('Warning: analyzed_barcodes_only=True, but the key '
                  '"barcodes_analyzed_inds" or "barcode_indices_for_latents" '
                  'is missing from the h5 file. '
                  'Will output all barcodes, and proceed as if '
                  'analyzed_barcodes_only=False')

    # Construct the anndata object.
    adata = anndata.AnnData(X=X,
                            obs={'barcode': d.pop('barcodes').astype(str)},
                            var={'gene_name': (d.pop('gene_names') if 'gene_names' in d.keys()
                                               else d.pop('name')).astype(str)},
                            dtype=X.dtype)
    adata.obs.set_index('barcode', inplace=True)
    adata.var.set_index('gene_name', inplace=True)

    # For CellRanger v2 legacy format, "gene_ids" was called "genes"... rename this
    if 'genes' in d.keys():
        d['id'] = d.pop('genes')

    # For purely aesthetic purposes, rename "id" to "gene_id"
    if 'id' in d.keys():
        d['gene_id'] = d.pop('id')

    # If genomes are empty, try to guess them based on gene_id
    if 'genome' in d.keys():
        if np.array([s.decode() == '' for s in d['genome']]).all():
            if '_' in d['gene_id'][0].decode():
                print('Genome field blank, so attempting to guess genomes based on gene_id prefixes')
                d['genome'] = np.array([s.decode().split('_')[0] for s in d['gene_id']], dtype=str)

    # Add other information to the anndata object in the appropriate slot.
    _fill_adata_slots_automatically(adata, d)

    # Add a special additional field to .var if it exists.
    if 'features_analyzed_inds' in adata.uns.keys():
        adata.var['cellbender_analyzed'] = [True if (i in adata.uns['features_analyzed_inds'])
                                            else False for i in range(adata.shape[1])]

    if analyzed_barcodes_only:
        for col in adata.obs.columns[adata.obs.columns.str.startswith('barcodes_analyzed')
                                     | adata.obs.columns.str.startswith('barcode_indices')]:
            try:
                del adata.obs[col]
            except Exception:
                pass
    else:
        # Add a special additional field to .obs if all barcodes are included.
        if 'barcodes_analyzed_inds' in adata.uns.keys():
            adata.obs['cellbender_analyzed'] = [True if (i in adata.uns['barcodes_analyzed_inds'])
                                                else False for i in range(adata.shape[0])]

    return adata


def dict_from_h5(file: str) -> Dict[str, np.ndarray]:
    """Read in everything from an h5 file and put into a dictionary."""
    d = {}
    with tables.open_file(file) as f:
        # read in everything
        for array in f.walk_nodes("/", "Array"):
            d[array.name] = array.read()
    return d


def _fill_adata_slots_automatically(adata, d):
    """Add other information to the adata object in the appropriate slot."""

    for key, value in d.items():
        try:
            if value is None:
                continue
            value = np.asarray(value)
            if len(value.shape) == 0:
                adata.uns[key] = value
            elif value.shape[0] == adata.shape[0]:
                if (len(value.shape) < 2) or (value.shape[1] < 2):
                    adata.obs[key] = value
                else:
                    adata.obsm[key] = value
            elif value.shape[0] == adata.shape[1]:
                if value.dtype.name.startswith('bytes'):
                    adata.var[key] = value.astype(str)
                else:
                    adata.var[key] = value
            else:
                adata.uns[key] = value
        except Exception:
            print('Unable to load data into AnnData: ', key, value, type(value))

@esrice
Copy link

esrice commented Jul 14, 2022

FYI, I tried the ptrepack solution but am sadly still getting the same error (Illegal slicing argument for scalar dataspace) when trying to read the ptrepack output file with scanpy's read_10x_h5().

@sjfleming
Copy link
Member

@esrice Thanks for letting me know. That is confusing to me... there must be something going on that I'm not understanding. Is it possible for you to share that .h5 file here so I can take a look and see what's going on with scanpy's data loader?

@esrice
Copy link

esrice commented Jul 18, 2022

Sure, here is an h5 output by cellbender that is parsed fine by scanpy v8 but not v9. Github doesn't allow h5 uploads in comments so here's a download link:

https://oc1.rnet.missouri.edu/index.php/s/fqOxxkjaxkcizUE/download

@sjfleming
Copy link
Member

Okay thank you @esrice , now I see. Yes there are still some extra values packed into the h5 file in the cellbender output for v0.2. I will soon update to v0.3, where the ptrepack solution might work. But for now, that suggestion did not make sense.

The best thing to do is to use the above data loader code for the time being: the anndata_from_h5() function.

In v0.3.0, the data loading functions will be distributed as part of cellbender, and also the format of the h5 output files will be tweaked to be compatible with the scanpy v1.9+ data loader.

@HugoCornu
Copy link

thanks for that anndata_from_h5() script !

gokceneraslan added a commit to scverse/scanpy that referenced this issue Oct 3, 2022
After @ivirshup's pytables PR (#2064) we started having issues with loading h5 files with scalar datasets, such as those created by CellBender (broadinstitute/CellBender#128). It is currently not an issue for the 10X h5 files for now since they don't have any scalars, however it'd be good to just handle scalars as well as arrays 1- to fix the cellbender file loading problem 2- to fix potential problems we might end up having if 10X h5 format includes scalar datasets.
@gokceneraslan
Copy link

@sjfleming you might like scverse/scanpy#2344 :)

@sjfleming
Copy link
Member

Woohoo, thanks @gokceneraslan !

gokceneraslan added a commit to scverse/scanpy that referenced this issue Oct 10, 2022
* Handle scalar datasets too

After @ivirshup's pytables PR (#2064) we started having issues with loading h5 files with scalar datasets, such as those created by CellBender (broadinstitute/CellBender#128). It is currently not an issue for the 10X h5 files for now since they don't have any scalars, however it'd be good to just handle scalars as well as arrays 1- to fix the cellbender file loading problem 2- to fix potential problems we might end up having if 10X h5 format includes scalar datasets.

* Add a scalar to the multiple_genomes.h5 test file

* Fixes #2203
@carmensandoval
Copy link

Thanks @gokceneraslan ! Is this change available in a public version of Scanpy? I'm on 1.9.1 and still seeing this error.

Name: scanpy
Version: 1.9.1

org_1 = sc.read_10x_h5('../cellbender/SAM24425933_cellbender_out_filtered.h5')

Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?a8abfd26-9931-48ee-951f-ff7127d3d029)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [6], line 1
----> 1 org_1 = sc.read_10x_h5('../cellbender/SAM24425933_cellbender_out_filtered.h5')

File ~/mambaforge/envs/pegasus/lib/python3.9/site-packages/scanpy/readwrite.py:183, in read_10x_h5(filename, genome, gex_only, backup_url)
    181     v3 = '/matrix' in f
    182 if v3:
--> 183     adata = _read_v3_10x_h5(filename, start=start)
    184     if genome:
    185         if genome not in adata.var['genome'].values:

File ~/mambaforge/envs/pegasus/lib/python3.9/site-packages/scanpy/readwrite.py:268, in _read_v3_10x_h5(filename, start)
    266 try:
    267     dsets = {}
--> 268     _collect_datasets(dsets, f["matrix"])
    270     from scipy.sparse import csr_matrix
    272     M, N = dsets['shape']

File ~/mambaforge/envs/pegasus/lib/python3.9/site-packages/scanpy/readwrite.py:256, in _collect_datasets(dsets, group)
    254 for k, v in group.items():
    255     if isinstance(v, h5py.Dataset):
--> 256         dsets[k] = v[:]
    257     else:
    258         _collect_datasets(dsets, v)
...
---> 86     raise ValueError("Illegal slicing argument for scalar dataspace")
     88 self.mspace = h5s.create(h5s.SCALAR)
     89 self.fspace = fspace

ValueError: Illegal slicing argument for scalar dataspace

@sjfleming
Copy link
Member

Closed by #238

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants