Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10x H5 loading Error #174

Closed
llumdi opened this issue Dec 15, 2022 · 9 comments
Closed

10x H5 loading Error #174

llumdi opened this issue Dec 15, 2022 · 9 comments

Comments

@llumdi
Copy link

llumdi commented Dec 15, 2022

I cannot load the cellbender output (v0.2.2) with sc.read_10x_h5 (scanpy==1.9.1). Cellbender was completed successfully.

adata_cellbender = sc.read_10x_h5('./cellbender.h5')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [13], line 1
----> 1 temp = sc.read_10x_h5('/projects/site/pred/cit_tdag/projects/other/CRO_scouting_ONCdiscovery/raw_data/cellbender_output_v0.2.0/Nuvisan_Tumor_Brain/cellbender.h5')

File ~/scratch/conda/envs/besca25_LAS/lib/python3.8/site-packages/scanpy/readwrite.py:183, in read_10x_h5(filename, genome, gex_only, backup_url)
    181     v3 = '/matrix' in f
    182 if v3:
--> 183     adata = _read_v3_10x_h5(filename, start=start)
    184     if genome:
    185         if genome not in adata.var['genome'].values:

File ~/scratch/conda/envs/besca25_LAS/lib/python3.8/site-packages/scanpy/readwrite.py:268, in _read_v3_10x_h5(filename, start)
    266 try:
    267     dsets = {}
--> 268     _collect_datasets(dsets, f["matrix"])
    270     from scipy.sparse import csr_matrix
    272     M, N = dsets['shape']

File ~/scratch/conda/envs/besca25_LAS/lib/python3.8/site-packages/scanpy/readwrite.py:256, in _collect_datasets(dsets, group)
    254 for k, v in group.items():
    255     if isinstance(v, h5py.Dataset):
--> 256         dsets[k] = v[:]
    257     else:
    258         _collect_datasets(dsets, v)

File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper()

File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File ~/scratch/conda/envs/besca25_LAS/lib/python3.8/site-packages/h5py/_hl/dataset.py:800, in Dataset.__getitem__(self, args, new_dtype)
    798 if self.shape == ():
    799     fspace = self.id.get_space()
--> 800     selection = sel2.select_read(fspace, args)
    801     if selection.mshape is None:
    802         arr = numpy.zeros((), dtype=new_dtype)

File ~/scratch/conda/envs/besca25_LAS/lib/python3.8/site-packages/h5py/_hl/selections2.py:101, in select_read(fspace, args)
     96 """ Top-level dispatch function for reading.
     97 
     98 At the moment, only supports reading from scalar datasets.
     99 """
    100 if fspace.shape == ():
--> 101     return ScalarReadSelection(fspace, args)
    103 raise NotImplementedError()

File ~/scratch/conda/envs/besca25_LAS/lib/python3.8/site-packages/h5py/_hl/selections2.py:86, in ScalarReadSelection.__init__(self, fspace, args)
     84     self.mshape = ()
     85 else:
---> 86     raise ValueError("Illegal slicing argument for scalar dataspace")
     88 self.mspace = h5s.create(h5s.SCALAR)
     89 self.fspace = fspace

ValueError: Illegal slicing argument for scalar dataspace

I would really appreaciate a fix. Thanks a lot

@sjfleming
Copy link
Member

Hi @llumdi ! This is related to #128 and should have been fixed by this recent change in scanpy: scverse/scanpy#2344

I don't know exactly which scanpy versions contain this fix. But I have heard some other people say that the scanpy loader is still giving them this error.

If you want a workaround (and the newest scanpy version still gives this error), then I would recommend loading the cellbender output into anndata format using this custom code:

#128 (comment)

@sjfleming
Copy link
Member

Let me know if you run into a problem trying that out! Happy to discuss more

@llumdi
Copy link
Author

llumdi commented Jan 17, 2023

Hi @sjfleming,
Thank you so much for the workaround. It worked!

@carmensandoval
Copy link

I am able to load cellbender output using the function anndata_from_h5 referenced above (thank you!).
However, I can't write these objects to file. Is this expected? Is there a workaround to be able to write?

from anndata_from_h5 import *
org_1 = anndata_from_h5("../cellbender/SAM24425933_cellbender_out_filtered.h5")

Assuming we are loading a "filtered" file that contains only cells.

org_1

AnnData object with n_obs × n_vars = 4341 × 58302
    obs: 'latent_RT_efficiency', 'latent_cell_probability', 'latent_scale'
    var: 'ambient_expression', 'feature_type', 'genome', 'gene_id'
    uns: 'contamination_fraction_params', 'fraction_data_used_for_testing', 'lambda_multiplier', 'overdispersion_mean_and_scale', 'target_false_positive_rate', 'test_elbo', 'test_epoch', 'training_elbo_per_epoch'
    obsm: 'latent_gene_encoding'
sc.write('org_1.h5ad', org_1)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/mambaforge/envs/pegasus/lib/python3.9/site-packages/anndata/_io/utils.py:214, in report_write_key_on_error.<locals>.func_wrapper(elem, key, val, *args, **kwargs)
    213 try:
--> 214     return func(elem, key, val, *args, **kwargs)
    215 except Exception as e:

File ~/mambaforge/envs/pegasus/lib/python3.9/site-packages/anndata/_io/specs/registry.py:175, in write_elem(f, k, elem, modifiers, *args, **kwargs)
    174 else:
--> 175     _REGISTRY.get_writer(dest_type, t, modifiers)(f, k, elem, *args, **kwargs)

File ~/mambaforge/envs/pegasus/lib/python3.9/site-packages/anndata/_io/specs/registry.py:24, in write_spec.<locals>.decorator.<locals>.wrapper(g, k, *args, **kwargs)
     22 @wraps(func)
     23 def wrapper(g, k, *args, **kwargs):
---> 24     result = func(g, k, *args, **kwargs)
     25     g[k].attrs.setdefault("encoding-type", spec.encoding_type)

File ~/mambaforge/envs/pegasus/lib/python3.9/site-packages/anndata/_io/specs/methods.py:307, in write_basic(f, k, elem, dataset_kwargs)
    306 """Write methods which underlying library handles nativley."""
--> 307 f.create_dataset(k, data=elem, **dataset_kwargs)

File ~/mambaforge/envs/pegasus/lib/python3.9/site-packages/h5py/_hl/group.py:161, in Group.create_dataset(self, name, shape, dtype, data, **kwds)
    159         group = self.require_group(parent_path)
--> 161 dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    162 dset = dataset.Dataset(dsid)
...
    224     ) from e

TypeError: Scalar datasets don't support chunk/filter options

Above error raised while writing key 'fraction_data_used_for_testing' of <class 'h5py._hl.group.Group'> to /

@sjfleming
Copy link
Member

@carmensandoval , I definitely want writing to work! I am betting that the issue lies in

uns: 'fraction_data_used_for_testing', 'lambda_multiplier', 'target_false_positive_rate'

since I think these values are scalars. I think the issue is that we are having trouble reading and writing scalars.

So one option would be to just delete the fields and forget about them:

for key in ['fraction_data_used_for_testing', 'lambda_multiplier', 'target_false_positive_rate']:
    del org_1.uns[key]

and then try

org_1.write('org_1.h5ad')

Or, if you want to keep those fields, you could try

import numpy as np

for key in ['fraction_data_used_for_testing', 'lambda_multiplier', 'target_false_positive_rate']:
    org_1.uns[key] = np.array(org_1.uns[key])  # wrap it in an array

and then try

org_1.write('org_1.h5ad')

@vitkl
Copy link

vitkl commented Feb 24, 2023

I have the same issue with scanpy 1.9.1 and scanpy 1.9.2 also doesn't work with the error below:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /envs/lightning19_scvi10_cuda113_torch113/lib/python3.9/site-packages/scanpy/readwrite.py:288, in _read_v3_10x_h5(filename, start)
    277 matrix = csr_matrix(
    278     (data, dsets['indices'], dsets['indptr']),
    279     shape=(N, M),
    280 )
    281 adata = AnnData(
    282     matrix,
    283     obs=dict(obs_names=dsets['barcodes'].astype(str)),
    284     var=dict(
    285         var_names=dsets['name'].astype(str),
    286         gene_ids=dsets['id'].astype(str),
    287         feature_types=dsets['feature_type'].astype(str),
--> 288         genome=dsets['genome'].astype(str),
    289     ),
    290 )
    291 logg.info('', time=start)

KeyError: 'genome'

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 adata_cb, ad_cb_list = read_cellbender_output(
      2     sc_data_folder, 
      3     smp_list=sample_list['folder'],
      4     metadata=sample_list, type='filtered'
      5 )
      6 adata_cb.obs['total_counts'] = np.array(adata_cb.X.sum(1)).flatten()
      7 # adata = adata[adata.obs['total_counts'] > 1000, :].copy()
      8 
      9 # export aggregated and pre-processed data

Cell In[9], line 34, in read_cellbender_output(file_path, smp_list, metadata, type, umi_filter)
     32     ad[sample_name].obs_names = ad[sample_name].obs['barcode']+"_"+ad[sample_name].obs['sample_id']
     33 except Exception as e:
---> 34     raise e
     35     print(f'{sample_name} {e}')
     36     smp_list = smp_list[smp_list != sample_name]

Cell In[9], line 14, in read_cellbender_output(file_path, smp_list, metadata, type, umi_filter)
     12     if type in i and 'h5' in i:
     13         file = i
---> 14 ad[sample_name] = sc.read_10x_h5(file_path + sample_name +'/'+file)
     15 ad[sample_name].var.rename(columns = {'gene_ids':'ENSEMBL'}, inplace = True)
     16 ad[sample_name].var['SYMBOL'] = ad[sample_name].var.index

File /envs/lightning19_scvi10_cuda113_torch113/lib/python3.9/site-packages/scanpy/readwrite.py:183, in read_10x_h5(filename, genome, gex_only, backup_url)
    181     v3 = '/matrix' in f
    182 if v3:
--> 183     adata = _read_v3_10x_h5(filename, start=start)
    184     if genome:
    185         if genome not in adata.var['genome'].values:

File /envs/lightning19_scvi10_cuda113_torch113/lib/python3.9/site-packages/scanpy/readwrite.py:294, in _read_v3_10x_h5(filename, start)
    292     return adata
    293 except KeyError:
--> 294     raise Exception('File is missing one or more required datasets.')

Exception: File is missing one or more required datasets.

@sjfleming
Copy link
Member

@vitkl , that is the first time I'm seeing that one. I'm curious: if you run sc.read_10x_h5() on your raw input file, does that work? If the input file has a genome field, then the output file should also have a genome field.

@iaaka
Copy link

iaaka commented Feb 24, 2023

I'll intervene since I did the analysis @vitkl is talking about. I used input in mtx format (since it was multiom and I needed to subset GEX only part and there and it appears that it is much easy to save data in mtx forma than in h5).
The input mtx can be read with scanpy.read_10x_mtx

@sjfleming
Copy link
Member

@iaaka It seems like you have found a real problem here, so I appreciate you bringing it to my attention. This is what happened:

  1. The MTX loader did not capture any genome information (I don't think genome is ever included in the MTX format? : https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices)
  2. The h5 output writer did not write any genome information, since none was loaded.
  3. Scanpy expects a genome field in its data loader. While this is kind of annoying, from the standpoint of flexibility, it is also kind of reasonable to expect genome information.

I think what I'll do to fix this moving forward is to include a genome field in the output that just says "NA" (for "not available") or something like that. Or maybe an empty string "". That way scanpy's data loader will work.

For now, I would say that your best option is to use my data loader function that currently lives here:

def anndata_from_h5(file: str,

You could also use one of the other data loading functions, if you wanted, such as this

def load_anndata_from_input_and_output(input_file: str,

which loads the input raw file and the cellbender output file jointly, so that you can have everything in one object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants