Error in is.factor(values) && levels(values) == c("FALSE", "TRUE") when using readH5AD #8

mt1022 · 2024-01-25T06:55:41Z

Dear developers,

I am trying to read an h5ad file into R with MuData and encountered the following error:

Error in is.factor(values) && levels(values) == c("FALSE", "TRUE") : 
  'length = 2' in coercion to 'logical(1)'

This is my command:

library(Seurat)
library(MuData)
library(tidyverse)

fca_head <- readH5AD('s_fca_biohub_head_10x.h5ad')

I am not sure what's going wrong. This dataset can be loaded by scanpy.read_h5ad in python. I also tried to down-sample a subset of cells of this dataset in python and save a new h5ad. The down-sampled file cannot be read in R either.

The file s_fca_biohub_head_10x.h5ad is available from Fly cell atlas: https://cloud.flycellatlas.org/index.php/s/LAEybPc2HZnpzKs

The text was updated successfully, but these errors were encountered:

ilia-kats · 2024-01-25T07:52:14Z

I think that should work with the current master branch. Can you install the current version from Github with remotes::install_github("ilia-kats/MuData") and try again?

@gtca Any chance we can update the package in Bioc?

mt1022 · 2024-01-25T08:52:05Z

You are right. I installed the latest developmental version and the down-sampled (and simplified) data can be loaded without any errors or warnings. It seems that the full dataset can be loaded too. However, although currently there is no error messages, this process is extremely slow. it has been running for about 20 min and has not finished at the time of writing :(. The full dataset has about 100k cells x 2k genes and can be loaded by scanpy in less than 1 min.

ilia-kats · 2024-01-27T13:03:19Z

It looks like it's spending all that time inside rhdf5's _H5Dread C function. I don't think there is anything we can do about it and I also don't know why it's taking that much longer in R compared to Python, since both are using the same HDF5 C library underneath. Perhaps @grimbough has an idea?

grimbough · 2024-01-29T14:17:29Z

It looks like it's spending most of the time reading some fairly large compound datasets e.g. /uns/leiden_res14.0__rank_genes_groups/names I'm not too familiar with the file structure, so I'm not sure if they need to be compound datatypes, but the look like they could just be arrays of string which I suspect would be quicker to read.

I'm actually surprised that the python implementation is so fast. Using the h5dump command line tool to read the largest of those datasets takes over 3 minutes:

-> % { time h5dump -d /uns/leiden_res50.0__rank_genes_groups/names s_fca_biohub_head_10x.h5ad 1> /dev/null ; }
h5dump -d /uns/leiden_res50.0__rank_genes_groups/names  > /dev/null  194,93s user 0,82s system 99% cpu 3:16,47 total

Do the python and R implementations actually do the same things?

mt1022 · 2024-01-29T15:27:03Z

The file structure can be examined in python:

import scanpy as sc
fca_head = sc.read_h5ad('s_fca_biohub_head_10x.h5ad')
fca_head.uns['leiden_res14.0__rank_genes_groups']['names']

/uns/leiden_res14.0__rank_genes_groups/names is an array of tuples of strings. I have no idea why they use such complex data structure :(. It was probably generated by other software they used in the analysis.

ilia-kats · 2024-01-29T15:30:59Z

Do the python and R implementations actually do the same things?

They should for most things, but I have to admit we haven't paid a lot of attention to compound datatypes (presumably the equivalent to numpy.recarray) because we have never seen them used extensively in the wild and people tend to use DataFrames for this functionality nowadays.

grimbough · 2024-01-29T15:38:40Z

Do the python and R implementations actually do the same things?

They should for most things, but I have to admit we haven't paid a lot of attention to compound datatypes (presumably the equivalent to numpy.recarray) because we have never seen them used extensively in the wild and people tend to use DataFrames for this functionality nowadays.

I guess my question was really "does scanpy actually read whatever /uns/leiden_res14.0__rank_genes_groups/names is?". My first guess would have been that this was an inefficiency in how rhdf5 was dealing with compound data types, but based on how long it takes h5dump to read that same structure, it looks like it's just slow regardless. Given that, I'd be very surprised python was actually reading it, but I'm very naive about the scanpy functionality.

ilia-kats · 2024-01-29T15:43:56Z

/uns is read unconditionally by both Python and R. /X is only read if backed=False, also in both Python and R. /layers is read unconditionally in Python, but is only read if backed=FALSE in R.

grimbough · 2024-01-29T15:51:57Z

Interesting, I wonder how come it's so much more efficient in python. As you said earlier, they all ultimately use H5Dread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in is.factor(values) && levels(values) == c("FALSE", "TRUE") when using readH5AD #8

Error in is.factor(values) && levels(values) == c("FALSE", "TRUE") when using readH5AD #8

mt1022 commented Jan 25, 2024

ilia-kats commented Jan 25, 2024

mt1022 commented Jan 25, 2024 •

edited

Loading

ilia-kats commented Jan 27, 2024

grimbough commented Jan 29, 2024

mt1022 commented Jan 29, 2024

ilia-kats commented Jan 29, 2024

grimbough commented Jan 29, 2024

ilia-kats commented Jan 29, 2024

grimbough commented Jan 29, 2024

Error in is.factor(values) && levels(values) == c("FALSE", "TRUE") when using readH5AD #8

Error in is.factor(values) && levels(values) == c("FALSE", "TRUE") when using readH5AD #8

Comments

mt1022 commented Jan 25, 2024

ilia-kats commented Jan 25, 2024

mt1022 commented Jan 25, 2024 • edited Loading

ilia-kats commented Jan 27, 2024

grimbough commented Jan 29, 2024

mt1022 commented Jan 29, 2024

ilia-kats commented Jan 29, 2024

grimbough commented Jan 29, 2024

ilia-kats commented Jan 29, 2024

grimbough commented Jan 29, 2024

mt1022 commented Jan 25, 2024 •

edited

Loading