[Feature request] dedicated tutorial on updating database values #3638

nick-youngblut · 2025-01-29T01:57:28Z

Is your feature request related to a problem? Please describe.

It would be helpful to have a dedicated tutorial on updating values in a database (e.g., convert all NaN values in an obs metadata column to 0). I believe this would involve:

Querying
- For NaN values, I've tried tiledbsoma.AxisQuery(value_filter='is_null(my_column)'), but that doesn't work.
Converting the obs metadata to a pandas dataframe
Modifying the values of interest (e.g., NaN => 0) in the dataframe
Updating the records of interest in the TileDB database
- Really not sure how to do this

Describe alternatives you've considered

The existing tutorials, API docs, and getting help from LLMs

The text was updated successfully, but these errors were encountered:

spencerseale · 2025-02-11T17:37:49Z

Hi @nick-youngblut,

Thanks for the feature request!

You can update obs by replacing an entire obs using tiledbsoma.io.update_obs (docs). That should help you with updating obs attributes. Please follow-up here once you've tried this!

We're currently working on releasing a new TileDB Academy doc focused exclusivley on updating and adding new arrays to existing tiledbsoma.Experiment objects. This should answer your questions on:

Updating the records of interest in the TileDB database

Your note on querying for null-type values is great feedback. I'll discuss with the team about adding this to the docs as well.

Spencer

nick-youngblut · 2025-02-12T16:23:44Z

I missed tiledbsoma.io.update_obs. Thanks so much for pointing that out! I was already trying to plan out how to do a large database migration (efficiently) just to add/remove columns. I'll give it a shot ASAP. Thanks again!

In regards to null-type, I found that some of the values are encoded as literal "NaN" strings, while others are actually NaN values. I'm not sure why this happened, since I've appended data in the same manner for all datasets added to the database.

johnkerl · 2025-02-12T16:54:12Z

In regards to null-type, I found that some of the values are encoded as literal "NaN" strings, while others are actually NaN values. I'm not sure why this happened, since I've appended data in the same manner for all datasets added to the database.

@nick-youngblut this is timely.

We have a couple things in flight at present:

If it's not too much trouble -- can you please share what your adata.obs.dtypes are (Pandas) for some of the inputs, as well as your exp.obs.schema for the resulting TileDB-SOMA experiments?

nick-youngblut · 2025-02-13T16:30:51Z

My adata.obs.dtypes are:

gene_count               int64
umi_count              float32
barcode                 object
SRX_accession           object
lib_prep                object
tech_10x                object
organism                object
tissue                  object
disease                 object
purturbation            object
cell_line               object
czi_collection_id       object
czi_collection_name     object

My exp.obs.schema is:

soma_joinid: int64 not null
obs_id: large_string
gene_count: int64
umi_count: float
barcode: large_string
SRX_accession: dictionary<values=string, indices=int32, ordered=0>
lib_prep: dictionary<values=string, indices=int32, ordered=0>
tech_10x: dictionary<values=string, indices=int32, ordered=0>
organism: dictionary<values=string, indices=int32, ordered=0>
tissue: dictionary<values=string, indices=int32, ordered=0>
disease: dictionary<values=string, indices=int32, ordered=0>
purturbation: dictionary<values=string, indices=int32, ordered=0>
cell_line: dictionary<values=string, indices=int32, ordered=0>
czi_collection_id: dictionary<values=string, indices=int32, ordered=0>
czi_collection_name: dictionary<values=string, indices=int32, ordered=0>

Converting some of the database obs to a pandas dataframe results in the following dtypes:

soma_joinid               int64
obs_id                   object
gene_count                int64
umi_count               float32
barcode                  object
SRX_accession          category
lib_prep               category
tech_10x               category
organism               category
tissue                 category
disease                category
purturbation           category
cell_line              category
czi_collection_id      category
czi_collection_name    category

I hope that info is helpful!

johnkerl · 2025-02-13T16:34:54Z

Thanks @nick-youngblut ! And in which column(s) are you seeing the NaN/"NaN"` misbehavior?

nick-youngblut · 2025-02-13T16:37:17Z

A bit of background might be helpful. My workflow for building the tiledb-soma database is:

I have a large number of datasets in mtx file format (STARsolo output)
I am converting batches of mtx files into a combined h5ad and adding obs metadata pulled from a relational database
- This is a separate HPC job from loading the tiledb-soma database; I've found that h5ad production is MUCH faster than appending to the tiledb-soma database
For each resulting h5ad file: append to the tiledb-soma database
- This job is much slower than creating the batched h5ad files

Currently, I'm batching 10's of mtx files into 1 h5ad file (less memory than 100's). Do you think tiledb-soma database appends are faster with:

less, but larger h5ad files
more, smaller h5ad files
doesn't seem to matter

Appending datasets to the tiledb-soma database takes a lot of time, so any suggestions on optimizing this process would be greatly appreciated!

nick-youngblut · 2025-02-13T16:38:13Z

which column(s) are you seeing the NaN/"NaN"` misbehavior?

The NaN values are seen in czi_collection_id and czi_collection_name since some datasets do not include this information.

johnkerl · 2025-02-13T16:53:34Z

Do you think tiledb-soma database appends are faster with:

Thanks @nick-youngblut for the question! In fact I do not have numbers on this performance question.

My gut feeling is it doesn't matter too much as the bulk of the time is (in my experience) in X writes, and those are 'chunked' already within a single .h5ad ingest. I think the 10 MTX -> 1 H5AD sounds like a good ratio, a priori ...

nick-youngblut · 2025-02-13T17:13:30Z

From limited testing, I've found that separating the creation of small batched h5ad files (e.g., 10-20 datasets) and then appending them to a database increases the throughput by ~10x. However, I don't know if that really scales with batching larger numbers of datasets. I'll check it empirically. Thanks!

spencerseale · 2025-02-13T19:31:41Z

@nick-youngblut our commercial product takes care of scaling, logging, cataloging TileDB-SOMAs along with your other data assets. We have solved these problems you're likely encountering if you'd like to discuss that more you can reach out to me or our team.

johnkerl assigned spencerseale Feb 11, 2025

johnkerl added the documentation Improvements or additions to documentation label Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] dedicated tutorial on updating database values #3638

[Feature request] dedicated tutorial on updating database values #3638

nick-youngblut commented Jan 29, 2025 •

edited

Loading

spencerseale commented Feb 11, 2025

nick-youngblut commented Feb 12, 2025

johnkerl commented Feb 12, 2025 •

edited

Loading

nick-youngblut commented Feb 13, 2025

johnkerl commented Feb 13, 2025

nick-youngblut commented Feb 13, 2025

nick-youngblut commented Feb 13, 2025

johnkerl commented Feb 13, 2025

nick-youngblut commented Feb 13, 2025

spencerseale commented Feb 13, 2025 •

edited

Loading

[Feature request] dedicated tutorial on updating database values #3638

[Feature request] dedicated tutorial on updating database values #3638

Comments

nick-youngblut commented Jan 29, 2025 • edited Loading

spencerseale commented Feb 11, 2025

nick-youngblut commented Feb 12, 2025

johnkerl commented Feb 12, 2025 • edited Loading

nick-youngblut commented Feb 13, 2025

johnkerl commented Feb 13, 2025

nick-youngblut commented Feb 13, 2025

nick-youngblut commented Feb 13, 2025

johnkerl commented Feb 13, 2025

nick-youngblut commented Feb 13, 2025

spencerseale commented Feb 13, 2025 • edited Loading

nick-youngblut commented Jan 29, 2025 •

edited

Loading

johnkerl commented Feb 12, 2025 •

edited

Loading

spencerseale commented Feb 13, 2025 •

edited

Loading