Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] dedicated tutorial on updating database values #3638

Open
nick-youngblut opened this issue Jan 29, 2025 · 10 comments
Open
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@nick-youngblut
Copy link

nick-youngblut commented Jan 29, 2025

Is your feature request related to a problem? Please describe.

It would be helpful to have a dedicated tutorial on updating values in a database (e.g., convert all NaN values in an obs metadata column to 0). I believe this would involve:

  • Querying
    • For NaN values, I've tried tiledbsoma.AxisQuery(value_filter='is_null(my_column)'), but that doesn't work.
  • Converting the obs metadata to a pandas dataframe
  • Modifying the values of interest (e.g., NaN => 0) in the dataframe
  • Updating the records of interest in the TileDB database
    • Really not sure how to do this

Describe alternatives you've considered

The existing tutorials, API docs, and getting help from LLMs

@johnkerl johnkerl added the documentation Improvements or additions to documentation label Feb 11, 2025
@spencerseale
Copy link
Collaborator

Hi @nick-youngblut,

Thanks for the feature request!

You can update obs by replacing an entire obs using tiledbsoma.io.update_obs (docs). That should help you with updating obs attributes. Please follow-up here once you've tried this!

We're currently working on releasing a new TileDB Academy doc focused exclusivley on updating and adding new arrays to existing tiledbsoma.Experiment objects. This should answer your questions on:

Updating the records of interest in the TileDB database

Your note on querying for null-type values is great feedback. I'll discuss with the team about adding this to the docs as well.

Spencer

@nick-youngblut
Copy link
Author

I missed tiledbsoma.io.update_obs. Thanks so much for pointing that out! I was already trying to plan out how to do a large database migration (efficiently) just to add/remove columns. I'll give it a shot ASAP. Thanks again!

In regards to null-type, I found that some of the values are encoded as literal "NaN" strings, while others are actually NaN values. I'm not sure why this happened, since I've appended data in the same manner for all datasets added to the database.

@johnkerl
Copy link
Member

johnkerl commented Feb 12, 2025

In regards to null-type, I found that some of the values are encoded as literal "NaN" strings, while others are actually NaN values. I'm not sure why this happened, since I've appended data in the same manner for all datasets added to the database.

@nick-youngblut this is timely.

We have a couple things in flight at present:

If it's not too much trouble -- can you please share what your adata.obs.dtypes are (Pandas) for some of the inputs, as well as your exp.obs.schema for the resulting TileDB-SOMA experiments?

@nick-youngblut
Copy link
Author

My adata.obs.dtypes are:

gene_count               int64
umi_count              float32
barcode                 object
SRX_accession           object
lib_prep                object
tech_10x                object
organism                object
tissue                  object
disease                 object
purturbation            object
cell_line               object
czi_collection_id       object
czi_collection_name     object

My exp.obs.schema is:

soma_joinid: int64 not null
obs_id: large_string
gene_count: int64
umi_count: float
barcode: large_string
SRX_accession: dictionary<values=string, indices=int32, ordered=0>
lib_prep: dictionary<values=string, indices=int32, ordered=0>
tech_10x: dictionary<values=string, indices=int32, ordered=0>
organism: dictionary<values=string, indices=int32, ordered=0>
tissue: dictionary<values=string, indices=int32, ordered=0>
disease: dictionary<values=string, indices=int32, ordered=0>
purturbation: dictionary<values=string, indices=int32, ordered=0>
cell_line: dictionary<values=string, indices=int32, ordered=0>
czi_collection_id: dictionary<values=string, indices=int32, ordered=0>
czi_collection_name: dictionary<values=string, indices=int32, ordered=0>

Converting some of the database obs to a pandas dataframe results in the following dtypes:

soma_joinid               int64
obs_id                   object
gene_count                int64
umi_count               float32
barcode                  object
SRX_accession          category
lib_prep               category
tech_10x               category
organism               category
tissue                 category
disease                category
purturbation           category
cell_line              category
czi_collection_id      category
czi_collection_name    category

I hope that info is helpful!

@johnkerl
Copy link
Member

Thanks @nick-youngblut ! And in which column(s) are you seeing the NaN/"NaN"` misbehavior?

@nick-youngblut
Copy link
Author

A bit of background might be helpful. My workflow for building the tiledb-soma database is:

  • I have a large number of datasets in mtx file format (STARsolo output)
  • I am converting batches of mtx files into a combined h5ad and adding obs metadata pulled from a relational database
    • This is a separate HPC job from loading the tiledb-soma database; I've found that h5ad production is MUCH faster than appending to the tiledb-soma database
  • For each resulting h5ad file: append to the tiledb-soma database
    • This job is much slower than creating the batched h5ad files

Currently, I'm batching 10's of mtx files into 1 h5ad file (less memory than 100's). Do you think tiledb-soma database appends are faster with:

  • less, but larger h5ad files
  • more, smaller h5ad files
  • doesn't seem to matter

Appending datasets to the tiledb-soma database takes a lot of time, so any suggestions on optimizing this process would be greatly appreciated!

@nick-youngblut
Copy link
Author

which column(s) are you seeing the NaN/"NaN"` misbehavior?

The NaN values are seen in czi_collection_id and czi_collection_name since some datasets do not include this information.

@johnkerl
Copy link
Member

Do you think tiledb-soma database appends are faster with:

Thanks @nick-youngblut for the question! In fact I do not have numbers on this performance question.

My gut feeling is it doesn't matter too much as the bulk of the time is (in my experience) in X writes, and those are 'chunked' already within a single .h5ad ingest. I think the 10 MTX -> 1 H5AD sounds like a good ratio, a priori ...

@nick-youngblut
Copy link
Author

From limited testing, I've found that separating the creation of small batched h5ad files (e.g., 10-20 datasets) and then appending them to a database increases the throughput by ~10x. However, I don't know if that really scales with batching larger numbers of datasets. I'll check it empirically. Thanks!

@spencerseale
Copy link
Collaborator

spencerseale commented Feb 13, 2025

@nick-youngblut our commercial product takes care of scaling, logging, cataloging TileDB-SOMAs along with your other data assets. We have solved these problems you're likely encountering if you'd like to discuss that more you can reach out to me or our team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants