-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tiledb #120
Comments
@rabernat - I've seen this before too and it does look quite promising. I get the impression from reading the documentation that the Python API is not quite ready to go yet. In theory though, this could be yet another backend for xarray. |
Hello folks. We just released a new version of TileDB, which has optimized support for AWS S3 and a Python API that uses NumPy (there is a lot of info and examples in our docs: https://docs.tiledb.io/). We would love to get some feedback from you. |
Hi @stavrospapadopoulos ! Thanks for chiming in. Do you happen to have a demo TileDB deployment somewhere publicly available that we could hit to try things out? |
Thanks @stavrospapadopoulos for sharing this information! TileDB looks like a promising possible storage layer for xarray / pangeo. I have read through the documentation and have a couple of questions that I hope you can clarify whenever you might have the time: Technical Questions
Social Questions
|
Oh and one more technical question
|
@jakebolewski, @tdenniston will chime in (probably next week), but here is my take. Responses to Technical Questions
@mrocklin we will put together a demo very soon as well.
Responses to Social Questions
|
I suspect that positional indexing may not be requied. XArray is accustomed to dealing with data stores that don't support this (like numpy itself). It has logic to deal with it. I suspect that the desired answer is instead "anything numpy can provide, including integers, slices (of all varieties), lists, or boolean numpy arrays" or some subset of that.
The ideal solution from our perspective would be some publicly-available data on a GCS bucket (we could probably front the cost for this) and an easy way to install TileDB, ideally with a conda package. We could also set up a smaller example to run from S3, but are unlikely to host. CondaI'm biased here (I am employed by Anaconda) but I strongly recommend creating a conda package if you're looking for adoption within the broader numeric Python community. One way to do this is through conda-forge, which operates a build farm with linux/windows/mac support. |
Correct. Currently we support integers and slices, whereas we are working on lists and boolean numpy arrays.
As I mentioned above, TileDB provides tight integration with the backends, which means that we implement our own low-level IO functionality using the backend's C or C++ SDK. The new release has AWS S3 support. GCS is in our roadmap, but it will require quite some work to tightly integrate with it using its SDK. Do you prefer GCS to S3? Please note that we have no bias - we will eventually support both. As a first step, we can certainly provide you with access to an S3 bucket + a conda package.
As mentioned above, we are almost there. :) |
The software packages have no preference. But the particular distributed deployment of http://pangeo.pydata.org/ runs on GCP, which would make it trivial for folks to try running scalability and performance tests without paying data transfer costs (though short term I don't think that these will amount to much). On a personal/OSS note I'd like to push people to support more than the dominant cloud vendor. |
Having data on some cloud accessible system would make it easy to repeat this experience: https://youtu.be/rSOJKbfNBNk |
Also cc @llllllllll who has been interested in comparing array storage systems |
OK, all this sounds good. We will set something up on S3 so that we can get some feedback, and we can run detailed benchmarks once we have the GCS integration. Thanks for taking a look! |
One can also run dask/xarray feasibility studies and benchmarks on a single
machine. It's just a bit less compelling due to the wealth of options with
a local disk
…On Fri, Feb 23, 2018 at 6:38 PM, Stavros Papadopoulos < ***@***.***> wrote:
OK, all this sounds good. We will set something up on S3 so that we can
get some feedback, and we can run detailed benchmarks once we have the GCS
integration. Thanks for taking a look!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#120 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszCvZz0QsASeRSdHzqyc9xU_ShFx0ks5tX0wPgaJpZM4SMi0Q>
.
|
On a side note, this is our up-to-date repo, which we maintain and further develop at TileDB-Inc: Not to be confused with the one that I used to develop at Intel Labs: |
Good to know. I suspect that this is due to
Intel-HLS/TileDB#72 . Want me to resubmit?
…On Fri, Feb 23, 2018 at 6:41 PM, Stavros Papadopoulos < ***@***.***> wrote:
On a side note, this is our up-to-date repo, which we maintain and further
develop at TileDB-Inc:
https://github.com/TileDB-Inc/TileDB
Not to be confused with the one that I used to develop at Intel Labs:
https://github.com/Intel-HLS/TileDB
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#120 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszDPlKkqJmejTzwh70SXA2bx4OkcPks5tX0y1gaJpZM4SMi0Q>
.
|
No worries, fixed (with thanks). |
Thanks so much for your quick and detailed replies to our questions. At this point, it is clear to me that we definitely need to evaluate TileDB as a potential backend for xarray. I imagine the first step will be to evaluate TileDB with dask. Perhaps @mrocklin can suggest the best approach to that. The main practical obstacle for testing with xarray will be the need for a new xarray backend. I see two main options:
I think option 2 is attractive for several reasons. Geosciences are potentially one of the biggest potential sources of TileDB-type data (e.g. NASA is putting > 300 PB of data into the cloud over the next 5 years). NetCDF is already a familiar interface for our community, so this might significantly lower the bar for the broader community. The pangeo project is stretched pretty thin right now in terms of developer time. So the main challenge will be to find someone to work on this. Maybe @sirrice can help us get a Columbia CS student interested. |
Testing locally with dask is probably pretty easy, assuming that TileDB
exposes a Python object that supports slicing (which I think we've already
established that it does).
Relevant dask docs here:
http://dask.pydata.org/en/latest/array-creation.html#numpy-slicing
…On Mon, Feb 26, 2018 at 12:07 PM, Ryan Abernathey ***@***.***> wrote:
Thanks so much for your quick and detailed replies to our questions. At
this point, it is clear to me that we definitely need to evaluate TileDB as
a potential backend for xarray.
I imagine the first step will be to evaluate TileDB with dask. Perhaps
@mrocklin <https://github.com/mrocklin> can suggest the best approach to
that.
The main practical obstacle for testing with xarray will be the need for a
new xarray backend. I see two main options:
1. Duplicate / modify the zarr backend. This involves about 500 lines
of code
<https://github.com/pydata/xarray/blob/master/xarray/backends/zarr.py>
plus another 200 of tests
<https://github.com/pydata/xarray/blob/master/xarray/tests/test_backends.py#L1132-L1362>.
(Note that the backend code is due for a refactor, pydata/xarray#1087
<pydata/xarray#1087>.)
2. Create a tiledbnetcdf project, similar to h5netcdf
<https://github.com/shoyer/h5netcdf>. This would essentially wrap
TileDB with the same API as the netcdf4-python
<https://github.com/Unidata/netcdf4-python> library, allowing xarray
to use it as a storage backend with less (but still some) new backend code.
I think option 2 is attractive for several reasons. Geosciences are
potentially one of the biggest potential sources of TileDB-type data (e.g.
NASA is putting > 300 PB of data into the cloud over the next 5 years).
NetCDF is already a familiar interface for our community, so this might
significantly lower the bar for the broader community.
The pangeo project is stretched pretty thin right now in terms of
developer time. So the main challenge will be to find someone to work on
this. Maybe @sirrice <https://github.com/sirrice> can help us get a
Columbia CS student interested.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#120 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszNF9CRyOPoau3Jd6mJJOSfpxMtOfks5tYuTjgaJpZM4SMi0Q>
.
|
cc @kaipak, who may be interested in this discussion |
Yep, I've been tagging along for the discussion :). |
Before we consider integrating TileDB with xarray or building a netcdf wrapper for TileDB, we suggest we perform some simple experiments to evaluate TileDB's performance on parallel writes and reads on nd-arrays using dask. We need to start with the following questions:
|
On normal file systems NetCDF4 on HDF5 is probably the thing to beat, at least for the science domains that interest the community active on this issue tracker (which is sizable). On cloud-based systems there is no obvious contender. I wrote about cloud options and concerns here: http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud
I don't think that this matters much. This community's workloads are typically write-once read-many. Data tends to be delivered in the form of NetCDF4 files. I don't think that people care strongly about the cost-to-convert. I think that the broader point is that you have to convert, which is a negative for any format other than NetCDF.
Those are both common. So too are large parallel POSIX file systems. |
Also, to be clear, if your goal is to gain adoption then I suspect that demonstrating modest factor speed improvements is not sufficient. Disk IO isn't necessarily a pain point, and the inertia to existing data formats is very high. Instead I think you would need to go a bit more broadly and demonstrate that various workflows now become feasible where they were not feasible before. Parallel and random access into cloud-storage is one such workflow, but there are likely others. |
If you are looking for an appropriate test dataset, I would recommend something from NASA. For example: A single file (e.g. ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2018/057/20180226090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc) is 377 MB, and there is one every day for the past 16 years, ~2.2 TB total. Any reasonable subset of this will make a good test dataset. Simple yet representative things our users might want to do are:
There are of course more complicated workflows, but these are representative of the I/O bound ones. |
Dask.array can do all of these computations. The question would be how things feel when Dask accesses data from TileDB when doing these computations. |
@rabernat - I like the strategy you outlined. Would it make sense to create a repo with a set of performance benchmarks based on this data collection? Ideally it would enable various I/O backends (e.g. HDF5 files, Fuse to S3, zarr, hsds, etc.) to be pluggedin. This would seem a good framework to evaluate performance for these common use cases. |
@rabernat - I was planning on doing some benchmarking with hdf5lib vs s3vfd vs hsds. I can start with some of the simpler codes and then ask you for help when I get stuck! For now I'm thinking to keep it to just a single client (i.e. without dask distributed). Is the Sea Temperature data on S3? Alternatively, we could use the DCP-30 dataset: s3://nasanex/NEX-DCP30. Any issues with that? |
Yes, I really like this idea. A modular system would be ideal and would save us from writing a lot of boilerplate.
This is kind of what xarray already does! 😏 Unfortunately, we do not have direct xarray for some of these storage layers (e.g. TileDB), so we can't just use xarray. We will end up reinventing some of xarray's backend logic, but I suppose that's acceptable. Does it make sense to try to integrate airspeedvelocity or is that overkill? @kaipak, a Pangeo intern here at Columbia, has some time to contribute to this benchmark project. @jreadey, it would be great if you two could work together. |
Reopening this issue because I think there is a lot of great discussion and there are some new developments worth pointing out. Thanks to @normanb TileDB support is now available in GDAL>3.0 (OSGeo/gdal#1402)! Documentation here: https://gdal.org/drivers/raster/tiledb.html. The upcoming GDAL 3.1 release will also have better support for multidimensional raster data. See:
Once rasterio releases support gdal>3 (conda-forge/gdal-feedstock#326), this should provide an easy way to bring tileDB into an xarray dataset. See also rasterio/rasterio#1699. I'd love to see a Zarr driver for GDAL as well, and given the overlap with TileDB, it seems like this would be fairly straightforward to implement (https://gdal.org/tutorials/raster_driver_tut.html). There has been a fair amount of discussion and interest on how to convert existing h5 and netcdf files to Zarr format (just one example here #686) Maybe someone is already doing this? But it seems like there isn't much cross talk between the zarr, gdal, rasterio communities. I'm happy to open some issues in the respective repos, but probably won't have time in the near future myself for a PR to GDAL. pinging @jakebolewski, @shoyer, @jhamman, @davidbrochart, @rabernat, @lewismc |
Also a quick illustration how to convert h5 to tiledb and read with gdal
Currently trying to read over s3 throws an error:
I'm pretty sure this is b/c it's a folder, so maybe need to a point to a file under .tiledb? But not sure how to read these since its a directory structure on S3:
|
@scottyhq the TileDB driver within GDAL uses it own methods to access S3 as opposed to the GDAL To access the array try
The GDAL driver code currently relies on their being a config file to determine S3 access https://github.com/OSGeo/gdal/blob/master/gdal/frmts/tiledb/tiledbdataset.cpp#L908. This isn't ideal for public buckets with no config parameters and I will change this check to remove the need for the config file when I add the new GDAL multi-dimensional API to this driver that is currently in master https://gdal.org/tutorials/multidimensional_api_tut.html. The TileDB configuration parameters are described here - https://docs.tiledb.io/en/stable/tutorials/config.html#summary-of-parameters If there are any problems then I am happy to work through them. |
Thanks @normanb for the pointers. I'm still having trouble getting tiledb to work with a config file. Reading the local file without a config file works (
I've tried the simple case of pointing TILEDB_CONFIG at a file containing a copy of the default config (
|
@scottyhq I ran through the notebook you linked to. Your last command should be gdalinfo -oo TILEDB_CONFIG=tiledb.config s3://pangeo-data-upload-virginia/gdal3-test/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb and the config should include the following keys (and add your own values)
The config is used to determine whether to access s3 or not which is a bug (as noted above) as a config file can also be used locally. I will changed that with my current updates. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
Great conversations! Just wondering if there are still plans afoot to implement an Xarray backend for TileDB? (That would be wonderful!) |
I would love having xarray + https://feedback.tiledb.com/tiledb-core/p/support-axes-labels |
I would say there is broad interest in TileDB support in xarray. Xarray is developed by volunteers. If anyone on this thread wants to volunteer implement a TileDB backend, the xarray devs will be glad to provide advice.
…Sent from my iPhone
On Feb 4, 2020, at 5:48 AM, Florian R. Hölzlwimmer ***@***.***> wrote:
I would love having xarray + https://feedback.tiledb.com/tiledb-core/p/support-axes-labels
This way, we could have e.g. native-performance string indexing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
i would be happy to implement tiledb backend for xarray.
i am waiting on TileDB to implement heterogenous dimensions feature - should be out soon.
Stan
… On Feb 4, 2020, at 9:52 AM, Jack Kelly ***@***.***> wrote:
I hear you, @rabernat <https://github.com/rabernat>! I've sent a tweet <https://twitter.com/jack_kelly/status/1224706970801901568> asking for folks to help implement a TileDB backend to Xarray :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#120?email_source=notifications&email_token=AHHF6KUYIL2UDDJCE4WVPDDRBF6LZA5CNFSM4ERSFUIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKX4ZLA#issuecomment-581946540>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHHF6KQK5D447MP3GQTIXFTRBF6LZANCNFSM4ERSFUIA>.
|
Awesome, thank you Stan! |
@petacube I'm also interested in tiledb's heterogenous dimensions feature. Do you know roughly when it might be implemented? I found the feature listed in tiledb's roadmap, but I couldn't see it in the tiledb github issue queue. |
@JackKelly this feature is currently under heavy development. It is blocked by #93. You can see some related merged PRs here. I expect the feature to be ready by early March. |
Just want to mention that heterogenous dimensions feature has been implemented in TileDB 2.0 release (1 May). |
Should we re-open this issue? (It was automatically closed due to inactivity. But maybe now's the time to re-open this issue, because heterogeneous dims are now implemented in TileDB?) |
@petacube would you still be interested in implementing a TileDB backend for xarray? It would be hugely useful! :) |
We can of course
Do you guys have a project/data to test it on?
…Sent from my iPhone
On Aug 17, 2020, at 11:33 AM, Jack Kelly ***@***.***> wrote:
@petacube would you still be interested in implementing a TileDB backend for xarray? It would be hugely useful! :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Awesome! Yeah, I plan to benchmark TileDB vs Zarr vs cloud-optimised-GeoTIFF on a satellite dataset (on Google Cloud Storage) over the next few weeks. |
Just FYI folks, here is a good thread about our plans to model netcdf data with TileDB, after great discussions with @rabernat and @DPeterK, which is very relevant to our TileDB-xarray integration. We will first share a proposal on a general spec for modeling labeled arrays (encompassing netcdf), as well as a proof implementation with our upcoming "axes labels" core TileDB feature. Eventually, this axes labels feature will drive the TileDB-xarray integration. |
Just to update this discussion: We developed a read-only TileDB backend for xarray available here. This uses the new xarray backend plugin architecture that is soon-to-be-released. |
Thanks so much @jp-dark! So exciting! Just curious how you recommend starting the transition to tiledb without a way to write files? Our community doesn't have any existing tiledb data. With Zarr, we can basically write code like this ds = xr.open_dataset('file.nc')
ds.to_zarr('store.zarr') ...and convert all our existing data to a new format. What sort of workflow would your recommend for converting existing netCDF data to tiledb? |
@rabernat This xarray plugin was developed for other customers that would like to visualize and manipulate data from existing TileDB arrays with xarray. A full fledge NetCDF data model in TileDB (including ingestor from NetCDF) and specification for comment is coming soon. |
As long as we are discussing cloud storage formats, it seems like we should be looking at TileDB:
https://www.tiledb.io/
https://github.com/TileDB-Inc/TileDB
If it lives up to the promised goals, TileDB could really be the solution to our storage problems.
cc @kaipak, @sirrice
The text was updated successfully, but these errors were encountered: