Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tiledb #120

Closed
rabernat opened this issue Feb 20, 2018 · 108 comments
Closed

tiledb #120

rabernat opened this issue Feb 20, 2018 · 108 comments

Comments

@rabernat
Copy link
Member

As long as we are discussing cloud storage formats, it seems like we should be looking at TileDB:
https://www.tiledb.io/
https://github.com/TileDB-Inc/TileDB

TileDB manages data that can be represented as dense or sparse arrays. It can support any number of dimensions and store in each array element any number of attributes of various data types.

If it lives up to the promised goals, TileDB could really be the solution to our storage problems.

cc @kaipak, @sirrice

@jhamman
Copy link
Member

jhamman commented Feb 20, 2018

@rabernat - I've seen this before too and it does look quite promising. I get the impression from reading the documentation that the Python API is not quite ready to go yet. In theory though, this could be yet another backend for xarray.

@stavrospapadopoulos
Copy link

Hello folks. We just released a new version of TileDB, which has optimized support for AWS S3 and a Python API that uses NumPy (there is a lot of info and examples in our docs: https://docs.tiledb.io/). We would love to get some feedback from you.

cc @jakebolewski, @tdenniston

@mrocklin
Copy link
Member

Hi @stavrospapadopoulos ! Thanks for chiming in.

Do you happen to have a demo TileDB deployment somewhere publicly available that we could hit to try things out?

@rabernat
Copy link
Member Author

rabernat commented Feb 23, 2018

Thanks @stavrospapadopoulos for sharing this information! TileDB looks like a promising possible storage layer for xarray / pangeo.

I have read through the documentation and have a couple of questions that I hope you can clarify whenever you might have the time:

Technical Questions

  1. Does TileDB use a client / server model? This was my impression when I saw the word "DB", but, looking more closely at the documentation, it appears that it is more like a [virtual] file format + an API for accessing those [virtual] files (similar to HDF5 or zarr).
  2. What sort of indexing is supported on the array-like objects returned by tiledb (e.g. the dense_array object in this example)?
  3. Does TileDB support distributed reads / writes on the same array (either via a shared filesystem or an S3 bucket) from multiple processes on a distributed cluster? If locking is required, does the python API provide a way for the user to pass a shared lock object?
  4. Does the python API support asynchronous I/O?
  5. The docs contain lots of examples, but can you point me to the complete python API documentation?
  6. Any estimate on when the conda-forge package will be available? (This greatly lowers the bar for the python community to try something new.)

Social Questions

  1. Can you point us towards some projects that are currently using TileDB?
  2. How does TileDB compare to zarr, our current library of choice for chunked / compressed array storage?
  3. Is anyone from your team interested in collaborating more directly on testing TileDB in the context of dask / xarray?

@rabernat
Copy link
Member Author

Oh and one more technical question

  1. Is there a way to store array metadata in TileDB (e.g. units or other arbitrary attributes), as in HDF5, netCDF, zarr, etc?

@stavrospapadopoulos
Copy link

@jakebolewski, @tdenniston will chime in (probably next week), but here is my take.

Responses to Technical Questions

  1. TileDB is an embeddable library (written in C and C++, more similar to HDF5 than zarr) that comes with a Python wrapper. You should truly think of TileDB as an alternative to HDF5, with (among some other important things) the extra capability of storing sparse arrays in addition to dense and also supporting a wide variety of backends in addition to posix filesystems (currently HDFS and S3, more in the future).

    • About “DB” in the name: when we started working on TileDB at MIT, we thought that we would be an alternative to SciDB and follow its client/server architecture. We eventually decided to make TileDB a lightweight embeddable library (although we are exploring implementing some client/server functionality along with a REST API). We kept the name because people already knew the software as TileDB and because, well, we liked it. :)
  2. The Python API currently supports only positional indexing. We assume though that you want this, so this is where we are going. Please stay tuned.

  3. Absolutely, this is the big strength of TileDB.

    • First, TileDB supports both concurrent reads and writes (even interleaved) without any locking. Please check our concurrency and consistency model. For any operation where locking is required (e.g., creating an array), we implement our own locking (via mutexes, as well as file locks for the backends that support them) on the C++ side.

    • Second, note that TileDB implements its own tight integration with the various backends via C/C++ SDKs (e.g., libhdfs for HDFS and AWS C++ SDK for S3), and its own LRU cache. We do not rely on another package (e.g, s3fs like zarr). This enables us to do some pretty cool stuff (performance-wise) under the covers (e.g., we use our own async IO internally, we are currently implementing another parallel functionality, etc). You may also want to understand how our updates work, which is ideal for append-only backends like S3. So, TileDB is designed and continuously optimized for extreme parallelism for distributed backends, even beyond posix.

  4. Our C and C++ API do, but we haven’t wrapped this for Python yet. It is trivial to do so for a NULL callback, but we need more time to make it work for arbitrary Python functions (e.g., lambdas). This is certainly in our roadmap.

  5. There is a menu item API Reference in the docs. Here are the Python API docs.

  6. We have already started working on this, we will publish it very soon. For now, the easiest way to start playing is

docker pull tiledb/tiledb:1.2.0
docker run -it tiledb/tiledb:1.2.0
python3
import tiledb

@mrocklin we will put together a demo very soon as well.

  1. Yes. Please take a look at our key-value store functionality and a Python example. You can attach a key-value store to an array (say, with URI /path/to/array/), by storing it in URI, say, /path/to/array/meta. The C/C++ implementation is very flexible (and much more powerful than HDF5's attributes), as it allows you to store any type of keys (not just strings) and any number of attributes of arbitrary types as values (not just strings), inheriting all the benefits from TileDB arrays. Currently the Python API supports only string keys/values, but we will extend it to support equivalent functionality to C/C++ very soon.

Responses to Social Questions

  1. TileDB is used at the core of GenomicsDB, which is the product of a collaboration between Intel HLS and the Broad Institute (which we started when I was still at Intel). GenomicsDB is officially a part of GATK 4.0. We are currently working with the Oak Ridge National Lab on another genomics project. We built a LiDAR data adaptor for Naval Postgraduate School (which we will release pretty soon). We have also started a POC with the JGI on yet another genomics project. We would love to see what value TileDB can bring to a use case like pangeo :).

  2. TileDB is very similar to zarr in that respect. Some of the key differences: (i) TileDB is built exclusively in C and C++, which allows us to bind it to other HL languages beyond Python as well (e.g., our Java bindings are coming up soon, we are starting working on R, which is very popular in genomics), (ii) TileDB natively supports sparse arrays with a powerful new format, and (iii) TileDB builds its own tight integration with the storage backends, rather than relying on generic libraries like s3fs, which allows us to do some nice low-level optimizations. We would be very interested to compare TileDB vs zarr on your workloads though.

  3. Absolutely! We love what you guys do and we will be benefitted enormously by your feedback. Please let me know if you would like to start an email thread ({stavros,jake,tyler}@tiledb.io).

@mrocklin
Copy link
Member

The Python API currently supports only positional indexing. We assume though that you want this, so this is where we are going. Please stay tuned.

I suspect that positional indexing may not be requied. XArray is accustomed to dealing with data stores that don't support this (like numpy itself). It has logic to deal with it. I suspect that the desired answer is instead "anything numpy can provide, including integers, slices (of all varieties), lists, or boolean numpy arrays" or some subset of that.

@mrocklin we will put together a demo very soon as well.

The ideal solution from our perspective would be some publicly-available data on a GCS bucket (we could probably front the cost for this) and an easy way to install TileDB, ideally with a conda package. We could also set up a smaller example to run from S3, but are unlikely to host.

Conda

I'm biased here (I am employed by Anaconda) but I strongly recommend creating a conda package if you're looking for adoption within the broader numeric Python community.

One way to do this is through conda-forge, which operates a build farm with linux/windows/mac support.
This is a community group that is very friendly and supportive. As an example, here are some links for HDF5 and H5Py which I'm guessing is similar-to but more-complex than your situation:

@stavrospapadopoulos
Copy link

I suspect that positional indexing may not be requied. XArray is accustomed to dealing with data stores that don't support this (like numpy itself). It has logic to deal with it. I suspect that the desired answer is instead "anything numpy can provide, including integers, slices (of all varieties), lists, or boolean numpy arrays" or some subset of that.

Correct. Currently we support integers and slices, whereas we are working on lists and boolean numpy arrays.

The ideal solution from our perspective would be some publicly-available data on a GCS bucket (we could probably front the cost for this) and an easy way to install TileDB, ideally with a conda package. We could also set up a smaller example to run from S3, but are unlikely to host.

As I mentioned above, TileDB provides tight integration with the backends, which means that we implement our own low-level IO functionality using the backend's C or C++ SDK. The new release has AWS S3 support. GCS is in our roadmap, but it will require quite some work to tightly integrate with it using its SDK. Do you prefer GCS to S3? Please note that we have no bias - we will eventually support both.

As a first step, we can certainly provide you with access to an S3 bucket + a conda package.

I'm biased here (I am employed by Anaconda) but I strongly recommend creating a conda package if you're looking for adoption within the broader numeric Python community.

As mentioned above, we are almost there. :)

@mrocklin
Copy link
Member

Do you prefer GCS to S3?

The software packages have no preference.

But the particular distributed deployment of http://pangeo.pydata.org/ runs on GCP, which would make it trivial for folks to try running scalability and performance tests without paying data transfer costs (though short term I don't think that these will amount to much).

On a personal/OSS note I'd like to push people to support more than the dominant cloud vendor.

@mrocklin
Copy link
Member

Having data on some cloud accessible system would make it easy to repeat this experience: https://youtu.be/rSOJKbfNBNk

@mrocklin
Copy link
Member

Also cc @llllllllll who has been interested in comparing array storage systems

@stavrospapadopoulos
Copy link

OK, all this sounds good. We will set something up on S3 so that we can get some feedback, and we can run detailed benchmarks once we have the GCS integration. Thanks for taking a look!

@mrocklin
Copy link
Member

mrocklin commented Feb 23, 2018 via email

@stavrospapadopoulos
Copy link

On a side note, this is our up-to-date repo, which we maintain and further develop at TileDB-Inc:
https://github.com/TileDB-Inc/TileDB

Not to be confused with the one that I used to develop at Intel Labs:
https://github.com/Intel-HLS/TileDB

@mrocklin
Copy link
Member

mrocklin commented Feb 23, 2018 via email

@stavrospapadopoulos
Copy link

stavrospapadopoulos commented Feb 23, 2018

No worries, fixed (with thanks).

@rabernat
Copy link
Member Author

Thanks so much for your quick and detailed replies to our questions. At this point, it is clear to me that we definitely need to evaluate TileDB as a potential backend for xarray.

I imagine the first step will be to evaluate TileDB with dask. Perhaps @mrocklin can suggest the best approach to that.

The main practical obstacle for testing with xarray will be the need for a new xarray backend. I see two main options:

  1. Duplicate / modify the zarr backend. This involves about 500 lines of code plus another 200 of tests. (Note that the backend code is due for a refactor, WIP: New DataStore / Encoder / Decoder API for review pydata/xarray#1087.)
  2. Create a tiledbnetcdf project, similar to h5netcdf. This would essentially wrap TileDB with the same API as the netcdf4-python library, allowing xarray to use it as a storage backend with less (but still some) new backend code.

I think option 2 is attractive for several reasons. Geosciences are potentially one of the biggest potential sources of TileDB-type data (e.g. NASA is putting > 300 PB of data into the cloud over the next 5 years). NetCDF is already a familiar interface for our community, so this might significantly lower the bar for the broader community.

The pangeo project is stretched pretty thin right now in terms of developer time. So the main challenge will be to find someone to work on this. Maybe @sirrice can help us get a Columbia CS student interested.

@mrocklin
Copy link
Member

mrocklin commented Feb 26, 2018 via email

@rabernat
Copy link
Member Author

cc @kaipak, who may be interested in this discussion

@kaipak
Copy link

kaipak commented Feb 27, 2018

Yep, I've been tagging along for the discussion :).

@stavrospapadopoulos
Copy link

Before we consider integrating TileDB with xarray or building a netcdf wrapper for TileDB, we suggest we perform some simple experiments to evaluate TileDB's performance on parallel writes and reads on nd-arrays using dask.

We need to start with the following questions:

  1. What are we comparing with (e.g., zarr, hdf5, netcdf)?
  2. What is the format of the input data to be ingested to TileDB and the other approaches? One idea would be to generate some nd-array data and store them in some simple CSV or flat binary format. We are open to any suggestions here.
  3. Where will the input and ingested data be stored? We suggest two experiments: (i) both locally, (ii) both on S3. In the future we can test with more backends.

@mrocklin
Copy link
Member

What are we comparing with (e.g., zarr, hdf5, netcdf)?

On normal file systems NetCDF4 on HDF5 is probably the thing to beat, at least for the science domains that interest the community active on this issue tracker (which is sizable). On cloud-based systems there is no obvious contender. I wrote about cloud options and concerns here: http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud

What is the format of the input data to be ingested to TileDB and the other approaches? One idea would be to generate some nd-array data and store them in some simple CSV or flat binary format. We are open to any suggestions here.

I don't think that this matters much. This community's workloads are typically write-once read-many. Data tends to be delivered in the form of NetCDF4 files. I don't think that people care strongly about the cost-to-convert. I think that the broader point is that you have to convert, which is a negative for any format other than NetCDF.

Where will the input and ingested data be stored? We suggest two experiments: (i) both locally, (ii) both on S3. In the future we can test with more backends.

Those are both common. So too are large parallel POSIX file systems.

@mrocklin
Copy link
Member

mrocklin commented Feb 27, 2018

Also, to be clear, if your goal is to gain adoption then I suspect that demonstrating modest factor speed improvements is not sufficient. Disk IO isn't necessarily a pain point, and the inertia to existing data formats is very high. Instead I think you would need to go a bit more broadly and demonstrate that various workflows now become feasible where they were not feasible before. Parallel and random access into cloud-storage is one such workflow, but there are likely others.

@rabernat
Copy link
Member Author

rabernat commented Feb 27, 2018

If you are looking for an appropriate test dataset, I would recommend something from NASA. For example:
GHRSST Level 4 MUR Global Foundation Sea Surface Temperature Analysis (v4.1)

A single file (e.g. ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2018/057/20180226090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc) is 377 MB, and there is one every day for the past 16 years, ~2.2 TB total. Any reasonable subset of this will make a good test dataset.

Simple yet representative things our users might want to do are:

  • get a timeseries at a single point
  • aggregations
    • global mean / variance of SST as a function of time
    • mean / variance over longitude and time (i.e. zonal mean)
    • monthly climatology
  • convolution (smoothing) over spatial dimensions
  • multidimensional Fourier transforms

There are of course more complicated workflows, but these are representative of the I/O bound ones.

@mrocklin
Copy link
Member

Dask.array can do all of these computations. The question would be how things feel when Dask accesses data from TileDB when doing these computations.

@stavrospapadopoulos
Copy link

@mrocklin we are on the same page about adoption and TileDB data access using Dask.
@rabernat this is very helpful, we can start with that.

@jreadey
Copy link

jreadey commented Feb 27, 2018

@rabernat - I like the strategy you outlined. Would it make sense to create a repo with a set of performance benchmarks based on this data collection? Ideally it would enable various I/O backends (e.g. HDF5 files, Fuse to S3, zarr, hsds, etc.) to be pluggedin. This would seem a good framework to evaluate performance for these common use cases.

@jhamman
Copy link
Member

jhamman commented Feb 27, 2018

@jreadey et al. I think a formal benchmarking repository would be a great contribution to the community (see also #45 and #5). This is something I'd be happy to see move forward and would eagerly participate in getting setup.

@jreadey
Copy link

jreadey commented Feb 27, 2018

@rabernat - I was planning on doing some benchmarking with hdf5lib vs s3vfd vs hsds. I can start with some of the simpler codes and then ask you for help when I get stuck!

For now I'm thinking to keep it to just a single client (i.e. without dask distributed).

Is the Sea Temperature data on S3? Alternatively, we could use the DCP-30 dataset: s3://nasanex/NEX-DCP30. Any issues with that?

@rabernat
Copy link
Member Author

Would it make sense to create a repo with a set of performance benchmarks based on this data collection?

Yes, I really like this idea. A modular system would be ideal and would save us from writing a lot of boilerplate.

Ideally it would enable various I/O backends (e.g. HDF5 files, Fuse to S3, zarr, hsds, etc.) to be pluggedin.

This is kind of what xarray already does! 😏 Unfortunately, we do not have direct xarray for some of these storage layers (e.g. TileDB), so we can't just use xarray. We will end up reinventing some of xarray's backend logic, but I suppose that's acceptable.

Does it make sense to try to integrate airspeedvelocity or is that overkill?

@kaipak, a Pangeo intern here at Columbia, has some time to contribute to this benchmark project. @jreadey, it would be great if you two could work together.

@stale stale bot closed this as completed Jun 10, 2019
@scottyhq
Copy link
Member

scottyhq commented Aug 14, 2019

Reopening this issue because I think there is a lot of great discussion and there are some new developments worth pointing out.

Thanks to @normanb TileDB support is now available in GDAL>3.0 (OSGeo/gdal#1402)! Documentation here: https://gdal.org/drivers/raster/tiledb.html.

The upcoming GDAL 3.1 release will also have better support for multidimensional raster data. See:

Once rasterio releases support gdal>3 (conda-forge/gdal-feedstock#326), this should provide an easy way to bring tileDB into an xarray dataset. See also rasterio/rasterio#1699.

I'd love to see a Zarr driver for GDAL as well, and given the overlap with TileDB, it seems like this would be fairly straightforward to implement (https://gdal.org/tutorials/raster_driver_tut.html). There has been a fair amount of discussion and interest on how to convert existing h5 and netcdf files to Zarr format (just one example here #686) Maybe someone is already doing this? But it seems like there isn't much cross talk between the zarr, gdal, rasterio communities. I'm happy to open some issues in the respective repos, but probably won't have time in the near future myself for a PR to GDAL.

pinging @jakebolewski, @shoyer, @jhamman, @davidbrochart, @rabernat, @lewismc

@scottyhq scottyhq reopened this Aug 14, 2019
@stale stale bot removed the stale label Aug 14, 2019
@scottyhq
Copy link
Member

Also a quick illustration how to convert h5 to tiledb and read with gdal

conda create -n gdal3 gdal=3
wget https://github.com/OSGeo/gdal/raw/master/autotest/gdrivers/data/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.h5
gdal_translate -of TileDB -SDS DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.h5 DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb
gdalinfo DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb

Currently trying to read over s3 throws an error:

gdalinfo /vsis3/pangeo-data-upload-virginia/gdal3-test/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb
ERROR 1: [TileDB::StorageManager] Error: Cannot open array; Array does not exist
gdalinfo failed - unable to open '/vsis3/pangeo-data-upload-virginia/gdal3-test/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb'.

I'm pretty sure this is b/c it's a folder, so maybe need to a point to a file under .tiledb? But not sure how to read these since its a directory structure on S3:

DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb/
├── DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tdb.aux.xml
├── __1565818782587_1565818782587_4938da7e10f443a5bb583be9ad448ee1
│   ├── __fragment_metadata.tdb
│   ├── solar_zenith_angle.tdb
│   └── viewing_zenith_angle.tdb
├── __array_schema.tdb
└── __lock.tdb

@normanb
Copy link

normanb commented Aug 15, 2019

@scottyhq the TileDB driver within GDAL uses it own methods to access S3 as opposed to the GDAL VSI approach.

To access the array try

gdalinfo -OO TILEDB_CONFIG=aws.config s3://pangeo-data-upload-virginia/gdal3-test/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb

The GDAL driver code currently relies on their being a config file to determine S3 access https://github.com/OSGeo/gdal/blob/master/gdal/frmts/tiledb/tiledbdataset.cpp#L908. This isn't ideal for public buckets with no config parameters and I will change this check to remove the need for the config file when I add the new GDAL multi-dimensional API to this driver that is currently in master https://gdal.org/tutorials/multidimensional_api_tut.html.

The TileDB configuration parameters are described here - https://docs.tiledb.io/en/stable/tutorials/config.html#summary-of-parameters

If there are any problems then I am happy to work through them.

@scottyhq
Copy link
Member

Thanks @normanb for the pointers. I'm still having trouble getting tiledb to work with a config file.
Full details here: https://gist.github.com/scottyhq/8222b99c3400209f96826d09389482c1

Reading the local file without a config file works (gdalinfo DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb):


Driver: TileDB/TileDB
Files: DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tdb
       DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb
       DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tdb.aux.xml
Size is 512, 512
Metadata:
  solar_zenith_angle_long_name=solar zenith angle
  solar_zenith_angle_standard_name=solar_zenith_angle
  solar_zenith_angle_units=degrees
  solar_zenith_angle_valid_range=0 90 
  solar_zenith_angle__FillValue=-999 
  viewing_zenith_angle_long_name=viewing zenith angle
  viewing_zenith_angle_units=degrees
  viewing_zenith_angle_valid_range=0 90 
  viewing_zenith_angle__FillValue=-999 
Subdatasets:
  SUBDATASET_1_NAME=TILEDB:"DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb":solar_zenith_angle
  SUBDATASET_1_DESC=[1x360x180] solar_zenith_angle (Float32)
  SUBDATASET_2_NAME=TILEDB:"DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb":viewing_zenith_angle
  SUBDATASET_2_DESC=[1x360x180] viewing_zenith_angle (Float32)
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,  512.0)
Upper Right (  512.0,    0.0)
Lower Right (  512.0,  512.0)
Center      (  256.0,  256.0)

I've tried the simple case of pointing TILEDB_CONFIG at a file containing a copy of the default config (CPL_DEBUG=ON gdalinfo -oo TILEDB_CONFIG=tiledb.config DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb), but then get a different error:


ERROR 4: `DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb' not recognized as a supported file format.
gdalinfo failed - unable to open 'DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb'.

@normanb
Copy link

normanb commented Aug 28, 2019

@scottyhq I ran through the notebook you linked to. Your last command should be

gdalinfo -oo TILEDB_CONFIG=tiledb.config s3://pangeo-data-upload-virginia/gdal3-test/DeepBlue-SeaWiFS-1.0_L3_20100101_v004-20130604T131317Z.tiledb

and the config should include the following keys (and add your own values)

vfs.s3.aws_access_key_id  xxxxxx
vfs.s3.aws_secret_access_key xxxxx

The config is used to determine whether to access s3 or not which is a bug (as noted above) as a config file can also be used locally. I will changed that with my current updates.

@stale
Copy link

stale bot commented Oct 27, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 27, 2019
@stale
Copy link

stale bot commented Nov 3, 2019

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed Nov 3, 2019
@JackKelly
Copy link

Great conversations! Just wondering if there are still plans afoot to implement an Xarray backend for TileDB? (That would be wonderful!)

@Hoeze
Copy link

Hoeze commented Feb 4, 2020

I would love having xarray + https://feedback.tiledb.com/tiledb-core/p/support-axes-labels
This way, we could have e.g. native-performance string indexing.

@rabernat
Copy link
Member Author

rabernat commented Feb 4, 2020 via email

@JackKelly
Copy link

I hear you, @rabernat! I've sent a tweet asking for folks to help implement a TileDB backend to Xarray :)

@petacube
Copy link

petacube commented Feb 4, 2020 via email

@JackKelly
Copy link

Awesome, thank you Stan!

@JackKelly
Copy link

@petacube I'm also interested in tiledb's heterogenous dimensions feature. Do you know roughly when it might be implemented? I found the feature listed in tiledb's roadmap, but I couldn't see it in the tiledb github issue queue.

@stavrospapadopoulos
Copy link

@JackKelly this feature is currently under heavy development. It is blocked by #93. You can see some related merged PRs here. I expect the feature to be ready by early March.

@Ledenel
Copy link

Ledenel commented May 18, 2020

Just want to mention that heterogenous dimensions feature has been implemented in TileDB 2.0 release (1 May).

@JackKelly
Copy link

Should we re-open this issue? (It was automatically closed due to inactivity. But maybe now's the time to re-open this issue, because heterogeneous dims are now implemented in TileDB?)

@JackKelly
Copy link

@petacube would you still be interested in implementing a TileDB backend for xarray? It would be hugely useful! :)

@petacube
Copy link

petacube commented Aug 17, 2020 via email

@JackKelly
Copy link

Awesome! Yeah, I plan to benchmark TileDB vs Zarr vs cloud-optimised-GeoTIFF on a satellite dataset (on Google Cloud Storage) over the next few weeks.

@stavrospapadopoulos
Copy link

Just FYI folks, here is a good thread about our plans to model netcdf data with TileDB, after great discussions with @rabernat and @DPeterK, which is very relevant to our TileDB-xarray integration. We will first share a proposal on a general spec for modeling labeled arrays (encompassing netcdf), as well as a proof implementation with our upcoming "axes labels" core TileDB feature. Eventually, this axes labels feature will drive the TileDB-xarray integration.

@jp-dark
Copy link

jp-dark commented Mar 11, 2021

Just to update this discussion: We developed a read-only TileDB backend for xarray available here. This uses the new xarray backend plugin architecture that is soon-to-be-released.

@rabernat
Copy link
Member Author

Thanks so much @jp-dark! So exciting!

Just curious how you recommend starting the transition to tiledb without a way to write files? Our community doesn't have any existing tiledb data. With Zarr, we can basically write code like this

ds = xr.open_dataset('file.nc')
ds.to_zarr('store.zarr')

...and convert all our existing data to a new format. What sort of workflow would your recommend for converting existing netCDF data to tiledb?

@jp-dark
Copy link

jp-dark commented Mar 12, 2021

@rabernat This xarray plugin was developed for other customers that would like to visualize and manipulate data from existing TileDB arrays with xarray. A full fledge NetCDF data model in TileDB (including ingestor from NetCDF) and specification for comment is coming soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests