Cataloguing the data #19

mattjbr123 · 2024-10-09T15:17:39Z

Thread to start thinking about how we want to catalogue the data (and therefore integrate with EIDC).

mattjbr123 · 2024-10-09T15:25:51Z

Pangeo-forge used to have a catalogue page like this:

which when clicking on the dataset of interest would give you some example code you could use in a notebook or script to access the data from where it sits on object storage:

I think this has now kinda been replaced with arraylake? (At least it's no longer accessible since the major pangeo-forge rewrite of the docs and shift away from providing bakeries).
But anyway, it could provide a nice starting point for ideas. Integrating this code snippet into the EIDC catalogue pages, and/or having a drop-down list of cloud-compute-compatible datasets on datalabs which when clicked pre-populates a notebook with this code could be good starting points for example.

mattjbr123 · 2024-10-23T17:28:55Z

Earthmover have released Icechunk, the version-control-of-ARCO aspect of arraylake. This does not contain the cataloguing features of arraylake (that remains something you can only get when paying for arraylake).

Is there a use for Icechunk in this project?
✔ Setup is simple
✔ All changes to datasets are tracked
✔ N new versions of the dataset don't take up O(N x dataset_size) size
✔ Could lend itself of provenance tracking

✖ Might be overkill for datasets that are never updated
✖ Another code library for users to get their heads around (examples would be essential to mitigate, but it takes us a little away from "use the data as if it were on disk" idea which is easier with some boilerplate "load the data and get out the way" fsspec code
✖ Integrating icechunk with EIDC could be trickier, given icechunk does modify the storage format of the files

Spinning off into a new issue #25 as related but tangential to cataloguing

mattjbr123 added this to Gridded data conversion initial product Oct 9, 2024

mattjbr123 converted this from a draft issue Oct 9, 2024

mattjbr123 self-assigned this Oct 9, 2024

mattjbr123 mentioned this issue Oct 23, 2024

Data Version Control of data in the object store #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cataloguing the data #19

Cataloguing the data #19

mattjbr123 commented Oct 9, 2024 •

edited

Loading

mattjbr123 commented Oct 9, 2024

mattjbr123 commented Oct 23, 2024 •

edited

Loading

Cataloguing the data #19

Cataloguing the data #19

Comments

mattjbr123 commented Oct 9, 2024 • edited Loading

mattjbr123 commented Oct 9, 2024

mattjbr123 commented Oct 23, 2024 • edited Loading

mattjbr123 commented Oct 9, 2024 •

edited

Loading

mattjbr123 commented Oct 23, 2024 •

edited

Loading