Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cataloguing the data #19

Open
mattjbr123 opened this issue Oct 9, 2024 · 2 comments
Open

Cataloguing the data #19

mattjbr123 opened this issue Oct 9, 2024 · 2 comments
Assignees

Comments

@mattjbr123
Copy link
Collaborator

mattjbr123 commented Oct 9, 2024

Thread to start thinking about how we want to catalogue the data (and therefore integrate with EIDC).

@mattjbr123 mattjbr123 converted this from a draft issue Oct 9, 2024
@mattjbr123 mattjbr123 self-assigned this Oct 9, 2024
@mattjbr123
Copy link
Collaborator Author

Pangeo-forge used to have a catalogue page like this:

Image

which when clicking on the dataset of interest would give you some example code you could use in a notebook or script to access the data from where it sits on object storage:

Image

I think this has now kinda been replaced with arraylake? (At least it's no longer accessible since the major pangeo-forge rewrite of the docs and shift away from providing bakeries).
But anyway, it could provide a nice starting point for ideas. Integrating this code snippet into the EIDC catalogue pages, and/or having a drop-down list of cloud-compute-compatible datasets on datalabs which when clicked pre-populates a notebook with this code could be good starting points for example.

@mattjbr123
Copy link
Collaborator Author

mattjbr123 commented Oct 23, 2024

Earthmover have released Icechunk, the version-control-of-ARCO aspect of arraylake. This does not contain the cataloguing features of arraylake (that remains something you can only get when paying for arraylake).

Is there a use for Icechunk in this project?
✔ Setup is simple
✔ All changes to datasets are tracked
✔ N new versions of the dataset don't take up O(N x dataset_size) size
✔ Could lend itself of provenance tracking

✖ Might be overkill for datasets that are never updated
✖ Another code library for users to get their heads around (examples would be essential to mitigate, but it takes us a little away from "use the data as if it were on disk" idea which is easier with some boilerplate "load the data and get out the way" fsspec code
✖ Integrating icechunk with EIDC could be trickier, given icechunk does modify the storage format of the files

Spinning off into a new issue #25 as related but tangential to cataloguing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant