Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Catalog utility functions] find_chunking_info #218

Open
Thomas-Moore-Creative opened this issue Oct 14, 2024 · 8 comments
Open

[Catalog utility functions] find_chunking_info #218

Thomas-Moore-Creative opened this issue Oct 14, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@Thomas-Moore-Creative
Copy link

Thomas-Moore-Creative commented Oct 14, 2024

Is your feature request related to a problem? Please describe.

To enable a better understanding of the underlying NetCDF data structure so settings like xarray_open_kwargs can be used effectively requires discovery of the native file chunking.

Describe the feature you'd like

  1. a place for the community to help build utility functions that would support the https://github.com/ACCESS-NRI/access-nri-intake-catalog
  2. a specific function to "find the native chunking information" for a dataset in the catalog that first-time users have the ability to use and understand

Describe alternatives you've considered

Writing my own pre-alpha functions here: https://github.com/Thomas-Moore-Creative/ACDtools/blob/main/ACDtools like find_chunking_info but I'd like a place to collaborate on utilities that was easier for the whole community to see and share.

Additional context

  • I'm a user of the data.
  • I'm not trained professionally as a software engineer, but I'm working on it.
  • I'm employing the from tabulate import tabulate function to make simple tables display in the Jupyter UI.
@aidanheerdegen
Copy link
Member

Hola @Thomas-Moore-Creative,

This sounds cool. I'm thinking the best place for something like this might be a repo on the ACCESS Community Hub organisation. It makes it more straightforward to collaborate as we could make you an admin on the repo.

https://github.com/ACCESS-Community-Hub

Then we could point to it from this repo

Does that sound like a good way forward?

@Thomas-Moore-Creative
Copy link
Author

Sounds fine to me @aidanheerdegen - thanks. Do you, @rbeucher, @dougiesquire, or any of your software engineering gurus have advice on how to structure this repo so it's portable, flexible, and available to all on NCI?

@aidanheerdegen
Copy link
Member

Dougie is on leave, so he's out of the picture.

To make it available on gadi I'd say we should add conda packaging. We could also arrange to publish it to the accessnri anaconda channel, or create another access community channel.

We can deal with that later.

As for repo structure, first decision might be flat layout vs src layout, and then isolate functionality in sub-directories.

Is that the sort of thing you were thinking about @Thomas-Moore-Creative?

Do you have any opinions @marc-white?

@marc-white
Copy link
Collaborator

I think the main thing to determine is a question of scope. What exactly are you trying to do? Is it just doing some stuff to work out the native chunking of netCDF files, or are you looking to expand this to include more tools later down the track?

Then, once you've worked out the answer to that question, that will inform your answer to the next question: should this come in as a part of access-nri-intake-catalog, or should it be spun off into its own utility package?

@Thomas-Moore-Creative
Copy link
Author

Thanks @marc-white.

What I'm trying to do is get my projects done, which requires using the access-nri-intake-catalog, and for me that means data discovery, building search filters, and understanding data structure to allow optimal analysis-ready-data workflows to be built for specific datasets.

I highlighted just one type of very simple utility that I'm building ( "find the native chunking information" ) in this issue but I am wondering out loud if there is a better place to be developing helper utilities than my personal repos? Maybe the questions are:

  • does ACCESS-NRI think utilities to help users discover and load data from the access-nri-intake-catalog is a real and general need for the community?
  • where and how ( in what repo ) might ACCESS-NRI want these functions to be collaborated on, if at all?

@charles-turner-1
Copy link
Collaborator

charles-turner-1 commented Oct 16, 2024

As for repo structure, first decision might be flat layout vs src layout, and then isolate functionality in sub-directories.

I'd recommend going with src layout for consistency - it seems to be Dougie's preferred layout, and would keep things consistent with this package itself and the related intake-dataframe-catalog.

As to whether this should be included within access-nri-intake-catalog or as a standalone package, I would suggest the latter. Lots of the functionality of the catalog, eg. loading datasets, is actually performed by intake-esm, and I suspect that this might cause complications. My vote would be for a separate package - something like access-intake-utils - and then we try to keep the interdependencies as minimal as possible.

@Thomas-Moore-Creative
Copy link
Author

My vote would be for a separate package - something like access-intake-utils - and then we try to keep the interdependencies as minimal as possible.

From a users point of view this makes sense to me. Thanks for the advice.

Can we start with an access-intake-utils repo in https://github.com/ACCESS-Community-Hub, as suggested by @aidanheerdegen above?

@rbeucher
Copy link
Member

I agree with @ charles-turner-1, A separate package is the way to go for now. Feel free to start in ACCESS-Community-Hub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

5 participants