-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data organization #45
Comments
Great thoughts @orianac! @jhamman and I started thinking about the organization of the azure directory a bit. In cleanup helper functions, rechunker results could be cleaned up at the end of a flow with fsspec. Isolating the 'temp' products in a read/write permissions bucket might save us from accidental deletions of obs/gcm data. A very rough schema:
Would love some input! |
Curious how we're thinking about differentiating between |
Proposal to have 3 buckets:
|
We also need to put results somewhere. Maybe:
Though the |
Thanks for the feedback. Does this structure capture everything? Everything has read / write / delete permissions, except for the input data.
|
small comment, should the |
They definitely can be! If we wanted to add more input data would we temporarily change the permissions? |
The containers themselves will always have read/write permissions. What we're really talking about is, as a matter of practice, using read-only access methods (credentials or otherwise) to access certain storage spaces. I think in most cases, we can access the cmip6 and training containers without credentials, which will limit us to read-only. For other containers, we should probably be using custom SAS tokens for specific applications. |
Coming back to this issue, I'd like to get a few things finalized. In particular, there's some additional cleanup needed in the
Then can we work on moving the following paths to the
These last three will require some path updates to the data catalogs in this repo as well as the two catalogs listed above. cc @andersy005 for visibility on the catalog side. |
As a way to offload my thinking about caching/data store organization/clean-up routines (@norlandrhagen @jhamman @tcchiao), here are some thoughts about how the data ins and outs for the bcsd workflow are organized currently and how that system could be improved. Hopefully this can complement whatever scheme you're setting up!
az://cmip6/intermediates/
(the stores are generated bymake_bcsd_flow_paths()
). These are the intermediate steps that are most likely to be shared by multiple runs and also are the downstream outputs of more time-intensive steps from natural breakpoints in the processing. These are the ones that make sense to cache, and that is set up for you currently since the inputs/outputs of all tasks are strings.az://cmip6/temp/
. The rechunked versions of the datasets could potentially be useful to multiple runs (e.g. obs for same GCM grid but different GCM, or a different downscaling method) however, there is a trade-off in keeping them around. So, I think the stuff stored inaz://cmip6/temp/
(or wherever you move it to) could be wiped at the termination of a run. But I wanted to highlight this as another potential optimization point.The text was updated successfully, but these errors were encountered: