Data organization #45

orianac · 2021-11-23T21:02:34Z

As a way to offload my thinking about caching/data store organization/clean-up routines (@norlandrhagen @jhamman @tcchiao), here are some thoughts about how the data ins and outs for the bcsd workflow are organized currently and how that system could be improved. Hopefully this can complement whatever scheme you're setting up!

Currently there is a set of files stored in az://cmip6/intermediates/ (the stores are generated by make_bcsd_flow_paths()). These are the intermediate steps that are most likely to be shared by multiple runs and also are the downstream outputs of more time-intensive steps from natural breakpoints in the processing. These are the ones that make sense to cache, and that is set up for you currently since the inputs/outputs of all tasks are strings.
There is another set of temporary stores though (and there are many of them as a result of the rechunking routines), stored in randomly generated store names in temporary area of az://cmip6/temp/. The rechunked versions of the datasets could potentially be useful to multiple runs (e.g. obs for same GCM grid but different GCM, or a different downscaling method) however, there is a trade-off in keeping them around. So, I think the stuff stored in az://cmip6/temp/ (or wherever you move it to) could be wiped at the termination of a run. But I wanted to highlight this as another potential optimization point.

The text was updated successfully, but these errors were encountered:

norlandrhagen · 2021-11-29T21:46:22Z

Great thoughts @orianac! @jhamman and I started thinking about the organization of the azure directory a bit. In cleanup helper functions, rechunker results could be cleaned up at the end of a flow with fsspec. Isolating the 'temp' products in a read/write permissions bucket might save us from accidental deletions of obs/gcm data.

A very rough schema:

scratch/ (bucket with read/write/delete permissions)
- rechunker_results (currently temp/)
- intermediate_products (flow outputs)
observation_data/ (bucket with read only access)
- ERA5
- ERA5 daily
- CR2MET
- etc..
CMIP6/ (bucket with read only access)
- CMIP6
- Scenariomip
prefect/ (bucket for cloud prefect storage)

Would love some input!

tcchiao · 2021-11-29T22:55:10Z

Curious how we're thinking about differentiating between scratch/intermediate_products, scratch/rechunker_results, and prefect buckets? For example, is prefect/ for storing final results only? Should all disposable intermediate material go into rechunker_results even though they're not from the rechunker?

tcchiao · 2021-12-03T00:49:01Z

Proposal to have 3 buckets:

intermediary -- intermediary outputs that we might want to inspect (e.g. bias correction results). Files should be in paths that are identifiable (e.g. {gcm}_{obs}_{method}.zarr). No automatic cleaning.
cache -- anything we might want to cache to save time but wouldn't want to inspect (e.g. rechuncker output). File paths can be random strings. No automatic cleaning.
scratch -- things that can be deleted after each model run or each week (e.g. rechunker intermediate output). Can have automatic cleaning

orianac · 2021-12-03T01:26:31Z

We also need to put results somewhere. Maybe:

results/
---data/
-------daily/
-------monthly/
-------annual/
---analyses/
------qaqc/
------metrics/

Though the results/data would probably end up being housed in some other bucket- but for now we could work under that structure. Thoughts?

norlandrhagen · 2021-12-03T01:45:55Z

Thanks for the feedback. Does this structure capture everything? Everything has read / write / delete permissions, except for the input data.

.
├── flow_outputs (read / write / delete)
│   ├── cache
│   ├── intermediary
│   │   └── {gcm}_{obs}.zarr
│   │   └── {gcm}_{obs}_{method}.zarr
│   └── scratch
├── inputs (read / write)
│   ├── CMIP6
│   ├── scenariomip
│   ├── CR2MET
│   ├── ERA5
│   └── ERA5_daily
├── prefect (read / write / delete)
└── results (read / write / delete)
    ├── analyses
    │   ├── metrics
    │   └── qaqc
    └── data
        ├── annual
        ├── daily
        └── monthly

jhamman · 2021-12-03T15:58:17Z

small comment, should the CMIP6 and input_data buckets be read-only?

norlandrhagen · 2021-12-03T16:54:03Z

They definitely can be! If we wanted to add more input data would we temporarily change the permissions?

jhamman · 2021-12-06T06:04:39Z

The containers themselves will always have read/write permissions. What we're really talking about is, as a matter of practice, using read-only access methods (credentials or otherwise) to access certain storage spaces.

I think in most cases, we can access the cmip6 and training containers without credentials, which will limit us to read-only. For other containers, we should probably be using custom SAS tokens for specific applications.

jhamman · 2022-03-25T15:22:58Z

orianac mentioned this issue Nov 23, 2021

Workflow enhancements #43

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data organization #45

Data organization #45

orianac commented Nov 23, 2021

norlandrhagen commented Nov 29, 2021

tcchiao commented Nov 29, 2021

tcchiao commented Dec 3, 2021

orianac commented Dec 3, 2021

norlandrhagen commented Dec 3, 2021 •

edited

Loading

jhamman commented Dec 3, 2021

norlandrhagen commented Dec 3, 2021

jhamman commented Dec 6, 2021

jhamman commented Mar 25, 2022 •

edited by norlandrhagen

Loading

Data organization #45

Data organization #45

Comments

orianac commented Nov 23, 2021

norlandrhagen commented Nov 29, 2021

tcchiao commented Nov 29, 2021

tcchiao commented Dec 3, 2021

orianac commented Dec 3, 2021

norlandrhagen commented Dec 3, 2021 • edited Loading

jhamman commented Dec 3, 2021

norlandrhagen commented Dec 3, 2021

jhamman commented Dec 6, 2021

jhamman commented Mar 25, 2022 • edited by norlandrhagen Loading

norlandrhagen commented Dec 3, 2021 •

edited

Loading

jhamman commented Mar 25, 2022 •

edited by norlandrhagen

Loading