Brainstorming about overwriting / appending / etc #29

rabernat · 2021-08-09T15:27:58Z

As noted in pangeo-forge/staged-recipes#67, we need to resolve some details about how bakeries should handle the scenario where data already exists in a target location.

An important principle is that the recipe object, as provided by the recipe box (github repo), should not explicitly know about its target. The recipe describes the source data and how to transform it into ARCO. Therefore, the recipe should not care whether the target data already exists. That's the bakery's problem.

I made this flow chart to start thinking through the logic.

Edit link:
https://lucid.app/lucidchart/invitations/accept/inv_2d95a404-c359-457e-b472-ae10f268e662?viewport_loc=-153%2C372%2C1889%2C1012%2C0_0

Consequences

If we accept this flow, it raises several important points we will have to implement.

We should have a force-overwite option. But where would this get set? What would trigger a force-overwrite situation?
Bakeries need to be able to manually clear their targets. That means we need some code that handles this. Should that become part of the prefect flow? Or should it occur outside of prefect? This is confusing to me.
Recipes need to become "append aware". A recipe class (e.g. XarrayZarrRecipe) should declare that it knows how to handle appending. And for appending, we need a way to tell recipes where to pick up. A common scenario would be a continuously updating timeseries. The easiest unit for appending would be by chunks. We could specify a subset of chunks, similar to when we call prune.
In this scenario, we would need a way to persist a count of which chunks have already been stored. I see two choices
- Inside the target data itself, e.g. in the xarray metadata
- In some external database, perhaps we could use the metadata cache for this?

cc @cisaacstern @sharkinsspatial

cisaacstern · 2021-08-16T21:32:26Z

Noting that in today's (2021-08-16) Pangeo Forge Coordination Meeting, we decided to start with a default overwrite/rewrite policy. This may require bakery operators to manually delete existing data, or could perhaps entail targets being versioned, to allow for rewriting rather than overwriting. Further details in the meeting minutes.

simonff · 2021-08-17T05:33:07Z

Drive-by comment based on the Earth Engine experience: many datasets have the formal notion of data generations. Eg, early / provisional / permanent climate data that get recomputed as more precise inputs arrive, or RT tier vs T1/T2 tiers for Landsat. We have an internal system for annotating such datasets to support overwrite only when the new data belong to a newer generation.

Another consideration is file mtime, if available. If you put the max mtime of input files into the metadata of the ingestion result, you can make an intelligent decision whether or not to overwrite during the next run.

rabernat mentioned this issue Sep 16, 2021

Add version to meta.yaml #34

Merged

rabernat mentioned this issue Dec 3, 2021

Skip re-computing metadata cache. pangeo-forge/pangeo-forge-recipes#243

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brainstorming about overwriting / appending / etc #29

Brainstorming about overwriting / appending / etc #29

rabernat commented Aug 9, 2021 •

edited

Loading

cisaacstern commented Aug 16, 2021 •

edited

Loading

simonff commented Aug 17, 2021

Brainstorming about overwriting / appending / etc #29

Brainstorming about overwriting / appending / etc #29

Comments

rabernat commented Aug 9, 2021 • edited Loading

Consequences

cisaacstern commented Aug 16, 2021 • edited Loading

simonff commented Aug 17, 2021

rabernat commented Aug 9, 2021 •

edited

Loading

cisaacstern commented Aug 16, 2021 •

edited

Loading