You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As noted in pangeo-forge/staged-recipes#67, we need to resolve some details about how bakeries should handle the scenario where data already exists in a target location.
An important principle is that the recipe object, as provided by the recipe box (github repo), should not explicitly know about its target. The recipe describes the source data and how to transform it into ARCO. Therefore, the recipe should not care whether the target data already exists. That's the bakery's problem.
I made this flow chart to start thinking through the logic.
If we accept this flow, it raises several important points we will have to implement.
We should have a force-overwite option. But where would this get set? What would trigger a force-overwrite situation?
Bakeries need to be able to manually clear their targets. That means we need some code that handles this. Should that become part of the prefect flow? Or should it occur outside of prefect? This is confusing to me.
Recipes need to become "append aware". A recipe class (e.g. XarrayZarrRecipe) should declare that it knows how to handle appending. And for appending, we need a way to tell recipes where to pick up. A common scenario would be a continuously updating timeseries. The easiest unit for appending would be by chunks. We could specify a subset of chunks, similar to when we call prune.
In this scenario, we would need a way to persist a count of which chunks have already been stored. I see two choices
Inside the target data itself, e.g. in the xarray metadata
In some external database, perhaps we could use the metadata cache for this?
Noting that in today's (2021-08-16) Pangeo Forge Coordination Meeting, we decided to start with a default overwrite/rewrite policy. This may require bakery operators to manually delete existing data, or could perhaps entail targets being versioned, to allow for rewriting rather than overwriting. Further details in the meeting minutes.
Drive-by comment based on the Earth Engine experience: many datasets have the formal notion of data generations. Eg, early / provisional / permanent climate data that get recomputed as more precise inputs arrive, or RT tier vs T1/T2 tiers for Landsat. We have an internal system for annotating such datasets to support overwrite only when the new data belong to a newer generation.
Another consideration is file mtime, if available. If you put the max mtime of input files into the metadata of the ingestion result, you can make an intelligent decision whether or not to overwrite during the next run.
As noted in pangeo-forge/staged-recipes#67, we need to resolve some details about how bakeries should handle the scenario where data already exists in a target location.
An important principle is that the recipe object, as provided by the recipe box (github repo), should not explicitly know about its target. The recipe describes the source data and how to transform it into ARCO. Therefore, the recipe should not care whether the target data already exists. That's the bakery's problem.
I made this flow chart to start thinking through the logic.
Edit link:
https://lucid.app/lucidchart/invitations/accept/inv_2d95a404-c359-457e-b472-ae10f268e662?viewport_loc=-153%2C372%2C1889%2C1012%2C0_0
Consequences
If we accept this flow, it raises several important points we will have to implement.
cc @cisaacstern @sharkinsspatial
The text was updated successfully, but these errors were encountered: