Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brainstorming about overwriting / appending / etc #29

Open
4 tasks
rabernat opened this issue Aug 9, 2021 · 2 comments
Open
4 tasks

Brainstorming about overwriting / appending / etc #29

rabernat opened this issue Aug 9, 2021 · 2 comments

Comments

@rabernat
Copy link
Contributor

rabernat commented Aug 9, 2021

As noted in pangeo-forge/staged-recipes#67, we need to resolve some details about how bakeries should handle the scenario where data already exists in a target location.

An important principle is that the recipe object, as provided by the recipe box (github repo), should not explicitly know about its target. The recipe describes the source data and how to transform it into ARCO. Therefore, the recipe should not care whether the target data already exists. That's the bakery's problem.

I made this flow chart to start thinking through the logic.

Pangeo Forge Target Update Flow

Edit link:
https://lucid.app/lucidchart/invitations/accept/inv_2d95a404-c359-457e-b472-ae10f268e662?viewport_loc=-153%2C372%2C1889%2C1012%2C0_0

Consequences

If we accept this flow, it raises several important points we will have to implement.

  • We should have a force-overwite option. But where would this get set? What would trigger a force-overwrite situation?
  • Bakeries need to be able to manually clear their targets. That means we need some code that handles this. Should that become part of the prefect flow? Or should it occur outside of prefect? This is confusing to me.
  • Recipes need to become "append aware". A recipe class (e.g. XarrayZarrRecipe) should declare that it knows how to handle appending. And for appending, we need a way to tell recipes where to pick up. A common scenario would be a continuously updating timeseries. The easiest unit for appending would be by chunks. We could specify a subset of chunks, similar to when we call prune.
  • In this scenario, we would need a way to persist a count of which chunks have already been stored. I see two choices
    • Inside the target data itself, e.g. in the xarray metadata
    • In some external database, perhaps we could use the metadata cache for this?

cc @cisaacstern @sharkinsspatial

@cisaacstern
Copy link
Member

cisaacstern commented Aug 16, 2021

Noting that in today's (2021-08-16) Pangeo Forge Coordination Meeting, we decided to start with a default overwrite/rewrite policy. This may require bakery operators to manually delete existing data, or could perhaps entail targets being versioned, to allow for rewriting rather than overwriting. Further details in the meeting minutes.

@simonff
Copy link

simonff commented Aug 17, 2021

Drive-by comment based on the Earth Engine experience: many datasets have the formal notion of data generations. Eg, early / provisional / permanent climate data that get recomputed as more precise inputs arrive, or RT tier vs T1/T2 tiers for Landsat. We have an internal system for annotating such datasets to support overwrite only when the new data belong to a newer generation.

Another consideration is file mtime, if available. If you put the max mtime of input files into the metadata of the ingestion result, you can make an intelligent decision whether or not to overwrite during the next run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants