Dynamically determine injections based on installed set of packages #86

yuvipanda · 2023-08-19T06:07:03Z

Earlier, we were hardcoding what transformations get what values injected into them. This means we can't actually use the full power of beam, making transforms like
pangeo-forge-cmr (https://github.com/yuvipanda/pangeo-forge-cmr) that inherit from the default transforms in pangeo-forge-recipes impossible. Hardcoding also deeply couples the transforms in pangeo-forge-recipes to a particular version of pangeo-forge-runner, necessitating them to move in lockstep.

Instead, we use the default python extension mechanism - entrypoints. We define an entrypoint group (pangeo_forge_recipes.injection_specs), and expect packages to provide a function that implements it.
There are three parts of injections:

An "injection spec", provided by other installed packages (such as
pangeo_forge_recipes, pangeo_forge_cmr, etc). This specifies what
values exactly will be injected as args for which callables.
It is in the form of a dictionary, and looks like this:
```
{
  "<callable-1-name>": {
    "<argument-1-name>": "<value-spec>",
    "<argument-2-name>": "<value-spec>"
  },
  "<callable-2-name>": {
    "<argument-1-name>": "<value-spec>",
    "<argument-2-name>": "<value-spec>"
  }
}
```
<value-spec> specifies what value should be injected. Currently
supported are two strings:
1. OUTPUT_ROOT - Root path that output should be written to. Will
  be an FSSpec path.
2. CACHE_ROOT - (Optional) Root path that should be used for caching
  input values if necessary.
Additional values may be provided in the future.

An example is:
```
 {
     'StoreToZarr': {
         'target_root': 'OUTPUT_ROOT',
     },
     'OpenURLWithFSSpec': {
         'cache': 'CACHE_ROOT'
     }
 }
```
We considered making this into an Enum, but that would have required all
packages that provide entrypoints also import pangeo_forge_runner. This
was deemed too complicating, and hence raw strings are used.
"Injection spec values", calculated by pangeo-forge-runner. This is simply a
mapping of "" to a specific value that will be injected for
that "" in this particular run. This might look like:
```
{
  "OUTPUT_ROOT": <A fsspec object>,
  "CACHE_ROOT": <another fsspec object>
}
```
"Injections", ready to be passed on to the rewriter! This merges (1) and (2),
and looks like:
```
 {
     'StoreToZarr': {
         'target_root': <An fsspec object pointing to output for this run>
     },
     'OpenURLWithFSSpec': {
         'cache': <Another fsspec object pointing to where this bakery stores cache>
     }
 }
```
This is what is actually injected into the recipes in the end.
This allows the transforms in recipes to evolve without having to be coupled to a particular version of runner, and allows very easy extension by arbitrary third party packages.

Includes #86
Appropriate change in pangeo-forge-recipes: pangeo-forge/pangeo-forge-recipes#566
Change in pangeo-forge-earthdatalogin: yuvipanda/pangeo-forge-earthdatalogin#2

Earlier, we were hardcoding what transformations get what values injected into them. This means we can't actually use the full power of beam, making transforms like pangeo-forge-cmr (https://github.com/yuvipanda/pangeo-forge-cmr) that inherit from the default transforms in pangeo-forge-recipes impossible. Hardcoding also deeply couples the transforms in pangeo-forge-recipes to a particular version of pangeo-forge-runner, necessitating them to move in lockstep. Instead, we use the default python extension mechanism - entrypoints. We define an entrypoint group (pangeo_forge_recipes.injection_specs), and expect packages to provide a function that implements it. There are three parts of injections: 1. An "injection spec", provided by other installed packages (such as pangeo_forge_recipes, pangeo_forge_cmr, etc). This specifies what *values* exactly will be injected as args for *which* callables. It is in the form of a dictionary, and looks like this: ``` { "<callable-1-name>": { "<argument-1-name>": "<value-spec>", "<argument-2-name>": "<value-spec>" }, "<callable-2-name>": { "<argument-1-name>": "<value-spec>", "<argument-2-name>": "<value-spec>" } } ``` `<value-spec>` specifies what value should be injected. Currently supported are two strings: 1. `OUTPUT_ROOT` - Root path that *output* should be written to. Will be an FSSpec path. 2. `CACHE_ROOT` - (Optional) Root path that should be used for caching input values if necessary. Additional values may be provided in the future. An example is: ``` { 'StoreToZarr': { 'target_root': 'OUTPUT_ROOT', }, 'OpenURLWithFSSpec': { 'cache': 'CACHE_ROOT' } } ``` We considered making this into an Enum, but that would have required all packages that provide entrypoints also *import* pangeo_forge_runner. This was deemed too complicating, and hence raw strings are used. 2. "Injection spec values", calculated by pangeo-forge-runner. This is simply a mapping of "<value-spec>" to a specific value that will be injected for that "<value-spec>" in this particular run. This might look like: ``` { "OUTPUT_ROOT": <A fsspec object>, "CACHE_ROOT": <another fsspec object> } ``` 3. "Injections", ready to be passed on to the rewriter! This merges (1) and (2), and looks like: ``` { 'StoreToZarr': { 'target_root': <An fsspec object pointing to output for this run> }, 'OpenURLWithFSSpec': { 'cache': <Another fsspec object pointing to where this bakery stores cache> } } ``` This is what is actually injected into the recipes in the end. This allows the transforms in recipes to evolve without having to be coupled to a particular version of runner, and allows very easy extension by arbitrary third party packages.

codecov · 2023-08-19T06:13:42Z

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (35e7732) 95.79% compared to head (ae4c41c) 96.01%.
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #86      +/-   ##
==========================================
+ Coverage   95.79%   96.01%   +0.22%     
==========================================
  Files          13       14       +1     
  Lines         428      452      +24     
==========================================
+ Hits          410      434      +24     
  Misses         18       18

Files	Coverage Δ
pangeo_forge_runner/commands/bake.py	`90.00% <100.00%> (+0.41%)`	⬆️
pangeo_forge_runner/plugin.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Much clearer than using a fixture that called out to pip

Without this, this package doesn't get info about where to store caches while fetching from remote sources. Works alongside pangeo-forge/pangeo-forge-runner#86

Based on #85 (comment)

cisaacstern · 2023-09-01T23:51:48Z

pangeo_forge_runner/plugin.py

+   ```
+    {
+        'StoreToZarr': {
+            'target_root': 'OUTPUT_ROOT',


@yuvipanda, I'm gradually understanding this. Thanks again for your patience as I review it, it's a great learning opportunity for me.

Question: Did you have a rationale for not going 1:1 on naming here, i.e.

'StoreToZarr': { - 'target_root': 'OUTPUT_ROOT', + 'target_root': 'TARGET_ROOT',

?

Perhaps OUTPUT_ROOT is more general/reusable?

@cisaacstern yes, OUTPUT_ROOT is more general was the reason. 'target' felt very zarr specific.

@yuvipanda I see that, and also the more I reflect on it, the more I find myself caught on the fact that, IIUC, the OUTPUT_ROOT is always parsed as:

pangeo-forge-runner/pangeo_forge_runner/commands/bake.py

Lines 178 to 180 in 8d61651

injection_values = {

"OUTPUT_ROOT": target_storage.get_forge_target(job_name=self.job_name),

}

which means it always becomes a pangeo_forge_recipes.storage.FSSpecTarget (via pangeo_forge_runner.storage.TargetStorage)...

pangeo-forge-runner/pangeo_forge_runner/storage.py

Lines 55 to 59 in 8d61651

from pangeo_forge_recipes import storage

cls = getattr(storage, self.pangeo_forge_target_class)

return cls(

pangeo-forge-runner/pangeo_forge_runner/storage.py

Lines 75 to 80 in 8d61651

class TargetStorage(StorageTargetConfig):

"""

Storage configuration for where the baked data should be stored

"""

pangeo_forge_target_class = "FSSpecTarget"

with that in mind, I think it would be more intuitive to me if this value-spec was called either FSSPEC_TARGET after the -recipes type it is parsed to (but that's probably too specific) or, after the -runner type that parses it TARGET_STORAGE (maybe this is better). And the same for the cache target. So to make the suggestion more specific, how about:

`<value-spec>` specifies what value should be injected. Currently supported are two strings: - 1. `OUTPUT_ROOT` - Root path that *output* should be written to. Will - be an FSSpec path. + 1. `TARGET_STORAGE` - Location *output* should be written to. Will + be parsed to an `pangeo_forge_recipes.storge.FSSpecTarget` object. - 2. `CACHE_ROOT` - (Optional) Root path that should be used for caching - input values if necessary. + 2. `INPUT_CACHE_STORAGE` - (Optional) Location that should be used for caching + input values. Parsed to `pangeo_forge_recipes.storge.CacheFSSpecTarget`.

This way the injection spec is not inventing its own terminology: we just use the name of the pangeo-forge-runner parsing interface as the name of the value-spec. Thoughts?

Yeah, I'm totally fine with standardizing and not picking own terminology! I think the question is one of - should we include or not include the 'root' terminology here? Prior to pangeo-forge-runner, 'root' didn't really exist as terminology, but it now does as we put multiple job outputs from the same run - #68 introduced this I think. So the question now is - is this TARGET_STORAGE or TARGET_STORAGE_ROOT?

Awesome.

So the question now is - is this TARGET_STORAGE or TARGET_STORAGE_ROOT?

I am less attached to one or the other of these choices, than I am to the idea that whichever it is, the value-spec name should be the screaming snake case version of the -runner class which will parses it, i.e. (just because I was curious how to do that... 😆):

import re cls_name = "TargetStorage" re.sub("(?<!^)(?=[A-Z])", "_", cls_name).upper() # --> 'TARGET_STORAGE'

So I'd say 'TARGET_STORAGE' because:

It changes less, because it doesn't require changing the parsing class names (If we prefer 'TARGET_STORAGE_ROOT', then I'd say let's rename TargetStorage -> TargetStorageRoot.)

It's more concise.

In terms of its meaning, I think it can be taken to imply "target storage device/location" in the general sense, and not the fully-qualified final output path.

Thoughts?

norlandrhagen · 2023-10-03T15:02:49Z

Late to the naming party, but I just used this branch and it was super helpful to run my recipe. Not super opinionated, but I think I like the idea of using some variation of target_storage. Whether that's target_storage or target_storage_root, either way seems good.

[base: `injections`] Rename value specs

cisaacstern · 2023-10-03T18:50:48Z

Thanks for the quick merge of #108 @yuvipanda!

I'll address the failing tests here now.

cisaacstern · 2023-10-03T19:26:31Z

TYSM @yuvipanda! For the PR and for your patience with my glacial review pace 🙏

yuvipanda added 2 commits August 18, 2023 23:02

Add new dependencies

dc17acd

yuvipanda force-pushed the injections branch from 26eb9a1 to e14d303 Compare August 19, 2023 06:36

Cleanup how we pick beam or no-beam versions of recipes to test

0995887

Much clearer than using a fixture that called out to pip

yuvipanda force-pushed the injections branch from e14d303 to 0995887 Compare August 19, 2023 06:46

Support pangeo-forge-recipes 0.10.x

b5ee154

yuvipanda mentioned this pull request Aug 19, 2023

Inject target_root for kerchunk writer #85

Closed

This was referenced Aug 19, 2023

Tell pangeo-forge-runner what to inject yuvipanda/pangeo-forge-earthdatalogin#2

Open

Arbitrary injections #67

Closed

yuvipanda added the test-dataflow Add this label to PRs to trigger Dataflow integration test. label Aug 20, 2023

Support WriteCombinedReference in pfr 0.10 too

2e0fd04

Based on #85 (comment)

cisaacstern mentioned this pull request Aug 22, 2023

WIP: Dynamic rechunking option for StoreToZarr pangeo-forge/pangeo-forge-recipes#546

Closed

3 tasks

yuvipanda mentioned this pull request Aug 25, 2023

Consolidate -recipes, -runner, and meta-yaml-schema into monorepo pangeo-forge/pangeo-forge-recipes#514

Open

Merge remote-tracking branch 'upstream/main' into injections

8d61651

cisaacstern reviewed Sep 1, 2023

View reviewed changes

cisaacstern and others added 2 commits October 3, 2023 14:44

rename value specs

3be510e

Merge pull request #108 from pangeo-forge/rename-value-specs

69deee6

[base: `injections`] Rename value specs

fix value spec names in bake command

ae4c41c

cisaacstern merged commit 241c167 into main Oct 3, 2023
11 checks passed

yuvipanda deleted the injections branch November 3, 2023 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamically determine injections based on installed set of packages #86

Dynamically determine injections based on installed set of packages #86

yuvipanda commented Aug 19, 2023 •

edited

Loading

codecov bot commented Aug 19, 2023 •

edited

Loading

cisaacstern Sep 1, 2023 •

edited

Loading

yuvipanda Sep 7, 2023

cisaacstern Sep 7, 2023

yuvipanda Sep 7, 2023

cisaacstern Sep 7, 2023

norlandrhagen commented Oct 3, 2023

cisaacstern commented Oct 3, 2023

cisaacstern commented Oct 3, 2023

	injection_values = {
	"OUTPUT_ROOT": target_storage.get_forge_target(job_name=self.job_name),
	}

	from pangeo_forge_recipes import storage

	cls = getattr(storage, self.pangeo_forge_target_class)

	return cls(

	class TargetStorage(StorageTargetConfig):
	"""
	Storage configuration for where the baked data should be stored
	"""

	pangeo_forge_target_class = "FSSpecTarget"

Dynamically determine injections based on installed set of packages #86

Dynamically determine injections based on installed set of packages #86

Conversation

yuvipanda commented Aug 19, 2023 • edited Loading

codecov bot commented Aug 19, 2023 • edited Loading

Codecov Report

cisaacstern Sep 1, 2023 • edited Loading

Choose a reason for hiding this comment

yuvipanda Sep 7, 2023

Choose a reason for hiding this comment

cisaacstern Sep 7, 2023

Choose a reason for hiding this comment

yuvipanda Sep 7, 2023

Choose a reason for hiding this comment

cisaacstern Sep 7, 2023

Choose a reason for hiding this comment

norlandrhagen commented Oct 3, 2023

cisaacstern commented Oct 3, 2023

cisaacstern commented Oct 3, 2023

yuvipanda commented Aug 19, 2023 •

edited

Loading

codecov bot commented Aug 19, 2023 •

edited

Loading

cisaacstern Sep 1, 2023 •

edited

Loading