-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamically determine injections based on installed set of packages #86
Conversation
Earlier, we were hardcoding what transformations get what values injected into them. This means we can't actually use the full power of beam, making transforms like pangeo-forge-cmr (https://github.com/yuvipanda/pangeo-forge-cmr) that inherit from the default transforms in pangeo-forge-recipes impossible. Hardcoding also deeply couples the transforms in pangeo-forge-recipes to a particular version of pangeo-forge-runner, necessitating them to move in lockstep. Instead, we use the default python extension mechanism - entrypoints. We define an entrypoint group (pangeo_forge_recipes.injection_specs), and expect packages to provide a function that implements it. There are three parts of injections: 1. An "injection spec", provided by other installed packages (such as pangeo_forge_recipes, pangeo_forge_cmr, etc). This specifies what *values* exactly will be injected as args for *which* callables. It is in the form of a dictionary, and looks like this: ``` { "<callable-1-name>": { "<argument-1-name>": "<value-spec>", "<argument-2-name>": "<value-spec>" }, "<callable-2-name>": { "<argument-1-name>": "<value-spec>", "<argument-2-name>": "<value-spec>" } } ``` `<value-spec>` specifies what value should be injected. Currently supported are two strings: 1. `OUTPUT_ROOT` - Root path that *output* should be written to. Will be an FSSpec path. 2. `CACHE_ROOT` - (Optional) Root path that should be used for caching input values if necessary. Additional values may be provided in the future. An example is: ``` { 'StoreToZarr': { 'target_root': 'OUTPUT_ROOT', }, 'OpenURLWithFSSpec': { 'cache': 'CACHE_ROOT' } } ``` We considered making this into an Enum, but that would have required all packages that provide entrypoints also *import* pangeo_forge_runner. This was deemed too complicating, and hence raw strings are used. 2. "Injection spec values", calculated by pangeo-forge-runner. This is simply a mapping of "<value-spec>" to a specific value that will be injected for that "<value-spec>" in this particular run. This might look like: ``` { "OUTPUT_ROOT": <A fsspec object>, "CACHE_ROOT": <another fsspec object> } ``` 3. "Injections", ready to be passed on to the rewriter! This merges (1) and (2), and looks like: ``` { 'StoreToZarr': { 'target_root': <An fsspec object pointing to output for this run> }, 'OpenURLWithFSSpec': { 'cache': <Another fsspec object pointing to where this bakery stores cache> } } ``` This is what is actually injected into the recipes in the end. This allows the transforms in recipes to evolve without having to be coupled to a particular version of runner, and allows very easy extension by arbitrary third party packages.
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #86 +/- ##
==========================================
+ Coverage 95.79% 96.01% +0.22%
==========================================
Files 13 14 +1
Lines 428 452 +24
==========================================
+ Hits 410 434 +24
Misses 18 18
☔ View full report in Codecov by Sentry. |
Much clearer than using a fixture that called out to pip
Without this, this package doesn't get info about where to store caches while fetching from remote sources. Works alongside pangeo-forge/pangeo-forge-runner#86
pangeo_forge_runner/plugin.py
Outdated
``` | ||
{ | ||
'StoreToZarr': { | ||
'target_root': 'OUTPUT_ROOT', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuvipanda, I'm gradually understanding this. Thanks again for your patience as I review it, it's a great learning opportunity for me.
Question: Did you have a rationale for not going 1:1 on naming here, i.e.
'StoreToZarr': {
- 'target_root': 'OUTPUT_ROOT',
+ 'target_root': 'TARGET_ROOT',
?
Perhaps OUTPUT_ROOT
is more general/reusable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cisaacstern yes, OUTPUT_ROOT
is more general was the reason. 'target' felt very zarr specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuvipanda I see that, and also the more I reflect on it, the more I find myself caught on the fact that, IIUC, the OUTPUT_ROOT
is always parsed as:
pangeo-forge-runner/pangeo_forge_runner/commands/bake.py
Lines 178 to 180 in 8d61651
injection_values = { | |
"OUTPUT_ROOT": target_storage.get_forge_target(job_name=self.job_name), | |
} |
which means it always becomes a pangeo_forge_recipes.storage.FSSpecTarget
(via pangeo_forge_runner.storage.TargetStorage
)...
pangeo-forge-runner/pangeo_forge_runner/storage.py
Lines 55 to 59 in 8d61651
from pangeo_forge_recipes import storage | |
cls = getattr(storage, self.pangeo_forge_target_class) | |
return cls( |
pangeo-forge-runner/pangeo_forge_runner/storage.py
Lines 75 to 80 in 8d61651
class TargetStorage(StorageTargetConfig): | |
""" | |
Storage configuration for where the baked data should be stored | |
""" | |
pangeo_forge_target_class = "FSSpecTarget" |
with that in mind, I think it would be more intuitive to me if this value-spec was called either FSSPEC_TARGET
after the -recipes
type it is parsed to (but that's probably too specific) or, after the -runner
type that parses it TARGET_STORAGE
(maybe this is better). And the same for the cache target. So to make the suggestion more specific, how about:
`<value-spec>` specifies what value should be injected. Currently
supported are two strings:
- 1. `OUTPUT_ROOT` - Root path that *output* should be written to. Will
- be an FSSpec path.
+ 1. `TARGET_STORAGE` - Location *output* should be written to. Will
+ be parsed to an `pangeo_forge_recipes.storge.FSSpecTarget` object.
- 2. `CACHE_ROOT` - (Optional) Root path that should be used for caching
- input values if necessary.
+ 2. `INPUT_CACHE_STORAGE` - (Optional) Location that should be used for caching
+ input values. Parsed to `pangeo_forge_recipes.storge.CacheFSSpecTarget`.
This way the injection spec is not inventing its own terminology: we just use the name of the pangeo-forge-runner
parsing interface as the name of the value-spec. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm totally fine with standardizing and not picking own terminology! I think the question is one of - should we include or not include the 'root' terminology here? Prior to pangeo-forge-runner, 'root' didn't really exist as terminology, but it now does as we put multiple job outputs from the same run - #68 introduced this I think. So the question now is - is this TARGET_STORAGE
or TARGET_STORAGE_ROOT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome.
So the question now is - is this TARGET_STORAGE or TARGET_STORAGE_ROOT?
I am less attached to one or the other of these choices, than I am to the idea that whichever it is, the value-spec name should be the screaming snake case version of the -runner
class which will parses it, i.e. (just because I was curious how to do that... 😆):
import re
cls_name = "TargetStorage"
re.sub("(?<!^)(?=[A-Z])", "_", cls_name).upper()
# --> 'TARGET_STORAGE'
So I'd say 'TARGET_STORAGE' because:
- It changes less, because it doesn't require changing the parsing class names (If we prefer 'TARGET_STORAGE_ROOT', then I'd say let's rename
TargetStorage
->TargetStorageRoot
.) - It's more concise.
- In terms of its meaning, I think it can be taken to imply "target storage device/location" in the general sense, and not the fully-qualified final output path.
Thoughts?
Late to the naming party, but I just used this branch and it was super helpful to run my recipe. Not super opinionated, but I think I like the idea of using some variation of |
[base: `injections`] Rename value specs
Thanks for the quick merge of #108 @yuvipanda! I'll address the failing tests here now. |
TYSM @yuvipanda! For the PR and for your patience with my glacial review pace 🙏 |
Earlier, we were hardcoding what transformations get what values injected into them. This means we can't actually use the full power of beam, making transforms like
pangeo-forge-cmr (https://github.com/yuvipanda/pangeo-forge-cmr) that inherit from the default transforms in pangeo-forge-recipes impossible. Hardcoding also deeply couples the transforms in pangeo-forge-recipes to a particular version of pangeo-forge-runner, necessitating them to move in lockstep.
Instead, we use the default python extension mechanism - entrypoints. We define an entrypoint group (pangeo_forge_recipes.injection_specs), and expect packages to provide a function that implements it.
There are three parts of injections:
An "injection spec", provided by other installed packages (such as
pangeo_forge_recipes, pangeo_forge_cmr, etc). This specifies what
values exactly will be injected as args for which callables.
It is in the form of a dictionary, and looks like this:
<value-spec>
specifies what value should be injected. Currentlysupported are two strings:
OUTPUT_ROOT
- Root path that output should be written to. Willbe an FSSpec path.
CACHE_ROOT
- (Optional) Root path that should be used for cachinginput values if necessary.
Additional values may be provided in the future.
An example is:
We considered making this into an Enum, but that would have required all
packages that provide entrypoints also import pangeo_forge_runner. This
was deemed too complicating, and hence raw strings are used.
"Injection spec values", calculated by pangeo-forge-runner. This is simply a
mapping of "" to a specific value that will be injected for
that "" in this particular run. This might look like:
"Injections", ready to be passed on to the rewriter! This merges (1) and (2),
and looks like:
This is what is actually injected into the recipes in the end.
This allows the transforms in recipes to evolve without having to be coupled to a particular version of runner, and allows very easy extension by arbitrary third party packages.
Includes #86
Appropriate change in pangeo-forge-recipes: pangeo-forge/pangeo-forge-recipes#566
Change in pangeo-forge-earthdatalogin: yuvipanda/pangeo-forge-earthdatalogin#2