Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically determine injections based on installed set of packages #86

Merged
merged 9 commits into from
Oct 3, 2023

Conversation

yuvipanda
Copy link
Collaborator

@yuvipanda yuvipanda commented Aug 19, 2023

Earlier, we were hardcoding what transformations get what values injected into them. This means we can't actually use the full power of beam, making transforms like
pangeo-forge-cmr (https://github.com/yuvipanda/pangeo-forge-cmr) that inherit from the default transforms in pangeo-forge-recipes impossible. Hardcoding also deeply couples the transforms in pangeo-forge-recipes to a particular version of pangeo-forge-runner, necessitating them to move in lockstep.

Instead, we use the default python extension mechanism - entrypoints. We define an entrypoint group (pangeo_forge_recipes.injection_specs), and expect packages to provide a function that implements it.
There are three parts of injections:

  1. An "injection spec", provided by other installed packages (such as
    pangeo_forge_recipes, pangeo_forge_cmr, etc). This specifies what
    values exactly will be injected as args for which callables.
    It is in the form of a dictionary, and looks like this:

    {
      "<callable-1-name>": {
        "<argument-1-name>": "<value-spec>",
        "<argument-2-name>": "<value-spec>"
      },
      "<callable-2-name>": {
        "<argument-1-name>": "<value-spec>",
        "<argument-2-name>": "<value-spec>"
      }
    }
    

    <value-spec> specifies what value should be injected. Currently
    supported are two strings:

    1. OUTPUT_ROOT - Root path that output should be written to. Will
      be an FSSpec path.
    2. CACHE_ROOT - (Optional) Root path that should be used for caching
      input values if necessary.

    Additional values may be provided in the future.

    An example is:

     {
         'StoreToZarr': {
             'target_root': 'OUTPUT_ROOT',
         },
         'OpenURLWithFSSpec': {
             'cache': 'CACHE_ROOT'
         }
     }
    

    We considered making this into an Enum, but that would have required all
    packages that provide entrypoints also import pangeo_forge_runner. This
    was deemed too complicating, and hence raw strings are used.

  2. "Injection spec values", calculated by pangeo-forge-runner. This is simply a
    mapping of "" to a specific value that will be injected for
    that "" in this particular run. This might look like:

    {
      "OUTPUT_ROOT": <A fsspec object>,
      "CACHE_ROOT": <another fsspec object>
    }
    
  3. "Injections", ready to be passed on to the rewriter! This merges (1) and (2),
    and looks like:

     {
         'StoreToZarr': {
             'target_root': <An fsspec object pointing to output for this run>
         },
         'OpenURLWithFSSpec': {
             'cache': <Another fsspec object pointing to where this bakery stores cache>
         }
     }
    

    This is what is actually injected into the recipes in the end.
    This allows the transforms in recipes to evolve without having to be coupled to a particular version of runner, and allows very easy extension by arbitrary third party packages.

Includes #86
Appropriate change in pangeo-forge-recipes: pangeo-forge/pangeo-forge-recipes#566
Change in pangeo-forge-earthdatalogin: yuvipanda/pangeo-forge-earthdatalogin#2

Earlier, we were hardcoding what transformations get what values
injected into them. This means we can't actually use the full
power of beam, making transforms like
pangeo-forge-cmr (https://github.com/yuvipanda/pangeo-forge-cmr)
that inherit from the default transforms in pangeo-forge-recipes
impossible. Hardcoding also deeply couples the transforms in
pangeo-forge-recipes to a particular version of pangeo-forge-runner,
necessitating them to move in lockstep.

Instead, we use the default python extension mechanism - entrypoints.
We define an entrypoint group (pangeo_forge_recipes.injection_specs),
and expect packages to provide a function that implements it.

There are three parts of injections:

1. An "injection spec", provided by other installed packages (such as
   pangeo_forge_recipes, pangeo_forge_cmr, etc). This specifies what
   *values* exactly will be injected as args for *which* callables.
   It is in the form of a dictionary, and looks like this:
   ```
   {
     "<callable-1-name>": {
       "<argument-1-name>": "<value-spec>",
       "<argument-2-name>": "<value-spec>"
     },
     "<callable-2-name>": {
       "<argument-1-name>": "<value-spec>",
       "<argument-2-name>": "<value-spec>"
     }
   }
   ```

   `<value-spec>` specifies what value should be injected. Currently
   supported are two strings:
   1. `OUTPUT_ROOT` - Root path that *output* should be written to. Will
      be an FSSpec path.
   2. `CACHE_ROOT` - (Optional) Root path that should be used for caching
      input values if necessary.

   Additional values may be provided in the future.

   An example is:

   ```
    {
        'StoreToZarr': {
            'target_root': 'OUTPUT_ROOT',
        },
        'OpenURLWithFSSpec': {
            'cache': 'CACHE_ROOT'
        }
    }
    ```

   We considered making this into an Enum, but that would have required all
   packages that provide entrypoints also *import* pangeo_forge_runner. This
   was deemed too complicating, and hence raw strings are used.

2. "Injection spec values", calculated by pangeo-forge-runner. This is simply a
   mapping of "<value-spec>" to a specific value that will be injected for
   that "<value-spec>" in this particular run. This might look like:

   ```
   {
     "OUTPUT_ROOT": <A fsspec object>,
     "CACHE_ROOT": <another fsspec object>
   }
   ```
3. "Injections", ready to be passed on to the rewriter! This merges (1) and (2),
   and looks like:
   ```
    {
        'StoreToZarr': {
            'target_root': <An fsspec object pointing to output for this run>
        },
        'OpenURLWithFSSpec': {
            'cache': <Another fsspec object pointing to where this bakery stores cache>
        }
    }
   ```

   This is what is actually injected into the recipes in the end.

This allows the transforms in recipes to evolve without having to
be coupled to a particular version of runner, and allows very
easy extension by arbitrary third party packages.
@codecov
Copy link

codecov bot commented Aug 19, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (35e7732) 95.79% compared to head (ae4c41c) 96.01%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #86      +/-   ##
==========================================
+ Coverage   95.79%   96.01%   +0.22%     
==========================================
  Files          13       14       +1     
  Lines         428      452      +24     
==========================================
+ Hits          410      434      +24     
  Misses         18       18              
Files Coverage Δ
pangeo_forge_runner/commands/bake.py 90.00% <100.00%> (+0.41%) ⬆️
pangeo_forge_runner/plugin.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Much clearer than using a fixture that called out to pip
yuvipanda added a commit to yuvipanda/pangeo-forge-earthdatalogin that referenced this pull request Aug 19, 2023
Without this, this package doesn't get info about where
to store caches while fetching from remote sources.

Works alongside pangeo-forge/pangeo-forge-runner#86
@yuvipanda yuvipanda added the test-dataflow Add this label to PRs to trigger Dataflow integration test. label Aug 20, 2023
```
{
'StoreToZarr': {
'target_root': 'OUTPUT_ROOT',
Copy link
Member

@cisaacstern cisaacstern Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuvipanda, I'm gradually understanding this. Thanks again for your patience as I review it, it's a great learning opportunity for me.

Question: Did you have a rationale for not going 1:1 on naming here, i.e.

         'StoreToZarr': {
-            'target_root': 'OUTPUT_ROOT',
+            'target_root': 'TARGET_ROOT',

?

Perhaps OUTPUT_ROOT is more general/reusable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cisaacstern yes, OUTPUT_ROOT is more general was the reason. 'target' felt very zarr specific.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuvipanda I see that, and also the more I reflect on it, the more I find myself caught on the fact that, IIUC, the OUTPUT_ROOT is always parsed as:

injection_values = {
"OUTPUT_ROOT": target_storage.get_forge_target(job_name=self.job_name),
}

which means it always becomes a pangeo_forge_recipes.storage.FSSpecTarget (via pangeo_forge_runner.storage.TargetStorage)...

from pangeo_forge_recipes import storage
cls = getattr(storage, self.pangeo_forge_target_class)
return cls(

class TargetStorage(StorageTargetConfig):
"""
Storage configuration for where the baked data should be stored
"""
pangeo_forge_target_class = "FSSpecTarget"

with that in mind, I think it would be more intuitive to me if this value-spec was called either FSSPEC_TARGET after the -recipes type it is parsed to (but that's probably too specific) or, after the -runner type that parses it TARGET_STORAGE (maybe this is better). And the same for the cache target. So to make the suggestion more specific, how about:

`<value-spec>` specifies what value should be injected. Currently
   supported are two strings:
-   1. `OUTPUT_ROOT` - Root path that *output* should be written to. Will
-      be an FSSpec path.
+   1. `TARGET_STORAGE` - Location *output* should be written to. Will
+      be parsed to an `pangeo_forge_recipes.storge.FSSpecTarget` object.
-   2. `CACHE_ROOT` - (Optional) Root path that should be used for caching
-      input values if necessary.
+   2. `INPUT_CACHE_STORAGE` - (Optional) Location that should be used for caching
+      input values. Parsed to `pangeo_forge_recipes.storge.CacheFSSpecTarget`.

This way the injection spec is not inventing its own terminology: we just use the name of the pangeo-forge-runner parsing interface as the name of the value-spec. Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm totally fine with standardizing and not picking own terminology! I think the question is one of - should we include or not include the 'root' terminology here? Prior to pangeo-forge-runner, 'root' didn't really exist as terminology, but it now does as we put multiple job outputs from the same run - #68 introduced this I think. So the question now is - is this TARGET_STORAGE or TARGET_STORAGE_ROOT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome.

So the question now is - is this TARGET_STORAGE or TARGET_STORAGE_ROOT?

I am less attached to one or the other of these choices, than I am to the idea that whichever it is, the value-spec name should be the screaming snake case version of the -runner class which will parses it, i.e. (just because I was curious how to do that... 😆):

import re
cls_name = "TargetStorage"
re.sub("(?<!^)(?=[A-Z])", "_", cls_name).upper()
# --> 'TARGET_STORAGE'

So I'd say 'TARGET_STORAGE' because:

  1. It changes less, because it doesn't require changing the parsing class names (If we prefer 'TARGET_STORAGE_ROOT', then I'd say let's rename TargetStorage -> TargetStorageRoot.)
  2. It's more concise.
  3. In terms of its meaning, I think it can be taken to imply "target storage device/location" in the general sense, and not the fully-qualified final output path.

Thoughts?

@norlandrhagen
Copy link

Late to the naming party, but I just used this branch and it was super helpful to run my recipe. Not super opinionated, but I think I like the idea of using some variation of target_storage. Whether that's target_storage or target_storage_root, either way seems good.

@cisaacstern
Copy link
Member

Thanks for the quick merge of #108 @yuvipanda!

I'll address the failing tests here now.

@cisaacstern
Copy link
Member

TYSM @yuvipanda! For the PR and for your patience with my glacial review pace 🙏

@cisaacstern cisaacstern merged commit 241c167 into main Oct 3, 2023
11 checks passed
@yuvipanda yuvipanda deleted the injections branch November 3, 2023 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-dataflow Add this label to PRs to trigger Dataflow integration test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants