Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI: cache outputs & Fuzzy Starts #645

Merged
merged 36 commits into from
Apr 12, 2024
Merged

CLI: cache outputs & Fuzzy Starts #645

merged 36 commits into from
Apr 12, 2024

Conversation

josephjclark
Copy link
Collaborator

@josephjclark josephjclark commented Apr 2, 2024

Short Description

This PR:

  • Writes the output of each step/job to disk if --cache-steps is passed
  • Uses cache data as input when running with --start if input is not explicitly passed in
  • Supports "fuzzy" matching on start node names (ie, pass a partial node id or name and it'll match)
  • Supports --end and --only as CLI arguments

Related issue

Fixes #375

Docs

PR open at OpenFn/docs#467

Caching

If the --cache-steps flag is passed to the CLI, every step will write its output to a local folder as a JSON file.

Caching is useful for debugging, but the real benefit comes from re-running a workflow from a fixed point.

If the --start flag is passed, the CLI will automatically load the correct input state from the cache for that step. This is is clearly logged.

Note that --cache-steps and --start are mutually exclusive. If you run --start without --cache-steps then the cached output will NOT be updated.

To work out the right input state, the CLI must find upstream step of the start job. At the moment, this is easy because in a workflow, each step can only have one input. If that rule ever changes, this part of the feature becomes more complex (I've added a comment in workflow validation to this effect).

The cache is written to a folder called .cli-cache adjacent to the workflow or job file. A sub folder will be created with the workflow name (which defaults to the file name), and a json file will be created for each step (with the step id).

So for .tmp/workflow.json you'll get a cache path something like ./.cli-cache/workflow/step-1.json.

Caching is off by default.

Setting the OPENFN_ALWAYS_CACHE_STEPS env var will default step caching ON. Disable by passing --no-cache-steps.

Fuzzy Starts

The PR also enables "fuzzy" start points to be defined.

  • If the start node exactly matches a step id, that step will be used as the start
  • If the start node partially matches a step id OR NAME, that step will be used
  • If the start fuzzily matches multiple steps, an error will be thrown.

This behaviour exists outside of the caching stuff, but the caching stuff benefits from it because it's easier to specify a start node now.

The idea here is that a) you may want to use the step name or id as the start point, depending on what your workflow json looks like; and b) if you're using a project downloaded from Lightning, the ids are a nightmare to type

TODO

  • Tidy typings
  • Add a bunch of unit and integration tests
  • Use the logger properly in cli/src/util/cache.ts
  • Add corresponding docs (readme and docs repo)
  • Maybe add an env var to default cache enabled (for power users)
  • Add an API to clear the cache I am not going to do this, its actually easier for everyone to rm -rf ./cli-cache
  • Consider what happens in the event of an error
  • Consider what happens if state is falsy (does it even matter?)
  • See "name or id?" - I think basically we need to allow --start to fuzzy match against step names and ids, for good UX
  • Ensure the worker will always run with the cache off (this is trivial if cache is off by default, or else may need a commit on the worker)
  • We should probably clear the cache on every run (except the start step), so that renaming steps doesn't mess things up, or older steps which didn't run don't interfere with the output.
  • generate a .gitignore?
  • Make --cache and --start mutually exclusive (probably)

Checklist before requesting a review

  • I have performed a self-review of my code
  • I have added unit tests
  • Changesets have been added (if there are production code changes)

@josephjclark

This comment was marked as resolved.

@josephjclark

This comment was marked as resolved.

@josephjclark

This comment was marked as resolved.

@josephjclark
Copy link
Collaborator Author

I am not going to implement a clear cache command. There's too much low level complexity associated with it.

If you do openfn clear-cache, it's not clear which cache. The repo cache? The metadata cache? The job cache? It's no good.

We could take a path to the cache. But is that a path to .cli-cache? The parent folder? The target workflow? Do you want to clear for all workflows or just for one workflow? Basically the path feels so ambiguous and unintuitive, it's not clear to me what it'll do.

Ok, so we can confirm the path before we do anything, but it's still an annoying command. Isn't rm -rf just easier
anyway? We all know what that does.

The other option is openfn workflow.json --clear-cache, which will clear the cache associated with that workflow. But it's weird because it won't run the workflow. It's even weirder if you do the long form of openfn execute workflow.json --clear-cache. It also doesn't help you clear the cache for all workflows.

Maybe we'll come back to it later - right now it feels super high complexity for incredibly low value.

@josephjclark josephjclark changed the title CLI: cache outputs CLI: cache outputs & Fuzzy Starts Apr 8, 2024
@josephjclark

This comment was marked as resolved.

@josephjclark

This comment was marked as outdated.

@josephjclark josephjclark marked this pull request as ready for review April 9, 2024 08:44
@josephjclark

This comment was marked as outdated.

@josephjclark
Copy link
Collaborator Author

Hi @mtuchi, when you've got a spare half hour can you take another look at this, maybe run a couple of tests? Thanks!

@mtuchi
Copy link
Contributor

mtuchi commented Apr 10, 2024

@josephjclark i have tested the --cache-steps and --start step-id. Everything work as expected
I have also tested openfn workflow.json --clear-cache. It didn't work, I don't think we need it honestly. It's very easy to clear .cli-cache if i needed

@josephjclark josephjclark changed the base branch from main to release/next April 11, 2024 11:51
@josephjclark
Copy link
Collaborator Author

Hi @mtuchi - last one I hope!

This PR now supports --end and --only

I plan to merge and release this in the morning. Let me know if you get a chance to test it out

@josephjclark

This comment was marked as resolved.

Always report when you're looking for a cache, and only warn if no input was found
If start is passed but state is not, we always try and load from the cache and warn if we dind't load anything

If no start is passed, we never try to load from the cache
let customEnd;

// Handle start, end and only
if (options.only) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interaction here - not just the working out which step to run, but how it is fed to the runtime AND reported to the user - is not well tested.

It's taken quite a while to work out a good UX.

I don't want to spend much more time on this but maybe I'll see if I can find a good way to test this. I kinda worry that it'll take half a day to get a good suite of tests on it - not worth it right now.

@josephjclark

This comment was marked as resolved.

@josephjclark josephjclark requested a review from mtuchi April 12, 2024 10:31
@josephjclark josephjclark merged commit b86355d into release/next Apr 12, 2024
5 checks passed
@josephjclark josephjclark deleted the cli-cache branch April 12, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

CLI: cache workflow steps
2 participants