CLI: cache outputs & Fuzzy Starts #645

josephjclark · 2024-04-02T14:50:37Z

Short Description

This PR:

Writes the output of each step/job to disk if --cache-steps is passed
Uses cache data as input when running with --start if input is not explicitly passed in
Supports "fuzzy" matching on start node names (ie, pass a partial node id or name and it'll match)
Supports --end and --only as CLI arguments

Related issue

Fixes #375

Docs

PR open at OpenFn/docs#467

Caching

If the --cache-steps flag is passed to the CLI, every step will write its output to a local folder as a JSON file.

Caching is useful for debugging, but the real benefit comes from re-running a workflow from a fixed point.

If the --start flag is passed, the CLI will automatically load the correct input state from the cache for that step. This is is clearly logged.

Note that --cache-steps and --start are mutually exclusive. If you run --start without --cache-steps then the cached output will NOT be updated.

To work out the right input state, the CLI must find upstream step of the start job. At the moment, this is easy because in a workflow, each step can only have one input. If that rule ever changes, this part of the feature becomes more complex (I've added a comment in workflow validation to this effect).

The cache is written to a folder called .cli-cache adjacent to the workflow or job file. A sub folder will be created with the workflow name (which defaults to the file name), and a json file will be created for each step (with the step id).

So for .tmp/workflow.json you'll get a cache path something like ./.cli-cache/workflow/step-1.json.

Caching is off by default.

Setting the OPENFN_ALWAYS_CACHE_STEPS env var will default step caching ON. Disable by passing --no-cache-steps.

Fuzzy Starts

The PR also enables "fuzzy" start points to be defined.

If the start node exactly matches a step id, that step will be used as the start
If the start node partially matches a step id OR NAME, that step will be used
If the start fuzzily matches multiple steps, an error will be thrown.

This behaviour exists outside of the caching stuff, but the caching stuff benefits from it because it's easier to specify a start node now.

The idea here is that a) you may want to use the step name or id as the start point, depending on what your workflow json looks like; and b) if you're using a project downloaded from Lightning, the ids are a nightmare to type

TODO

Checklist before requesting a review

I have performed a self-review of my code
I have added unit tests
Changesets have been added (if there are production code changes)

prevents junk building up

josephjclark · 2024-04-05T15:37:28Z

I am not going to implement a clear cache command. There's too much low level complexity associated with it.

If you do openfn clear-cache, it's not clear which cache. The repo cache? The metadata cache? The job cache? It's no good.

We could take a path to the cache. But is that a path to .cli-cache? The parent folder? The target workflow? Do you want to clear for all workflows or just for one workflow? Basically the path feels so ambiguous and unintuitive, it's not clear to me what it'll do.

Ok, so we can confirm the path before we do anything, but it's still an annoying command. Isn't rm -rf just easier
anyway? We all know what that does.

The other option is openfn workflow.json --clear-cache, which will clear the cache associated with that workflow. But it's weird because it won't run the workflow. It's even weirder if you do the long form of openfn execute workflow.json --clear-cache. It also doesn't help you clear the cache for all workflows.

Maybe we'll come back to it later - right now it feels super high complexity for incredibly low value.

josephjclark · 2024-04-10T09:13:28Z

Hi @mtuchi, when you've got a spare half hour can you take another look at this, maybe run a couple of tests? Thanks!

mtuchi · 2024-04-10T11:37:54Z

@josephjclark i have tested the --cache-steps and --start step-id. Everything work as expected
I have also tested openfn workflow.json --clear-cache. It didn't work, I don't think we need it honestly. It's very easy to clear .cli-cache if i needed

…li-cache

The runtime will exit after this step has been executed (even if there are more steps outstanding)

josephjclark · 2024-04-11T16:52:11Z

Hi @mtuchi - last one I hope!

This PR now supports --end and --only

I plan to merge and release this in the morning. Let me know if you get a chance to test it out

Always report when you're looking for a cache, and only warn if no input was found

If start is passed but state is not, we always try and load from the cache and warn if we dind't load anything If no start is passed, we never try to load from the cache

josephjclark · 2024-04-12T09:45:56Z

packages/cli/src/execute/handler.ts

+  let customEnd;
+
+  // Handle start, end and only
+  if (options.only) {


This interaction here - not just the working out which step to run, but how it is fed to the runtime AND reported to the user - is not well tested.

It's taken quite a while to work out a good UX.

I don't want to spend much more time on this but maybe I'll see if I can find a good way to test this. I kinda worry that it'll take half a day to get a good suite of tests on it - not worth it right now.

josephjclark added 5 commits April 2, 2024 15:08

cli: write job output to disk if --cache is passed

79567a0

cli: don't cache by default

d6f5a8f

cli: use cached input state if appropriate

e2a56cc

cli: for a start node, find the input from the upstream node

43eacb7

cli: don't use step name in the cache

c67d0e4

This comment was marked as resolved.

Sign in to view

josephjclark added 5 commits April 5, 2024 13:01

cli: return input state properly

8a6c55a

cli: tweak logging and fix file writes

015f08e

fix state loading with and without cache

33562f5

cli: fix log output and order

4dcd7e3

cli: clear the cache when running a workflow with cache enabled

2fe2dce

prevents junk building up

josephjclark added 6 commits April 8, 2024 14:03

cli: fix typings

d8be760

update package lock

a0c40d6

cli: unit tests for getUpstreamStepId

a3f6e49

prettier

dde19b6

cli: add a fuzzy string match for step names

05f9553

cli: added error handling and fuzzy matching to cli

5e22ac0

josephjclark changed the title ~~CLI: cache outputs~~ CLI: cache outputs & Fuzzy Starts Apr 8, 2024

This comment was marked as resolved.

Sign in to view

josephjclark added 2 commits April 9, 2024 08:23

cli: add env var to default step caching on

20b833b

cli: cache -> cache-steps

3231841

This comment was marked as outdated.

Sign in to view

josephjclark added 4 commits April 9, 2024 08:33

cli: fix refactor

acf7483

readme

df4d0a9

cli: unit tests for step cache

b8f87cc

cli: generate gitignore in cli cache

83f4318

josephjclark marked this pull request as ready for review April 9, 2024 08:44

cli: review tweaks

d83657f

This comment was marked as outdated.

Sign in to view

josephjclark mentioned this pull request Apr 9, 2024

Document step caching in the CLI OpenFn/docs#467

Merged

changeset

ea248a3

josephjclark changed the base branch from main to release/next April 11, 2024 11:51

josephjclark added 7 commits April 11, 2024 16:11

Merge branch 'cli-cache' of github.com-josephjclark:OpenFn/kit into c…

98989c0

…li-cache

runtime: support an end option

0c17ee9

The runtime will exit after this step has been executed (even if there are more steps outstanding)

runtime: changeset

cecdb60

cli: support end and only

e666bec

cli: unit tests and fixes for only

a614623

cli: comment

0f21ce3

tests: fix tests

20fab1b

This comment was marked as resolved.

Sign in to view

josephjclark added 3 commits April 12, 2024 10:02

cli: fix an issue where load-step blows up if there's no next object

42bc8a7

cli: tweak log levels

34cec30

Always report when you're looking for a cache, and only warn if no input was found

cli: sort out UX of loading from the cache

898a390

If start is passed but state is not, we always try and load from the cache and warn if we dind't load anything If no start is passed, we never try to load from the cache

josephjclark commented Apr 12, 2024

View reviewed changes

cli: fix integration test

bf1f940

This comment was marked as resolved.

Sign in to view

cli: update readme

49c7ff4

josephjclark requested a review from mtuchi April 12, 2024 10:31

josephjclark merged commit b86355d into release/next Apr 12, 2024
5 checks passed

josephjclark deleted the cli-cache branch April 12, 2024 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI: cache outputs & Fuzzy Starts #645

CLI: cache outputs & Fuzzy Starts #645

josephjclark commented Apr 2, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

josephjclark commented Apr 5, 2024

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

josephjclark commented Apr 10, 2024

mtuchi commented Apr 10, 2024

josephjclark commented Apr 11, 2024

This comment was marked as resolved.

josephjclark Apr 12, 2024

This comment was marked as resolved.

CLI: cache outputs & Fuzzy Starts #645

CLI: cache outputs & Fuzzy Starts #645

Conversation

josephjclark commented Apr 2, 2024 • edited Loading

Short Description

Related issue

Docs

Caching

Fuzzy Starts

TODO

Checklist before requesting a review

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

josephjclark commented Apr 5, 2024

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

josephjclark commented Apr 10, 2024

mtuchi commented Apr 10, 2024

josephjclark commented Apr 11, 2024

This comment was marked as resolved.

josephjclark Apr 12, 2024

Choose a reason for hiding this comment

This comment was marked as resolved.

josephjclark commented Apr 2, 2024 •

edited

Loading