Support for Crunch automation #395

jamesrkg · 2021-03-08T05:18:33Z

@jjdelc @malecki can we please get the features needed to manage Crunch automation from scrunch added soon? We have plans to use this heavily in the near future. In the scheduling/repetition of project processing we want to migrate all intra-dataset actions to automation scripts, however, we need to push these using scrunch because they will be the punctuation between inter-dataset actions (still performed with traditional scrunch calls).

jjdelc · 2021-03-09T22:17:41Z

I started working on this https://github.com/Crunch-io/scrunch/tree/dataset-scripts

jamesrkg · 2021-03-10T03:07:41Z

Great.

In the docstring for DatasetScripts.collapse you mention "too many scripts". Can you describe what the limitations/expectations about potentially using multiple scripts are? For tracking studies we'll likely have cyclical processes such as fork>scripts>mergeback>>repeat. Do the scripts added on those forks accumulate permanently over time? Any concerns to be aware of? If we're controling/storing scripts outside of the dataset, is it possible to cleanup scripts that run successfully to prevent this, perhaps with something like DatasetScripts.exectute(..., store=False)?

jamesrkg · 2021-03-10T03:13:25Z

Another question, is there any functional difference between reverting to a pre-script savepoint and reverting to any other savepoint? Are script savepoints simply a convenience around regular dataset savepoints that are being managed as part of the script execution process?

jjdelc · 2021-03-11T05:29:50Z

Can you describe what the limitations/expectations about potentially using multiple scripts are?

We've seen some users of automation run several single command scripts, quickly accumulating hundreds of scripts. Each script has a tax cost of a dataset savepoint to allow for revert. So even if the single line script seems small, it takes a bit of storage space.

The expectation is to perform most (ideally all) transformations in one script and go. The revert-repeat cycle is great. You run the script, if you don't like the results, you revert, adjust the script and re-run. Reverting a script deletes its entry, reverted attempts don't accumulate. The scripts list only contains successfully executed scripts.

is there any functional difference between reverting to a pre-script savepoint and reverting to any other savepoint?

Scripts generate a savepoint before executing. These savepoints are the same as all the savepoints in the system. If you go to your savepoints list and revert to a SP that was created by a script, you'll bring the world back to that point same as with other SP. Scripts are part of this world, so reverting also brings the scripts list back in time to that mark.

Script entities do have a dedicated /revert/ endpoint which does a bit of extra trickery. If you revert a script (instead of reverting only its associated endpoint) it will bring back the dataset's state to the savepoints description but it will also delete any artifact that was created by the script. Normal SP reverting does not delete artifacts. Reverting a script will delete any filter or multitable that was created by it.

We are still testing these behaviors, they can be confusing so feedback on those will be helpful.

Just to make this comment more complicated, Scripts also expose an /undo/ endpoint (not exposed in this Scrunch API) that unlike /revert/ would only delete artifacts and variables created but would not revert a savepoint, this means that undoing a script (unlike revering it) wouldn't make you loose appends that you'd done after such script.

I opted to only expose one of these implementations to exonerate users from this confusion.

jamesrkg · 2021-03-12T03:14:05Z

Ok thanks for the details, we'll make sure to include in training around this to minimize the number of automations where possible by collecting all adjacent actions into a single script.

Say a project had 40 countries to update in the master dataset every month. For each update, for each country, a script would be used because the process would be: fork streaming>script>append to fork of master>mergeback.

Does this mean the master script would end up with +38 scripts/savepoints each month? What if the update frequency increased to twice-monthly or even weekly (2-4x as many scripts per being used per month) - what does the extreme end of this look like from your POV? What if the project went on for 5 years?

Essentially, can you anticipate a point at which we'd encounter an issue and what should we know/do from the beginning to mitigate any adverse effects? Is it possible to cleanup savepoints/executed scripts some time in the future when we know they're never going to be reverted again?

jjdelc · 2021-03-15T23:39:42Z

I believe yes, a script execution gets recorded as an action execution, so when you merge the fork, such action (and all the steps the script performed) will get replayed. You will end up with all the scripts from all the forks.

The fact that a dataset has hundreds of scripts executed isn't a problem on the system, but it's a problem on the user because on how to make sense of what's going on there with so many scripts there. In the master tracker I suppose they can be ignored, since nobody would be reverting/re-running scripts there.

The scripts API provides a /collapse/ endpoint that collapses all executed script into a single concatenation to go back to 1 script. You can do that at any point and always go back to a single script.

jamesrkg · 2021-03-16T02:08:14Z

Thanks @jjdelc. It might not make sense to collapse these scripts because they couldn't be replayed all at once. Since automation scripts offer only intra-dataset functionality they will be punctuated by inter-dataset actions as directed by other parts of scrunch (e.g. forking, merging back, joining, appending, comparing, etc.). I suppose looking only at a universe of scripts housed in a dataset ignores these kinds of actions.

jjdelc · 2021-03-17T18:10:52Z

@jamesrkg I need to make a correction, what I said above is incorrect. I ran some tests last night to make sure of how scripts worked on forks and merges.

Turns out that the action that records the script creation does not get replayed when it gets merged back. But all the steps executed in the script will.

So, you can proceed forking, scripting, merging and your scripts will only live in the forks. The main dataset will not keep record that those modifications happened via scripts.

jamesrkg · 2021-03-18T03:08:07Z

Thanks for confirming @jjdelc, I think overall that is a good thing. 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Crunch automation #395

Support for Crunch automation #395

jamesrkg commented Mar 8, 2021

jjdelc commented Mar 9, 2021

jamesrkg commented Mar 10, 2021

jamesrkg commented Mar 10, 2021 •

edited

Loading

jjdelc commented Mar 11, 2021

jamesrkg commented Mar 12, 2021

jjdelc commented Mar 15, 2021

jamesrkg commented Mar 16, 2021

jjdelc commented Mar 17, 2021

jamesrkg commented Mar 18, 2021

Support for Crunch automation #395

Support for Crunch automation #395

Comments

jamesrkg commented Mar 8, 2021

jjdelc commented Mar 9, 2021

jamesrkg commented Mar 10, 2021

jamesrkg commented Mar 10, 2021 • edited Loading

jjdelc commented Mar 11, 2021

jamesrkg commented Mar 12, 2021

jjdelc commented Mar 15, 2021

jamesrkg commented Mar 16, 2021

jjdelc commented Mar 17, 2021

jamesrkg commented Mar 18, 2021

jamesrkg commented Mar 10, 2021 •

edited

Loading