Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Crunch automation #395

Open
jamesrkg opened this issue Mar 8, 2021 · 9 comments
Open

Support for Crunch automation #395

jamesrkg opened this issue Mar 8, 2021 · 9 comments

Comments

@jamesrkg
Copy link

jamesrkg commented Mar 8, 2021

@jjdelc @malecki can we please get the features needed to manage Crunch automation from scrunch added soon? We have plans to use this heavily in the near future. In the scheduling/repetition of project processing we want to migrate all intra-dataset actions to automation scripts, however, we need to push these using scrunch because they will be the punctuation between inter-dataset actions (still performed with traditional scrunch calls).

@jjdelc
Copy link
Contributor

jjdelc commented Mar 9, 2021

I started working on this https://github.com/Crunch-io/scrunch/tree/dataset-scripts

@jamesrkg
Copy link
Author

Great.

In the docstring for DatasetScripts.collapse you mention "too many scripts". Can you describe what the limitations/expectations about potentially using multiple scripts are? For tracking studies we'll likely have cyclical processes such as fork>scripts>mergeback>>repeat. Do the scripts added on those forks accumulate permanently over time? Any concerns to be aware of? If we're controling/storing scripts outside of the dataset, is it possible to cleanup scripts that run successfully to prevent this, perhaps with something like DatasetScripts.exectute(..., store=False)?

@jamesrkg
Copy link
Author

jamesrkg commented Mar 10, 2021

Another question, is there any functional difference between reverting to a pre-script savepoint and reverting to any other savepoint? Are script savepoints simply a convenience around regular dataset savepoints that are being managed as part of the script execution process?

@jjdelc
Copy link
Contributor

jjdelc commented Mar 11, 2021

Can you describe what the limitations/expectations about potentially using multiple scripts are?

We've seen some users of automation run several single command scripts, quickly accumulating hundreds of scripts. Each script has a tax cost of a dataset savepoint to allow for revert. So even if the single line script seems small, it takes a bit of storage space.

The expectation is to perform most (ideally all) transformations in one script and go. The revert-repeat cycle is great. You run the script, if you don't like the results, you revert, adjust the script and re-run. Reverting a script deletes its entry, reverted attempts don't accumulate. The scripts list only contains successfully executed scripts.

is there any functional difference between reverting to a pre-script savepoint and reverting to any other savepoint?

Scripts generate a savepoint before executing. These savepoints are the same as all the savepoints in the system. If you go to your savepoints list and revert to a SP that was created by a script, you'll bring the world back to that point same as with other SP. Scripts are part of this world, so reverting also brings the scripts list back in time to that mark.

Script entities do have a dedicated /revert/ endpoint which does a bit of extra trickery. If you revert a script (instead of reverting only its associated endpoint) it will bring back the dataset's state to the savepoints description but it will also delete any artifact that was created by the script. Normal SP reverting does not delete artifacts. Reverting a script will delete any filter or multitable that was created by it.

We are still testing these behaviors, they can be confusing so feedback on those will be helpful.

Just to make this comment more complicated, Scripts also expose an /undo/ endpoint (not exposed in this Scrunch API) that unlike /revert/ would only delete artifacts and variables created but would not revert a savepoint, this means that undoing a script (unlike revering it) wouldn't make you loose appends that you'd done after such script.

I opted to only expose one of these implementations to exonerate users from this confusion.

@jamesrkg
Copy link
Author

Ok thanks for the details, we'll make sure to include in training around this to minimize the number of automations where possible by collecting all adjacent actions into a single script.

Say a project had 40 countries to update in the master dataset every month. For each update, for each country, a script would be used because the process would be: fork streaming>script>append to fork of master>mergeback.

Does this mean the master script would end up with +38 scripts/savepoints each month? What if the update frequency increased to twice-monthly or even weekly (2-4x as many scripts per being used per month) - what does the extreme end of this look like from your POV? What if the project went on for 5 years?

Essentially, can you anticipate a point at which we'd encounter an issue and what should we know/do from the beginning to mitigate any adverse effects? Is it possible to cleanup savepoints/executed scripts some time in the future when we know they're never going to be reverted again?

@jjdelc
Copy link
Contributor

jjdelc commented Mar 15, 2021

I believe yes, a script execution gets recorded as an action execution, so when you merge the fork, such action (and all the steps the script performed) will get replayed. You will end up with all the scripts from all the forks.

The fact that a dataset has hundreds of scripts executed isn't a problem on the system, but it's a problem on the user because on how to make sense of what's going on there with so many scripts there. In the master tracker I suppose they can be ignored, since nobody would be reverting/re-running scripts there.

The scripts API provides a /collapse/ endpoint that collapses all executed script into a single concatenation to go back to 1 script. You can do that at any point and always go back to a single script.

@jamesrkg
Copy link
Author

Thanks @jjdelc. It might not make sense to collapse these scripts because they couldn't be replayed all at once. Since automation scripts offer only intra-dataset functionality they will be punctuated by inter-dataset actions as directed by other parts of scrunch (e.g. forking, merging back, joining, appending, comparing, etc.). I suppose looking only at a universe of scripts housed in a dataset ignores these kinds of actions.

@jjdelc
Copy link
Contributor

jjdelc commented Mar 17, 2021

@jamesrkg I need to make a correction, what I said above is incorrect. I ran some tests last night to make sure of how scripts worked on forks and merges.

Turns out that the action that records the script creation does not get replayed when it gets merged back. But all the steps executed in the script will.

So, you can proceed forking, scripting, merging and your scripts will only live in the forks. The main dataset will not keep record that those modifications happened via scripts.

@jamesrkg
Copy link
Author

Thanks for confirming @jjdelc, I think overall that is a good thing. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants