Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safeguards don't always stop the experiment #5

Open
WixoLeo opened this issue Jun 15, 2021 · 6 comments · May be fixed by #6
Open

Safeguards don't always stop the experiment #5

WixoLeo opened this issue Jun 15, 2021 · 6 comments · May be fixed by #6

Comments

@WixoLeo
Copy link

WixoLeo commented Jun 15, 2021

Because of the unreliable nature of the signals, they don't always stop the experiment. If a signal was thrown, some random try and except closure can catch and swallow it. Causing the experiment to continue running.

We would like to suggest a quick workaround.
In safeguards we would like to set an environment variable "CHAOS_STOP" whenever the safeguard probe fails.
And then in controls, we can verify if the environment is set before every activity and stop the experiment by preventing the activity from running. Perhaps by running exit gracefully again or creating a proper way of stopping the experiments in controls.

@alexander-gorelik
Copy link

@Lawouach any ETA for that? If you don't have time we can change it and PR.

@Lawouach
Copy link
Contributor

No ETA. Setting an environment variable that is seen by other processes is usually not possible considering it's done in the private space of the process. Unless my Unix foo is rusted :p

Perhaps a different approach would be to rely on a well-known lock file instead?

@Lawouach
Copy link
Contributor

Thanks. Not sure this answers the initial question. The actual requests of Leo is supported already (being able to stop before/after activities), at least in the sense you can already create a controller that does this.

But, as mentioned elsewhere, the challenge is not synchronizing threads. The challenge is interrupting blocking calls that live outside the Python VM.

Say you call time.sleep(30), the call is blocked under the Python VM itself. You cannot interrupt gracefully this native calls.

Now, if you are only looking for a change in the safeguard control to say "don't exit randomly but wait for the next 'before_activity', we can indeed do that without lock file or env variable". But I had understood you needed the env variable to terminate the chaos process from outside, by monitoring for that var/file. But you still have to acceept that if the activity is doing a long blocking call, I will not be able to do anything until it has terminated.

@Lawouach Lawouach linked a pull request Jun 17, 2021 that will close this issue
@Lawouach Lawouach linked a pull request Jun 17, 2021 that will close this issue
@Lawouach
Copy link
Contributor

Please have a try with that branch and let me know if this helps.

Add this flag to the safeguards arguments (next to the probes list):

"interrupt_after_activity": true

@WixoLeo
Copy link
Author

WixoLeo commented Jun 20, 2021

image
For some reason interruption happens only after the next activity and not after current activity..
(I've made 2 same probes in safeguards and made the tolerance unexpected on the first one)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants