Safeguards don't always stop the experiment #5

WixoLeo · 2021-06-15T09:44:22Z

Because of the unreliable nature of the signals, they don't always stop the experiment. If a signal was thrown, some random try and except closure can catch and swallow it. Causing the experiment to continue running.

We would like to suggest a quick workaround.
In safeguards we would like to set an environment variable "CHAOS_STOP" whenever the safeguard probe fails.
And then in controls, we can verify if the environment is set before every activity and stop the experiment by preventing the activity from running. Perhaps by running exit gracefully again or creating a proper way of stopping the experiments in controls.

alexander-gorelik · 2021-06-16T11:55:13Z

@Lawouach any ETA for that? If you don't have time we can change it and PR.

Lawouach · 2021-06-16T14:38:49Z

No ETA. Setting an environment variable that is seen by other processes is usually not possible considering it's done in the private space of the process. Unless my Unix foo is rusted :p

Perhaps a different approach would be to rely on a well-known lock file instead?

alexander-gorelik · 2021-06-17T09:21:30Z

Found two very interesting articles on that matter
https://www.pythonforthelab.com/blog/starting-and-synchronizing-threads/
https://www.pythonforthelab.com/blog/handling-and-sharing-data-between-threads/

Lawouach · 2021-06-17T09:28:51Z

Thanks. Not sure this answers the initial question. The actual requests of Leo is supported already (being able to stop before/after activities), at least in the sense you can already create a controller that does this.

But, as mentioned elsewhere, the challenge is not synchronizing threads. The challenge is interrupting blocking calls that live outside the Python VM.

Say you call time.sleep(30), the call is blocked under the Python VM itself. You cannot interrupt gracefully this native calls.

Now, if you are only looking for a change in the safeguard control to say "don't exit randomly but wait for the next 'before_activity', we can indeed do that without lock file or env variable". But I had understood you needed the env variable to terminate the chaos process from outside, by monitoring for that var/file. But you still have to acceept that if the activity is doing a long blocking call, I will not be able to do anything until it has terminated.

Lawouach · 2021-06-17T12:18:34Z

Please have a try with that branch and let me know if this helps.

Add this flag to the safeguards arguments (next to the probes list):

"interrupt_after_activity": true

WixoLeo · 2021-06-20T13:50:48Z

For some reason interruption happens only after the next activity and not after current activity..
(I've made 2 same probes in safeguards and made the tolerance unexpected on the first one)

Lawouach linked a pull request Jun 17, 2021 that will close this issue

Add a flag to the sefaguard control to delay it #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safeguards don't always stop the experiment #5

Safeguards don't always stop the experiment #5

WixoLeo commented Jun 15, 2021

alexander-gorelik commented Jun 16, 2021

Lawouach commented Jun 16, 2021

alexander-gorelik commented Jun 17, 2021

Lawouach commented Jun 17, 2021

Lawouach commented Jun 17, 2021

WixoLeo commented Jun 20, 2021 •

edited

Loading

Safeguards don't always stop the experiment #5

Safeguards don't always stop the experiment #5

Comments

WixoLeo commented Jun 15, 2021

alexander-gorelik commented Jun 16, 2021

Lawouach commented Jun 16, 2021

alexander-gorelik commented Jun 17, 2021

Lawouach commented Jun 17, 2021

Lawouach commented Jun 17, 2021

WixoLeo commented Jun 20, 2021 • edited Loading

WixoLeo commented Jun 20, 2021 •

edited

Loading