-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] [wip] Rework memory store signal checking in C++ instead of cython #49319
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: dayshah <[email protected]>
@@ -31,6 +32,16 @@ const int64_t kUnhandledErrorGracePeriodNanos = static_cast<int64_t>(5e9); | |||
// when there are too many local objects. | |||
const int kMaxUnhandledErrorScanItems = 1000; | |||
|
|||
namespace { | |||
|
|||
Status signal_status = Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has the possibility of being called from multiple threads right, should we make this safe by making it thread_local or guarded by a mutex?
@jjyao If this makes sense, I can also replace the rest of the signal checking logic in the core worker with C++ code vs. passing it down through cython code to the CoreWorkerProcess |
timed_out = remaining_timeout <= 0; | ||
{ | ||
std::signal(SIGINT, SignalHandler); | ||
std::signal(SIGTERM, SignalHandler); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want to overwrite default SIGTERM handler? Which means exception thrown will not have any effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to replicate the behavior of the check_signals mentioned in the pr description, I think sigterm maps to the SystemExit, will double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm possibly not, removed sigterm and only keeping sigint for now, we really just need to check for ctrl+c looking at the previous pr that introduced check_signals
Signed-off-by: dayshah <[email protected]>
@dayshah the title mentioned WIP, let me know when this is ready for review |
yup, my bad on assigning before I realized a test is failing in premerge, investigating |
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Why are these changes needed?
So right now the Monitor thread used for compiled graphs calls ray.get and the get call can stay alive up to the point of the python interpreter starting to shutdown. This can cause a segfault when acquiring the GIL in check_signals as mentioned by @kevin85421 here #47864 (comment). Also based on the documentation of https://docs.python.org/3/c-api/exceptions.html#c.PyErr_CheckSignals which is currently used in check_signals, it doesn't actually work on threads that aren't the main python thread. By moving the signal checking to C++ code we no longer need to acquire the gil here and can also correctly check for signals outside the main thread.
Verified that ctrl+c works with a standard
ray.get(result of task with while True)
and also that compiled graphs tests pass consistently with this vs. before when usually one exception throwing dag test would segfault.Current check_signals cython implementation:
ray/python/ray/_raylet.pyx
Lines 2360 to 2365 in 9375c1f
Related issue number
#48806 #47864
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.