Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] [wip] Rework memory store signal checking in C++ instead of cython #49319

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dayshah
Copy link
Contributor

@dayshah dayshah commented Dec 17, 2024

Why are these changes needed?

So right now the Monitor thread used for compiled graphs calls ray.get and the get call can stay alive up to the point of the python interpreter starting to shutdown. This can cause a segfault when acquiring the GIL in check_signals as mentioned by @kevin85421 here #47864 (comment). Also based on the documentation of https://docs.python.org/3/c-api/exceptions.html#c.PyErr_CheckSignals which is currently used in check_signals, it doesn't actually work on threads that aren't the main python thread. By moving the signal checking to C++ code we no longer need to acquire the gil here and can also correctly check for signals outside the main thread.

Verified that ctrl+c works with a standard ray.get(result of task with while True) and also that compiled graphs tests pass consistently with this vs. before when usually one exception throwing dag test would segfault.

Current check_signals cython implementation:

ray/python/ray/_raylet.pyx

Lines 2360 to 2365 in 9375c1f

cdef CRayStatus check_signals() nogil:
with gil:
# The Python exceptions are not handled if it is raised from cdef,
# so we have to handle it here.
try:
PyErr_CheckSignals()

Related issue number

#48806 #47864

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -31,6 +32,16 @@ const int64_t kUnhandledErrorGracePeriodNanos = static_cast<int64_t>(5e9);
// when there are too many local objects.
const int kMaxUnhandledErrorScanItems = 1000;

namespace {

Status signal_status = Status::OK();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has the possibility of being called from multiple threads right, should we make this safe by making it thread_local or guarded by a mutex?

@dayshah
Copy link
Contributor Author

dayshah commented Dec 17, 2024

@jjyao If this makes sense, I can also replace the rest of the signal checking logic in the core worker with C++ code vs. passing it down through cython code to the CoreWorkerProcess

@dayshah dayshah requested a review from jjyao December 17, 2024 23:45
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Dec 17, 2024
timed_out = remaining_timeout <= 0;
{
std::signal(SIGINT, SignalHandler);
std::signal(SIGTERM, SignalHandler);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want to overwrite default SIGTERM handler? Which means exception thrown will not have any effect

Copy link
Contributor Author

@dayshah dayshah Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to replicate the behavior of the check_signals mentioned in the pr description, I think sigterm maps to the SystemExit, will double check

Copy link
Contributor Author

@dayshah dayshah Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm possibly not, removed sigterm and only keeping sigint for now, we really just need to check for ctrl+c looking at the previous pr that introduced check_signals

Signed-off-by: dayshah <[email protected]>
@dayshah dayshah changed the title [core] Rework memory store signal checking in CPP instead of cython [core] [wip] Rework memory store signal checking in C++ instead of cython Dec 18, 2024
@dayshah dayshah marked this pull request as draft December 18, 2024 05:06
@dayshah dayshah removed the go add ONLY when ready to merge, run all tests label Dec 18, 2024
@ruisearch42
Copy link
Contributor

@dayshah the title mentioned WIP, let me know when this is ready for review

@dayshah
Copy link
Contributor Author

dayshah commented Dec 19, 2024

@dayshah the title mentioned WIP, let me know when this is ready for review

yup, my bad on assigning before I realized a test is failing in premerge, investigating

@dayshah dayshah removed the request for review from jjyao December 19, 2024 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants