Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO::Event::Interrupt#signal mangles errno #127

Open
jscheid opened this issue Jan 13, 2025 · 5 comments
Open

IO::Event::Interrupt#signal mangles errno #127

jscheid opened this issue Jan 13, 2025 · 5 comments

Comments

@jscheid
Copy link

jscheid commented Jan 13, 2025

IO::Event::Interrupt#signal overwrites errno which can cause havoc, including VM-level errors such as this one.

The patch below (against Ruby master) fixes that particular error, however I'm not sure it's the correct general fix - it seems too narrow. Ideally similar code would be used in the signal implementation directly, and in any other functions where errno could be corrupted unexpectedly.

diff --git a/scheduler.c b/scheduler.c
index bb406a5cf6..64f1e844ee 100644
--- a/scheduler.c
+++ b/scheduler.c
@@ -412,7 +412,10 @@ rb_fiber_scheduler_unblock(VALUE scheduler, VALUE blocker, VALUE fiber)
 {
     RUBY_ASSERT(rb_obj_is_fiber(fiber));
 
-    return rb_funcall(scheduler, id_unblock, 2, blocker, fiber);
+    int saved_errno = errno;
+    VALUE result = rb_funcall(scheduler, id_unblock, 2, blocker, fiber);
+    errno = saved_errno;
+    return result;
 }
 
 /*

To reproduce, use this repo and run the following command a few times:

docker compose run --build --rm --env IO_EVENT_SELECTOR=Select test

Sooner or later you should hit the case where io_binwrite uses rb_mutex_synchronize here with a scheduler set, while returning -1 because the socket is ECONNREFUSED.

In this case rb_mutex_synchronize uses rb_fiber_scheduler_unblock, which due to the mangling by signal will reset errno to zero, which Ruby doesn't like when io_binwrite returns -1: rb_sys_fail_on_write ends up invoking rb_bug in that situation.

@ioquatix
Copy link
Member

Thanks for pointing this out. As an aside, I strongly dislike having a write lock in the IO object.

My general feeling is operations that require errno should be close together, e.g.

if (operation(...) == -1) {
  do_something_with(errno);
}

Or even better, we follow the pattern using negative error codes:

error = operation(...);
if (error < 0) {
  if (-error == EWHATEVER) {
    ...

Because of the reasons you mention, most of the new IO code uses the latter pattern, e.g.: https://github.com/ruby/ruby/blob/5f3d1eeb55ec69591e76633346d8a4812d3cc36b/io_buffer.c#L2782

I think as you said, saving and restoring the errno doesn't feel like the right solution. More specifically, it feels like it will hide bugs rather than fix/prevent them.

I think the best solution is to ensure operation() which has a well defined usage of errno is not corrupted by internal behaviour. In other words, we should be clear about what operations and modify the errno, and then ensure that those operations are wrapped appropriately when used in functions that also have a well-defined usage of errno.

I need to review the code in more detail to see the impact.

@ioquatix
Copy link
Member

ioquatix commented Jan 14, 2025

For some reason, your fix does not resolve the issue for me:

VALUE
rb_fiber_scheduler_unblock(VALUE scheduler, VALUE blocker, VALUE fiber)
{
    RUBY_ASSERT(rb_obj_is_fiber(fiber));

    int saved_errno = errno;

    VALUE result = rb_funcall(scheduler, id_unblock, 2, blocker, fiber);

    fprintf(stderr, "Restoring errno: %d -> %d\n", errno, saved_errno);
    errno = saved_errno;

    return result;
}

And the output:

> bundle exec ruby test.rb 0.5
...
Using async_percent: 0.5
[pry-remote] Waiting for client on druby://0.0.0.0:9876
Start 1 (async)
Start 2 (async)
Start 3 (async)
Start 4 (sync)
Restoring errno: 0 -> 0
Restoring errno: 0 -> 0
Start 5 (sync)
Restoring errno: 61 -> 61
Restoring errno: 0 -> 0
Start 6 (sync)
Restoring errno: 0 -> 0
Start 7 (sync)
Restoring errno: 0 -> 0
Start 8 (async)
Start 9 (async)
Start 10 (async)
Done 1
[hang]

Am I missing something?

@ioquatix
Copy link
Member

Cross referencing socketry/async#368

@jscheid
Copy link
Author

jscheid commented Jan 14, 2025

@ioquatix sorry for any confusion. There are two issues:

  1. The hang, discussed in the issue you've cross referenced. It occurs with any selector implementation and isn't fixed by the patch above, and is very much still an issue that I hope you can help track down.
  2. A crash (VM bug report), discussed here, an example of which is in the Gist linked above. It happens sometimes, but not always, with the repro for (1) but only when using the Select selector implementation. This crash is fixed by the patch.

As far as I can tell these two issues are unrelated, except that the same repro test case happens to trigger both (one more reliably than the other.)

Does this make sense?

@ioquatix
Copy link
Member

Yes, that clarifies the situation, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants