`IO::Event::Interrupt#signal` mangles `errno` #127

jscheid · 2025-01-13T21:45:52Z

IO::Event::Interrupt#signal overwrites errno which can cause havoc, including VM-level errors such as this one.

The patch below (against Ruby master) fixes that particular error, however I'm not sure it's the correct general fix - it seems too narrow. Ideally similar code would be used in the signal implementation directly, and in any other functions where errno could be corrupted unexpectedly.

diff --git a/scheduler.c b/scheduler.c
index bb406a5cf6..64f1e844ee 100644
--- a/scheduler.c
+++ b/scheduler.c
@@ -412,7 +412,10 @@ rb_fiber_scheduler_unblock(VALUE scheduler, VALUE blocker, VALUE fiber)
 {
     RUBY_ASSERT(rb_obj_is_fiber(fiber));
 
-    return rb_funcall(scheduler, id_unblock, 2, blocker, fiber);
+    int saved_errno = errno;
+    VALUE result = rb_funcall(scheduler, id_unblock, 2, blocker, fiber);
+    errno = saved_errno;
+    return result;
 }
 
 /*

To reproduce, use this repo and run the following command a few times:

docker compose run --build --rm --env IO_EVENT_SELECTOR=Select test

Sooner or later you should hit the case where io_binwrite uses rb_mutex_synchronize here with a scheduler set, while returning -1 because the socket is ECONNREFUSED.

In this case rb_mutex_synchronize uses rb_fiber_scheduler_unblock, which due to the mangling by signal will reset errno to zero, which Ruby doesn't like when io_binwrite returns -1: rb_sys_fail_on_write ends up invoking rb_bug in that situation.

The text was updated successfully, but these errors were encountered:

ioquatix · 2025-01-13T22:19:29Z

Thanks for pointing this out. As an aside, I strongly dislike having a write lock in the IO object.

My general feeling is operations that require errno should be close together, e.g.

if (operation(...) == -1) {
  do_something_with(errno);
}

Or even better, we follow the pattern using negative error codes:

error = operation(...);
if (error < 0) {
  if (-error == EWHATEVER) {
    ...

Because of the reasons you mention, most of the new IO code uses the latter pattern, e.g.: https://github.com/ruby/ruby/blob/5f3d1eeb55ec69591e76633346d8a4812d3cc36b/io_buffer.c#L2782

I think as you said, saving and restoring the errno doesn't feel like the right solution. More specifically, it feels like it will hide bugs rather than fix/prevent them.

I think the best solution is to ensure operation() which has a well defined usage of errno is not corrupted by internal behaviour. In other words, we should be clear about what operations and modify the errno, and then ensure that those operations are wrapped appropriately when used in functions that also have a well-defined usage of errno.

I need to review the code in more detail to see the impact.

ioquatix · 2025-01-14T01:23:11Z

For some reason, your fix does not resolve the issue for me:

VALUE
rb_fiber_scheduler_unblock(VALUE scheduler, VALUE blocker, VALUE fiber)
{
    RUBY_ASSERT(rb_obj_is_fiber(fiber));

    int saved_errno = errno;

    VALUE result = rb_funcall(scheduler, id_unblock, 2, blocker, fiber);

    fprintf(stderr, "Restoring errno: %d -> %d\n", errno, saved_errno);
    errno = saved_errno;

    return result;
}

And the output:

> bundle exec ruby test.rb 0.5
...
Using async_percent: 0.5
[pry-remote] Waiting for client on druby://0.0.0.0:9876
Start 1 (async)
Start 2 (async)
Start 3 (async)
Start 4 (sync)
Restoring errno: 0 -> 0
Restoring errno: 0 -> 0
Start 5 (sync)
Restoring errno: 61 -> 61
Restoring errno: 0 -> 0
Start 6 (sync)
Restoring errno: 0 -> 0
Start 7 (sync)
Restoring errno: 0 -> 0
Start 8 (async)
Start 9 (async)
Start 10 (async)
Done 1
[hang]

Am I missing something?

ioquatix · 2025-01-14T02:33:54Z

Cross referencing socketry/async#368

jscheid · 2025-01-14T04:35:30Z

@ioquatix sorry for any confusion. There are two issues:

The hang, discussed in the issue you've cross referenced. It occurs with any selector implementation and isn't fixed by the patch above, and is very much still an issue that I hope you can help track down.
A crash (VM bug report), discussed here, an example of which is in the Gist linked above. It happens sometimes, but not always, with the repro for (1) but only when using the Select selector implementation. This crash is fixed by the patch.

As far as I can tell these two issues are unrelated, except that the same repro test case happens to trigger both (one more reliably than the other.)

Does this make sense?

ioquatix · 2025-01-14T05:04:41Z

Yes, that clarifies the situation, thanks!

ioquatix mentioned this issue Jan 14, 2025

Save and restore errno in IO_Event_Interrupt_signal. #128

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`IO::Event::Interrupt#signal` mangles `errno` #127

`IO::Event::Interrupt#signal` mangles `errno` #127

jscheid commented Jan 13, 2025 •

edited

Loading

ioquatix commented Jan 13, 2025

ioquatix commented Jan 14, 2025 •

edited

Loading

ioquatix commented Jan 14, 2025

jscheid commented Jan 14, 2025 •

edited

Loading

ioquatix commented Jan 14, 2025

IO::Event::Interrupt#signal mangles errno #127

IO::Event::Interrupt#signal mangles errno #127

Comments

jscheid commented Jan 13, 2025 • edited Loading

ioquatix commented Jan 13, 2025

ioquatix commented Jan 14, 2025 • edited Loading

ioquatix commented Jan 14, 2025

jscheid commented Jan 14, 2025 • edited Loading

ioquatix commented Jan 14, 2025

`IO::Event::Interrupt#signal` mangles `errno` #127

`IO::Event::Interrupt#signal` mangles `errno` #127

jscheid commented Jan 13, 2025 •

edited

Loading

ioquatix commented Jan 14, 2025 •

edited

Loading

jscheid commented Jan 14, 2025 •

edited

Loading