Missing write readiness notifications using Child with Stdio::piped? #5644
-
I have a program that among other things spawns subprocess (which are usually Python scripts) and communicates with them over pipes using a line-oriented protocol. This protocol is half-duplex, i.e. at each point in time only one of the two processes is expected to write a line and the other is expected to read one. This seems to mostly work, but sometimes the program and its subprocess deadlock. Adding timeouts on the Rust/tokio/parent side, I found out that it was the writing which blocked indefinitely, e.g. This does not really make sense to me as when the pipe buffer is full, the read should succeed and empty it. However, increasing the kernel's pipe buffer size to 1 MB which is larger than all lines I am trying to write so that all writes should succeed immediately did apparently resolve the issue. To me, this appears like the write readiness notifications when the pipe buffer is significantly smaller than the lines I try to write are just lost. (While the timeouts are currently set to 60s, the deadlocks really are deadlocks persisting for days if not resolved manually or via the timeouts.) Of course, a bug in my code is much more likely, but at this point I have run out of ideas trying to diagnose it. Also due to the protocol being half-duplex, the classic case of pipe buffer deadlock involving two pipes should not happen and indeed continuously draining To add some context, the parent code is here and the child code is here and here. I am sorry for not having a more minimal reproducer, but I have a hard time reproducing it at all. For example, it seems to happen only on our staging VM running Ubuntu 22.04 and hence kernel 5.15, but not on our development VM currently running Ubuntu 23.04 and hence kernel 6.2. Due to the kernel version differences, I wonder if the "recent" rework of the kernel's pipe subsystem have anything to with it, but that is admittedly a long shot. EDIT: To make this more concrete, when they deadlock, the parents waits here and the child blocks here. EDIT: Trying a minimal reproducer which just ping-pongs 500 kB worth of line data between a Rust parent and Python child in a tight loop does not seem to trigger the issue. So at least some of the other work for the runtime seems required. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
I am testing this against the 5.19 HWE kernel shipped with Ubuntu 22.04 now. Maybe the problem magically disappears that way... |
Beta Was this translation helpful? Give feedback.
-
I fear this could be an issue in tokio or the underlying I/O stack after all: I replaced that single call to I would be grateful for any insights or advice on what I should collect to turn this into an actionable bug report. |
Beta Was this translation helpful? Give feedback.
-
Seemed to have been an instance of #6133 after all. |
Beta Was this translation helpful? Give feedback.
Could this be an instance of #6133 ?