session::Client_session::sync_connect() can block for seconds given certain user-code behavior on server side. #27

ygoldfeld · 2024-11-16T01:17:57Z

[ CC - @wkorbe who ran into this (albeit I believe he was intentionally trying to simulate a "stuck process") when integrating Flow-IPC into a real application. ]

ipc::session::Client_session::sync_connect() overload is (as of this writing) the way a session-client initially connects to the opposing process (hence session-server). So in the opposing process normally one performs:

// Construct it: this places a CNS (PID) file in the proper location and "begins listening."
ipc::session::Session_server /* <-- or similar */ srv(...);
// Now or later... accept a connection (or wait for one and then accept).
srv.async_accept(on_connect_func); // on_connect_func() shall receive a ready-to-go Session.

The expected behavior (contract) of Client_session::sync_connect() is, either:

fail immediately (non-blocking), as indicated by Error_code/exception; or
succeed immediately (non-blocking), as indicated by falsy Error_code/no exception.

The short version is, this contract breaks in this situation and manner:

Session_server is properly constructed; but there is no .async_accept() in progress.
It should either fail or succeed quickly (the docs, formally speaking, don't make clear which) - non-blockingly - but instead it blocks until an .async_accept() does get called, or 5 seconds pass, whichever happens first.

The work-around is "just" to not do this on the server-side: Only construct the Session_server once ready to actually accept a session. (After that, the continuing "work-around" is essentially to act in the standard fashion recommended in the Guided Manual. Namely, if the server is simple and only accepts one session at a time, then clearly the client side shouldn't attempt to sync_connect() again until the first session dies; and once it does the session-server should either shut down or do another .async_accept(). If the server is more complex - capable of concurrent sessions - then it should .async_accept() immediately upon handling an .async_accept() callback being invoked. Assuming no improper blocking within this loop on the server, there is no problem.)

So essentially the "work-around" is likely to be how one would code the server-side anyway. However that's not 100% true, in that it is fairly natural to construct a thing early-on and further use it (.async_accept() here) somewhat later after potentially lengthy other setup. So at that point constructing it only once ready to potentially accept becomes in fact a work-around.

Naturally, work-arounds aside, it is not good to break the documented contract formally speaking - and, as shown just above, practically speaking too.

SIDE NOTE: From memory, I (@ygoldfeld) think I have a to-do somewhere in the code about this. I am 50% certain of this. If so I should have filed an issue immediately but neglected to do so (human error, unintentional).

--

How to solve it internally?

The reason for this is now too difficult to understand. The session-connect procedure basically is:

Server places CNS (PID) file into a location based on the App::m_name, which both sides know and pass (in App struct) to Session_server and Client_session ctors respectively.
Server then immediately (also in ctor) sets up a Unix domain server socket (ipc::transport::Native_socket_stream_acceptor) which immediately listen()s. That is incoming socket connect()s are queued by the kernel, so each socket low-level connect() immediately succeeds from that point on.
Then, client and server engage in a fairly complex message exchange to set up many things like channels and SHM. This exchange is non-blocking (albeit heavy-weight, to the tune of possibly tens of milliseconds). Once this is done on each side, server async_accept() async-triggers the on-accept user callback; and on client side sync_connect() returns. Though heavy-weight (but quite rare), this means sync_connect() is non-blocking; while async_accept() is "quick," once the low-level UDS connect() arrives.

The problem here is about when the last bullet point actually occurs. On the server the answer is, only once at least one .async_accept() is outstanding. Until then, the client side "sees" the initial Unix domain socket connect, while on the server only the kernel knows about it. Once .async_accept() is called, low-level accept() actually is executed, and the rest of the message exchange happens.

That's just the way I coded it, as it is straightforward and seems correct. It's not quite, though, as if .async_accept() does not occur or occurs late, then the sync_connect() thinks it's off to the races (low-level connect() succeeded immediately as expected), while the server knows nothing of it. Client has no way of knowing it's really only in the kernel UDS backlog still.

So, not hard to understand. As for fixing it... somewhat surprisingly to me, it's actually not obvious. Simple ideas tend to be either wrong or not-quite-great:

Could augment server side (e.g., making it an option in the low-level Native_socket_stream_acceptor) to not allow this backlogging of incoming connect()s, when there are not 1+ async_accept()s pending. That would solve the problem during the initial "gap" (between Session_server ctor and first .async_accept())... but it would also break things during subsequent .async_accept()less "gaps." For example, suppose the code is written (and this would be proper) to .async_accept(F), then handle the new session in F (non-blockingly), then .async_accept() again. If a client attempts sync_connect() between the two, it'll fail. Not a catastrophe perhaps, but typically code would be written to now wait a while and try again in N seconds - when really it would already work quite soon, once the next .async_accept() is executed on server.
Could simply add (internally) a timeout inside .sync_connect() - e.g., 100msec - so it would give up and yield error unless message exchange starts within 100msec of the UDS connect() success. This is a decent solution. However it's hard to know whether it should be 100msec or 50 or 500; if the machine is quite slow then that's how long conceivably an immediate success would take. So it's brittle in that way. It could be configurable... but that sucks for the user. The whole point of .sync_connect() -- as opposed to having an .async_connect() -- is simplicity, synchronicity, non-blockingness when connecting.

Upon kicking around solutions in my mind I couldn't escape the following conclusion (not that it's all that hairy really).

The bottom line is backlogging incoming .sync_connect()s on the server, when Session_server exists but happens to not have any .async_accept() pending at the moment, is good behavior. Think of it as an analogy to just a simple TCP or stream-UDS connect: if it's listening in any capacity, then a connect should work - whether further traffic occurs in timely fashion is a separate concern, and indeed if there is no matching accept() on the opposing side, then the client can eventually detect this and break the half-functioning channel. So in our case, if there's a Session_server listening, then connects should fully work from the client's point of view. However, in practical terms, the server side would then remain in almost-PEER state until user .async_accept()s successfully and then does Server_session::init_handlers(). Therefore, later synchronous calls can block, if the user indeed has an .async_accept()less Session_server for an extended period of time. Those calls:

Session::open_channel(): This is formally allowed to block, but usually it's undesirable. We can give (see below notes re. docs) tips for how to avoid it (plus recommend using init-channels anyway when possible, so open_channel() can be skipped).
struc::Channel::sync_request() (and other API combos that prefer non-blockingly-fast request responses). This too is formally allowed to block, and usually it's undesirable. Again, we can give tips for how to avoid it.

In addition, consider adding a simple change wherein in fact we do not listen - nor even place the CNS (PID) file (etc.) - until the first .async_accept(). This one could go either way, but leaving a constructed, not-yet-accepting Session_server around and async_accept()ing for the first time later could be pretty common - IMO it reduces entropy to reject sync_connect()s during that time period. (Note: Doing this alone would mitigate the original issue quite a lot in practice. It would then still remain formally a problem but in appropriate user server-side patterns it would no longer come up.

NOTE! It is important to explicitly document the expected behavior. This can be subtle, particularly after the above solution. We should take great care to (1) be formally accurate and (2) be informally helpful. Namely:

sync_connect() is non-blocking and in particular will succeed as long as opposing Session_server is constructed. However, if the app coder on server allows for blocking-length periods sans .async_accept(), then subsequent operations might not be instantly responsive - until .async_accept() and .init_handlers() gets session to PEER state instead of almost-PEER.
Moreover, that situation will be avoided before the first .async_accept(): server will reject connects during this time.
Moreover, the app code on server can avoid that situation by
- In one-concurrent-session setups: doing .async_accept() quickly once a session is hosed.
- In many-concurrent-sessions setups: doing .async_accept() quickly once an .async_accept() triggers callback.
These measures are proper techniques in any case.
If not heeded, the things that might (from client-side point of view) be not-instantly-responsive within a given apparently-PEER-state session are at least: open_channel(), sync_request(), and conceptually similar algorithms to sync_request(). It is typically recommended to heed the advice, as there is great power from the simplicity of always knowing that a session is in full PEER state on both sides, and therefore all bidirectional request-like exchanges will be instantly responsive.

--

Impl detail notes:

As it is currently structured (namely all in session::Session_server_impl), the change should not be that hard (though it'll feel "unfortunate" given how mercifully simple today's session::Session_server_impl is). Currently .async_accept() simply triggers N_s_s_acceptor::async_accept() 1-1; once the latter fires successfully, directly from that thread W we do Server_session impl's async-accept-log-in stuff which does the message exchange. So it's just a wrapper around those 2 steps really (plus a couple hooks for SHM-enabled sessions... doesn't concern us here at all).

So now it will have to be somewhat more complex - in fact similar to N_s_s_acceptor internals. It'll need to do the equivalent of today's async_accept() - but mandatorily have one outstanding at all times; keep a queue of user's outstanding "outer" .async_accept() "requests"; and do the usual surplus/deficit thing as in N_s_s_a and some other places in Flow-IPC. (That means dtor will need to fire operation-aborted callbacks if queue is non-empty in dtor... and so on. Use the aforementioned such classes like N_s_s_a; conceptually it'll be a copy-paste like that.)

Will probably need to add a thread W to Session_server_impl (so that if there is a surplus of available almost-PEER sessions, and then a request comes in: fire callback immediately from thread W using W.post()); and mutex-protect the surplus queue/etc.; as that stuff (m_state?) may be modified from at least the Server_sessions' many threads W, or from the new thread W, or I believe N_s_s_a's thread W. Perf is no big deal in this context, so it's fine; just be careful with the multi-threading.

Oh and, again, basically keep everything (maybe m_state?) null until the first .async_accept(); at which point do the equivalent of today's ctor and begin the internal .async_accept() chain.

transport_test is a decent place to have this auto-tested.

ygoldfeld · 2024-12-06T22:57:31Z

Background thoughts since then: addendum:

The approach above solves the immediate problem (Client_session::sync_connect() potentially blocking given certain opposing behavior, against the promised contract). However, as noted in that wall of text, there's the practical issue of what happens on the client side if a sync_connect() immediately succeeds, but in the meantime it was a background acceptance on the server side without an async_accept() yet. Or, equivalently for the following situation, if they async_accept()ed but didn't immediately init_handlers() for whatever reason - either way, on server the session is in almost-PEER state, not PEER state; but client session is in PEER state and doesn't know any better.

At this point whether there's any practical issue depends on the how the protocol/code is set up. If client side doesn't do open_channel() (e.g., all channels are init-channels - the easier thing in most cases); and all traffic along channels is done using potentially not-instant-result async_request() calls and similar async techniques - then there's no problem. The code is ready for things to not be instant (when networking programming this is required - but this is local IPC - and that's different).

However, if the client code (1) does do open_channel() and/or does do sync_request() and similar techniques; AND (2) relies on these things being non-blocking/instant - then.... Well, first note that this makes coding quite a bit simpler. If you do sync_request() in the middle of high-performance code but can reasonably expect it to be non-blocking - simply because you know the other side is in PEER state and has set up all request handlers, etc., right at the start - then you can skip async IPC entirely. And that's great! Of course it's not suitable to every algorithm/protocol; but when it is suitable then it's super-nice to take advantage of it.

So in that case, in the aforementioned scenario, such code on the client side might hit unexpected blocking or unexpectedly long responses at the start of the session, and/or unexpectedly blocking open_channel() (if applicable).

This can, technically, even happen in a properly written server-side algorithm where one assumes only one session at a time (not multiple clients at a time but just one) and therefore performs async_accept(), which succeeds; then work along that session; then if it gets hoses performs another async_accept(). If the session was hosed from an unusual scenario such as a stuck/zombie opposing (client) process, then this might be detected a few seconds after entering that state (idle timeout); hence async_accept() + init_handlers() would only run a few seconds - potentially - after the client (for whatever reason) re-initialized and attempted sync_connect() again. So sync_connect() would succeed immediately, but the server app code isn't actually executing its side of the protocol until seconds later... during this time things might block or take longer than one would like/assume. (Contrived, as such a session death is unusual - typically, even on crash, it'll be detected by the other side quickly, as various channels including the session master channel FD(s) go bad.)

So how to deal with it? Well, it's not fixable in the existing API, because the bottom line is the client requires some async step around connect time to wait until the server is "really" ready. (Reminder: This is all only under this specific scenario/way of coding that relies on all exchanges to be quick, after session is ready.)

The solution for this (I would claim) corner case that I like is to simply enable the client side to easily-enough wait-up for the server to be "really" ready (PEER state, as opposed to almost-PEER state). This can be an optional API:

(async-I/O core API) Client_session::async_sync(F): Similar to async_request() in spirit but basically invokes completion handler F() in standard fashion, once server side internally indicates it's really-ready. Internally: Server_session::init_handlers() executes session_master_channel.expect_msg(X, Fint), where X is a new internal message, maybe ClientReadinessReq; and Fint() send()s response ClientReadinessRsp (not the whole thing is one-off). (If send() fails, hose the channel in standard fashion... we're in PEER state now.) async_sync() itself is a wrapper around async_request(X, F), very straightforward. Oh, and refuse/return false if async_sync() or sync() has already been called for this session.
(async-I/O core API) Client_session::sync(optional_timeout): Blocking wrapper around the above.
(sync_io wrapper API) Provide Client_session_adapter::async_sync() counterpart around core()->async_sync(). Use the usual pipe technique internally for this, like the other stuff in Session_adapter. Only a few lines of code. (Reminder: Cannot provide a blocking API, similarly to how there is no sync_io::Channel::sync_request().)

To be clear, this thing would be optional to use. (It's also basically syntactic sugar: One can do it today manually with an async exchange along some channel in the session. But it's nice syntactic sugar that saves a lot of boilerplate and also encourages a convenient pattern in some scenarios.) We could test it in transport_test along with the other crazy session stuff exercised therein.

ygoldfeld added the bug Something isn't working label Nov 16, 2024

ygoldfeld assigned ygoldfeld and wkorbe and unassigned wkorbe Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

session::Client_session::sync_connect() can block for seconds given certain user-code behavior on server side. #27

session::Client_session::sync_connect() can block for seconds given certain user-code behavior on server side. #27

ygoldfeld commented Nov 16, 2024

ygoldfeld commented Dec 6, 2024

session::Client_session::sync_connect() can block for seconds given certain user-code behavior on server side. #27

session::Client_session::sync_connect() can block for seconds given certain user-code behavior on server side. #27

Comments

ygoldfeld commented Nov 16, 2024

ygoldfeld commented Dec 6, 2024