-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
session::Client_session::sync_connect() can block for seconds given certain user-code behavior on server side. #27
Comments
Background thoughts since then: addendum: The approach above solves the immediate problem (Client_session::sync_connect() potentially blocking given certain opposing behavior, against the promised contract). However, as noted in that wall of text, there's the practical issue of what happens on the client side if a sync_connect() immediately succeeds, but in the meantime it was a background acceptance on the server side without an async_accept() yet. Or, equivalently for the following situation, if they async_accept()ed but didn't immediately init_handlers() for whatever reason - either way, on server the session is in almost-PEER state, not PEER state; but client session is in PEER state and doesn't know any better. At this point whether there's any practical issue depends on the how the protocol/code is set up. If client side doesn't do open_channel() (e.g., all channels are init-channels - the easier thing in most cases); and all traffic along channels is done using potentially not-instant-result async_request() calls and similar async techniques - then there's no problem. The code is ready for things to not be instant (when networking programming this is required - but this is local IPC - and that's different). However, if the client code (1) does do open_channel() and/or does do sync_request() and similar techniques; AND (2) relies on these things being non-blocking/instant - then.... Well, first note that this makes coding quite a bit simpler. If you do sync_request() in the middle of high-performance code but can reasonably expect it to be non-blocking - simply because you know the other side is in PEER state and has set up all request handlers, etc., right at the start - then you can skip async IPC entirely. And that's great! Of course it's not suitable to every algorithm/protocol; but when it is suitable then it's super-nice to take advantage of it. So in that case, in the aforementioned scenario, such code on the client side might hit unexpected blocking or unexpectedly long responses at the start of the session, and/or unexpectedly blocking open_channel() (if applicable). This can, technically, even happen in a properly written server-side algorithm where one assumes only one session at a time (not multiple clients at a time but just one) and therefore performs async_accept(), which succeeds; then work along that session; then if it gets hoses performs another async_accept(). If the session was hosed from an unusual scenario such as a stuck/zombie opposing (client) process, then this might be detected a few seconds after entering that state (idle timeout); hence async_accept() + init_handlers() would only run a few seconds - potentially - after the client (for whatever reason) re-initialized and attempted sync_connect() again. So sync_connect() would succeed immediately, but the server app code isn't actually executing its side of the protocol until seconds later... during this time things might block or take longer than one would like/assume. (Contrived, as such a session death is unusual - typically, even on crash, it'll be detected by the other side quickly, as various channels including the session master channel FD(s) go bad.) So how to deal with it? Well, it's not fixable in the existing API, because the bottom line is the client requires some async step around connect time to wait until the server is "really" ready. (Reminder: This is all only under this specific scenario/way of coding that relies on all exchanges to be quick, after session is ready.) The solution for this (I would claim) corner case that I like is to simply enable the client side to easily-enough wait-up for the server to be "really" ready (PEER state, as opposed to almost-PEER state). This can be an optional API:
To be clear, this thing would be optional to use. (It's also basically syntactic sugar: One can do it today manually with an async exchange along some channel in the session. But it's nice syntactic sugar that saves a lot of boilerplate and also encourages a convenient pattern in some scenarios.) We could test it in transport_test along with the other crazy session stuff exercised therein. |
[ CC - @wkorbe who ran into this (albeit I believe he was intentionally trying to simulate a "stuck process") when integrating Flow-IPC into a real application. ]
ipc::session::Client_session::sync_connect() overload is (as of this writing) the way a session-client initially connects to the opposing process (hence session-server). So in the opposing process normally one performs:
The expected behavior (contract) of Client_session::sync_connect() is, either:
The short version is, this contract breaks in this situation and manner:
The work-around is "just" to not do this on the server-side: Only construct the Session_server once ready to actually accept a session. (After that, the continuing "work-around" is essentially to act in the standard fashion recommended in the Guided Manual. Namely, if the server is simple and only accepts one session at a time, then clearly the client side shouldn't attempt to sync_connect() again until the first session dies; and once it does the session-server should either shut down or do another .async_accept(). If the server is more complex - capable of concurrent sessions - then it should .async_accept() immediately upon handling an .async_accept() callback being invoked. Assuming no improper blocking within this loop on the server, there is no problem.)
So essentially the "work-around" is likely to be how one would code the server-side anyway. However that's not 100% true, in that it is fairly natural to construct a thing early-on and further use it (.async_accept() here) somewhat later after potentially lengthy other setup. So at that point constructing it only once ready to potentially accept becomes in fact a work-around.
Naturally, work-arounds aside, it is not good to break the documented contract formally speaking - and, as shown just above, practically speaking too.
SIDE NOTE: From memory, I (@ygoldfeld) think I have a to-do somewhere in the code about this. I am 50% certain of this. If so I should have filed an issue immediately but neglected to do so (human error, unintentional).
--
How to solve it internally?
The reason for this is now too difficult to understand. The session-connect procedure basically is:
The problem here is about when the last bullet point actually occurs. On the server the answer is, only once at least one .async_accept() is outstanding. Until then, the client side "sees" the initial Unix domain socket connect, while on the server only the kernel knows about it. Once .async_accept() is called, low-level accept() actually is executed, and the rest of the message exchange happens.
That's just the way I coded it, as it is straightforward and seems correct. It's not quite, though, as if .async_accept() does not occur or occurs late, then the sync_connect() thinks it's off to the races (low-level connect() succeeded immediately as expected), while the server knows nothing of it. Client has no way of knowing it's really only in the kernel UDS backlog still.
So, not hard to understand. As for fixing it... somewhat surprisingly to me, it's actually not obvious. Simple ideas tend to be either wrong or not-quite-great:
Upon kicking around solutions in my mind I couldn't escape the following conclusion (not that it's all that hairy really).
The bottom line is backlogging incoming .sync_connect()s on the server, when Session_server exists but happens to not have any .async_accept() pending at the moment, is good behavior. Think of it as an analogy to just a simple TCP or stream-UDS connect: if it's listening in any capacity, then a connect should work - whether further traffic occurs in timely fashion is a separate concern, and indeed if there is no matching accept() on the opposing side, then the client can eventually detect this and break the half-functioning channel. So in our case, if there's a Session_server listening, then connects should fully work from the client's point of view. However, in practical terms, the server side would then remain in almost-PEER state until user .async_accept()s successfully and then does Server_session::init_handlers(). Therefore, later synchronous calls can block, if the user indeed has an .async_accept()less Session_server for an extended period of time. Those calls:
In addition, consider adding a simple change wherein in fact we do not listen - nor even place the CNS (PID) file (etc.) - until the first .async_accept(). This one could go either way, but leaving a constructed, not-yet-accepting Session_server around and async_accept()ing for the first time later could be pretty common - IMO it reduces entropy to reject sync_connect()s during that time period. (Note: Doing this alone would mitigate the original issue quite a lot in practice. It would then still remain formally a problem but in appropriate user server-side patterns it would no longer come up.
NOTE! It is important to explicitly document the expected behavior. This can be subtle, particularly after the above solution. We should take great care to (1) be formally accurate and (2) be informally helpful. Namely:
--
Impl detail notes:
As it is currently structured (namely all in session::Session_server_impl), the change should not be that hard (though it'll feel "unfortunate" given how mercifully simple today's session::Session_server_impl is). Currently .async_accept() simply triggers N_s_s_acceptor::async_accept() 1-1; once the latter fires successfully, directly from that thread W we do Server_session impl's async-accept-log-in stuff which does the message exchange. So it's just a wrapper around those 2 steps really (plus a couple hooks for SHM-enabled sessions... doesn't concern us here at all).
So now it will have to be somewhat more complex - in fact similar to N_s_s_acceptor internals. It'll need to do the equivalent of today's async_accept() - but mandatorily have one outstanding at all times; keep a queue of user's outstanding "outer" .async_accept() "requests"; and do the usual surplus/deficit thing as in N_s_s_a and some other places in Flow-IPC. (That means dtor will need to fire operation-aborted callbacks if queue is non-empty in dtor... and so on. Use the aforementioned such classes like N_s_s_a; conceptually it'll be a copy-paste like that.)
Will probably need to add a thread W to Session_server_impl (so that if there is a surplus of available almost-PEER sessions, and then a request comes in: fire callback immediately from thread W using W.post()); and mutex-protect the surplus queue/etc.; as that stuff (m_state?) may be modified from at least the Server_sessions' many threads W, or from the new thread W, or I believe N_s_s_a's thread W. Perf is no big deal in this context, so it's fine; just be careful with the multi-threading.
Oh and, again, basically keep everything (maybe m_state?) null until the first .async_accept(); at which point do the equivalent of today's ctor and begin the internal .async_accept() chain.
transport_test is a decent place to have this auto-tested.
The text was updated successfully, but these errors were encountered: