Race of SessionStarted
event and startup-cmd execution in multi-node batch-type cluster sessions
#2326
Labels
comp:agent
Related to Agent component
comp:manager
Related to Manager component
type:bug
Reports about that are not working
urgency:5
It is imperative that action be taken right away.
Milestone
Background
Currently, our batch-type session works as follows:
KernelStarted
event(s).main1
container and the given kernel type is "batch", the agent triggers execution of thestartup_command
asynchronously.startup_command
finishes, the agent generates aSessionSuccess
,SessionFailure
, orKernelTerminated
event depending on the result. Let's call this a "completion signal".KernelStarted
event(s). If it is a cluster session, the manager changes the session status to RUNNING and generates theSessionStarted
event after receivingKernelStarted
events from all belonging kernels (containers).Problem
Since the step 2-i and 3 are asynchronous, there is a possibility that the completion signal is ignored by the manager as it thinks the session is not ready yet because the batch-result processing is not a forced termination. (And actually, it shouldn't be!)
We have observed this could really happen with the following combinations:
startup_command
likeecho hi
.We also confirmed that adding arbitrary delays in the
startup_command
(likesleep 10
) reduces the possibility of this race.There is also another potential problem due to this:
Plans to Fix
We need to ensure that the step 2-i happens after the step 3 always, to prevent both the ignorance of completion signals and breaking of user codes expecting that the sibling containers are ready when started.
The goal is to start execution of
startup_command
after the manager-generatedSessionStarted
event.Idea 1 (Fire-and-forget from the manager's perspective)
To achieve this, we could introduce a concept like "session creation tracker" to the agent as in the manager API handlers.
In the manager:
In the agent:
asyncio.Event
objects for each observed session ID.KernelStarted
event if it is a main kernel.execute_batch()
task creation after sending theKernelStarted
event with an async function that waits for theasyncio.Event
of the parent session ID before continuing.asyncio.Event
object when the agent receives aSessionStarted
event that maps to the session ID in the bookkeeper mapping.Idea 2 (Transfer more control to the manager)
Or, we could move the trigger to execute the startup-command from the agent to the manager.
In the agent:
exec_batch()
task and returns immediately.In the manager:
SessionStarted
event.The text was updated successfully, but these errors were encountered: