Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race of SessionStarted event and startup-cmd execution in multi-node batch-type cluster sessions #2326

Closed
achimnol opened this issue Jun 21, 2024 · 1 comment · Fixed by #2327
Assignees
Labels
comp:agent Related to Agent component comp:manager Related to Manager component type:bug Reports about that are not working urgency:5 It is imperative that action be taken right away.
Milestone

Comments

@achimnol
Copy link
Member

achimnol commented Jun 21, 2024

Background

Currently, our batch-type session works as follows:

  1. The agent spawns container(s) and generates KernelStarted event(s).
  2. If the started container is the main1 container and the given kernel type is "batch", the agent triggers execution of the startup_command asynchronously.
    1. When the startup_command finishes, the agent generates a SessionSuccess, SessionFailure, or KernelTerminated event depending on the result. Let's call this a "completion signal".
  3. The manager receives KernelStarted event(s). If it is a cluster session, the manager changes the session status to RUNNING and generates the SessionStarted event after receiving KernelStarted events from all belonging kernels (containers).
  4. In normal conditions, if not forced, the manager allows the state transition to TERMINATING only after entering RUNNING.

Problem

Since the step 2-i and 3 are asynchronous, there is a possibility that the completion signal is ignored by the manager as it thinks the session is not ready yet because the batch-result processing is not a forced termination. (And actually, it shouldn't be!)

We have observed this could really happen with the following combinations:

  • The sesison is a multi-node cluster session including more than 2 containers.
  • It is created as a batch-type session.
  • It is given a very short-lived startup_command like echo hi.

We also confirmed that adding arbitrary delays in the startup_command (like sleep 10) reduces the possibility of this race.

There is also another potential problem due to this:

  • If the user's startup command requires all other sibling containers to be ready to execute, it may fail depending on the timing of kernel creation in other agent nodes.

Plans to Fix

We need to ensure that the step 2-i happens after the step 3 always, to prevent both the ignorance of completion signals and breaking of user codes expecting that the sibling containers are ready when started.

The goal is to start execution of startup_command after the manager-generated SessionStarted event.

Idea 1 (Fire-and-forget from the manager's perspective)

To achieve this, we could introduce a concept like "session creation tracker" to the agent as in the manager API handlers.

In the manager:

  • Pass the parent session IDs along with kernel IDs to the agents when creating kernels via agent RPC.

In the agent:

  • Add a bookkeeper mapping to store asyncio.Event objects for each observed session ID.
  • Create an entry to this mapping before sending the KernelStarted event if it is a main kernel.
  • Wrap execute_batch() task creation after sending the KernelStarted event with an async function that waits for the asyncio.Event of the parent session ID before continuing.
  • Resolve (and remove) the asyncio.Event object when the agent receives a SessionStarted event that maps to the session ID in the bookkeeper mapping.
  • To avoid memory leaks, ensure cleaning up the bookkeeper mapping entry when any main kernel is destroyed.

Idea 2 (Transfer more control to the manager)

Or, we could move the trigger to execute the startup-command from the agent to the manager.

In the agent:

  • Add a new RPC function which spawns the exec_batch() task and returns immediately.

In the manager:

  • Let it invoke this RPC function after generating the SessionStarted event.
@achimnol achimnol added type:bug Reports about that are not working comp:manager Related to Manager component comp:agent Related to Agent component urgency:5 It is imperative that action be taken right away. labels Jun 21, 2024
@achimnol achimnol added this to the 24.03 milestone Jun 21, 2024
@achimnol
Copy link
Member Author

After discussion with @fregataa, we will try the second approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:agent Related to Agent component comp:manager Related to Manager component type:bug Reports about that are not working urgency:5 It is imperative that action be taken right away.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants