Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Run batch execution after session starts #2327

Merged
merged 13 commits into from
Jul 13, 2024

Conversation

fregataa
Copy link
Member

@fregataa fregataa commented Jun 21, 2024

resolves #2326

How the lifetime of the batch session changes

  1. When a manager calculates a status of a batch session as RUNNING, it calls src.ai.backend.manager.registry.trigger_batch_execution() API.
  2. The API calls trigger_batch_execution() RPC function to an agent where a main kernel is allocated.
  3. The agent spawns an asynchronous task that runs the batch execution.
  4. Rest of the lifecycle remains the same. The agent dispatches SessionSuccess event when the batch execution finishes and a manager that consumes the event destroys the batch session.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • API server-client counterparts (e.g., manager API -> client SDK)

@fregataa fregataa added this to the 24.03 milestone Jun 21, 2024
@fregataa fregataa self-assigned this Jun 21, 2024
Copy link

graphite-app bot commented Jun 21, 2024

Your org has enabled the Graphite merge queue for merging into main

Add the label “flow:merge-queue” to the PR and Graphite will automatically add it to the merge queue when it’s ready to merge. Or use the label “flow:hotfix” to add to the merge queue as a hot fix.

You must have a Graphite account and log in to Graphite in order to use the merge queue. Sign up using this link.

@github-actions github-actions bot added comp:manager Related to Manager component comp:agent Related to Agent component size:M 30~100 LoC type:bug Reports about that are not working urgency:5 It is imperative that action be taken right away. labels Jun 21, 2024
@fregataa fregataa force-pushed the fix/run-batch-execution-after-session-starts branch from 82d6f94 to 75e93a6 Compare June 23, 2024 06:05
@fregataa fregataa marked this pull request as ready for review June 23, 2024 15:02
@fregataa fregataa requested a review from achimnol June 24, 2024 03:33
Copy link
Member

@achimnol achimnol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... in my setup, it seems not working well. Investigating more.

  • Missing session events in the event stream (only kernel events)
    • According to the manager log, "SessionStarted" event is fired.
  • There is no way to check if the batch execution task is actually running or not. The container log shows nothing?
image

Update:
After restarting my dev services, it seems to work as expected, except that SessionStarted events are often missing. 🤔

Copy link
Member

@achimnol achimnol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to retrigger the execution of the startup command when we restart the session. See the tail of AgentRegistry.restart_session() method.

@fregataa fregataa requested a review from achimnol July 10, 2024 13:26
@achimnol achimnol enabled auto-merge July 13, 2024 15:07
@achimnol achimnol added this pull request to the merge queue Jul 13, 2024
Merged via the queue into main with commit 09785db Jul 13, 2024
29 checks passed
@achimnol achimnol deleted the fix/run-batch-execution-after-session-starts branch July 13, 2024 15:12
lablup-octodog pushed a commit that referenced this pull request Jul 13, 2024
Co-authored-by: Joongi Kim <[email protected]>
Backported-from: main (24.09)
Backported-to: 24.03
Backport-of: 2327
github-merge-queue bot pushed a commit that referenced this pull request Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:agent Related to Agent component comp:manager Related to Manager component size:M 30~100 LoC type:bug Reports about that are not working urgency:5 It is imperative that action be taken right away.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race of SessionStarted event and startup-cmd execution in multi-node batch-type cluster sessions
2 participants