Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Improve handling of Mesos resource offers #582

Merged
merged 1 commit into from
Jun 28, 2024
Merged

Conversation

teo
Copy link
Member

@teo teo commented Jun 27, 2024

OCTRL-777

This commit addresses an issue with timing between a Mesos REVIVE call and its corresponding OFFERS event.

Specifically:

  • The channel that pipes incoming deployment requests into the OFFERS handler is now buffered.
  • We retry an unsatisfiable task deployment 3 times before giving up.
  • The deployment request is now passed as a pointer.
  • The response from the OFFERS handler to task.Manager.acquireTasks now goes through its own channel, one per request.
  • The deployment request is now enqueued immediately before a REVIVE, as opposed to after, in order to prevent a race with Mesos.

This commit addresses an issue with timing between a Mesos REVIVE
call and its corresponding OFFERS event.

Specifically:

* The channel that pipes incoming deployment requests into the OFFERS handler is now buffered.
* We retry an unsatisfiable task deployment 3 times before giving up.
* The deployment request is now passed as a pointer.
* The response from the OFFERS handler to task.Manager.acquireTasks now goes through its own channel, one per request.
* The deployment request is now enqueued immediately *before* a REVIVE, as opposed to after, in order to prevent a race with Mesos.
@teo teo requested review from knopers8 and justonedev1 June 27, 2024 14:56
Copy link
Collaborator

@knopers8 knopers8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit I don't fully understand the code here, but I would not block the PR to have it in the next release and rather discuss it during our ECS weekly.

Comment on lines +538 to +540
// ↑ Not all roles could be deployed. If some were critical,
// we cannot proceed with running this environment. Either way,
// we keep the roles running since they might be useful in the future.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Where is the part of code which tries to re-deploy only the undeployed tasks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the comment is misleading. They would be useful in the future if task reuse was enabled, but it was disabled some years ago since start-stop-start wasn't there yet. In case of partial deployment we'd actually fail, but to my knowledge the current undeployable cases we have are actually early failures where no tasks are deployed.

@teo teo merged commit dd30081 into master Jun 28, 2024
2 checks passed
@teo teo deleted the scheduler-cleanup branch June 28, 2024 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants