[core] Improve handling of Mesos resource offers #582

teo · 2024-06-27T14:56:02Z

This commit addresses an issue with timing between a Mesos REVIVE call and its corresponding OFFERS event.

Specifically:

The channel that pipes incoming deployment requests into the OFFERS handler is now buffered.
We retry an unsatisfiable task deployment 3 times before giving up.
The deployment request is now passed as a pointer.
The response from the OFFERS handler to task.Manager.acquireTasks now goes through its own channel, one per request.
The deployment request is now enqueued immediately before a REVIVE, as opposed to after, in order to prevent a race with Mesos.

This commit addresses an issue with timing between a Mesos REVIVE call and its corresponding OFFERS event. Specifically: * The channel that pipes incoming deployment requests into the OFFERS handler is now buffered. * We retry an unsatisfiable task deployment 3 times before giving up. * The deployment request is now passed as a pointer. * The response from the OFFERS handler to task.Manager.acquireTasks now goes through its own channel, one per request. * The deployment request is now enqueued immediately *before* a REVIVE, as opposed to after, in order to prevent a race with Mesos.

knopers8

I have to admit I don't fully understand the code here, but I would not block the PR to have it in the next release and rather discuss it during our ECS weekly.

knopers8 · 2024-06-28T07:46:45Z

core/task/manager.go

+				// ↑ Not all roles could be deployed. If some were critical,
+				//   we cannot proceed with running this environment. Either way,
+				//   we keep the roles running since they might be useful in the future.


I don't understand. Where is the part of code which tries to re-deploy only the undeployed tasks?

I guess the comment is misleading. They would be useful in the future if task reuse was enabled, but it was disabled some years ago since start-stop-start wasn't there yet. In case of partial deployment we'd actually fail, but to my knowledge the current undeployable cases we have are actually early failures where no tasks are deployed.

teo requested review from knopers8 and justonedev1 June 27, 2024 14:56

knopers8 approved these changes Jun 28, 2024

View reviewed changes

teo merged commit dd30081 into master Jun 28, 2024
2 checks passed

teo deleted the scheduler-cleanup branch June 28, 2024 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Improve handling of Mesos resource offers #582

[core] Improve handling of Mesos resource offers #582

teo commented Jun 27, 2024

knopers8 left a comment

knopers8 Jun 28, 2024

teo Jun 28, 2024

[core] Improve handling of Mesos resource offers #582

[core] Improve handling of Mesos resource offers #582

Conversation

teo commented Jun 27, 2024

knopers8 left a comment

Choose a reason for hiding this comment

knopers8 Jun 28, 2024

Choose a reason for hiding this comment

teo Jun 28, 2024

Choose a reason for hiding this comment