-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad Agent Simultaneous Restart Behavior #673
Comments
Minimal Job Configuration
{
"ID": "1",
"Name": "1",
"Type": "batch",
"TaskGroups": [
{
"Name": "default-group",
"Count": 1,
"RestartPolicy": {
"Attempts": 3,
"Interval": 3600000000000,
"Delay": 15000000000,
"Mode": "fail",
"RenderTemplates": false
},
"ReschedulePolicy": {
"Attempts": 3,
"Interval": 21600000000000,
"Delay": 60000000000,
"DelayFunction": "exponential",
"MaxDelay": 240000000000,
"Unlimited": false
},
"Tasks": [
{
"Name": "default-task",
"Driver": "docker",
"Config": {
"command": "sleep",
"force_pull": true,
"image": "openhpi/co_execenv_python:3.8",
"network_mode": "none",
"args": [
"infinity"
]
}
}
]
}
]
} We conducted 5 repetitions with each having 5 jobs running while restarting the agents.
Questions
@MrSerth I would ask these questions in an upstream Issue?!
Because Poseidon |
We have to specify the observed behavior. The behavior differs depending on how recently the job has been created. In case we (stop and) create the job each time before the restart, the recreation mostly succeeds:
If we restart the agents multiple times for the same job, the recreation fails:
The behavior happens only with the drain on shutdown configuration: leave_on_interrupt = true
leave_on_terminate = true
client {
drain_on_shutdown {
deadline = "15s"
}
} |
Thanks for investigating here. I am currently a bit unsure on how to interpret these results.
|
Good open questions, I will forward them.
That's on my wording. The time of how recently the job has been deployed does not seem to have an influence. Instead, the Agent-restart count seems to be important.
|
We are currently blocked by the upstream issue and are waiting for a response. |
In #612 we noticed that on a simultaneous restart of all Nomad agents, some Jobs are completely removed and disappear, others are dead but still listed, and some are being restarted.
complete
/deadJob
-JobDeregistered
events?The text was updated successfully, but these errors were encountered: