Nomad Agent Simultaneous Restart Behavior #673

mpass99 · 2024-09-04T13:16:37Z

In #612 we noticed that on a simultaneous restart of all Nomad agents, some Jobs are completely removed and disappear, others are dead but still listed, and some are being restarted.

Create Reproduction Steps for this behavior
Create an upstream Issue
- Why are most of the Jobs not restarted after the agents are ready again?
- Why do some Jobs are completely removed while others are still listed as complete/dead
- Why do we not receive Job-JobDeregistered events?

The text was updated successfully, but these errors were encountered:

mpass99 · 2024-09-05T09:21:34Z

Create Reproduction Steps for this behavior

Minimal Job Configuration

{
  "ID": "1",
  "Name": "1",
  "Type": "batch",
  "TaskGroups": [
    {
      "Name": "default-group",
      "Count": 1,
      "RestartPolicy": {
        "Attempts": 3,
        "Interval": 3600000000000,
        "Delay": 15000000000,
        "Mode": "fail",
        "RenderTemplates": false
      },
      "ReschedulePolicy": {
        "Attempts": 3,
        "Interval": 21600000000000,
        "Delay": 60000000000,
        "DelayFunction": "exponential",
        "MaxDelay": 240000000000,
        "Unlimited": false
      },
      "Tasks": [
        {
          "Name": "default-task",
          "Driver": "docker",
          "Config": {
            "command": "sleep",
            "force_pull": true,
            "image": "openhpi/co_execenv_python:3.8",
            "network_mode": "none",
            "args": [
              "infinity"
            ]
          }
        }
      ]
    }
  ]
}

We conducted 5 repetitions with each having 5 jobs running while restarting the agents.
Interestingly, the recreation does neither count as Restart nor as Rescheduling, although the events state that the Allocation is being migrated.

Repetition	Running Jobs after Restart
1	5/5
2	4/5
3	5/5
4	0/5
5	5/5

Questions

Why is the Recreation unreliable in this scenario?
Why is the migration (and recreation of the Docker container) counted neither as Restart nor as Rescheduling?
Why do we not receive Job-JobDeregistered events?

@MrSerth I would ask these questions in an upstream Issue?!

Why do some Jobs are completely removed while others are still listed as complete/dead

Because Poseidon purge runner jobs that are being stopped. This seems not only superfluous but likely aggravates the above-described scenario, because when a job might be recreated, we purge it, in the mid of the recreation process.

mpass99 · 2024-09-05T11:03:02Z

We have to specify the observed behavior. The behavior differs depending on how recently the job has been created.

In case we (stop and) create the job each time before the restart, the recreation mostly succeeds:

Repetition	Running Jobs after Restart
1	5/5
2	5/5
3	5/5
4	4/5

If we restart the agents multiple times for the same job, the recreation fails:

Repetition	Running Jobs after first Restart (timestamp)	after second Restart (timestamp)	Time waited
1	5/5	0/5	10 minutes
2	5/5 (1725533193)	0/5 (1725533276)	8 minutes
3	5/5 (1725533749)	0/5 (1725533794)	2 minutes
4	5/5 (1725543688)	0/5 (1725544945)	30 minutes

The behavior happens only with the drain on shutdown configuration:

leave_on_interrupt = true
leave_on_terminate = true

client {
  drain_on_shutdown {
    deadline = "15s"
  }
}

MrSerth · 2024-09-05T12:15:00Z

Thanks for investigating here. I am currently a bit unsure on how to interpret these results.

I get that the drain_on_shutdown has an influence, but I currently don't understand
- why it affects the second restart only
- why it makes the situation worse (and not better); shouldn't this help to avoid loosing jobs?
Why is it important how recently the job as been deployed? I could understand if we would hit the restart / rescheduling limit, but for the first agent restart it shouldn't make any difference how "old" the job is.
Since I don'T have any answers for your three questions, you may proceed to ask them upstream (or some other suitable discussion list) 👍

mpass99 · 2024-09-09T13:25:23Z

why it affects the second restart only
why it makes the situation worse (and not better); shouldn't this help to avoid loosing jobs?

Good open questions, I will forward them.

Why is it important how recently the job as been deployed?

That's on my wording. The time of how recently the job has been deployed does not seem to have an influence. Instead, the Agent-restart count seems to be important.

Since I don't have any answers for your three questions, you may proceed to ask them upstream

See hashicorp/nomad#23937

MrSerth · 2024-09-25T13:03:11Z

We are currently blocked by the upstream issue and are waiting for a response.

mpass99 added the bug Something isn't working label Sep 4, 2024

mpass99 mentioned this issue Sep 5, 2024

Fix prohibiting Nomad Job recreation with delay in between #680

Merged

mpass99 mentioned this issue Sep 5, 2024

Check Event Handling on Nomad Agent Simultaneous Restart #674

Closed

mpass99 mentioned this issue Sep 12, 2024

Nomad Rescheduling #639

Closed

This was referenced Sep 25, 2024

Handle permanently dead Nomad jobs #612

Closed

Prewarming Pool Alert #587

Open

MrSerth added the pending label Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad Agent Simultaneous Restart Behavior #673

Nomad Agent Simultaneous Restart Behavior #673

mpass99 commented Sep 4, 2024

mpass99 commented Sep 5, 2024

mpass99 commented Sep 5, 2024 •

edited

Loading

MrSerth commented Sep 5, 2024

mpass99 commented Sep 9, 2024

MrSerth commented Sep 25, 2024

Nomad Agent Simultaneous Restart Behavior #673

Nomad Agent Simultaneous Restart Behavior #673

Comments

mpass99 commented Sep 4, 2024

mpass99 commented Sep 5, 2024

Questions

mpass99 commented Sep 5, 2024 • edited Loading

MrSerth commented Sep 5, 2024

mpass99 commented Sep 9, 2024

MrSerth commented Sep 25, 2024

mpass99 commented Sep 5, 2024 •

edited

Loading