Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Agent Simultaneous Restart Behavior #673

Open
mpass99 opened this issue Sep 4, 2024 · 5 comments
Open

Nomad Agent Simultaneous Restart Behavior #673

mpass99 opened this issue Sep 4, 2024 · 5 comments
Labels
bug Something isn't working pending

Comments

@mpass99
Copy link
Contributor

mpass99 commented Sep 4, 2024

In #612 we noticed that on a simultaneous restart of all Nomad agents, some Jobs are completely removed and disappear, others are dead but still listed, and some are being restarted.

  • Create Reproduction Steps for this behavior
  • Create an upstream Issue
    • Why are most of the Jobs not restarted after the agents are ready again?
    • Why do some Jobs are completely removed while others are still listed as complete/dead
    • Why do we not receive Job-JobDeregistered events?
@mpass99 mpass99 added the bug Something isn't working label Sep 4, 2024
@mpass99
Copy link
Contributor Author

mpass99 commented Sep 5, 2024

Create Reproduction Steps for this behavior

Minimal Job Configuration

{
  "ID": "1",
  "Name": "1",
  "Type": "batch",
  "TaskGroups": [
    {
      "Name": "default-group",
      "Count": 1,
      "RestartPolicy": {
        "Attempts": 3,
        "Interval": 3600000000000,
        "Delay": 15000000000,
        "Mode": "fail",
        "RenderTemplates": false
      },
      "ReschedulePolicy": {
        "Attempts": 3,
        "Interval": 21600000000000,
        "Delay": 60000000000,
        "DelayFunction": "exponential",
        "MaxDelay": 240000000000,
        "Unlimited": false
      },
      "Tasks": [
        {
          "Name": "default-task",
          "Driver": "docker",
          "Config": {
            "command": "sleep",
            "force_pull": true,
            "image": "openhpi/co_execenv_python:3.8",
            "network_mode": "none",
            "args": [
              "infinity"
            ]
          }
        }
      ]
    }
  ]
}

We conducted 5 repetitions with each having 5 jobs running while restarting the agents.
Interestingly, the recreation does neither count as Restart nor as Rescheduling, although the events state that the Allocation is being migrated.

Repetition Running Jobs after Restart
1 5/5
2 4/5
3 5/5
4 0/5
5 5/5

Questions

  • Why is the Recreation unreliable in this scenario?
  • Why is the migration (and recreation of the Docker container) counted neither as Restart nor as Rescheduling?
  • Why do we not receive Job-JobDeregistered events?

@MrSerth I would ask these questions in an upstream Issue?!

Why do some Jobs are completely removed while others are still listed as complete/dead

Because Poseidon purge runner jobs that are being stopped. This seems not only superfluous but likely aggravates the above-described scenario, because when a job might be recreated, we purge it, in the mid of the recreation process.

@mpass99
Copy link
Contributor Author

mpass99 commented Sep 5, 2024

We have to specify the observed behavior. The behavior differs depending on how recently the job has been created.

In case we (stop and) create the job each time before the restart, the recreation mostly succeeds:

Repetition Running Jobs after Restart
1 5/5
2 5/5
3 5/5
4 4/5

If we restart the agents multiple times for the same job, the recreation fails:

Repetition Running Jobs after first Restart (timestamp) after second Restart (timestamp) Time waited
1 5/5 0/5 10 minutes
2 5/5 (1725533193) 0/5 (1725533276) 8 minutes
3 5/5 (1725533749) 0/5 (1725533794) 2 minutes
4 5/5 (1725543688) 0/5 (1725544945) 30 minutes

The behavior happens only with the drain on shutdown configuration:

leave_on_interrupt = true
leave_on_terminate = true

client {
  drain_on_shutdown {
    deadline = "15s"
  }
}

@MrSerth
Copy link
Member

MrSerth commented Sep 5, 2024

Thanks for investigating here. I am currently a bit unsure on how to interpret these results.

  • I get that the drain_on_shutdown has an influence, but I currently don't understand
    • why it affects the second restart only
    • why it makes the situation worse (and not better); shouldn't this help to avoid loosing jobs?
  • Why is it important how recently the job as been deployed? I could understand if we would hit the restart / rescheduling limit, but for the first agent restart it shouldn't make any difference how "old" the job is.
  • Since I don'T have any answers for your three questions, you may proceed to ask them upstream (or some other suitable discussion list) 👍

@mpass99
Copy link
Contributor Author

mpass99 commented Sep 9, 2024

why it affects the second restart only
why it makes the situation worse (and not better); shouldn't this help to avoid loosing jobs?

Good open questions, I will forward them.

Why is it important how recently the job as been deployed?

That's on my wording. The time of how recently the job has been deployed does not seem to have an influence. Instead, the Agent-restart count seems to be important.

Since I don't have any answers for your three questions, you may proceed to ask them upstream

See hashicorp/nomad#23937

@MrSerth
Copy link
Member

MrSerth commented Sep 25, 2024

We are currently blocked by the upstream issue and are waiting for a response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending
Projects
None yet
Development

No branches or pull requests

2 participants