Elastic Agent should support a restart action #3367

jlind23 · 2023-09-07T11:57:56Z

In order to give our users the ability to remotely restart an Agent, the Elastic Agent should support a restart action.
We have two path forward:
This should have the exact same behaviour as the elastic-agent restart command.

The text was updated successfully, but these errors were encountered:

jlind23 · 2023-09-07T12:01:27Z

@cmacknz @pierrehilbert I believe this is the first piece to deliver before working on elastic/kibana#144585

cmacknz · 2023-09-07T15:49:07Z

Yes this needs to come first.

pierrehilbert · 2023-09-07T15:50:03Z

From my point of view, we "just" need to add a new handler in the same way we did it for remote diag or upgrades.
What is the priority for this new feature compare to the other we have?

blakerouse · 2023-09-07T15:53:43Z

The hardest part of adding the action is how to handle the ACK. Do we ACK before we restart or after we restart. If we do it after then we need to handle that, if we do it before we need to ensure the ACK is received by Fleet Server before we perform the restart.

cmacknz · 2023-09-07T16:05:49Z

We should acknowledge only after the restart has completed. This is basically the same problem as acknowledging an upgrade, and the easiest solution to implement without edge cases will be to report the state in as part of the periodic check in rather than trying to guarantee delivery of a single acknowledgment across restarts.

There are probably a few ways to accomplish this, off the top of my head including any kind of monotonic number in the check in that resets on a restart is one way for Fleet to detect this without an explicit acknowledgement. A counter for the number of restarts performed, the accumulated time since the last restart / since the agent process started, etc.

We could also persist the acknowledgement to disk and just retry it infinitely until it gets through, but this will have a lot more edge cases to test.

elasticmachine · 2024-05-16T20:58:45Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

nchaulet · 2025-01-16T14:03:23Z

Hi I am working on the tech definition for the kibana part elastic/kibana#144585 and I am wondering what we will have to send to the agent in that case? does a new action type something like RESTART will work for you?

cmacknz · 2025-01-16T15:15:07Z

Not defined yet, a restart action is one way, but we could also define it as a special case of the upgrade action so we can reuse the state reporting and acknowledgement mechanisms from that.

nchaulet · 2025-01-16T17:53:57Z

Not defined yet, a restart action is one way, but we could also define it as a special case of the upgrade action so we can reuse the state reporting and acknowledgement mechanisms from that.

Could it be confusing, and misleading to use the upgrade state reporting for that? what if a restart and an upgrade action are triggered in the same time? will it be better if we need a state reporting to have some new fields, inspired by what have been done for upgrade?
Trying to understand a little more the need here, what information could we want to report after a restart other than acknowledging the restart as been done? could there by some restarts failure?

cmacknz · 2025-01-16T19:18:59Z

The hardest parts of a restart action is that the ack should technically happen after the restart, which requires that state be tracked outside of the agent process, which is exactly what our upgrade state reporting is built to do. Also, "restart yourself" is already a step that exists in the upgrade state machine. Perhaps there is a separate restart action that is just an alias for some part of this existing process.

I don't want to over specify this before someone else has a chance to think more in depth about this, but it feels like we can reuse a lot of the existing upgrade machinery to make sure this works properly at an implementation level. We may not actually want to recycle the upgrade action to trigger this.

nimarezainia · 2025-01-16T23:40:57Z

How about our audit logs? we would want to see in the audit logs or the activity panel that the agent was "explicitly" restarted. I believe for that we need a RESTART action as a separate action sent to the agent. As Craig mentions, the RESTART action should invoke the relevant portion of the upgrade process. It's battle tested and has all the necessary state machine states and reporting.

The persona issuing the restart may be different to the one that is doing the upgrade. Upgrade happens at scale but in reality restart may be singular - some one trying to troubleshoot for example. So separate sessions would be issuing these commands. We need to ensure there's a paper-trail and these actions are distinguished from one another.

Trying to understand a little more the need here, what information could we want to report after a restart other than acknowledging the restart as been done? could there by some restarts failure?

Almost always the agent is already in a failure state when the user issues a restart (as a desperate last effort move to restore), without having resolved the underlying issue causing the agent to be unhealthy. Chances are that after the restart agent may end up in the same unhealthy/failed state.

nchaulet · 2025-01-17T16:08:35Z

I don't want to over specify this before someone else has a chance to think more in depth about this, but it feels like we can reuse a lot of the existing upgrade machinery to make sure this works properly at an implementation level. We may not actually want to recycle the upgrade action to trigger this.

👍 who could take a look more in depth about this? from what I understand the implementation could be something like:

Fleet UI send a RESTART action
this is handled by the agent with:
audit log about the restart
use the same code path as upgrade for restart
ack after restart

cmacknz · 2025-01-17T16:35:21Z

@jlind23 is the best person to assign one of the control plane engineers to help refine this.

jlind23 mentioned this issue Sep 7, 2023

Add ability to remotely restart an agent elastic/kibana#144585

Open

5 tasks

jlind23 transferred this issue from elastic/kibana Sep 7, 2023

jlind23 added the Team:Elastic-Agent Label for the Agent team label Sep 7, 2023

ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic Agent should support a restart action #3367

Elastic Agent should support a restart action #3367

jlind23 commented Sep 7, 2023 •

edited

Loading

jlind23 commented Sep 7, 2023

cmacknz commented Sep 7, 2023

pierrehilbert commented Sep 7, 2023

blakerouse commented Sep 7, 2023

cmacknz commented Sep 7, 2023

elasticmachine commented May 16, 2024

nchaulet commented Jan 16, 2025

cmacknz commented Jan 16, 2025

nchaulet commented Jan 16, 2025

cmacknz commented Jan 16, 2025

nimarezainia commented Jan 16, 2025

nchaulet commented Jan 17, 2025

cmacknz commented Jan 17, 2025

Elastic Agent should support a restart action #3367

Elastic Agent should support a restart action #3367

Comments

jlind23 commented Sep 7, 2023 • edited Loading

jlind23 commented Sep 7, 2023

cmacknz commented Sep 7, 2023

pierrehilbert commented Sep 7, 2023

blakerouse commented Sep 7, 2023

cmacknz commented Sep 7, 2023

elasticmachine commented May 16, 2024

nchaulet commented Jan 16, 2025

cmacknz commented Jan 16, 2025

nchaulet commented Jan 16, 2025

cmacknz commented Jan 16, 2025

nimarezainia commented Jan 16, 2025

nchaulet commented Jan 17, 2025

cmacknz commented Jan 17, 2025

jlind23 commented Sep 7, 2023 •

edited

Loading