Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Agent should support a restart action #3367

Open
Tracked by #144585
jlind23 opened this issue Sep 7, 2023 · 13 comments
Open
Tracked by #144585

Elastic Agent should support a restart action #3367

jlind23 opened this issue Sep 7, 2023 · 13 comments
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@jlind23
Copy link
Contributor

jlind23 commented Sep 7, 2023

In order to give our users the ability to remotely restart an Agent, the Elastic Agent should support a restart action.
We have two path forward:
This should have the exact same behaviour as the elastic-agent restart command.

@jlind23 jlind23 transferred this issue from elastic/kibana Sep 7, 2023
@jlind23 jlind23 added the Team:Elastic-Agent Label for the Agent team label Sep 7, 2023
@jlind23
Copy link
Contributor Author

jlind23 commented Sep 7, 2023

@cmacknz @pierrehilbert I believe this is the first piece to deliver before working on elastic/kibana#144585

@cmacknz
Copy link
Member

cmacknz commented Sep 7, 2023

Yes this needs to come first.

@pierrehilbert
Copy link
Contributor

From my point of view, we "just" need to add a new handler in the same way we did it for remote diag or upgrades.
What is the priority for this new feature compare to the other we have?

@blakerouse
Copy link
Contributor

The hardest part of adding the action is how to handle the ACK. Do we ACK before we restart or after we restart. If we do it after then we need to handle that, if we do it before we need to ensure the ACK is received by Fleet Server before we perform the restart.

@cmacknz
Copy link
Member

cmacknz commented Sep 7, 2023

We should acknowledge only after the restart has completed. This is basically the same problem as acknowledging an upgrade, and the easiest solution to implement without edge cases will be to report the state in as part of the periodic check in rather than trying to guarantee delivery of a single acknowledgment across restarts.

There are probably a few ways to accomplish this, off the top of my head including any kind of monotonic number in the check in that resets on a restart is one way for Fleet to detect this without an explicit acknowledgement. A counter for the number of restarts performed, the accumulated time since the last restart / since the agent process started, etc.

We could also persist the acknowledgement to disk and just retry it infinitely until it gets through, but this will have a lot more edge cases to test.

@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 16, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@nchaulet
Copy link
Member

Hi I am working on the tech definition for the kibana part elastic/kibana#144585 and I am wondering what we will have to send to the agent in that case? does a new action type something like RESTART will work for you?

@cmacknz
Copy link
Member

cmacknz commented Jan 16, 2025

Not defined yet, a restart action is one way, but we could also define it as a special case of the upgrade action so we can reuse the state reporting and acknowledgement mechanisms from that.

@nchaulet
Copy link
Member

Not defined yet, a restart action is one way, but we could also define it as a special case of the upgrade action so we can reuse the state reporting and acknowledgement mechanisms from that.

Could it be confusing, and misleading to use the upgrade state reporting for that? what if a restart and an upgrade action are triggered in the same time? will it be better if we need a state reporting to have some new fields, inspired by what have been done for upgrade?
Trying to understand a little more the need here, what information could we want to report after a restart other than acknowledging the restart as been done? could there by some restarts failure?

@cmacknz
Copy link
Member

cmacknz commented Jan 16, 2025

The hardest parts of a restart action is that the ack should technically happen after the restart, which requires that state be tracked outside of the agent process, which is exactly what our upgrade state reporting is built to do. Also, "restart yourself" is already a step that exists in the upgrade state machine. Perhaps there is a separate restart action that is just an alias for some part of this existing process.

I don't want to over specify this before someone else has a chance to think more in depth about this, but it feels like we can reuse a lot of the existing upgrade machinery to make sure this works properly at an implementation level. We may not actually want to recycle the upgrade action to trigger this.

@nimarezainia
Copy link
Contributor

How about our audit logs? we would want to see in the audit logs or the activity panel that the agent was "explicitly" restarted. I believe for that we need a RESTART action as a separate action sent to the agent. As Craig mentions, the RESTART action should invoke the relevant portion of the upgrade process. It's battle tested and has all the necessary state machine states and reporting.

The persona issuing the restart may be different to the one that is doing the upgrade. Upgrade happens at scale but in reality restart may be singular - some one trying to troubleshoot for example. So separate sessions would be issuing these commands. We need to ensure there's a paper-trail and these actions are distinguished from one another.

Trying to understand a little more the need here, what information could we want to report after a restart other than acknowledging the restart as been done? could there by some restarts failure?

Almost always the agent is already in a failure state when the user issues a restart (as a desperate last effort move to restore), without having resolved the underlying issue causing the agent to be unhealthy. Chances are that after the restart agent may end up in the same unhealthy/failed state.

@nchaulet
Copy link
Member

I don't want to over specify this before someone else has a chance to think more in depth about this, but it feels like we can reuse a lot of the existing upgrade machinery to make sure this works properly at an implementation level. We may not actually want to recycle the upgrade action to trigger this.

👍 who could take a look more in depth about this? from what I understand the implementation could be something like:

  • Fleet UI send a RESTART action
  • this is handled by the agent with:
  • audit log about the restart
  • use the same code path as upgrade for restart
  • ack after restart

@cmacknz
Copy link
Member

cmacknz commented Jan 17, 2025

@jlind23 is the best person to assign one of the control plane engineers to help refine this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

8 participants