-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastic Agent should support a restart action #3367
Comments
@cmacknz @pierrehilbert I believe this is the first piece to deliver before working on elastic/kibana#144585 |
Yes this needs to come first. |
From my point of view, we "just" need to add a new handler in the same way we did it for remote diag or upgrades. |
The hardest part of adding the action is how to handle the ACK. Do we ACK before we restart or after we restart. If we do it after then we need to handle that, if we do it before we need to ensure the ACK is received by Fleet Server before we perform the restart. |
We should acknowledge only after the restart has completed. This is basically the same problem as acknowledging an upgrade, and the easiest solution to implement without edge cases will be to report the state in as part of the periodic check in rather than trying to guarantee delivery of a single acknowledgment across restarts. There are probably a few ways to accomplish this, off the top of my head including any kind of monotonic number in the check in that resets on a restart is one way for Fleet to detect this without an explicit acknowledgement. A counter for the number of restarts performed, the accumulated time since the last restart / since the agent process started, etc. We could also persist the acknowledgement to disk and just retry it infinitely until it gets through, but this will have a lot more edge cases to test. |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Hi I am working on the tech definition for the kibana part elastic/kibana#144585 and I am wondering what we will have to send to the agent in that case? does a new action type something like |
Not defined yet, a restart action is one way, but we could also define it as a special case of the upgrade action so we can reuse the state reporting and acknowledgement mechanisms from that. |
Could it be confusing, and misleading to use the upgrade state reporting for that? what if a restart and an upgrade action are triggered in the same time? will it be better if we need a state reporting to have some new fields, inspired by what have been done for upgrade? |
The hardest parts of a restart action is that the ack should technically happen after the restart, which requires that state be tracked outside of the agent process, which is exactly what our upgrade state reporting is built to do. Also, "restart yourself" is already a step that exists in the upgrade state machine. Perhaps there is a separate restart action that is just an alias for some part of this existing process. I don't want to over specify this before someone else has a chance to think more in depth about this, but it feels like we can reuse a lot of the existing upgrade machinery to make sure this works properly at an implementation level. We may not actually want to recycle the upgrade action to trigger this. |
How about our audit logs? we would want to see in the audit logs or the activity panel that the agent was "explicitly" restarted. I believe for that we need a RESTART action as a separate action sent to the agent. As Craig mentions, the RESTART action should invoke the relevant portion of the upgrade process. It's battle tested and has all the necessary state machine states and reporting. The persona issuing the restart may be different to the one that is doing the upgrade. Upgrade happens at scale but in reality restart may be singular - some one trying to troubleshoot for example. So separate sessions would be issuing these commands. We need to ensure there's a paper-trail and these actions are distinguished from one another.
Almost always the agent is already in a failure state when the user issues a restart (as a desperate last effort move to restore), without having resolved the underlying issue causing the agent to be unhealthy. Chances are that after the restart agent may end up in the same unhealthy/failed state. |
👍 who could take a look more in depth about this? from what I understand the implementation could be something like:
|
@jlind23 is the best person to assign one of the control plane engineers to help refine this. |
In order to give our users the ability to remotely restart an Agent, the Elastic Agent should support a restart action.
We have two path forward:
This should have the exact same behaviour as the
elastic-agent restart
command.The text was updated successfully, but these errors were encountered: