-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet]Restart elastic agents from fleet server remotely #2523
Comments
Hi @allamiro! Thanks for taking the time to suggest this. We've discussed this recently internally as well. I'm curious, in what scenarios are you needing to restart Agent? I suspect we may have a bug if this is something you need to do on a regular basis. |
@joshdover I believe that the fleet server, as the managing platform for agents, lacks the ability to start connectors remotely. This ability could be useful for various reasons, not just limited to a specific scenario. Some of these reasons ::
This ability is available on different SIEM Agent management applications and I think it would be great and will add great value if Elastic has it as well. |
this is a duplicate of #2311 and #144585 however most of the questions that werent answered and asked on those are answered in this issue it seems alot of people need this feature my recommendation is to :
What will be the agent state after the user starts the restart? When does the state change back to healthy? What if it never successfully restarts? My recommendation is when a user initiates a restart, the agent's state will transition to a "restarting" state. At this point, the agent stops processing new data and attempts to gracefully shut down. If the shutdown is successful, the agent's state will change to "stopped". If not, the agent's state may remain in "restarting" or transition to an "error" state.Once the agent has successfully restarted and is able to process new data, its state will transition back to "healthy". If an agent is unable to successfully restart, its state may remain in "restarting" or transition to an "error" "failed restart " state after a specific period of time. |
@allamiro This seems to be the focus of your use case and is definitely already supported. You should not need to restart Agent in order for a new integration to be started. For example, if I add the "System" integration to an agent policy in the Fleet UI, within seconds, all the agents enrolled in that policy should start the necessary processes to collect that data and start shipping it to Elasticsearch. No restart required. If this is not happening for you, I suspect you may be running into a bug. Could you open a new elastic-agent issue with steps to reproduce this (especially which integration you have seen this happen on) and link it here so we can investigate it?
I think this is a valid use case in some cases, but it definitely shouldn't be the norm. Another option to a full agent restart could be to offer an on/off toggle on the integration's row inside the Agent Policy UI.
Thanks for taking the time to weigh in on this. I agree we need this level of visibility into the restart process if we decide to implement it. In general, we need ways to surface more granular state information like this. |
@joshdover one of the issue we have around Possible transient errors on the elastic agents log its also discussed on https://discuss.elastic.co/t/fail-to-checkin-to-fleet-server/318932/7 and it seems restarting the service fixes the situation I totally agree it doesn't happen often but its important to have the option just in case. A question to ask is there a builtin recovery check in the agent to restart the daemon if its seeing an error may be attempting to restart 3 times if it detects an error or a problem I see a built in check to connect to the fleet server but I m not sure if there is also recovery checks for errors and those types of issues. |
"There are tons of scenarios where this is needed, for example: zombie processes or update errors in integrations, here is an example: [elastic_agent][error] Unit state changed log-e2ec1ee0-8152-11ed-95ab-01bcb4735c50 (CONFIGURING->FAILED): [failed to reloading inputs: 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::173-64771, Finished: false, Fileinfo: &{messages 404568 384 {112726762 63822407977 0x55d7cf19cb60} {64771 173 1 33152 0 0 0 0 404568 4096 792 {1686810787 776220378} {1686811177 112726762} {1686811177 112726762} [0 0 0]}}, Source: /var/log/messages, Offset: 439661, Timestamp: 2023-06-15 02:44:43.199529844 -0400 -04 m=+32073.104975279, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 173-64771}]. This error occurred while updating the SYSTEM integration in a large fleet. To solve it, I managed to visualize 2 paths: 1. Remove the integration from the policies and readd (which is no joke because there are many policies), and the other (the least viable due to the number of machines) is to restart the agent (both ways solved the problem). I opted for the first path and it was resolved. If I had had a restart button for the agent, it would have been much easier. When everything works well, it's true that it seems almost unnecessary, but when something fails (which is more common than we would like), that button is really missed. |
I second this. I just had to login to a system and restart the agent because of an integration issue. |
same reasons for me, would be a nice feature. |
I also think being able to restart an agent from Fleet would be extremely useful. I have had a number of support cases which required me to restart the agent. I had to open a ticket with IT support so they could login to the endpoint and restart the agent for me. That takes days. There are also a number of issues with the agent that a simple restart would fix. Sometimes agents appear unhealthy in Fleet, and changing the logging level from info to debug and back will kick start the agent enough for it to go healthy. Reading through this thread I see a number of people that use the product every day asking for a feature which seems to be reasonable to me. Then I read a number of comments stating why the feature shouldn't be needed from people that probably are not operating the product in production at scale within an enterprise. Thanks for your support, and I hope you consider adding this feature for the benefit of your customers. |
According to https://discuss.elastic.co/t/restart-elastic-agent-from-fleet-centralized-management/318646, the restart feature should be on the roadmap. Elastic team member confirmed this. Sometimes happened exactly what tjputzGSA mentioned. Agents appear unhealthy, for example:
Only restart helps. Creating a support ticket and waiting till support restarts it, is not ideal. |
@mrhackcz Their initial plan was to incorporate the agent restart, but I think they may be veered off that course. |
I've addressed and provided recommendations for the questions posed by the Elastic team, which were previously unaddressed in all the tickets related to this issue. If needed please reference them on that ticket. |
I think the work on fleet-server to support a new action type will be minimal, it may just be defining a new action type/data struct in the openapi doc (after #3060) has been merged. |
any updates on this ? |
Describe the enhancement:
Ability to restart Elastic agents on one or multiple systems from the fleet section through the fleet server
Describe a specific use case for the enhancement or feature:
Restarting Elastic Agents from Fleet Server can save time and effort compared to manually restarting each individual agent. This is especially useful when managing a large number of agents across multiple hosts.It allows you to push updated configuration to the agent without having to manually update each individual agent. This can save time and reduce the risk of errors.
The text was updated successfully, but these errors were encountered: