-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to remotely restart an agent #144585
Comments
Pinging @elastic/fleet (Team:Fleet) |
Questions:
|
Closing in favour of https://github.com/elastic/ingest-dev/issues/1221 |
@nimarezainia Is this issue intentionally reopened? |
@juliaElastic this is the public issue. I had closed it in favor of the private one to reduce duplicates by mistake. We should close the public issue once the implementation is compete. hope this makes sense. the private issue has the bulk of the prioritization and implementation discussions. |
I think we're still not yet aligned on whether or not we want to support this at all. If we do support it, I think it should be an advanced action not exposed in the UI and we should have telemetry to track usage as ideally this isn't needed often. |
We have currently 1150 agents out in our environment. Each time, after a restart of logstash, all agents look to work fine, but some are not able so send data anymore. They are still visible as helthy. In kibana I couldn't find anything bad. But the didn't send data anymore. If I restart the elastic-agent, it works fine. That is the reason, why we I need this feature. |
@amolnater-qasource As part of the Logstash test cases you run, is this included? If not, worth adding it then. |
Thank you for the update @jlind23 We have added a testcase where the Logstash is restarted when connected to the elastic-agent under Fleet test suite at link: Please let us know if we are missing anything here. |
@amolnater-qasource csn you please check this as soon as possible? I want to check if we have a really bad problem here. |
@jlind23 We have revalidated this scenario on latest 8.10.0 BC2 kibana cloud environment and found this issue not reproducible there. Observations:
For reconfirming we tried several times to reproduce this, however the data resumed for the agent as soon as logstash service gets up. Few other scenarios tried:
This issue isn't reproducible this way too. Screen Recording: Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-08-29.19-30-40.mp4After Restart: Data.streams.-.Fleet.-.Elastic.-.Google.Chrome.2023-08-29.19-35-05.mp4Build details: Please let us know if anything else is required from our end. Thanks! |
@ThomSwiss please let us know if we're running the tests in a different way, we're unable to reproduce. If this does reproduce for you, would be great to share your agent diagnostics files and we're happy to investigate further. |
@ThomSwiss also what version are you on? |
@amitkanfer, @nimarezainia We use the newest Agent version 8.9.1. We had the same issue also with older releases. We do not all releases, but I am sure that this was a problem with 8.7.x releases as well. I try to do a query on the reached data to find out which clients don't send data anymore. Than I can run diagnostics on it. I hope to get this answer in the next 1-2 days. |
I did a lot of tests the last 2 days. I can now tell you: Elastic Agent is correctly working after restart logstash. My Testcase:
I am sorry for my wrong post. But I am still unclear, when this happend in the past. I remember at least two occourencies in the last 2 years, where we had to restart all agents to get them back to send correctly data. I guess, sometimes it also helped when we just changed the fleet policy. So for example added or disabled Powershell logs in the windows integration. I have now this script and also my logs. I will check carefully, if this appears again and will come back, if I have details. Also with diagnostics. Thanks for your work! Elastic is a great product. |
@pierrehilbert @blakerouse Does Elastic Agent have a restart command than be sent down from fleet? Just like upgrade or any other actions? |
From what I know, we don't have an action handler to restart the Agent. |
Correct. The Elastic Agent doesn't support that action. |
Today I had a problem with an Elastic Agent Custom log integration: I did an error in the processor field (the Kibana Fleets GUI, didn't show me an error. I had two \ sign in a replace pattern ). I saved it successfully. Later the Agent changed to not healty. I corrected the error. But the client did not change to healty. The message did not disapper. I waited at least 15 minutes. Then I restarted the elastic agent (had to login to the device). After restart, all was fine. If you are interested, I did a analytics before I restarted. This is a typical use case for a restart. |
@nimarezainia updated the issue description following the chat we had. |
This is my suggestion : |
thanks for this information. Since we are providing this capability via an API only, wouldn't the logic you describe better be accomodated by the user's code that invokes this API? |
This would also help if the metricbeat or other beasts will contain memory leak bugs in the future. |
We do have a plus 30K agent infrastructure, with agents running also in remote locations. Utilising such an API would be of great advantage. Do you have any update on when that will become available? |
We have plus 12K agents and sometimes have Agents that doesn't do anything. After the last windows patch, we had again to restart some agents, because they didn't run correctly, they just don't send data. After restart, all was fine. |
@ThomSwiss this shouldn't be happening and I consider it a bug. Could you open an support case with us if possible to the issue can be diagnosed. Not denying that this feature would be useful, just want to ensure the primary problem is addressed. Would be great to obtain the diagnostics file or any error that you see which agents produce which could give us a clue. |
I would also like it, when It is implemented in Kibana. Of course, we use often the API. I think I should be like when we update our 15000 Agents. I can just tell, that It should update them for example in the next 24 hours. |
We have a feature that allow to schedule actions, it is currently used for upgrade, we could probably provide something similar for restart (note it need to have a enterprise license) @nimarezainia do you think it will be a good way to solve this? proving a schedule restart action. @juliaElastic I know you worked a lot on upgrades, do you see any issue with the current scheduling feature, that could cause issues here, or any thoughts if it's bad idea to use that feature for restart. |
@nchaulet I think it's good to use the same scheduling feature to send a restart action, most of the logic is in fleet-server that delivers actions at the scheduled time, spread out over the rollout period. |
@nchaulet the scheduling/rollout upgrade you mention was required mainly due to the fact that during the upgrade there's a download phase of the binary, which at scale could exhaust the network bandwidth available. So we allow the user to schedule outside of working hours and ensure that download is deterministic across a period. The restart is a bit different. For starters it doesn't need a download. I also suspect that majority of the times it would be applied to a small number of agents in a troubleshooting scenario - to fix an unhealthy agent. I can justify a user needing to "fix" their agent in the future. I would suggest we avoid this complication for now and solicit feedback to see if it becomes a requirement. At this moment I don't see it as a requirement. |
@nimarezainia so your suggestions will be to keep it as simple as possible correct? so just a restart action and bulk restart action in the UI correct? |
There are some cases where a simple restart of an Agent may resolve common problems. Currently there's no way to do this remotely.
In order to allow this action we should offer a new API endpoint that will be shipped under an experimental status for now.
This endpoint should one of multiple Agent ID in order to operate a bulk restart if needed.
Depends on
This is a two steps issue:
The text was updated successfully, but these errors were encountered: