Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to remotely restart an agent #144585

Open
5 tasks
jlind23 opened this issue Nov 4, 2022 · 32 comments
Open
5 tasks

Add ability to remotely restart an agent #144585

jlind23 opened this issue Nov 4, 2022 · 32 comments
Assignees
Labels
QA:Needs Validation Issue needs to be validated by QA Team:Elastic-Agent-Control-Plane Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@jlind23
Copy link
Contributor

jlind23 commented Nov 4, 2022

There are some cases where a simple restart of an Agent may resolve common problems. Currently there's no way to do this remotely.
In order to allow this action we should offer a new API endpoint that will be shipped under an experimental status for now.
This endpoint should one of multiple Agent ID in order to operate a bulk restart if needed.

Depends on

This is a two steps issue:

  • Allow this for a single Elastic Agent
  • Allow this for multiple Elastic Agent
@jlind23 jlind23 added the Team:Fleet Team label for Observability Data Collection Fleet team label Nov 4, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@joshdover
Copy link
Contributor

Questions:

  • What will be the agent state after the user starts the restart? When does the state change back to healthy? What if it never successfully restarts?
  • We need to consider the potential impact on the user's Fleet or Elasticsearch cluster. It's possible that restarting all agents at once leads to a high volume of backlogged data being ingested. If ES performance is degraded, operating Fleet may not be possible.
    • Ideally this is something the whole system can handle through back-pressure, but such a test has not been done with Fleet.
    • Should we allow or require that users schedule bulk restarts with a maintenance window to avoid this, at least for more than X agents? Or warn them about the potential for high data volumes/instability?

@nimarezainia
Copy link
Contributor

Closing in favour of https://github.com/elastic/ingest-dev/issues/1221

@juliaElastic
Copy link
Contributor

@nimarezainia Is this issue intentionally reopened?

@nimarezainia
Copy link
Contributor

nimarezainia commented Jun 5, 2023

@nimarezainia Is this issue intentionally reopened?

@juliaElastic this is the public issue. I had closed it in favor of the private one to reduce duplicates by mistake. We should close the public issue once the implementation is compete. hope this makes sense. the private issue has the bulk of the prioritization and implementation discussions.

@joshdover
Copy link
Contributor

joshdover commented Aug 15, 2023

I think we're still not yet aligned on whether or not we want to support this at all. If we do support it, I think it should be an advanced action not exposed in the UI and we should have telemetry to track usage as ideally this isn't needed often.

@ThomSwiss
Copy link

We have currently 1150 agents out in our environment.
The most of them send there data to on of two logstashes.

Each time, after a restart of logstash, all agents look to work fine, but some are not able so send data anymore. They are still visible as helthy. In kibana I couldn't find anything bad. But the didn't send data anymore. If I restart the elastic-agent, it works fine. That is the reason, why we I need this feature.

@jlind23
Copy link
Contributor Author

jlind23 commented Aug 29, 2023

@amolnater-qasource As part of the Logstash test cases you run, is this included? If not, worth adding it then.

@amolnater-qasource
Copy link

Thank you for the update @jlind23

We have added a testcase where the Logstash is restarted when connected to the elastic-agent under Fleet test suite at link:

Please let us know if we are missing anything here.
Thanks!

@jlind23
Copy link
Contributor Author

jlind23 commented Aug 29, 2023

@amolnater-qasource csn you please check this as soon as possible? I want to check if we have a really bad problem here.

@amolnater-qasource
Copy link

@jlind23 We have revalidated this scenario on latest 8.10.0 BC2 kibana cloud environment and found this issue not reproducible there.

Observations:

  • On restarting logstash output, new data is generated for the connected agent after 10-15 seconds as soon as Logstash is up.

For reconfirming we tried several times to reproduce this, however the data resumed for the agent as soon as logstash service gets up.

Few other scenarios tried:

  • Restarted Elastic-Agent from services and then restarted logstash.
  • Stopped the agent till it went offline and then getting the host back up. After that restarting the logstash.
  • Set agent logs to debug level and then tried to restart logstash.

This issue isn't reproducible this way too.

Screen Recording:
Before Restart:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-08-29.19-30-40.mp4

After Restart:

Data.streams.-.Fleet.-.Elastic.-.Google.Chrome.2023-08-29.19-35-05.mp4

Build details:
VERSION: 8.10.0
BUILD: 66107 BC2
COMMIT: fa3473f

Please let us know if anything else is required from our end.

Thanks!

@amitkanfer
Copy link

@ThomSwiss please let us know if we're running the tests in a different way, we're unable to reproduce. If this does reproduce for you, would be great to share your agent diagnostics files and we're happy to investigate further.

@nimarezainia
Copy link
Contributor

@ThomSwiss also what version are you on?

@ThomSwiss
Copy link

@amitkanfer, @nimarezainia
Thanks for your help!

We use the newest Agent version 8.9.1. We had the same issue also with older releases. We do not all releases, but I am sure that this was a problem with 8.7.x releases as well.

I try to do a query on the reached data to find out which clients don't send data anymore. Than I can run diagnostics on it. I hope to get this answer in the next 1-2 days.

@ThomSwiss
Copy link

ThomSwiss commented Aug 31, 2023

I did a lot of tests the last 2 days. I can now tell you: Elastic Agent is correctly working after restart logstash. My

Testcase:

  • With a powershell script, runned many times
    • Get all 984 Elastic Agents with status healthy, all Windows
    • Count the number of records we received during the last 30 minutes/per Agent on dataview winlogbeat (includes logs-system.application,logs-system.security,logs-windows.powershell,logs-windows.powershell_operational and a view more)
    • List all Agents that send less than 30 records during the last 30 minutes
  • Compare this lists during many runs
  • Restart the two logstashes that receives Elastic Agent input on port 5044
  • Result: We received also data after restarting logstash

I am sorry for my wrong post. But I am still unclear, when this happend in the past. I remember at least two occourencies in the last 2 years, where we had to restart all agents to get them back to send correctly data. I guess, sometimes it also helped when we just changed the fleet policy. So for example added or disabled Powershell logs in the windows integration. I have now this script and also my logs. I will check carefully, if this appears again and will come back, if I have details. Also with diagnostics.

Thanks for your work! Elastic is a great product.

@jlind23
Copy link
Contributor Author

jlind23 commented Sep 6, 2023

@pierrehilbert @blakerouse Does Elastic Agent have a restart command than be sent down from fleet? Just like upgrade or any other actions?

@pierrehilbert
Copy link
Contributor

From what I know, we don't have an action handler to restart the Agent.
@blakerouse if you can keep me honest here

@blakerouse
Copy link

Correct. The Elastic Agent doesn't support that action.

@ThomSwiss
Copy link

Today I had a problem with an Elastic Agent Custom log integration: I did an error in the processor field (the Kibana Fleets GUI, didn't show me an error. I had two \ sign in a replace pattern ). I saved it successfully. Later the Agent changed to not healty.

I corrected the error. But the client did not change to healty. The message did not disapper. I waited at least 15 minutes. Then I restarted the elastic agent (had to login to the device). After restart, all was fine. If you are interested, I did a analytics before I restarted. This is a typical use case for a restart.

@jlind23
Copy link
Contributor Author

jlind23 commented Sep 7, 2023

@nimarezainia updated the issue description following the chat we had.
cc @kpollich for awareness

@allamiro
Copy link

allamiro commented Sep 12, 2023

Questions:

  • What will be the agent state after the user starts the restart? When does the state change back to healthy? What if it never successfully restarts?

  • We need to consider the potential impact on the user's Fleet or Elasticsearch cluster. It's possible that restarting all agents at once leads to a high volume of backlogged data being ingested. If ES performance is degraded, operating Fleet may not be possible.

  • Ideally this is something the whole system can handle through back-pressure, but such a test has not been done with Fleet.

  • Should we allow or require that users schedule bulk restarts with a maintenance window to avoid this, at least for more than X agents? Or warn them about the potential for high data volumes/instability?

This is my suggestion :
I believe restricting the initiation of no more than 10 to 20 agents simultaneously could help bypass scheduling during a maintenance window. If there's a need to restart more than 20 agents, the system should prompt the admin to schedule a maintenance window outside of operational hours. When executing bulk restarts, the system shouldn't restart all agents simultaneously; instead, it should process them in batches of 20 to 30 at a time.

@juliaElastic juliaElastic added the QA:Needs Validation Issue needs to be validated by QA label Sep 13, 2023
@nimarezainia
Copy link
Contributor

This is my suggestion :
I believe restricting the initiation of no more than 10 to 20 agents simultaneously could help bypass scheduling during a maintenance window. If there's a need to restart more than 20 agents, the system should prompt the admin to schedule a maintenance window outside of operational hours. When executing bulk restarts, the system shouldn't restart all agents simultaneously; instead, it should process them in batches of 20 to 30 at a time.

thanks for this information. Since we are providing this capability via an API only, wouldn't the logic you describe better be accomodated by the user's code that invokes this API?

@zez3
Copy link

zez3 commented Dec 4, 2023

This would also help if the metricbeat or other beasts will contain memory leak bugs in the future.

@msecpim
Copy link

msecpim commented Nov 10, 2024

We do have a plus 30K agent infrastructure, with agents running also in remote locations. Utilising such an API would be of great advantage. Do you have any update on when that will become available?

@ThomSwiss
Copy link

We have plus 12K agents and sometimes have Agents that doesn't do anything. After the last windows patch, we had again to restart some agents, because they didn't run correctly, they just don't send data. After restart, all was fine.

@nimarezainia
Copy link
Contributor

We have plus 12K agents and sometimes have Agents that doesn't do anything. After the last windows patch, we had again to restart some agents, because they didn't run correctly, they just don't send data. After restart, all was fine.

@ThomSwiss this shouldn't be happening and I consider it a bug. Could you open an support case with us if possible to the issue can be diagnosed. Not denying that this feature would be useful, just want to ensure the primary problem is addressed. Would be great to obtain the diagnostics file or any error that you see which agents produce which could give us a clue.

@allamiro
Copy link

allamiro commented Dec 26, 2024

This is my suggestion :
I believe restricting the initiation of no more than 10 to 20 agents simultaneously could help bypass scheduling during a maintenance window. If there's a need to restart more than 20 agents, the system should prompt the admin to schedule a maintenance window outside of operational hours. When executing bulk restarts, the system shouldn't restart all agents simultaneously; instead, it should process them in batches of 20 to 30 at a time.

thanks for this information. Since we are providing this capability via an API only, wouldn't the logic you describe better be accomodated by the user's code that invokes this API?

While I understand that the logic could technically be implemented in the user's code when invoking the API, i think the goal is to streamline the user experience and provide a safeguard against potential misuse directly within the GUI. By integrating this functionality at the GUI level, we can ensure consistent enforcement of these rules, even for users who may lack the expertise or resources to handle such logic programmatically. This approach aligns with providing a more robust and user-friendly solution.
Would you agree this might better serve a broader range of use cases?
For instance, ArcSight Management Center (ArcMC) and many other SIEM solutions offer this capability to streamline the management of agents and connectors. It’s unclear why would delegate such a critical feature to be managed solely by the Fleet and made available through the gui using Kibana.
Image

@ThomSwiss
Copy link

I would also like it, when It is implemented in Kibana. Of course, we use often the API. I think I should be like when we update our 15000 Agents. I can just tell, that It should update them for example in the next 24 hours.

@nchaulet
Copy link
Member

We have a feature that allow to schedule actions, it is currently used for upgrade, we could probably provide something similar for restart (note it need to have a enterprise license)

Image

@nimarezainia do you think it will be a good way to solve this? proving a schedule restart action.

@juliaElastic I know you worked a lot on upgrades, do you see any issue with the current scheduling feature, that could cause issues here, or any thoughts if it's bad idea to use that feature for restart.

@juliaElastic
Copy link
Contributor

@nchaulet I think it's good to use the same scheduling feature to send a restart action, most of the logic is in fleet-server that delivers actions at the scheduled time, spread out over the rollout period.

@nimarezainia
Copy link
Contributor

@nchaulet the scheduling/rollout upgrade you mention was required mainly due to the fact that during the upgrade there's a download phase of the binary, which at scale could exhaust the network bandwidth available. So we allow the user to schedule outside of working hours and ensure that download is deterministic across a period.

The restart is a bit different. For starters it doesn't need a download. I also suspect that majority of the times it would be applied to a small number of agents in a troubleshooting scenario - to fix an unhealthy agent. I can justify a user needing to "fix" their agent in the future.

I would suggest we avoid this complication for now and solicit feedback to see if it becomes a requirement. At this moment I don't see it as a requirement.

@nchaulet
Copy link
Member

@nimarezainia so your suggestions will be to keep it as simple as possible correct? so just a restart action and bulk restart action in the UI correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
QA:Needs Validation Issue needs to be validated by QA Team:Elastic-Agent-Control-Plane Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests