Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet]Restart elastic agents from fleet server remotely #2523

Open
Tracked by #144585
allamiro opened this issue Apr 21, 2023 · 15 comments
Open
Tracked by #144585

[Fleet]Restart elastic agents from fleet server remotely #2523

allamiro opened this issue Apr 21, 2023 · 15 comments
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team

Comments

@allamiro
Copy link

allamiro commented Apr 21, 2023

Describe the enhancement:
Ability to restart Elastic agents on one or multiple systems from the fleet section through the fleet server
Describe a specific use case for the enhancement or feature:

Restarting Elastic Agents from Fleet Server can save time and effort compared to manually restarting each individual agent. This is especially useful when managing a large number of agents across multiple hosts.It allows you to push updated configuration to the agent without having to manually update each individual agent. This can save time and reduce the risk of errors.

@michel-laterman michel-laterman added Team:Fleet Label for the Fleet team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Apr 21, 2023
@allamiro allamiro changed the title Restart elastic agents from fleet server remotely [Fleet]Restart elastic agents from fleet server remotely Apr 22, 2023
@joshdover
Copy link
Contributor

Hi @allamiro! Thanks for taking the time to suggest this. We've discussed this recently internally as well.

I'm curious, in what scenarios are you needing to restart Agent? I suspect we may have a bug if this is something you need to do on a regular basis.

@allamiro
Copy link
Author

allamiro commented Apr 27, 2023

@joshdover I believe that the fleet server, as the managing platform for agents, lacks the ability to start connectors remotely. This ability could be useful for various reasons, not just limited to a specific scenario. Some of these reasons ::

  • Automated processes: In many automated processes, it may be necessary to start a connector remotely to trigger a workflow or process. For example, if you have an automated data ingestion process that uses a connector to collect data from a specific source, you may need to start the connector remotely to initiate the data collection process.

  • Troubleshooting: If a connector is experiencing issues, it may be necessary to start it remotely to diagnose and resolve the problem. For example, if a connector is not running correctly, you may need to restart it remotely to see if that resolves the issue.

  • Remote work or permission: With more people working remotely, the ability to start connectors remotely can be useful for accessing data or applications that are not available locally. For example, if you need to access data from a remote location, you may need to start a connector remotely to establish the connection. or if you are not part of the team with access to the server.

  • Scalability: When you have a large number of connectors running, it may be easier to start them remotely rather than manually starting each one individually or attempting to logon to the devices. This can save time and reduce the risk of errors.

This ability is available on different SIEM Agent management applications and I think it would be great and will add great value if Elastic has it as well.

@allamiro
Copy link
Author

allamiro commented Apr 27, 2023

this is a duplicate of #2311 and #144585 however most of the questions that werent answered and asked on those are answered in this issue

it seems alot of people need this feature my recommendation is to :

  1. Limit the ability to start only not more than a maximum of 1 - 5 or max 20 Agents at one time until the questions related to elastic agents and system stability Back-pressure handling Potential impact on Fleet or Elasticsearch cluster are all addressed.
  2. If a. task is scheduled to start 10 Agents at a time the system should not allow another restart till previous agent task is completed and all agents reported their latest status whether its healthy or failed start .

What will be the agent state after the user starts the restart? When does the state change back to healthy? What if it never successfully restarts?

My recommendation is when a user initiates a restart, the agent's state will transition to a "restarting" state. At this point, the agent stops processing new data and attempts to gracefully shut down. If the shutdown is successful, the agent's state will change to "stopped". If not, the agent's state may remain in "restarting" or transition to an "error" state.Once the agent has successfully restarted and is able to process new data, its state will transition back to "healthy". If an agent is unable to successfully restart, its state may remain in "restarting" or transition to an "error" "failed restart " state after a specific period of time.

@joshdover
Copy link
Contributor

lacks the ability to start connectors remotely

@allamiro This seems to be the focus of your use case and is definitely already supported. You should not need to restart Agent in order for a new integration to be started. For example, if I add the "System" integration to an agent policy in the Fleet UI, within seconds, all the agents enrolled in that policy should start the necessary processes to collect that data and start shipping it to Elasticsearch. No restart required.

If this is not happening for you, I suspect you may be running into a bug. Could you open a new elastic-agent issue with steps to reproduce this (especially which integration you have seen this happen on) and link it here so we can investigate it?

  • Troubleshooting: If a connector is experiencing issues, it may be necessary to start it remotely to diagnose and resolve the problem. For example, if a connector is not running correctly, you may need to restart it remotely to see if that resolves the issue.

I think this is a valid use case in some cases, but it definitely shouldn't be the norm. Another option to a full agent restart could be to offer an on/off toggle on the integration's row inside the Agent Policy UI.

What will be the agent state after the user starts the restart? When does the state change back to healthy? What if it never successfully restarts?

My recommendation is when a user initiates a restart, the agent's state will transition to a "restarting" state. At this point, the agent stops processing new data and attempts to gracefully shut down. If the shutdown is successful, the agent's state will change to "stopped". If not, the agent's state may remain in "restarting" or transition to an "error" state.Once the agent has successfully restarted and is able to process new data, its state will transition back to "healthy". If an agent is unable to successfully restart, its state may remain in "restarting" or transition to an "error" "failed restart " state after a specific period of time.

Thanks for taking the time to weigh in on this. I agree we need this level of visibility into the restart process if we decide to implement it. In general, we need ways to surface more granular state information like this.

@allamiro
Copy link
Author

allamiro commented May 11, 2023

@joshdover one of the issue we have around Possible transient errors on the elastic agents log its also discussed on https://discuss.elastic.co/t/fail-to-checkin-to-fleet-server/318932/7 and it seems restarting the service fixes the situation I totally agree it doesn't happen often but its important to have the option just in case.

A question to ask is there a builtin recovery check in the agent to restart the daemon if its seeing an error may be attempting to restart 3 times if it detects an error or a problem I see a built in check to connect to the fleet server but I m not sure if there is also recovery checks for errors and those types of issues.

@ITSEC-Hescalona
Copy link

ITSEC-Hescalona commented Jun 15, 2023

"There are tons of scenarios where this is needed, for example: zombie processes or update errors in integrations, here is an example:

[elastic_agent][error] Unit state changed log-e2ec1ee0-8152-11ed-95ab-01bcb4735c50 (CONFIGURING->FAILED): [failed to reloading inputs: 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::173-64771, Finished: false, Fileinfo: &{messages 404568 384 {112726762 63822407977 0x55d7cf19cb60} {64771 173 1 33152 0 0 0 0 404568 4096 792 {1686810787 776220378} {1686811177 112726762} {1686811177 112726762} [0 0 0]}}, Source: /var/log/messages, Offset: 439661, Timestamp: 2023-06-15 02:44:43.199529844 -0400 -04 m=+32073.104975279, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 173-64771}].

This error occurred while updating the SYSTEM integration in a large fleet. To solve it, I managed to visualize 2 paths: 1. Remove the integration from the policies and readd (which is no joke because there are many policies), and the other (the least viable due to the number of machines) is to restart the agent (both ways solved the problem). I opted for the first path and it was resolved. If I had had a restart button for the agent, it would have been much easier.

When everything works well, it's true that it seems almost unnecessary, but when something fails (which is more common than we would like), that button is really missed.

@defensivedepth
Copy link

I second this. I just had to login to a system and restart the agent because of an integration issue.

@matthiasledergerber
Copy link

same reasons for me, would be a nice feature.

@tjputzGSA
Copy link

I also think being able to restart an agent from Fleet would be extremely useful. I have had a number of support cases which required me to restart the agent. I had to open a ticket with IT support so they could login to the endpoint and restart the agent for me. That takes days.

There are also a number of issues with the agent that a simple restart would fix. Sometimes agents appear unhealthy in Fleet, and changing the logging level from info to debug and back will kick start the agent enough for it to go healthy.

Reading through this thread I see a number of people that use the product every day asking for a feature which seems to be reasonable to me. Then I read a number of comments stating why the feature shouldn't be needed from people that probably are not operating the product in production at scale within an enterprise.

Thanks for your support, and I hope you consider adding this feature for the benefit of your customers.

@mrhackcz
Copy link

mrhackcz commented Sep 5, 2023

According to https://discuss.elastic.co/t/restart-elastic-agent-from-fleet-centralized-management/318646, the restart feature should be on the roadmap. Elastic team member confirmed this.

Sometimes happened exactly what tjputzGSA mentioned. Agents appear unhealthy, for example:

Unit state changed log-default-logfile-mongodb-9e979d90-4b74-11ee-b0f8-a3719086b4fe (CONFIGURING->FAILED): [failed to reloading inputs: 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4070-28, Finished: false, Fileinfo: &{mongod.log 31263563411 384 {519849631 63829522593 0x561969c80b60} {28 4070 1 33152 120 130 0 0 31263563411 131072 9104745 {1677589338 111252080} {1693925793 519849631} {1693925793 519849631} [0 0 0]}}, Source: /var/log/mongodb/mongod.log, Offset: 31365736316, Timestamp: 2023-09-05 23:11:13.774047588 +0200 CEST m=+22483.546855525, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 4070-28}]

Only restart helps. Creating a support ticket and waiting till support restarts it, is not ideal.

@allamiro
Copy link
Author

allamiro commented Sep 6, 2023

@mrhackcz Their initial plan was to incorporate the agent restart, but I think they may be veered off that course.

@jlind23
Copy link
Contributor

jlind23 commented Sep 7, 2023

@allamiro @mrhackcz this is for now still part of our roadmap but has not been prioritised yet compare to other tasks we had to tackle. I'll link all the related issues together in order to give more visibility.

@allamiro
Copy link
Author

allamiro commented Sep 12, 2023

@allamiro @mrhackcz this is for now still part of our roadmap but has not been prioritised yet compare to other tasks we had to tackle. I'll link all the related issues together in order to give more visibility.

I've addressed and provided recommendations for the questions posed by the Elastic team, which were previously unaddressed in all the tickets related to this issue. If needed please reference them on that ticket.

@michel-laterman
Copy link
Contributor

I think the work on fleet-server to support a new action type will be minimal, it may just be defining a new action type/data struct in the openapi doc (after #3060) has been merged.
A majority of the work that we will need to enable this will be a part of the elastic-agent and fleet-ui

@allamiro
Copy link
Author

any updates on this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team
Projects
None yet
Development

No branches or pull requests

9 participants