You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When performing SSH auto-discovery on cloud providers, discovery service automatically installs Teleport on the discovered instances. It works in the successful scenario however when something fails (e.g. we fail to install Teleport package via SSM on EC2 instance) the experience is subpar:
There is no good indication that some nodes failed to join and which ones and why.
There is no clear way for a user to find out what failed and where to go to troubleshoot.
There is no easy way for a user to retry the operation.
Oftentimes Teleport would keep trying to reinstall and failing and it can't self-repair.
Proposed solution(-s)
In the order of the required effort, I think these are the steps that we need to take to improve the reliability and observability of auto-discovery, focused specifically on SSH for now since that seems to be most fragile at the moment.
Improve audit logging
As a cheap first step, let's make sure that all successes and failures of the discovery service are captured in the audit log. I think we do have audit logs for it but let's reinspect them and make sure they're useful and in case of failure contain all required information for a user about:
Which node installation failed on.
Reason for the failure (e.g. install script error).
Link to the relevant part of Cloud Console for troubleshooting (e.g. SSM execution logs for EC2).
Make sure auto-discovery install/upgrade scripts can handle small problems
We should identify most common failure scenarios and make sure that the auto-discovery system and the install script can overcome them gracefully upon retry or once the issue has been resolved. Some examples that come to mind (but not exhaustive):
Lack of IAM permissions (should also be clearly evident from audit log, to the point above).
Installation failed part-way due to external issues (like, repos, networking, etc.).
Installation failed due to bad config.
The system should detect that it failed to install before (as opposed to, for example, this being an installation done by something else) and take appropriate actions like uninstall and start from scratch.
Make use of notification system
Auto-enrollment failures should be captured in our new notification system.
Build auto-discovery dashboard
( see #41909 )
I think this is what will be most useful for users but also require most engineering effort and design support but the idea is to have a dedicated dashboard that shows auto-discovery status where users can see/do things like:
See whether auto-discovery is enabled, turn it on/off and update discovery config.
See relevant auto-discovery status, for example nodes that fail to enroll and errors explaining why.
Manually trigger "retry" (or "fix") on the nodes that failed.
Description
When performing SSH auto-discovery on cloud providers, discovery service automatically installs Teleport on the discovered instances. It works in the successful scenario however when something fails (e.g. we fail to install Teleport package via SSM on EC2 instance) the experience is subpar:
Proposed solution(-s)
In the order of the required effort, I think these are the steps that we need to take to improve the reliability and observability of auto-discovery, focused specifically on SSH for now since that seems to be most fragile at the moment.
Improve audit logging
As a cheap first step, let's make sure that all successes and failures of the discovery service are captured in the audit log. I think we do have audit logs for it but let's reinspect them and make sure they're useful and in case of failure contain all required information for a user about:
Make sure auto-discovery install/upgrade scripts can handle small problems
We should identify most common failure scenarios and make sure that the auto-discovery system and the install script can overcome them gracefully upon retry or once the issue has been resolved. Some examples that come to mind (but not exhaustive):
The system should detect that it failed to install before (as opposed to, for example, this being an installation done by something else) and take appropriate actions like uninstall and start from scratch.
Make use of notification system
Auto-enrollment failures should be captured in our new notification system.
Build auto-discovery dashboard
( see #41909 )
I think this is what will be most useful for users but also require most engineering effort and design support but the idea is to have a dedicated dashboard that shows auto-discovery status where users can see/do things like:
Related issues:
#31180
Tasks
Replace installation script with go code
/etc/os-release
#42614sh
inoneoff
script #42633oneoff
script: Use user directory instead of /tmp for downloading teleport #43029teleport install autodiscover-node
command for Server Auto Discovery #43466The text was updated successfully, but these errors were encountered: