Improve robustness and auditing of discovery service #37620

r0mant · 2024-01-31T22:43:12Z

Description

When performing SSH auto-discovery on cloud providers, discovery service automatically installs Teleport on the discovered instances. It works in the successful scenario however when something fails (e.g. we fail to install Teleport package via SSM on EC2 instance) the experience is subpar:

There is no good indication that some nodes failed to join and which ones and why.
There is no clear way for a user to find out what failed and where to go to troubleshoot.
There is no easy way for a user to retry the operation.
Oftentimes Teleport would keep trying to reinstall and failing and it can't self-repair.

Proposed solution(-s)

In the order of the required effort, I think these are the steps that we need to take to improve the reliability and observability of auto-discovery, focused specifically on SSH for now since that seems to be most fragile at the moment.

Improve audit logging

As a cheap first step, let's make sure that all successes and failures of the discovery service are captured in the audit log. I think we do have audit logs for it but let's reinspect them and make sure they're useful and in case of failure contain all required information for a user about:

Which node installation failed on.
Reason for the failure (e.g. install script error).
Link to the relevant part of Cloud Console for troubleshooting (e.g. SSM execution logs for EC2).

Make sure auto-discovery install/upgrade scripts can handle small problems

We should identify most common failure scenarios and make sure that the auto-discovery system and the install script can overcome them gracefully upon retry or once the issue has been resolved. Some examples that come to mind (but not exhaustive):

Lack of IAM permissions (should also be clearly evident from audit log, to the point above).
Installation failed part-way due to external issues (like, repos, networking, etc.).
Installation failed due to bad config.

The system should detect that it failed to install before (as opposed to, for example, this being an installation done by something else) and take appropriate actions like uninstall and start from scratch.

Make use of notification system

Auto-enrollment failures should be captured in our new notification system.

Build auto-discovery dashboard

( see #41909 )
I think this is what will be most useful for users but also require most engineering effort and design support but the idea is to have a dedicated dashboard that shows auto-discovery status where users can see/do things like:

See whether auto-discovery is enabled, turn it on/off and update discovery config.
See relevant auto-discovery status, for example nodes that fail to enroll and errors explaining why.
Manually trigger "retry" (or "fix") on the nodes that failed.

Related issues:
#31180

Tasks

Give feedback

Improve messages when EC2 Auto Discover with SSM fails #41465

backport/branch/v15 discovery documentation no-changelog size/md
Add SSM Commands stdout/err to audit log #41478

backport/branch/v15 no-changelog size/sm
SSMRun Audit Event: add invocation url #41663

backport/branch/v15 no-changelog size/sm
Fix missing SSHD_CONFIG variable in Default Agentless Installer script #41523

backport/branch/v13 backport/branch/v14 backport/branch/v15 size/sm
EC2 Auto Discover with SSM: add script stdout and stderr to audit log #41479

backport/branch/v15 size/md
Fix EC2 Auto Discover SSM failure when sending an extra param #41532

backport/branch/v15 size/sm
EC2 Auto Discover with SSM: add invocation url to audit log #41689

backport/branch/v15 size/sm
EC2 Auto Discover SSM: add support for debugging custom SSM Docs #41706

backport/branch/v15 documentation no-changelog size/md
Auto Discover Servers: recover from bad configuration #44282

backport/branch/v15 backport/branch/v16 size/md
Auto Discover Server: add support for Rocky and Almalinux #44283

backport/branch/v16 size/sm
Auto Discover Server must not overwrite manual changes #44637

backport/branch/v16 no-changelog size/md
Options

Replace installation script with go code

Give feedback

Extract more information from /etc/os-release #42614

backport/branch/v15 backport/branch/v16 no-changelog size/sm
Allow arguments for the oneoff script #42621

backport/branch/v15 backport/branch/v16 no-changelog size/sm
Use sh in oneoff script #42633

backport/branch/v15 backport/branch/v16 no-changelog size/sm
Azure IMDS: add location, vmid, subscription and group #42857

backport/branch/v15 backport/branch/v16 no-changelog size/sm
oneoff script: Use user directory instead of /tmp for downloading teleport #43029

backport/branch/v15 backport/branch/v16 no-changelog size/sm
Script oneoff: add optional command prefix (sudo) #43234

backport/branch/v15 backport/branch/v16 no-changelog size/sm
Teleport install command for discovery agents (go) #43423

backport/branch/v16 no-changelog size/xl
Use teleport install autodiscover-node command for Server Auto Discovery #43466

backport/branch/v16 no-changelog size/sm
Options

The text was updated successfully, but these errors were encountered:

r0mant added bug ux ui discover Issues related to Teleport Discover labels Jan 31, 2024

r0mant assigned marcoandredinis Jan 31, 2024

marcoandredinis mentioned this issue Jul 3, 2024

Teleport install command for discovery agents (go) #43423

Merged

marcoandredinis mentioned this issue Jul 16, 2024

Auto Discover Servers: recover from bad configuration #44282

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve robustness and auditing of discovery service #37620

Improve robustness and auditing of discovery service #37620

r0mant commented Jan 31, 2024 •

edited by marcoandredinis

Loading

Tasks

Replace installation script with go code

Improve robustness and auditing of discovery service #37620

Improve robustness and auditing of discovery service #37620

Comments

r0mant commented Jan 31, 2024 • edited by marcoandredinis Loading

Description

Proposed solution(-s)

Improve audit logging

Make sure auto-discovery install/upgrade scripts can handle small problems

Make use of notification system

Build auto-discovery dashboard

Tasks

Replace installation script with go code

r0mant commented Jan 31, 2024 •

edited by marcoandredinis

Loading