[ws-manager] Re-create workspace pods on rejection #20243

geropl · 2024-09-26T08:28:54Z

Description

This PR enables ws-manager to re-create a workspace pod when it detects that the pod is stuck. This i
s done by stopping the failed workspace, wait a configurable timeout (15s default), and create a pod with the same name/ID again.

The root cause is an unfixed race condition in kubelet: node labels are out of sync, and suddenly the Pod - although already scheduled to a node - does no longer fit onto the node, so the kubelet marks it as "rejected": A situation, in which it is "stuck".
It shows up in both AWS and GCP, and has a ~1% chance of occurring (from our rough investigation).

This change makes it so that ws-manager detects this condition (which is luckily very straight forward and clearly distinguishable), and once detected, marks it with the PodRejected condition. From there on, workspace is treated differently. This ensures that this change remains (nearly) zero-impact for the other 99% of workspaces (with the exception of some changes to context handling in ws-daemon's pod control loop, see below).

To make sure this change does not break anything it has been thoroughly stress-tested using a loadgen + a breaker program which "rejects" every workspace that is created exactly once (see the "DEV" commit).

TODO

Discussion: "Cool, but it's a big change. Is this worth it?"

Yes, I think so. And that's because:

a) it's fairly isolated (the change does impose very little risk to the majority/99% of workspaces) which continue to function 💯% as they do today.
b) it fixes a long-standing bug (it's a kubernetes bug, granted, but it directly affects Gitpod usability)

Changes in Detail

The general theme is that we maintain all the existing logic of starting and stopping workspaces. The only conceptual changes are:

Stopping a workspace with PodRejected works a bit differently, as a) we don't care about a backup, b) it might fail at any point during Pending/Creating and c) we need to cleanup everything we know about that workspace in ws-daemon, as there might be a new pod with the exact same ID in the near future, soon.
- the brunt of the changes are actually related to enabling ws-daemon to "wipe" a workspace, in a synchronized way ( this is only exercised in this failure condition!)
- this includes using a dedicated condition to communicated this being done (next to BackupFailure/BackupComplete): StateWiped
once a workspace with PodRejected has been stopped, it gets eventually re-created

ws-manager

Two new config fields:

PodRecreationMaxRetries: (default: 3): how often we are trying to re-create a given Workspace
PodRecreationBackoff (default: 15s): how long two wait before creating a new pod after the old one has been deleted

The workspace CRD gets two new fields:

.status.podRecreated: the number of times a workspace has gone through the re-creation logic (usually 0)
.status.podDeletionTime: The time the workspace pod was seen removed the first time by ws-manager (set in status.go)
- this is used to make sure
  There is a 3rd field podStarts which has been there since forever, which is finally used now. 🙂

Two new conditions got introduced to, which are :

PodRejected (set in status.go)
- true: when the pod is failing (pending -> stopped)
- false: when the pod is re-created
StateWiped (set in ws-daemon, and checked in status.go)
- used to synchronize the "workspace state wiped" state from ws-daemon back to ws-manager. Only once this is set a "rejected workspace" is considered "stopped".

ws-daemon

Workspace control loop

On handleWorkspaceStop, we call doWipeWorkspace instead of doWorkspaceContentBackup. It calls workspace_operations.WipeWorkspace, and sets the StateWiped condition accordingly.

WipeWorkspace does:

calls the regular "dispose" hooks
calls an additional "dispose" hook, which tries to call the IWS, and if successful, runs WipingTeardown
- this unmount the shiftfs mounts from inside the workspace
- ℹ️ this might be a change we actually can use for all workspaces. But that's a separate discussion.
calls dispatch.DisposeWorkspace (see below)
disposes all stored metadata (like on regular shutdown)

Pod control loop

This control loop is fueled by a PodInformer, and before this change it's shutdown mechanics were not sync'ed with the Workspace control loop, and timing worked on "best-effort" basis.

dispatch.DisposeWorkspace got added (which again is only called in the PodRejected case, cmp. above), which:

cancels the context of all running go routines
waits for all of them to report being cancelled (WaitGroup)
calls containerRuntime.DisposeContainer, which clears some internally held ID mappings between Workspace, Pod and Container
ℹ️ blocks future pod updates (excluding "deletions") from being processed (pod is being identified by .Name + .CreationTimestamp!)

Diagram

I tried to capture the new state changes in a diagram (source):

Related Issue(s)

Fixes ENT-530

How to test

Check tests are sensible and 🟢

Manual: regular workspace

Manual: prebuild workspace

Manual: using `loadgen` and the "rejection"-program

Loom of my testing is here: https://www.loom.com/share/e2eae31965dc48829b415c81f3005f1b?sid=6b4acf8d-06d0-4d21-b3e2-ef3182f8e974
Summary screen:

open a workspace on this branch,
connect to the ephemeral-gpl4 cluster (announce here so you don't block others)
prepare to run a loadgen test as usual
in parallel, start "rejection" tool: cd scheduler && go run main.go -kubeconfig ~/.kube/config
run loadgen
notice how all workspaces come up nicely 😌

TODO(gpl): add loom so others don't have to do this

Documentation

Preview status

Gitpod was successfully deployed to your preview environment.

🏷️ Name - gpl-530-retry
🔗 URL - gpl-530-retry.preview.gitpod-dev.com/workspaces.
📚 Documentation - See our internal documentation for information on how to interact with your preview environment.
📦 Version - gpl-530-retry-gha.30031
🗒️ Logs - GCP Logs Explorer

Build Options

Build

/werft with-werft
Run the build with werft instead of GHA
leeway-no-cache
/werft no-test
Run Leeway with --dont-test

Publish

/werft publish-to-npm
/werft publish-to-jb-marketplace

Installer

analytics=segment
with-dedicated-emulation
workspace-feature-flags
Add desired feature flags to the end of the line above, space separated

Preview Environment / Integration Tests

/werft with-local-preview
If enabled this will build install/preview
/werft with-preview
/werft with-large-vm
/werft with-gce-vm
If enabled this will create the environment on GCE infra
/werft preemptible
Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
with-integration-tests=all
Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
with-monitoring

/hold

github-actions · 2024-10-18T15:04:33Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

geropl · 2024-10-24T09:25:31Z

Still work in progress

github-actions · 2024-11-04T15:05:02Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

socket-security · 2024-11-11T08:35:01Z

New dependencies detected. Learn more about Socket for GitHub ↗︎

Package	New capabilities	Transitives	Size
golang/k8s.io/[email protected]	unsafe	`0`	23.9 MB
golang/k8s.io/[email protected]	environment, filesystem, network, shell, unsafe	`0`	4.17 MB
golang/k8s.io/[email protected]	environment, filesystem, network, shell, unsafe	`0`	13.9 MB

View full report↗︎

WVerlaek

overall nice work!
Left a few comments/questions, didn't look too deeply into the ws-daemon changes

components/ws-daemon/pkg/daemon/markunmount.go

components/ws-daemon/pkg/cgroup/cgroup.go

components/ws-daemon/pkg/controller/workspace_operations.go

components/ws-daemon/pkg/dispatch/dispatch.go

components/ws-manager-mk2/controllers/workspace_controller.go

scheduler/main.go

components/ws-manager-mk2/controllers/workspace_controller.go

components/ws-daemon/pkg/netlimit/netlimit.go

components/ws-daemon/pkg/dispatch/dispatch.go

Furisto · 2024-11-14T10:52:37Z

components/ws-daemon/pkg/controller/workspace_operations.go

+		}
+	}()
+
+	timeout := time.NewTicker(10 * time.Second)


Why do we restrict the delete to 10 seconds?

Added a comment.

tl;dr: If we don't have the timeout, there is a chance that that this operation blocks forever (at least before I fixed the mark unmounting). By adding the timeout guarantee it never blocks the workspace from being deleted (because that depends on the "wiping" to finish.

We trade it for the slight chance of not wiping cleanly, which might become a problem iff the new workspace pod is scheduled to the same node again. Which I deem ok, because then we are talking about an error case in the 0.1-0.01% range, which is already way better than the original 1%.
All of this, again, only if the mark unmounting did not fix this problem in the first place. 👍

…g fix

geropl · 2024-11-14T16:27:10Z

@iQQBot I added the loom for the load test (incl. rejector running, so "breaking" every workspace pod once), and addressed all the other comments. Would be great if you could review, and do a manual smoke test on the preview (once it's there again) that the "normal case" still works as expected.

Two things I still need to test (tomorrow):

have dev/rejector running against the preview
manually (+ via prebuild) start workspaces
check that the UX does not break 👍

Once that's done, we can merge. 🙂

geropl · 2024-11-15T10:11:12Z

Done testing! 🥳 🎉

Added a small logging improvement, otherwise just need ✔️ to 🚢 !

iQQBot

Code LGTM, try regular start and stop, it works like before

A question: If content-init has already started and the pod re-schedule at same node, and content is big enough, what will happen? Is there a possibility of a data overlap issue?

components/ws-manager-mk2/controllers/status.go

geropl · 2024-11-15T12:27:54Z

A question: If content-init has already started and the pod re-schedule at same node, and content is big enough, what will happen? Is there a possibility of a data overlap issue?

The approach in this PR is the "wiping mode" as you found out, where first all running handlers etc. are stopped, and finally all content directories are removed. The oprations are synchronized on multiple levels (one e.g. for content-init), so this should* not happen.

*: saying "should", because the ws-daemon code was surprinsingly complex, as we have three layers of feeback loops (k8s workspace, k8s pod, containerd) we have to synchronize. There still might be a whole somewhere, but after the extensive testing, I'm absolutely sure we are taking about 0.01% here, instead of the 1% we have atm.

geropl · 2024-11-15T12:28:02Z

/unhold

iQQBot · 2024-11-15T12:31:15Z

The approach in this PR is the "wiping mode" as you found out, where first all running handlers etc. are stopped, and finally all content directories are removed. The oprations are synchronized on multiple levels (one e.g. for content-init), so this should* not happen.

content-init is a special case that executes using another process and has no related context to control (i.e. the context is the context of ws-deamon).

geropl · 2024-11-15T12:42:28Z

content-init is a special case that executes using another process and has no related context to control (i.e. the context is the context of ws-deamon).

But content init happens in sync. And all workspace-state related changes are handled by workspace_controller.go, using the controller framework by k8s, which I understood handles one event at time (e.g. handles synchronization).

roboquat added do-not-merge/hold do-not-merge/work-in-progress size/L labels Sep 26, 2024

geropl changed the title ~~[WIP][ws-manager] Re-create workspace pods on rejection (incl. test)~~ [WIP][ws-manager] Re-create workspace pods on rejection Oct 1, 2024

roboquat added size/XL and removed size/L labels Oct 1, 2024

geropl force-pushed the gpl/530-retry branch 2 times, most recently from 2a35e06 to ac1c86b Compare October 8, 2024 14:43

github-actions bot added the meta: stale This issue/PR is stale and will be closed soon label Oct 18, 2024

github-actions bot closed this Oct 23, 2024

geropl reopened this Oct 24, 2024

geropl force-pushed the gpl/530-retry branch from cefc61e to d874dfb Compare October 24, 2024 09:39

github-actions bot removed the meta: stale This issue/PR is stale and will be closed soon label Oct 24, 2024

geropl force-pushed the gpl/530-retry branch 2 times, most recently from 9aeba20 to d3b0ed0 Compare October 25, 2024 12:58

github-actions bot added the meta: stale This issue/PR is stale and will be closed soon label Nov 4, 2024

geropl force-pushed the gpl/530-retry branch from d3b0ed0 to f8436b9 Compare November 8, 2024 15:11

roboquat added size/XXL and removed size/XL labels Nov 8, 2024

github-actions bot removed the meta: stale This issue/PR is stale and will be closed soon label Nov 9, 2024

geropl force-pushed the gpl/530-retry branch 2 times, most recently from 669e328 to 2f27e7b Compare November 11, 2024 08:33

geropl changed the title ~~[WIP][ws-manager] Re-create workspace pods on rejection~~ [ws-manager] Re-create workspace pods on rejection Nov 11, 2024

geropl marked this pull request as ready for review November 11, 2024 15:00

geropl requested review from a team as code owners November 11, 2024 15:00

roboquat removed the do-not-merge/work-in-progress label Nov 11, 2024

github-actions bot added team: team-engine team: team-enterprise labels Nov 11, 2024

WVerlaek reviewed Nov 12, 2024

View reviewed changes

geropl added 2 commits November 12, 2024 16:07

[ws-manager] Re-create workspace pods (incl. test)

5e1d3e3

[gpctl] Fix "workspaces list"

68830be

geropl force-pushed the gpl/530-retry branch from ba21ee7 to 88c40a7 Compare November 12, 2024 16:08

Furisto reviewed Nov 14, 2024

View reviewed changes

geropl added 2 commits November 14, 2024 16:19

[ws-daemon, ws-manager] Review comments, logging cleanups and orderin…

72f87e0

…g fix

[dev/rejector] Add the tool we use to test PodRejection under "dev"

42a0e3a

geropl force-pushed the gpl/530-retry branch from 88c40a7 to 42a0e3a Compare November 14, 2024 16:23

[bridge] Log nested status shape (and properly scrubbing it)

e4a1756

geropl requested a review from a team as a code owner November 15, 2024 10:10

github-actions bot added the team: team-experience label Nov 15, 2024

iQQBot approved these changes Nov 15, 2024

View reviewed changes

components/ws-manager-mk2/controllers/status.go Show resolved Hide resolved

roboquat removed the do-not-merge/hold label Nov 15, 2024

roboquat merged commit 99cc66b into main Nov 15, 2024
18 checks passed

roboquat deleted the gpl/530-retry branch November 15, 2024 12:28

[ws-manager] Re-create workspace pods on rejection #20243

[ws-manager] Re-create workspace pods on rejection #20243

Uh oh!

Conversation

geropl commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

TODO

Discussion: "Cool, but it's a big change. Is this worth it?"

Changes in Detail

ws-manager

ws-daemon

Diagram

Related Issue(s)

How to test

Check tests are sensible and 🟢

Manual: regular workspace

Manual: prebuild workspace

Manual: using loadgen and the "rejection"-program

Documentation

Preview status

Build Options

Uh oh!

github-actions bot commented Oct 18, 2024

Uh oh!

geropl commented Oct 24, 2024

Uh oh!

github-actions bot commented Nov 4, 2024

Uh oh!

socket-security bot commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WVerlaek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Furisto Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

geropl Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

geropl commented Nov 14, 2024

Uh oh!

geropl commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iQQBot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

geropl commented Nov 15, 2024

Uh oh!

geropl commented Nov 15, 2024

Uh oh!

Uh oh!

iQQBot commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geropl commented Nov 15, 2024

Uh oh!

Uh oh!

geropl commented Sep 26, 2024 •

edited

Loading

Manual: using `loadgen` and the "rejection"-program

socket-security bot commented Nov 11, 2024 •

edited

Loading

geropl commented Nov 15, 2024 •

edited

Loading

iQQBot commented Nov 15, 2024 •

edited

Loading