-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ws-manager] Re-create workspace pods on rejection #20243
Conversation
2a35e06
to
ac1c86b
Compare
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Still work in progress |
9aeba20
to
d3b0ed0
Compare
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
669e328
to
2f27e7b
Compare
New dependencies detected. Learn more about Socket for GitHub ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall nice work!
Left a few comments/questions, didn't look too deeply into the ws-daemon changes
ba21ee7
to
88c40a7
Compare
} | ||
}() | ||
|
||
timeout := time.NewTicker(10 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we restrict the delete to 10 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
tl;dr: If we don't have the timeout, there is a chance that that this operation blocks forever (at least before I fixed the mark unmounting). By adding the timeout guarantee it never blocks the workspace from being deleted (because that depends on the "wiping" to finish.
We trade it for the slight chance of not wiping cleanly, which might become a problem iff the new workspace pod is scheduled to the same node again. Which I deem ok, because then we are talking about an error case in the 0.1-0.01% range, which is already way better than the original 1%.
All of this, again, only if the mark unmounting did not fix this problem in the first place. 👍
88c40a7
to
42a0e3a
Compare
@iQQBot I added the loom for the load test (incl. rejector running, so "breaking" every workspace pod once), and addressed all the other comments. Would be great if you could review, and do a manual smoke test on the preview (once it's there again) that the "normal case" still works as expected. Two things I still need to test (tomorrow):
Once that's done, we can merge. 🙂 |
Done testing! 🥳 🎉 Added a small logging improvement, otherwise just need ✔️ to 🚢 ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM, try regular start and stop, it works like before
A question: If content-init has already started and the pod re-schedule at same node, and content is big enough, what will happen? Is there a possibility of a data overlap issue?
The approach in this PR is the "wiping mode" as you found out, where first all running handlers etc. are stopped, and finally all content directories are removed. The oprations are synchronized on multiple levels (one e.g. for content-init), so this should* not happen. *: saying "should", because the ws-daemon code was surprinsingly complex, as we have three layers of feeback loops (k8s workspace, k8s pod, containerd) we have to synchronize. There still might be a whole somewhere, but after the extensive testing, I'm absolutely sure we are taking about 0.01% here, instead of the 1% we have atm. |
/unhold |
content-init is a special case that executes using another process and has no related context to control (i.e. the context is the context of ws-deamon). |
But content init happens in sync. And all workspace-state related changes are handled by |
Description
This PR enables
ws-manager
to re-create a workspace pod when it detects that the pod is stuck. This is done by stopping the failed workspace, wait a configurable timeout (15s default), and create a pod with the same name/ID again.
The root cause is an unfixed race condition in kubelet: node labels are out of sync, and suddenly the Pod - although already scheduled to a node - does no longer fit onto the node, so the kubelet marks it as "rejected": A situation, in which it is "stuck".
It shows up in both AWS and GCP, and has a ~1% chance of occurring (from our rough investigation).
This change makes it so that ws-manager detects this condition (which is luckily very straight forward and clearly distinguishable), and once detected, marks it with the
PodRejected
condition. From there on, workspace is treated differently. This ensures that this change remains (nearly) zero-impact for the other 99% of workspaces (with the exception of some changes to context handling in ws-daemon's pod control loop, see below).To make sure this change does not break anything it has been thoroughly stress-tested using a
loadgen
+ abreaker
program which "rejects" every workspace that is created exactly once (see the "DEV" commit).TODO
Discussion: "Cool, but it's a big change. Is this worth it?"
Yes, I think so. And that's because:
Changes in Detail
The general theme is that we maintain all the existing logic of starting and stopping workspaces. The only conceptual changes are:
PodRejected
works a bit differently, as a) we don't care about a backup, b) it might fail at any point duringPending
/Creating
and c) we need to cleanup everything we know about that workspace in ws-daemon, as there might be a new pod with the exact same ID in the near future, soon.ws-daemon
to "wipe" a workspace, in a synchronized way ( this is only exercised in this failure condition!)BackupFailure
/BackupComplete
):StateWiped
PodRejected
has been stopped, it gets eventually re-createdws-manager
Two new config fields:
PodRecreationMaxRetries
: (default:3
): how often we are trying to re-create a given WorkspacePodRecreationBackoff
(default:15s
): how long two wait before creating a new pod after the old one has been deletedThe workspace CRD gets two new fields:
.status.podRecreated
: the number of times a workspace has gone through the re-creation logic (usually0
).status.podDeletionTime
: The time the workspace pod was seen removed the first time by ws-manager (set instatus.go
)There is a 3rd field
podStarts
which has been there since forever, which is finally used now. 🙂Two new conditions got introduced to, which are :
PodRejected
(set instatus.go
)true
: when the pod isfailing
(pending -> stopped)false
: when the pod is re-createdStateWiped
(set in ws-daemon, and checked instatus.go
)ws-daemon
Workspace control loop
On
handleWorkspaceStop
, we calldoWipeWorkspace
instead ofdoWorkspaceContentBackup
. It callsworkspace_operations.WipeWorkspace
, and sets theStateWiped
condition accordingly.WipeWorkspace
does:WipingTeardown
dispatch.DisposeWorkspace
(see below)Pod control loop
This control loop is fueled by a
PodInformer
, and before this change it's shutdown mechanics were not sync'ed with theWorkspace
control loop, and timing worked on "best-effort" basis.dispatch.DisposeWorkspace
got added (which again is only called in thePodRejected
case, cmp. above), which:WaitGroup
)containerRuntime.DisposeContainer
, which clears some internally held ID mappings between Workspace, Pod and Container.Name + .CreationTimestamp
!)Diagram
I tried to capture the new state changes in a diagram (source):
Related Issue(s)
Fixes ENT-530
How to test
Check tests are sensible and 🟢
Manual: regular workspace
Manual: prebuild workspace
Manual: using
loadgen
and the "rejection"-programLoom of my testing is here: https://www.loom.com/share/e2eae31965dc48829b415c81f3005f1b?sid=6b4acf8d-06d0-4d21-b3e2-ef3182f8e974
Summary screen:
ephemeral-gpl4
cluster (announce here so you don't block others)cd scheduler && go run main.go -kubeconfig ~/.kube/config
loadgen
TODO(gpl): add loom so others don't have to do this
Documentation
Preview status
Gitpod was successfully deployed to your preview environment.
Build Options
Build
Run the build with werft instead of GHA
Run Leeway with
--dont-test
Publish
Installer
Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
If enabled this will build
install/preview
If enabled this will create the environment on GCE infra
Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
Valid options are
all
,workspace
,webapp
,ide
,jetbrains
,vscode
,ssh
. If enabled,with-preview
andwith-large-vm
will be enabled./hold