-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
containerMode: kubernetes: Command.RunScriptStep
in packages/k8s/src/index.ts
never returns
#124
Comments
It's 100% reproducible. Please let me know how I can help debug the issue; This is the last blocker (famous last words ;-)) to move our dedicated runners to ARC/GKE. |
I'm thinking it might be related to #107 |
I thought so as well; I added the try/catch with logging, but it never got hit, unfortunately. :-( Patch:
I'm far from a websocket/typescript person, so I'm very much out on a limb here... |
Maybe you could try to extend the catch with this: catch (error) {
core.error(`BT Failed to exec pod step: ${error}`)
reject(error)
} And see if that does it? |
It shouldn't be the problem with the try-catch there, since that one may fail only if websocket communication fails (naive opinion for now, I did not find time to actually work on it but I'm hoping to find it soon) I'm hoping it is something much simpler, like changing tty to true. But if you find the problem before me, I would be happy to review it 😄! |
@nielstenboom I'll spin up a test first thing tomorrow! Note that I did not see the error logging, meaning I didn't land in the catch... :-( |
@nikola-jokic I was trying to figure out if I could somehow log in the node if the websocket went down, but I couldn't wrap my head around it. |
Feel free to hit me up with snippets to try, and I'll take them for a spin! |
@nielstenboom Took your PR #125 for a spin -- same hang, unfortunately. |
Hmmm unfortunate :( Which k8s version are you guys running? One more thing you could try maybe is bumping the @kubernetes/client-node package? No clue if it'll help, just spewing random ideas here haha 😄 |
It's GKE: v1.27.7-gke.1056000 Random ideas are very much welcome! TY! |
Hey @bjoto, I am failing to reproduce the issue... Can you please add logs so we can trace where exactly in the hook we block? export async function execPodStep(
command: string[],
podName: string,
containerName: string,
stdin?: stream.Readable
): Promise<void> {
const exec = new k8s.Exec(kc)
await new Promise(function (resolve, reject) {
exec
.exec(
namespace(),
podName,
containerName,
command,
process.stdout,
process.stderr,
stdin ?? null,
false /* tty */,
resp => {
// kube.exec returns an error if exit code is not 0, but we can't actually get the exit code
if (resp.status === 'Success') {
resolve(resp.code)
} else {
core.debug(
JSON.stringify({
message: resp?.message,
details: resp?.details
})
)
reject(resp?.message)
}
}
)
// eslint-disable-next-line github/no-then
.catch(err => reject(err))
})
} |
@nikola-jokic TY for spending time on this! I'll try the patch. Wdym by inspecting the pod? What data are you looking for? LMK, and I'll dig it out. Some clarifications: The workflow job is running on a separate node than the runner. I'm exec:ing into the workflow to verify that the job has completed correctly. The pod is running, but idle. |
@nikola-jokic Re-ran with your patch, and same behavior -- the step hangs. Here are the context of the 'BT' logs:
Successful step:
Hanging step:
Never get to the "BT... post" log, and no exceptions. |
Did you try executing the script manually with these inputs? It does look from the log that the script hangs, but is it possible that the script does not exit with these inputs? Can you please try executing any other simple script? This does not indicate any problem with run script step especially since web socket does not throw. |
I kubectl-exec into the pod/container, and verify that the script has finished successfully (it's not running, and the logs from the script contain correct output), but for some reason the execPod never returns.
A simpler script, e.g. https://github.com/bjoto/linux/actions/runs/7356297121/job/20026212978 completes from a runner perspective. |
I have tried not using the promise, but inspecting the source code here does seem like we should only catch the websocket error the way I did it in the previous comment... Did you try changing the tty parameter? If the script works on the regular runner, is it possible that resource requirements are a problem for the job pod? Can you monitor the job pod to see if there are any interruptions? |
Progress! The step that hangs doesn't provide any output to stdout/stderr for ~1.5h. So, I tried adding a keepalive:
And the step succeeds! So it seems like the exec socket times out, without notifying the caller, similar to what SSH can experience. This seems to be a workaround -- ugly -- but still. |
To summarize; This seems to be related to how kubectl-exec behaves. The workaround above (making sure that the channel is active) is sufficient for me to move forward with using ARC. A quick goolging for kubectl-exec timeouts for long running jobs is not super uncommon. Potential fixes, like changing the kubelet config (https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1 streamingConnectionIdleTimeout and similar) might work, but AFAIU this is not something that can be easily changed on managed k8s like GKE/Autopilot. @nikola-jokic I'm open to closing this issue for now. I think that as long as we're depending on kubectl-exec we can experience issues like this. :-( |
I agree with you, this is really related to kubernetes and node client that we depend on... If the websocket does not throw, and callback is never called, there is nothing we can do on the hook. 😞 |
I'm running ARC in containerMode: kubernetes, and for some workflow steps, the step never completes.
This is deployed on GKE (Google K8s Engine) with Autopilot.
Workflow:
The last
Run checks
step never completes. I've done a build of the Runner that has additional logging, and I can verify thatCommand.RunScriptStep
inpackages/k8s/src/index.ts
never completes -- i.e. the hook script/home/runner/k8s/index.js
never returns.When I exec into the workflow container, I can verify that the script has completed, and is not running any more.
The
Run check
step takes ~1h to complete.Expected:
Helm setup:
The text was updated successfully, but these errors were encountered: