-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Supervisor (Kallichore) failed to start in smoke test #5337
Comments
I also ran into this error today on Ubuntu, so it appears to not be isolated to Windows. |
@midleman Did you see this error interactively, or in a smoke test run? If the latter, could you link it? (so far I've been unable to reproduce on Windows) |
In a smoke test run, but it's on my branch, let me grab the logs ... |
Sorry.. I got confused looking at failures.. this one was on Ubuntu 22 (as the CI link correctly points to) ...ignore the windows part |
This scenario passed in my smoke test run with Kallichore enabled so perhaps this is timing related? Some notes from an initial look:
So what's interesting here is that Kallichore starts successfully the first time, but after a project switch the IDE restarts and Kallichore doesn't. |
This comment has been minimized.
This comment has been minimized.
This change improves diagnostics and logging for (hopefully rare) cases in which the kernel supervisor _itself_ cannot start. This is distinct from the cases where R or Python can't start. So far we've seen just one of these in the wild, as a result of running on an unsupported OS. The approach is to borrow the wrapper script technique from the Jupyter Adapter (formerly used to invoke the kernels themselves). The wrapper script acts as a sort of supervisor for the supervisor; it eats the output of the supervisor process and writes it to a file. If the supervisor exits unexpectedly at startup, the output file is written to the log channel, and the user is directed there to view errors. As an additional benefit, this runs the supervisor under a `bash` process on Unix-alikes, so any environment variables or configuration set up in `.bashrc` (etc) will now be available to the supervisor. Addresses #5611 . May help us figure out #5337. ### QA Notes An easy way to test this is to replace your `kcserver` binary with a shell script that emits some nonsense and then exits immediately with a nonzero status code. If you're feeling ambitious, you could also test this on the OS named in #5611. Also, did you know that the [longest worm in the world](https://en.wikipedia.org/wiki/Lineus_longissimus) can reach up to 55 meters? Crazy.
A real user has run into this here: #5910 |
Here's a recent hit on this issue |
Here are the logs from another failure this morning. |
If we're still seeing a generic timeout after adf5792 then it means that the output file containing the supervisor's standard output is not even getting created. So this could be a very early failure that is resulting in the supervisor never even launching (let alone emitting any output). In the past I've tried debugging this by making the terminal visible so that we can see what's going in during launch, but that screws up the tests because they are confused by the presence of a terminal where one is not expected. Maybe worth another try, though, since I can't think of any other way to get visibility into what's happening. |
This change addresses an issue frequently seen in tests wherein the kernel supervisor appears to fail to connect at startup. After several rounds of debugging, it eventually became clear that there was no actual reconnection failure; the problem was simply that the CI machines (especially on Windows hardware) are very slow, and the client gave up retrying before the supervisor process was fully started. The fix is wait much longer before giving up, and to base the number of retries on wall clock time rather than attempts. Formerly, we could exhaust retries in as little as 1.5-2 seconds; now we wait up to 10 seconds. Here are 2 e2e test runs from a branch with this change: https://github.com/posit-dev/positron/actions/runs/12934857967 https://github.com/posit-dev/positron/actions/runs/12934185231 Addresses #5337. ### QA Notes This change is primarily intended to address the issue in CI and shouldn't have much impact on the product; the only downside of this change is that now if there really _is_ some issue that causes a connectivity failure, it takes 10 seconds for it to show up instead of 2. These situations should not be common in the wild, since practically all of the issues we have seen manifest as an error starting/launching the supervisor, which we can detect without waiting for a timeout.
Verified FixedPositron Version(s) : 2025.02.0-112 Test scenario(s)Re-ran full test suite and repeat runs of New Project Wizard tests on Windows many, many times and have been unable to reproduce. NotesSince @jmcphers extended the supervisor startup time, we haven’t been able to consistently reproduce this issue. I’ve also added logging of the Terminal Contents at the end of each test. If the issue crops up again, we’ll reopen this ticket with the additional information and address it accordingly... but hoping we don't have to! 😄 |
System details:
Positron and OS details:
Positron:
main
after kallichore set as default supervisorOperationg System: Ubuntu 22
Interpreter details:
Describe the issue:
In CI smoke test, the test-explorer test failed due to Kallichore not starting.
See CI Run
And another CI Smoke Run
This is the first time we've seen this error. So not sure how prevalent.
Expected or desired behavior:
Kallichore to start up
Were there any error messages in the UI, Output panel, or Developer Tools console?
The text was updated successfully, but these errors were encountered: