Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Supervisor (Kallichore) failed to start in smoke test #5337

Closed
jonvanausdeln opened this issue Nov 12, 2024 · 11 comments
Closed

New Supervisor (Kallichore) failed to start in smoke test #5337

jonvanausdeln opened this issue Nov 12, 2024 · 11 comments
Assignees
Milestone

Comments

@jonvanausdeln
Copy link
Contributor

jonvanausdeln commented Nov 12, 2024

System details:

Positron and OS details:

Positron: main after kallichore set as default supervisor
Operationg System: Ubuntu 22

Interpreter details:

Describe the issue:

In CI smoke test, the test-explorer test failed due to Kallichore not starting.

Image

See CI Run
And another CI Smoke Run

This is the first time we've seen this error. So not sure how prevalent.

Expected or desired behavior:

Kallichore to start up

Were there any error messages in the UI, Output panel, or Developer Tools console?

@midleman
Copy link
Contributor

I also ran into this error today on Ubuntu, so it appears to not be isolated to Windows.

@jmcphers
Copy link
Collaborator

@midleman Did you see this error interactively, or in a smoke test run? If the latter, could you link it?

(so far I've been unable to reproduce on Windows)

@midleman
Copy link
Contributor

In a smoke test run, but it's on my branch, let me grab the logs ...

@jonvanausdeln
Copy link
Contributor Author

Sorry.. I got confused looking at failures.. this one was on Ubuntu 22 (as the CI link correctly points to) ...ignore the windows part

@jmcphers
Copy link
Collaborator

This scenario passed in my smoke test run with Kallichore enabled so perhaps this is timing related?

Some notes from an initial look:

2024-11-12 14:49:25.336 [info] Positron Kernel Supervisor activated
Starting Kallichore server /home/runner/work/positron/positron/extensions/positron-supervisor/resources/kallichore/kcserver on port 55683
Kallichore 0.1.18 server online with 0 sessions
Kallichore server started in 273ms
...
14:49:32 [INFO] [client r-97ae99a4-0] Websocket closed by client

...

2024-11-12 14:49:38.638 [info] Positron Kernel Supervisor activated
Starting Kallichore server /home/runner/work/positron/positron/extensions/positron-supervisor/resources/kallichore/kcserver on port 51542
Waiting for Kallichore server to start (attempt 1, 589ms)
Waiting for Kallichore server to start (attempt 6, 853ms)
Waiting for Kallichore server to start (attempt 11, 1130ms)
Waiting for Kallichore server to start (attempt 16, 1391ms)
Kallichore server did not start after 1599ms

So what's interesting here is that Kallichore starts successfully the first time, but after a project switch the IDE restarts and Kallichore doesn't.

@jmcphers jmcphers self-assigned this Nov 12, 2024
@jmcphers jmcphers added this to the 2024.12.0 Pre-Release milestone Nov 12, 2024
@Liu-yuansheng

This comment has been minimized.

jmcphers added a commit that referenced this issue Dec 12, 2024
This change improves diagnostics and logging for (hopefully rare) cases
in which the kernel supervisor _itself_ cannot start. This is distinct
from the cases where R or Python can't start. So far we've seen just one
of these in the wild, as a result of running on an unsupported OS.

The approach is to borrow the wrapper script technique from the Jupyter
Adapter (formerly used to invoke the kernels themselves). The wrapper
script acts as a sort of supervisor for the supervisor; it eats the
output of the supervisor process and writes it to a file. If the
supervisor exits unexpectedly at startup, the output file is written to
the log channel, and the user is directed there to view errors.

As an additional benefit, this runs the supervisor under a `bash`
process on Unix-alikes, so any environment variables or configuration
set up in `.bashrc` (etc) will now be available to the supervisor.

Addresses #5611 . 

May help us figure out #5337.

### QA Notes

An easy way to test this is to replace your `kcserver` binary with a
shell script that emits some nonsense and then exits immediately with a
nonzero status code. If you're feeling ambitious, you could also test
this on the OS named in #5611.

Also, did you know that the [longest worm in the
world](https://en.wikipedia.org/wiki/Lineus_longissimus) can reach up to
55 meters? Crazy.
@juliasilge
Copy link
Contributor

A real user has run into this here: #5910

@jonvanausdeln
Copy link
Contributor Author

Here's a recent hit on this issue
https://github.com/posit-dev/positron/actions/runs/12895509300

jmcphers added a commit that referenced this issue Jan 22, 2025
…5983)

This change is intended to help troubleshoot #5337, which continues to
show up occasionally in CI but for which no one has found a reliable
local repro.

It shows any output emitted by the supervisor alongside the connection
timeout message, for easy debugging.
@jonvanausdeln
Copy link
Contributor Author

Here are the logs from another failure this morning.

logs-new-project-python.test.ts.zip

@jmcphers
Copy link
Collaborator

If we're still seeing a generic timeout after adf5792 then it means that the output file containing the supervisor's standard output is not even getting created. So this could be a very early failure that is resulting in the supervisor never even launching (let alone emitting any output).

In the past I've tried debugging this by making the terminal visible so that we can see what's going in during launch, but that screws up the tests because they are confused by the presence of a terminal where one is not expected. Maybe worth another try, though, since I can't think of any other way to get visibility into what's happening.

jmcphers added a commit that referenced this issue Jan 23, 2025
This change addresses an issue frequently seen in tests wherein the
kernel supervisor appears to fail to connect at startup. After several
rounds of debugging, it eventually became clear that there was no actual
reconnection failure; the problem was simply that the CI machines
(especially on Windows hardware) are very slow, and the client gave up
retrying before the supervisor process was fully started.

The fix is wait much longer before giving up, and to base the number of
retries on wall clock time rather than attempts. Formerly, we could
exhaust retries in as little as 1.5-2 seconds; now we wait up to 10
seconds.

Here are 2 e2e test runs from a branch with this change:

https://github.com/posit-dev/positron/actions/runs/12934857967
https://github.com/posit-dev/positron/actions/runs/12934185231

Addresses #5337.

### QA Notes

This change is primarily intended to address the issue in CI and
shouldn't have much impact on the product; the only downside of this
change is that now if there really _is_ some issue that causes a
connectivity failure, it takes 10 seconds for it to show up instead of
2.

These situations should not be common in the wild, since practically all
of the issues we have seen manifest as an error starting/launching the
supervisor, which we can detect without waiting for a timeout.
@midleman
Copy link
Contributor

Verified Fixed

Positron Version(s) : 2025.02.0-112
OS Version(s) : Ubuntu/Windows

Test scenario(s)

Re-ran full test suite and repeat runs of New Project Wizard tests on Windows many, many times and have been unable to reproduce.

Notes

Since @jmcphers extended the supervisor startup time, we haven’t been able to consistently reproduce this issue. I’ve also added logging of the Terminal Contents at the end of each test. If the issue crops up again, we’ll reopen this ticket with the additional information and address it accordingly... but hoping we don't have to! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants