Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel nanny proposal #14

Closed
wants to merge 2 commits into from
Closed

Kernel nanny proposal #14

wants to merge 2 commits into from

Conversation

takluyver
Copy link
Member

As discussed at the dev meeting. There are a few TODOs which have not yet been decided. We can bikeshed about them, or whoever gets to implementing the relevant bits first can try their preferred option ;-).

Pinging @JanSchulz, who was interested in this for IRkernel logging.

@jankatins
Copy link

jankatins commented Apr 18, 2016

Not sure if I understand the details, but currently the notebook isn't very good at shutting down the R kernel on windows, because the R kernel is not a single process, but more like R.exe -> cmd -> rterm.exe [see https://github.com/jupyter/jupyter_client/issues/104]. I'm not sure if the "nanny" can detect such a thing without a heartbeat?

[Such things might happen even for python kernels, if you use a batch file with activate <env> & python <kernel startup line>, which is needed to get the correct PATH in a kernel...]

@takluyver
Copy link
Member Author

I suspect it won't make much of a difference either way in that situation. Both currently and with the kernel nanny, it will send shutdown_request to ask the kernel to shut itself down, and if it doesn't shut down within some time period, it will terminate it more forcefully. I'd guess that second bit is where it goes wrong, since it only knows about the top-level process that it started.

Besides fiddling with the time we wait for the kernel to shut itself down, I'm not sure what we could do to improve that.

instructing the nanny to shut down the kernel.
* A new message type on the control channel from the frontend to the nanny,
instructing the nanny to signal/interrupt the kernel. (*TODO: Expose all Unix
signals, or just SIGINT?*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A signal_request message makes the most sense. I don't think there's a reason to limit to interrupt/term/kill, all of which we probably want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Unix systems that certainly makes sense. For Windows, should we just pick some numbers to refer to the available ways we have of interrupting/stopping the kernel process?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only one or two signals work on Windows reliably, but they are still integers, aren't they?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI Windows doesn't really have signals at all, but Python exposes certain similar operations through the same interface it uses for signals on Windows. The description of os.kill has some useful info:

https://docs.python.org/3/library/os.html#os.kill

We could quite reasonably expose the same set of options with the same meanings as Python does, of course.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the nanny process is going to run in the same machine as the kernel, it makes sense that the nanny process is asked to interrupt the kernel by means of a message similar to shutdown_request, then the nanny process interrupts the kernel process by sending the appropriate signal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's exactly how this will work. We're just trying to work out what form the message will take. If all the world was Unix, we'd almost certainly just call it signal_request, and pass a signal number or name. But things get a bit more complicated when we consider kernels running on Windows.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the windows problems, see here: jupyter/jupyter_client#104

@minrk
Copy link
Member

minrk commented Apr 19, 2016

Thanks for writing this up, @takluyver!


When a frontend wants to start a kernel, it currently instantiates a `KernelManager`
object which reads the kernelspec to find how to start the kernel, writes a
connection file, and launches the kernel process. With this process, it will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps for clarity:
"With this proposed process, the frontend will..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, tweaked

@willingc
Copy link
Member

@takluyver Nicely designed and clearly written. I made a drawing by hand of the frontends, nanny, kernel, and channels; let me know if you would like a copy. 😄

- There will be a consistent way to start kernels without a frontend
(`jupyter kernel --kernel x`).
- Kernel stdout & stderr can be captured at the OS level, with real-time updates
of output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@takluyver
Copy link
Member Author

Thanks all!

@willingc, yes, it would be good to see your drawing, to check if the explanation conveyed what I was thinking clearly.

@willingc
Copy link
Member

@takluyver Here's the link to the drawing's folder on Dropbox: https://www.dropbox.com/sh/kzc9bom60c9e57x/AAAWcdlGo8RZB9cklEv7jC2ua?dl=0

@takluyver
Copy link
Member Author

Thanks, that looks good.

@willingc
Copy link
Member

@takluyver Great. You detailed things out very clearly 🔑

advantages over the current situation, including:

- Kernels will no longer need to implement the 'heartbeat' for frontends to
check that they are still alive.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the nanny process check the kernel is alive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. subprocess.Popen.poll(), but depending on how it's written, there may well be smarter ways. On Unix, the parent process is sent SIGCHLD when one of its children dies.

@n-riesco
Copy link

I would suggest that this proposal is split into four proposals:

  1. a proposal to replace SIGINT.
  2. a proposal for kernels to capture the low level stderr and stdout streams and forward them to the frontend.
  3. a proposal to introduce the command jupyter kernel --kernel x
  4. a proposal (specific to IPython) that will use a nanny process to implement the first two proposals.

@takluyver
Copy link
Member Author

a proposal (specific to IPython) that will use a nanny process to implement the first two proposals.

This is absolutely not specific to IPython - the proposal is for the nanny process to be used for all kernels.

Your 1 & 2 don't really make sense without a nanny process. 3 is doable, but it's a more incidental benefit. I don't see the benefit of splitting this up into smaller pieces: it's one change to the architecture that lets us do a number of useful things, which seems like exactly the right scope for a JEP.

This was also discussed at the in-person dev meeting, and while I don't want to suggest that it's closed for discussion, we did spend some time hashing out what we wanted, and I'd really hope that the remaining issues to work out are details, not the fundamental nature of the proposal.

@n-riesco
Copy link

On 21/04/16 17:54, Thomas Kluyver wrote:

Your 1 & 2 don't really make sense without a nanny process.

I think the kernel is in a better position to handle 1 and specially 2 than an agnostic nanny process:

  • a kernel, if really needed, can implement its own nanny process to handle 1 and 2
  • the nanny process cannot determine the origin of stdout and stderr without the kernel's help
  • a kernel can always capture the low level stdout/stderr

@lbustelo
Copy link
Contributor

@takluyver How about for kernels that are remote? I know this is not officially supported by the notebook server, but it is something that we've experimented with and could be a requirement in certain deployments.

@jasongrout
Copy link
Member

@n-riesco - the idea in the proposal is that rather than having every kernel implement the capturing and signal/interrupt logic, we'd implement it once outside of the kernel and everyone automatically benefits. As for capturing output, that's opt-in for a kernel, so a kernel absolutely can do their own input/output instead of having the nanny handle it. The nanny makes it much easier to have this automatically taken care of.

@jasongrout
Copy link
Member

Another concern brought up in the meeting was the latency introduced in forwarding messages through the nanny. Can you mention that in the proposal? I thought @minrk said he might run some tests to get some idea about how much the latency on messages would be impacted by this proposal.

@n-riesco
Copy link

n-riesco commented Apr 21, 2016

@takluyver I'm sorry for suggesting a proposal split.

Here are 2 suggestions to the current proposal:

  • make the nanny an opt-in feature declared in the kernel spec (currently, IJavascript can run in the official docker images for Node.js; if the nanny was to be made compulsory, then the nanny (and all the dependencies) would have to be installed in the docker container).
  • consider extending the jupyter protocol with a message for the kernel to record log messages (so that frontends have access to them).

@jasongrout
Copy link
Member

@minrk - how heavy-weight do you see the nanny being? I imagined either a python file, or a lightweight OS-specific C program with zeromq as a dependency.

@takluyver
Copy link
Member Author

@lbustelo The idea is that the nanny and the kernel are always running together on the same system. They may both be remote from the frontend (e.g. the notebook server), and this will work much like it already does - zmq messages sent over the network. One of the key advantages of this is that will allow interrupting remote kernels, which is currently impossible.

@n-riesco @jasongrout I definitely see the nanny as being a lightweight thing with few dependencies. In the first instance, it will likely be written in Python, because that's what we can write and debug most effectively, but I may later use it as an excuse to brush up on a language like Rust or Go, which will make it even lighter.

consider extending the jupyter protocol with a message for the kernel to record log messages (so that frontends have access to them).

The logging system is what you'll want to rely on to debug problems with the messaging, so I want it to be a) a separate mechanism, and b) as simple as possible, like 'open this file and write to it'. We can still arrange things so that the frontend can get the kernel logs by making the log file a named pipe, and having the frontend read it. I like this Unix-y approach here, because it provides a lot of flexibility while requiring very little complexity in the kernels.

@n-riesco
Copy link

On 22/04/16 14:32, Thomas Kluyver wrote:

[...] We can still arrange things so that the frontend can get the kernel logs by making the log file a named pipe, and having the frontend read it.

How would that work in the case of remote kernels?

@takluyver
Copy link
Member Author

There are advantages either way; this way allows the nanny to be running as a service, so remote frontends have a standard way to connect, see which kernel specs are available, and start one. Neither of us felt strongly, but I want to pick something so I can go and prototype it.

Even with nanny-per-system, I'd still like to provide an entry point to start the nanny and immediately start one kernel, for the cases where you're spinning up a VM or container to run a single kernel.

@jasongrout
Copy link
Member

I want to pick something so I can go and prototype it.

Definitely +1!

@lbustelo
Copy link
Contributor

lbustelo commented Dec 9, 2016

When thinking about remote frontends, it starts to feel like this may be something that belongs in kernel gateway. See https://github.com/jupyter/kernel_gateway_demos/tree/master/nb2kg.

@takluyver
Copy link
Member Author

The proposed 'kernel nanny', now that we're leaning towards one nanny managing multiple kernels, is indeed quite like kernel gateway. The key difference is that this would expose a ZeroMQ interface, rather than an HTTP/websockets interface.

  1. Do we still want to do this, or should we always use HTTP for managing remote kernels? I see some potential advantages of exposing a ZMQ interface, but I don't know if they're compelling enough:
  • Security could be simpler in some situations as it doesn't rely on SSL certificates.
  • There may be more (or better supported) zmq libraries for some languages than websocket client libraries, as the main purpose of websockets is to talk to a browser. I haven't surveyed zmq & websocket client libraries, though.
  1. If we do want a ZMQ interface to managing kernels, would it share much code with kernel gateway? Would it make sense to be part of the kernel gateway project?

@tristanz
Copy link

As a consumer, I find ZMQ and custom encryption to be a cost compared to WebSockets + SSL.

@takluyver
Copy link
Member Author

Right, there are definitely situations where that's easier. But I could imagine other situations where dealing with SSL certificates is more complex than a simple shared secret for the connection.

@lbustelo
Copy link
Contributor

@takluyver Checkout the suppor in kernel gateway for personalities. Might be that it can get enhanced to expose ZMQ.

@parente
Copy link
Member

parente commented Feb 17, 2017

The kernel gateway is using the notebook server code as a package and so only knows how to communicate with the outside world using HTTP/Websocket based protocols. Adding a zmq personality would be quite an undertaking.

Also, I'm not sure what the future holds for KG in general. Notebook has subsumed some of its features (e.g., token auth, kernel activity monitoring) and jupyter_server is on the enhancement proposal table for separating the frontend and backend bits of notebook.

@minrk
Copy link
Member

minrk commented Feb 17, 2017

websockets definitely have some nice characteristics when talking over remote networks (dealing with all of the connections over a single port is a big one). I think supporting additional transports for remote kernels is a bit of a separate discussion, though. Adding the KernelNanny should make such a proposal easier to implement, since we would know that we always have a process running next to the kernel which could serve as that multiplexer. Part of the point of the nanny proposal is that it is transparent to both the kernel and the client.

The most basic functionality that I want from KernelNanny is the original idea in IPEP 12:

  • KernelClient gets the interrupt/restart/etc methods of KernelManager that can be expected to work for all kernels.

Since the kernel client API is all zmq, it makes the most sense to me for these to be zmq request/reply messages like all the rest, rather than adding additional HTTP requests to an otherwise zmq API. If we went with HTTP, these would presumably be regular HTTP requests, though, not websockets, so the bar is lower for clients to talk to them. It does seem internally inconsistent to have some zmq requests and some HTTP requests for the same API, though.

To me, starting remote kernels, which is solved by KG, is not part of the nanny proposal. But they do creep closer together when we make the nanny a singleton kernel-providing service rather than a simple manager of one kernel, since it has to gain start/list APIs in addition to the more basic stop/restart.

@takluyver
Copy link
Member Author

I was roughly thinking that with local kernels, the server would integrate the nanny functionality (i.e. it would keep handling interrupts as it currently does), and the remote case would be handled by nb2kg (or a ZMQ equivalent if we wrote one). I suspect it would be no harder to write a KernelManager/KernelClient pair wrapping the HTTP/websockets interface than to write a ZMQ kernel gateway.

Maybe this points back to the nanny being per kernel, if we've already got KG as the multiple kernel manager.

I think I need to think more carefully about what situations we actually want remote kernels in, and how we manage them (e.g. putting kernels in docker lends itself to one kernel per container; does the nanny process need to be inside the container with the kernel? Or can it do what it needs from outside?).

@minrk
Copy link
Member

minrk commented Feb 21, 2017

I was roughly thinking that with local kernels, the server would integrate the nanny functionality (i.e. it would keep handling interrupts as it currently does)

I think that makes sense. The QtConsole and jupyter_console and all other entrypoints would also need to run the nanny or talk to an existing one.

putting kernels in docker lends itself to one kernel per container; does the nanny process need to be inside the container with the kernel? Or can it do what it needs from outside?

I think it can go either way. The simplest version is the nanny in the container with the kernel, where there's really no awareness that docker is involved. The more sophisticated version is a "DockerKernelNanny", akin to the DockerSpawner in JupyterHub, where 'raw' kernels are in containers and the nanny does all of its activities via the docker API.

@thomasjm
Copy link

Hi all -- sorry to comment on an old thread, but I'm hoping to get clarification about something...what is the current status of the deprecation of kernel heartbeats? There are some mentions of deprecating the heartbeat in this proposal, but the latest version of the messaging spec that I can find still contains them.

This is troubling me because I just noticed a kernel (gophernotes) which seems to have dropped heartbeat support in July 2017. I somehow missed the memo about this whole thing, so could someone let me know if heartbeats are dead and the nanny process is how things work now? Is there another source of documentation/truth besides http://jupyter-client.readthedocs.io ? Thanks!

@takluyver
Copy link
Member Author

No problem!

The Jupyter notebook doesn't rely on heartbeats, and since that's the interface most people use, some kernels only worry about supporting it. We could make all interfaces work without the heartbeat so long as the kernel runs on the same machine (which it normally does). I don't remember which ones already work like this.

Getting rid of the heartbeat entirely was waiting on this proposal because it would have provided another way to tell if a remote kernel is still alive.

@thomasjm
Copy link

Got it, thanks! Is there some official way to keep up-to-date on the latest version of the messaging spec? Speaking as someone working on an alternate Jupyter frontend, dropping heartbeats seems like a major change that would merit a new major version number on the "Messaging in Jupyter" page...

@takluyver
Copy link
Member Author

If we ever do officially drop them, that would certainly be a new protocol version. For now, kernels are theoretically expected to implement it, but in typical cases there's no need, so many don't. We should probably clarify that in the messaging doc.

@thomasjm
Copy link

Cool, thanks for the clarification. Tagging @dwhitena so he sees these comments.

thomasjm referenced this pull request in gopherdata/gophernotes Oct 25, 2017
@dwhitena
Copy link

Thanks @thomasjm. We will work on supporting this in gophernotes.

@sushantd195
Copy link

sushantd195 commented Jun 18, 2022

a proposal for kernels to capture the low level stderr and stdout streams and forward them to the frontend.

This is still an issue as seen here. I'm on latest IRkernel and reticulate
image

@cladosphaero
Copy link

cladosphaero commented Oct 14, 2022

Hi @takluyver @jasongrout was this ever implemented? It looks like OS-level output is still suppressed by Jupyter, as evidenced by reticulate not displaying Python stdout in IRkernel. If not, do you have any suggestions for manually rerouting OS-level output to Jupyter frontend?

@krassowski
Copy link
Member

There is an adjacent pre-proposal which addresses some of the same issues (but not IRkernel logging issue) over at #117.

@Zsailer
Copy link
Member

Zsailer commented Mar 4, 2024

Hi all 👋 —Zach here from the @jupyter/software-steering-council.

We're working through old JEPs and closing proposals that are no longer active or may not be relevant anymore. Under Jupyter's new governance model, we have an active Software Steering Council who reviews JEPs weekly. We are catching up on the backlog now. Since there has been no active discussion on this JEP in awhile, I'd propose we close it here (we'll leave it open for two more weeks in case you'd like to revive the conversation). If you would like to re-open the discussion after we close it, you are welcome to do that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.