Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel nanny proposal #14

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions kernel-nanny.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Kernel 'nanny' processes

## Summary

We propose to start Jupyter kernels through a 'nanny' process, which will always
be running on the same machine as its associated kernel. This offers various
advantages over the current situation, including:

- Kernels will no longer need to implement the 'heartbeat' for frontends to
check that they are still alive.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the nanny process check the kernel is alive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. subprocess.Popen.poll(), but depending on how it's written, there may well be smarter ways. On Unix, the parent process is sent SIGCHLD when one of its children dies.

- We will be able to interrupt remote kernels (SIGINT cannot be sent over the network)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a message like shutdown_request?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shutdown_request is sent to tell the kernel to shut itself down. However, if it's currently executing code, it won't process that message - or any other message - until it finishes. Interrupting is how we break it out of execution.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a kernel implementation detail. IJavascript doesn't suffer from that problem (suffers from others, though 😛 ).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the perspective of writing a native app, we'll do this directly: sending shutdown_request first followed by shutting the kernel process down directly.

- There will be a consistent way to start kernels without a frontend
(`jupyter kernel --kernel x`).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm misunderstanding the proposal. Is jupyter kernel the nanny? I mean jupyter kernel is kernel-agnostic and it'll be able to launch any kernels?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jupyter kernel will be a way to launch the nanny, yes (it may not be called exactly that, but that's the idea).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brings me back to what I was thinking when we talked about the kernel nanny, which is that we should consider a general daemon for launching kernels on a system, similar to docker's interface (CLI + API).

- Kernel stdout & stderr can be captured at the OS level, with real-time updates
of output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


## The basics

When a frontend wants to start a kernel, it currently instantiates a `KernelManager`
object which reads the kernelspec to find how to start the kernel, writes a
connection file, and launches the kernel process. With the proposed changes, it will
instead launch the kernel nanny on the machine where the kernel is to run, and
the nanny will be responsible for creating the connection file and launching
the kernel process.

**Rejected alternative:** One kernel nanny process per machine, able to start
multiple kernels. This would be more complex, but we may come back to it later
if the overhead of one nanny per kernel is too much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is down the one nanny per kernel approach now. Glad you cleared this up.

## Socket connections

Currently, the frontend connects to five sockets to communicate with the kernel:

* Shell
* Control (priority, used for shutdown)
* Iopub (kernel to frontend only, for output)
* Stdin (used to request input from the frontend)
* Heartbeat

Of these, shell and stdin will remain connected directly between the kernel and
the frontend. Control and iopub (see output capturing) will be connected through
the nanny, i.e. each channel will have one socket for communications between
the frontend and the nanny, and a second socket for communications between the
nanny and the kernel. (*TODO: What are these called in the connection file? Or
do we have two connection files?*) The heartbeat will only be between the
frontend and the nanny, to detect situations such as network failures.

## Messaging changes

* A new message type on the control channel from the frontend to the nanny,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instructing the nanny to shut down the kernel.
* A new message type on the control channel from the frontend to the nanny,
instructing the nanny to signal/interrupt the kernel. (*TODO: Expose all Unix
signals, or just SIGINT?*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A signal_request message makes the most sense. I don't think there's a reason to limit to interrupt/term/kill, all of which we probably want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Unix systems that certainly makes sense. For Windows, should we just pick some numbers to refer to the available ways we have of interrupting/stopping the kernel process?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only one or two signals work on Windows reliably, but they are still integers, aren't they?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI Windows doesn't really have signals at all, but Python exposes certain similar operations through the same interface it uses for signals on Windows. The description of os.kill has some useful info:

https://docs.python.org/3/library/os.html#os.kill

We could quite reasonably expose the same set of options with the same meanings as Python does, of course.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the nanny process is going to run in the same machine as the kernel, it makes sense that the nanny process is asked to interrupt the kernel by means of a message similar to shutdown_request, then the nanny process interrupts the kernel process by sending the appropriate signal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's exactly how this will work. We're just trying to work out what form the message will take. If all the world was Unix, we'd almost certainly just call it signal_request, and pass a signal number or name. But things get a bit more complicated when we consider kernels running on Windows.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the windows problems, see here: jupyter/jupyter_client#104

* Heartbeat becomes a broadcast signal from the nanny to all connected frontends,
rather than a REP/REQ pattern (which was only the case before because pyzmq
makes it easy to echo messages without grabbing the GIL).
* New broadcast message from nanny to frontends when kernel dies unexpectedly,
including exit status.
* New end of output message, from nanny to frontends? (*TODO: yes/no?*)

## Output capturing

In IPython, we capture stdout/stderr at the Python level (sys.std*). Code which
writes to stdout/stderr at a lower level (e.g. C extensions) will send its output
to the terminal where the frontend was started, instead of to the frontend.
Many other kernels suffer from similar issues.

We know of tricks using `dup2` to redirect the low-level file handles within the
kernel, but we don't want each kernel to reimplement this, and it is not
possible on Windows.

To this end, when the kernel nanny starts the kernel, it will be able to create
stdout and stderr as pipes, and turn data read from them into *stream* messages
to be sent to the frontend via the iopub channel. However, this may make
debugging difficult or show unwanted output if kernel authors are using the
terminal to debug the kernel implementation. Therefore, output capturing will
only be enabled if the kernel opts in via its `kernel.json` specification:

"capture_stdstreams": true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A problem that this introduces that we should address is that now kernels cannot write to the terminal at all - there is no way for kernels to have logging without going straight to a file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(duh, I should finish reading, I see # Kernel logging two seconds down)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other potential problems:

  • kernels echoing writes to stdout/stderr from the higher level to the lower level
  • kernel output cannot be linked to an execution cell, this will cause problems when the user selects the frontend option to run all the cells of a notebook

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kernels echoing writes to stdout/stderr from the higher level to the lower level

Can you expand on this? I don't follow what you mean.

kernel output cannot be linked to an execution cell, this will cause problems when the user selects the frontend option to run all the cells of a notebook

That shouldn't be a problem: the notebook does that by queueing an execution request for each cell. It has to do that for the output from each cell to go to the right place.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, when console.log('Hello,World!') was executed, earlier versions of IJavascript would send an iopub message, and would also print 'Hello, World!' to the terminal where the IJavascript was running.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's precisely the kind of reason that output capturing will be opt-in - kernels need to be ready for it, and then kernel authors can flip the switch to enable it. It won't be enabled for earlier versions of IJavascript, so that won't be a problem.


### Output synchronisation

With this proposal, there are multiple asynchronous channels for output coming
from the kernel: the `iopub` socket from the kernel to the nanny, and the pipes
carrying stdout and stderr. At present, the `status` message with
`execution_state: idle` marks the end of output on the iopub channel.

Kernels that opt in to output capturing should print a delimiter (*TODO: define
delimiter*) on each of stdout and stderr, before and after running user code.
The delimiter will include the message ID of the execute_request message.
The nanny will not forward these to the frontend, but will use the 'before'
delimiters to indicate which execution output resulted from, and the 'after'
delimiters to detect when stream output is finished. The nanny will tell the
frontend when all output from an execute_request, on iopub and the two pipes,
is complete (*TODO: using status:idle, or a new message?*).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming synchronous execution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but that's already generally our assumption, and I don't see any other way to do it. There is no side channel in sync with stdout/stderr by which we can convey metadata like the parent message ID, so it has to be in-band.

This is all in addition to the existing mechanisms for kernels to send output, so kernels for which async is really important should focus on capturing output in process and sending the correct messages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would let the kernel handle that. If the kernel can identify the source a stream message, let the kernel send the appropriate iopub reply.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the kernel is sending the stream message itself, it should absolutely do that. This is for the case where the data is going over the stdout/stderr pipes. If you can capture all stdout/stderr well enough within IJavascript, there's no need for you to enable output capturing.

**Rejected alternative:** Frontends monitoring when output is complete on each
channel. The frontend output handling logic would have to know whether the
kernel in use used output capturing or not, and the logic would have to be
written for each frontend. With the scheme we decided upon, the logic must only
be implemented once in the nanny process, and the frontend can remain ignorant
of whether the kernel has enabled output capturing.

### Kernel logging

Where kernels have been previously using low-level stdout/stderr to log to the
terminal, they need a new way to produce diagnostic logs which shouldn't be
displayed in the frontend. Kernels opting in to output capturing will be
started with an environment variable `JUPYTER_KERNEL_LOG` set. The kernel
should treat this as a filesystem path, which it can open and write logs to.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: add a new type of message for kernels to send their log messages to the frontend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, but this logging is what you're going to be using if your kernel's messages are not getting to the frontend for whatever reason. So I think it really needs to be a) a separate channel, and b) as technically simple as possible.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the kernel log output be disabled? If yes, how is that indicated? Variable not set, or set to an empty value?
As it is described now, the nanny would have to create an empty file and set JUPYTER_KERNEL_LOG for every kernel that uses output capturing, even if that kernel does not produce diagnostic logs. Disabling log output from the kernelspec, for example, would be a way to avoid that overhead.

Kernels should make minimal assumptions about the type of file they are opening.
It may be a regular file, a FIFO (or named pipe), or the slave end of the tty
where the frontend was started, on systems where that is possible. This will
likely depend on configuration settings in the frontend, and possibly on how the
frontend is started. It should never be a directory, however.