-
Notifications
You must be signed in to change notification settings - Fork 408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GIT PULL] man/io_uring_internal: Man page about high lvl inner workings of io_uring #1256
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,282 @@ | ||
.TH io_uring_internals 7 2024-10-5 "Linux" "Linux Programmer's Manual" | ||
.SH NAME | ||
io_uring_internals | ||
.SH SYNOPSIS | ||
.nf | ||
.B "#include <linux/io_uring.h>" | ||
.fi | ||
.PP | ||
.SH DESCRIPTION | ||
.PP | ||
.B io_uring | ||
is a linux specific, asynchronous API that allows the submission of requests to | ||
the kernel. Applications pass requests to the kernel via a shared ring buffer | ||
the | ||
.I Submission Queue | ||
(SQ) and receive notifications of the completion of these requests via the | ||
.I Completion Queue | ||
(CQ). An important detail here is that after a request has been submitted to | ||
the kernel some CPU time has to be spent in kernel space to perform the | ||
required submission and completion related work. | ||
The mechanism used to provide this CPU time, as well as what process does so | ||
and when is different in | ||
.I io_uring | ||
than for the traditional API provided by regular syscalls. | ||
|
||
.PP | ||
.SH Traditional Syscall Driven I/O | ||
.PP | ||
For regular syscalls the CPU time for this work is directly provided by the | ||
process issuing the syscall, with the submission side work in kernel space | ||
being directly executed after the context switch. The time for completion | ||
related work is either also subsequently directly provided in the case of | ||
polled I/O. In the case of interrupt driven I/O the CPU time is provided, | ||
depending on the driver in question, by either the traditional top and bottom | ||
half IRQ approach or via threaded IRQ handling. The CPU time for completion | ||
work is thus in this case provided by the CPU on which the hardware | ||
interrupt arrives, as well as the CPU to which the dedicated kernel worker | ||
thread for the threaded IRQ handling gets scheduled, if that is used. | ||
|
||
.PP | ||
.SH The Submission Side Work | ||
.PP | ||
|
||
The work required in kernel space on the submission side mostly consists of | ||
checking the SQ for newly arrived SQEs, parsing and checking them for | ||
validity and permissions and then passing them on to the responsible system, | ||
such as a block device driver, networking stack, etc. An important note here is | ||
that | ||
.I io_uring | ||
guarantees that the process of submitting the request to responsible subsystem | ||
and thus in this case the | ||
.IR io_uring_enter (2) | ||
syscall made to submit the new requests, | ||
.B will never | ||
.BR block . | ||
However, the mechanism how io_uring achieves this generally depends on the | ||
capabilities of the file a request operates on. While the mechanism | ||
.I io_uring | ||
ends up utilizing for this is not directly observable to the application it | ||
does have significant performance implications. | ||
There are generally four scenarios: | ||
.PP | ||
1. The operation is finished in its entirety immediately. Examples of this | ||
are reads or writes to a pipe or socket or reads and writes to regular | ||
files not using direct I/O that have be served via the page cache. In this | ||
scenario the according CQE is posted inline as well and will thus be visible | ||
to the application even before the | ||
.IR io_uring_enter (2) | ||
call returns. | ||
|
||
2. The operation is not finished inline, but can be submitted fully | ||
asynchronously. How | ||
.I io_uring | ||
handles the asynchronous completion depends on whether or not interrupt or | ||
polled I/O is used (See section on Completion Side Work). An example of a | ||
backend capable of this fully asynchronous operation is the NVMe driver. | ||
|
||
3. The operation is not finished inline, but the file can signal readiness for | ||
when the operation can be retried. Examples of such files are any pollable file | ||
including sockets, pipes etc. It should be noted that these retry operations | ||
are performed during subsequent | ||
.IR io_uring_enter (2) | ||
calls, if SQ polling is not used. The operation is thus performed in the | ||
context of the submitting thread and there are no additional other threads | ||
involved. If SQ polling is used the retries are performed by the SQ poll | ||
thread. | ||
|
||
4. The operation is not finished inline and the file is incapable of signaling | ||
when it is ready to do I/O. This is the only case in which | ||
.I io_uring | ||
will | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't fully accurate. If an IO can be issued in a non-blocking fashion, then one of two things can happen:
2a) It wasn't done, but submitted async. io_uring will get a callback at some point when the operation completes, and a CQE will be posted. Examples of this are async reads/writes to a storage device. 2b) It wasn't done, but the file in question can signal readiness for when the operation can be retried. Examples of this are any pollable file, like a pipe, socket, etc. When io_uring receives the callback that data can now be read/written, it will retry the operation. Importantly, this retry happens from the task that submitted the IO. There's no async thread involved in this operation. 2c) It wasn't done, and the file has limited async support. Eg it cannot signal when it's ready to do IO. For this case, and only this case, does io_uring punt to an async worker to do the IO. I don't want to imply that io_uring just willy nilly punts to async workers, as that is not the case, and that would not be very efficient. It's a last resort kind of thing, for when the driver / file type is pretty basic and doesn't support more than very basic primitives. Now, for the application, it doesn't really matter which of the 2 cases end up happening, as completions are posted as it expects. But for efficiency reasons, it very much does matter, and there's a common theme where people assume that io_uring is just a thread work pool. That is very much WRONG, and this man page should not perpetuate that myth, it should help clear up the misunderstanding. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I forgot that 1. is even a thing when writing this :D, so this should definitively be mentioned explicitly. |
||
.I async punt | ||
the request, i.e. offload the potentially blocking execution of the request to | ||
an asynchronous worker thread. (See IO WQ section below) | ||
.PP | ||
|
||
.PP | ||
.SH The Completion Side Work | ||
.PP | ||
|
||
The work required in kernel space on the completion side mostly come in the | ||
form of various request type dependent obligations, such as copying buffers, | ||
parsing packet headers etc., as well as posting a CQE to the CQ to inform the | ||
application of the completion of the request. | ||
|
||
.PP | ||
.SH Who does the work | ||
.PP | ||
|
||
One of | ||
the primary motivations behind | ||
.I io_uring | ||
was to reduce or entirely avoid the overheads of syscalls to provide the | ||
required CPU time in kernel space. The mechanism that | ||
.I io_uring | ||
utilizes to achieve this differs depending on the configuration with different | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. utilize There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think utilizes is correct here? |
||
trade-offs between configurations in respect to e.g. CPU efficiency and latency. | ||
|
||
With the default configuration the primary mechanism to provide the kernel space | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure what this "provide the kernel space CPU time" means here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Someone" needs to run some code in kernel space (to do the kernel side submission, post the CQE etc.) , be that one of the submitting processes after the context switch during a syscall or e.g. the sq poll thread or to a limited extend the io wq threads. So "Someone" e.g. the caller of io_uring_enter or the sq poll thread would "provide the kernel space CPU time" ... and use it to run the relevant code in kernel space. That's how i have been thinking about this, but yeah maybe not the best wording... |
||
CPU time in | ||
.I io_uring | ||
is also a syscall: | ||
.IR io_uring_enter (2) | ||
This still differs from requests made via their respective syscall directly, | ||
such as | ||
.IR read (2), | ||
in the sense that it allows for batching in a more flexible way than e.g. | ||
possible via | ||
.IR readv (2), | ||
as different syscalls types can be freely mixed and matched and chains of | ||
dependent requests, such as a | ||
.IR send (2) | ||
followed by a | ||
.IR recv (2) | ||
can be submitted with one syscall. Furthermore it is possible to both process | ||
requests for submissions and process arrived completions within the same | ||
.IR io_uring_enter (2) | ||
call. Applications can set the flag | ||
.I IORING_ENTER_GETEVENTS | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for someone who doesn't have deep understanding of io_uring (me) it is still not clear what |
||
to in addition to processing any pending submissions, process any arrived | ||
completions and | ||
optionally wait until a specified amount of completions have arrived before | ||
returning. | ||
|
||
If polled I/O is used all completion related work is performed during the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like this section! |
||
.IR io_uring_enter (2) | ||
call. For interrupt driven I/O, the CPU receiving the hardware interrupt | ||
schedules the remaining work to be performed including posting the CQE to be | ||
performed via task work. Any outstanding task work is performed during any | ||
user-kernel space transition. Per default, the CPU that received the hw | ||
interrupt will after scheduling the task work interrupt a user space process | ||
via an inter processor interrupt (IPI), which will cause it to enter the kernel, | ||
and thus perform the scheduled work. While this ensures a timely delivery of | ||
the CQE, it is a relatively disruptive and high overhead operation. To avoid | ||
this applications can configure | ||
.I io_uring | ||
via | ||
.I IORING_SETUP_COOP_TASKRUN | ||
to elide the IPI. Applications must now ensure that they perform any syscall | ||
ever so often to be able to observe new completions, but benefit from eliding | ||
the overheads of the IPIs. Additionally | ||
.I io_uring | ||
can be configured to inform an application about the fact that it should now | ||
perform any syscall to reap new completions by setting | ||
.IR IORING_SETUP_TASKRUN_FLAG . | ||
This will result in | ||
.I io_uring | ||
setting | ||
.I IORING_SQ_TASKRUN | ||
in the SQ flags once the application should do so. This mechanism can be | ||
restricted further via | ||
.IR IORING_SETUP_DEFER_TASKRUN , | ||
which results in the task work only being executed when | ||
.IR io_uring_enter (2) | ||
is called with | ||
.I IORING_ENTER_GETEVENTS | ||
set, rather than at any context switch, which gives the application more agency | ||
about when the work is executed, thus enabling e.g. more opportunities for | ||
batching. | ||
|
||
.PP | ||
.SH IO Threads | ||
.PP | ||
|
||
For SQ polling and the IO WQ (See below) | ||
.I io_uring | ||
utilizes special threads called | ||
.I IO | ||
.IR Threads . | ||
These are threads that only run in kernel space and never exit to user space, | ||
but are notably different to | ||
.I kernel | ||
.IR threads , | ||
that are e.g. used for threaded interrupt handling. While kernel threads are | ||
not associated with any user space thread, IO Threads, like pthreads, | ||
inherit the file table, memory mappings, credentials etc. from their parent. | ||
In the case of | ||
.I io_uring | ||
any IO thread of an instance is a child of the processes that created that | ||
.I io_uring | ||
instance. This has many of the usual implications of this relation e.g. one can | ||
profile them and measure their resource consumption via the children specific | ||
options of | ||
.IR getrusage (2) | ||
and | ||
.IR perf_event_open (2). | ||
|
||
.PP | ||
.SH Submission Queue Polling | ||
.PP | ||
|
||
Sq polling introduces a dedicated IO thread that performs essentially all | ||
submission and completion related work from fetching SQEs from the SQ, | ||
submitting requests, polling requests, if configured for I/O poll and posting | ||
CQEs. Notably, async punt requests are still processed by the IO WQ, to not | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above explanation of how requests are issued and when io-wq is actually used, applies here too. |
||
hinder the progress of other requests (See Submission Side Work sections for | ||
when the async punt will occur). If the SQ thread does not have any work | ||
to do for a user supplied timeout it goes to sleep. Sq polling removes the need | ||
for any syscall during operation, besides waking up the sq thread after long | ||
periods of inactivity and thus reduces per request overheads at the cost of a | ||
high constant upkeep cost. | ||
|
||
.PP | ||
.SH IO Work Queue | ||
.PP | ||
|
||
The IO WQ is a pool of IO threads used to execute any requests that can not be | ||
submitted in a non-blocking way (See Submission Side Work sections for when | ||
this is the case). After either the sq poll thread or a user space | ||
thread calling | ||
.IR io_uring_enter (2) | ||
fails the initial attempt to submit the request without blocking it passes the | ||
request on to a IO WQ thread that then performs the blocking submission. This | ||
mechanism ensures that | ||
.IR io_uring , | ||
unlike e.g. AIO, never blocks on any of the submission paths. However, the | ||
blocking nature of the submission, the passing of the request to another | ||
thread, as well as the scheduling of the IO WQ threads are all ideally avoided | ||
overheads. Significant IO WQ activity can thus be seen as an indicator that | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is very true. |
||
something is very likely going wrong. Similarly the flag | ||
.I IOSQE_ASYNC | ||
should only be used if the user knows that a request will always or is very | ||
likely to async punt and not to ensure that the submission will not block, as | ||
.I io_uring | ||
guarantees to never block in any case. | ||
|
||
.PP | ||
.SH Kernel Thread Management | ||
.PP | ||
|
||
Each user space process utilizing | ||
.I io_uring | ||
posses an | ||
.I io_uring | ||
context, which manages all | ||
.I io_uring | ||
instances created within said process via | ||
.IR io_uring_setup (2). | ||
Per default, both the sq poll thread, as well as the IO WQ thread pool are | ||
dedicated for each | ||
.I io_uring | ||
instance and are thus not shared within a process and are never shared between | ||
different processes. However sharing these between two or more instances can | ||
be achieved during setup via | ||
.IR IORING_SETUP_ATTACH_WQ . | ||
The threads of the IO WQ are created lazily in response to request being async | ||
punted and fall into two accounts, the | ||
bounded account responsible for requests with a generally bounded execution | ||
time, such as block I/O and the unbounded account for requests with unbounded | ||
execution time such as e.g. recv operations. | ||
The maximum thread count of the accounts is per default 2 * NPROC and can be | ||
adjusted via | ||
.IR IORING_REGISTER_IOWQ_MAX_WORKERS . | ||
Their CPU affinity can be adjusted via | ||
.IR IORING_REGISTER_IOWQ_AFF . | ||
|
||
.EE | ||
.SH SEE ALSO | ||
.BR io_uring (7) | ||
.BR io_uring_enter (2) | ||
.BR io_uring_register (2) | ||
.BR io_uring_setup (2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that sentence reads that well, I'm having a hard time trying to make sense of it.