Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Windows registered IO #918

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

Matthias247
Copy link
Contributor

This change demonstrates how quinn could make use of windows
registered IO (RIO), or other kinds of completion based IO.

Upfront note: The change is incomplete, buggy, and will leak all memory.
Don't even think about using it as is. This is just a quick hack to get
some ideas about integration and about achievable performance.

Integrating with registered IO requires the following changes to get
things working:

  1. The endpoint is now running on its own dedicated IO thread instead of
    running on the shared tokio runtime. This allows it to use any platform
    specific IO primitives it requires to use. In this case we are using RIO,
    and a custom eventloop which waits for new IO being possible using
    a windows ManualResetEvent. Waiting via IOCP, or letting the thread
    busy-spinning is also possible.
    The actual loop, which is implemented in EndpointDriver::run, is not
    that different from the existing EndpointDriver::poll method.
  2. Reading and writing data with RIO is submission+completion oriented,
    and requires buffers to be registered with kernel space for the complete
    lifetime of the socket. In order to accomodate for those requirements,
    a buffer pool is allocated when the socket + endpoint are created, and
    reused throughout the lifetime of the endpoint. The endpoint will make
    sure the maximum possible amount of concurrent receive operations is
    scheduled. Transmit operations get scheduled whenever data to transmit
    is available and TX buffers are available.
  3. Since remaining quinn is not aware about pinned buffers, and requires
    Vec to transmit outgoing buffers and BytesMut to decode incoming
    datagrams, all datagrams are copied once from the IO buffers to those
    higher level buffers. This could theoretically be optimized.
  4. Since the endpoint is no longer an async task, it can't receive instructions
    from connections anymore using an async channel. This adds a custom
    channel implemetation for this purpose, which consists of a trivial
    synchronized queue and a wakeup of the endpoint eventloop.
  5. The endpoint can't use tokio::spawn anymore to spawn new connections,
    since it is not running inside a tokio context.
    Therefore a runtime handle needs to be explicitely propagated.
  6. Socket needs to be created with the WSA_FLAG_REGISTERED_IO. Therefore
    UDP sockets create via std::net::UdpSocket unfortunately can't be trivially
    forwarded. It would be debatable whether this means the quinn library
    should be resopnsible for creating all sockets, or whether it should still
    accept external sockets but explictily require that those have been configured
    with all the necessary flags.

Most of the points outlined here would also be required to support io_uring
or AF_XDP with buffers pre-registered with the kernel, or just
sendmsg/sendmmsg using MSG_ZEROCOPY, which has similar
requirements.

Performance with this approach varies. The benchmarks indicate a
throughput somewhere between 180MB/s and 330MB/s. If a benchmark
was started, it will either consistently report the low or the high value.
Some comments in the msquic repository indicate that this might be
due to RSS. Maybe something can be improved here by making sure
the endpoint IO thread runs on the ideal core.

A follow up POC which could be built, but isn't part of this demo,
is to also move the Connection handling onto the new dedicated
IO thread.

This change demonstrates how quinn could make use of windows
registered IO (RIO), or other kinds of completion based IO.

Upfront note: The change is incomplete, buggy, and will leak all memory.
Don't even think about using it as is. This is just a quick hack to get
some ideas about integration and about achievable performance.

Integrating with registered IO requires the following changes to get
things working:
1. The endpoint is now running on its own dedicated IO thread instead of
  running on the shared tokio runtime. This allows it to use any platform
  specific IO primitives it requires to use. In this case we are using RIO,
  and a custom eventloop which waits for new IO being possible using
  a windows ManualResetEvent. Waiting via IOCP, or letting the thread
  busy-spinning is also possible.
  The actual loop, which is implemented in `EndpointDriver::run`, is not
  that different from the existing `EndpointDriver::poll` method.
2. Reading and writing data with RIO is submission+completion oriented,
  and requires buffers to be registered with kernel space for the complete
  lifetime of the socket. In order to accomodate for those requirements,
  a buffer pool is allocated when the socket + endpoint are created, and
  reused throughout the lifetime of the endpoint. The endpoint will make
  sure the maximum possible amount of concurrent receive operations is
  scheduled. Transmit operations get scheduled whenever data to transmit
  is available and TX buffers are available.
3. Since remaining quinn is not aware about pinned buffers, and requires
  `Vec` to transmit outgoing buffers and  `BytesMut` to decode incoming
  datagrams, all datagrams are copied once from the IO buffers to those
  higher level buffers. This could theoretically be optimized.
4. Since the endpoint is no longer an async task, it can't receive instructions
  from connections anymore using an async channel. This adds a custom
  channel implemetation for this purpose, which consists of a trivial
  synchronized queue and a wakeup of the endpoint eventloop.
5. The endpoint can't use `tokio::spawn` anymore to spawn new connections,
  since it is not running inside a tokio context.
  Therefore a runtime handle needs to be explicitely propagated.
6. Socket needs to be created with the `WSA_FLAG_REGISTERED_IO`. Therefore
  UDP sockets create via `std::net::UdpSocket` unfortunately can't be trivially
  forwarded. It would be debatable whether this means the quinn library
  should be resopnsible for creating all sockets, or whether it should still
  accept external sockets but explictily require that those have been configured
  with all the necessary flags.

Most of the points outlined here would also be required to support io_uring
or AF_XDP with buffers pre-registered with the kernel, or just
`sendmsg/sendmmsg`  using `MSG_ZEROCOPY`, which has similar
requirements.

Performance with this approach varies. The benchmarks indicate a
throughput somewhere between 180MB/s and 330MB/s. If a benchmark
was started, it will either consistently report the low or the high value.
Some comments in the msquic repository indicate that this might be
due to RSS. Maybe something can be improved here by making sure
the endpoint IO thread runs on the ideal core.

A follow up POC which could be built, but isn't part of this demo,
is to also move the `Connection` handling onto the new dedicated
IO thread.
@Ralith
Copy link
Collaborator

Ralith commented Nov 15, 2020

Some comments in the msquic repository indicate that this might be due to RSS

I think RSS is only involved when you're receiving on multiple threads in parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants