-
Notifications
You must be signed in to change notification settings - Fork 1
/
draft-lxin-quic-socket-apis.xml
2401 lines (2251 loc) · 86.7 KB
/
draft-lxin-quic-socket-apis.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version='1.0' encoding='utf-8'?>
<?xml-model href='rfc7991bis.rnc'?>
<rfc
xmlns:xi='http://www.w3.org/2001/XInclude'
category='std'
docName='draft-lxin-quic-socket-apis-00'
ipr='trust200902'
submissionType='IETF'
consensus='true'
xml:lang='en'
version='3'>
<front>
<title abbrev='QUIC socket APIs'>
Sockets API Extensions for In-kernel QUIC Implementations</title>
<seriesInfo name='Internet-Draft' value='draft-lxin-quic-socket-apis-00'/>
<author fullname='Xin Long' initials='L' role='editor' surname='Xin'>
<organization>Red Hat</organization>
<address>
<postal>
<street>20 Deerfield Drive</street>
<city>Ottawa</city>
<region>ON</region>
<country>CA</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author fullname='Moritz Buhl' initials='M.' role='editor' surname='Buhl'>
<organization>Technical University of Munich</organization>
<address>
<postal>
<street>Boltzmannstrasse 3</street>
<city>Garching</city>
<code>85748</code>
<country>Germany</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author fullname='Marcelo Ricardo Leitner' initials='M' role='editor'
surname='Leitner'>
<organization>Red Hat</organization>
<address>
<postal>
<street>Av. Brg. Faria Lima, 3732</street>
<city>Sao Paolo</city>
<region>SP</region>
<country>BR</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<date year='2024'/>
<area>Web and Internet Transport</area>
<workgroup>Internet Engineering Task Force</workgroup>
<keyword>QUIC socket APIs</keyword>
<abstract>
<t>This document describes a mapping of In-kernel QUIC Implementations
into a sockets API. The benefits of this mapping include compatibility
for TCP applications, access to new QUIC features, and a consolidated
error and event notification scheme. In-kernel QUIC enables usage for
both userspace applications and kernel consumers.</t>
</abstract>
</front>
<middle>
<section title='Introduction'>
<t>The QUIC protocol, as defined in <xref target='RFC9000'/>, offers a
UDP-based, secure transport with flow-controlled streams for efficient
communication, low-latency connection setup, and network path migration,
ensuring confidentiality, integrity, and availability across various
deployments.</t>
<t>In-kernel QUIC implementations will be able to offer several key
advantages:</t>
<ul>
<li>Seamless Integration for Kernel Subsystems: Kernel subsystems such as
SMB and NFS can operate over QUIC seamlessly after the handshake,
leveraging the netlink APIs.</li>
<li>Efficient ALPN Routing: It incorporates ALPN routing within the kernel,
efficiently directing incoming requests to the appropriate applications
across different processes based on ALPN.</li>
<li>Performance Enhancements: By minimizing data duplication through
zero-copy techniques such as sendfile(), and paving the way for crypto
offloading in NICs, this implementation enhances performance and prepares
for future optimizations.</li>
<li>Standardized Socket APIs for QUIC: It standardizes the socket APIs for
QUIC, covering essential operations like listen(), accept(), connect(),
sendmsg(), recvmsg(), close(), get/setsockopt() and
getsock/peername().</li>
</ul>
<t>The socket APIs have provided a standard mapping of the Internet
Protocol suite to many operating systems. Both TCP
<xref target='RFC9293'/> and UDP <xref target='RFC0768'/> have benefited
from this standard representation and access method across many diverse
platforms. SCTP <xref target='RFC6458'/> has also created its own socket
APIs. Based on <xref target='RFC6458'/>, this document defines a method to
map the existing socket APIs for use with In-kernel QUIC, providing both
a base for access to new features and compatibility so that most existing
TCP applications can be migrated to QUIC with few (if any) changes.</t>
<t>Some of the QUIC mechanisms cannot be adequately mapped to an
existing socket interface. In some cases, it is more desirable to
have a new interface instead of using existing socket calls.</t>
<section title='Conventions'>
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in
<xref target='RFC2119'/>.</t>
</section>
</section>
<section title='Data Types'>
<t>Whenever possible, Portable Operating System Interface (POSIX) data
types defined in IEEE-1003.1-2008 are used: uintN_t means an
unsigned integer of exactly N bits (e.g., uint16_t). This document
also assumes the argument data types from POSIX when possible (e.g.,
the final argument to setsockopt() is a socklen_t value). Whenever
buffer sizes are specified, the POSIX size_t data type is used.</t>
</section>
<section title='Interface'>
<t>A typical QUIC server uses the following socket calls in sequence to
prepare an endpoint for servicing requests:</t>
<t>o socket()</t>
<t>o bind()</t>
<t>o listen()</t>
<t>o accept()</t>
<t>o quic_server_handshake()</t>
<t>o recvmsg()</t>
<t>o sendmsg()</t>
<t>o close()</t>
<t>It is similar to a TCP server, except for the quic_server_handshake()
call, which handles the TLS message exchange to complete the handshake.
See <xref target='advanced_handshake'/>.</t>
<t>All TLS handshake messages carried in QUIC packets MUST be processed in
userspace. When a Client Initial packet is received, it triggers accept()
to create a new socket and return. However, the TLS handshake message
contained in this packet will be processed by quic_server_handshake() via
the newly created socket.</t>
<t>A typical QUIC client uses the following calls in sequence to set up
a connection with a server to request services:</t>
<t>o socket()</t>
<t>o connect()</t>
<t>o quic_client_handshake()</t>
<t>o sendmsg()</t>
<t>o recvmsg()</t>
<t>o close()</t>
<t>It is similar to a TCP client, except for the quic_client_handshake()
call, which handles the TLS message exchange to complete the handshake.
See <xref target='advanced_handshake'/>.</t>
<t>On the client side, connect() SHOULD not send any packets to the server.
Instead, all TLS handshake messages are generated by the TLS library and
sent in quic_client_handshake().</t>
<t>In the implementation, one QUIC socket represents a single QUIC
connection and MAY manage multiple UDP sockets simultaneously to support
connection migration or future multipath features. Conversely, a single
lower-layer UDP socket MAY serve multiple QUIC sockets.</t>
<section title='Basic Operation'>
<section title='socket()'>
<t>Applications use socket() to create a socket descriptor to represent
a QUIC endpoint.</t>
<t>The function prototype is</t>
<sourcecode type='language C'>
int socket(int domain,
int type,
int protocol);
</sourcecode>
<t>and one uses PF_INET or PF_INET6 as the domain, SOCK_STREAM or
SOCK_DGRAM as the type, and IPPROTO_QUIC as the protocol.</t>
<t>Note that QUIC does not have a protocol number allocated by IANA.
Similar to IPPROTO_MPTCP in Linux, IPPROTO_QUIC is simply a value used
when opening a QUIC socket, and its value MAY vary depending on the
implementation.</t>
<t>The function returns a socket descriptor or -1 in case of an error.
Using the PF_INET domain indicates the creation of an endpoint that MUST
use only IPv4 addresses, while PF_INET6 creates an endpoint that MAY use
both IPv6 and IPv4 addresses. See <xref target='RFC3493' section='3.7'/>.
</t>
</section>
<section title='bind()'>
<t>Applications use bind() to specify with which local address and port
the QUIC endpoint SHOULD associate itself.</t>
<t>The function prototype of bind() is</t>
<sourcecode type='language C'>
int bind(int sd,
struct sockaddr *addr,
socklen_t addrlen);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor returned by socket().</li>
<li>addr: The address structure (struct sockaddr_in for an IPv4 address
or struct sockaddr_in6 for an IPv6 address. See <xref target='RFC3493'/>).
</li>
<li>addrlen: The size of the address structure.</li>
</ul>
<t>bind() returns 0 on success and -1 in case of an error.</t>
<t>Applications cannot call bind() multiple times to associate multiple
addresses with an endpoint. After the first bind() call, all subsequent
calls will return an error. However, multiple applications MAY bind()
to the same address and port, sharing the same lower UDP socket in the
kernel.</t>
<t>The IP address part of addr MAY be specified as a wildcard (e.g.,
INADDR_ANY for IPv4 or IN6ADDR_ANY_INIT or in6addr_any for IPv6). If the
IPv4 sin_port or IPv6 sin6_port is set to 0, the operating system will
choose an ephemeral port for the endpoint.</t>
<t>If bind() is not explicitly called before connect() on the client, the
system will automatically determine a valid source address based on the
routing table and assign an ephemeral port to bind the socket during
connect().</t>
<t>Completing the bind() process does not permit the QUIC endpoint to
accept inbound QUIC connection requests on the server. This capability
is only enabled after a listen() system call, as described below,
is performed on the socket. </t>
</section>
<section title='listen()'>
<t>An application uses listen() to mark a socket as being able to accept
new connections.</t>
<t>The function prototype is</t>
<sourcecode type='language C'>
int listen(int sd,
int backlog);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor of the endpoint.</li>
<li>backlog: If backlog is non-zero, enable listening, else disable
listening.</li>
</ul>
<t>listen() returns 0 on success and -1 in case of an error.</t>
<t>The implementation SHOULD allow the kernel to parse the ALPN from a
Client Initial packet and direct the incoming request based on it to
different listening sockets (binding to the same address and port).
These sockets could belong to different user processes or kernel
threads. The ALPNs for sockets are set via the ALPN socket option
<xref target='sockopt_alpn'/> before calling listen().</t>
<t>If no ALPNs are configured before calling listen(), the listening
socket will only be capable of accepting client connections that do
not specify any ALPN.</t>
</section>
<section title='accept()'>
<t>Applications use the accept() call to remove a QUIC connection request
from the accept queue of the endpoint. A new socket descriptor will be
returned from accept() to represent the newly formed connection
request.</t>
<t>The function prototype is</t>
<sourcecode type='language C'>
int accept(int sd,
struct sockaddr *addr,
socklen_t *addrlen);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor of the endpoint.</li>
<li>addr: The address structure (struct sockaddr_in for an IPv4 address
or struct sockaddr_in6 for an IPv6 address. See <xref target='RFC3542'/>).
</li>
<li>addrlen: The size of the address structure.</li>
</ul>
<t>The function returns the socket descriptor for the newly formed
connection request on success and -1 in case of an error.</t>
<t>Note that the incoming Client Initial packet triggers the accept() call,
and the TLS message carried by the Client Initial packet will be queued in
the receive queue of the socket returned by accept(). This TLS message will
then be received and processed by userspace through the newly returned
socket, ensuring that the TLS handshake is completed in userspace.</t>
</section>
<section title='connect()'>
<t>Applications use connect() to perform routing and determine the
appropriate source address and port to bind if bind() has not been called.
Additionally, connect() initializes the connection ID and installs the
initial keys necessary for encrypting handshake packets.</t>
<t>The function prototype is</t>
<sourcecode type='language C'>
int connect(int sd,
const struct sockaddr *addr,
socklen_t addrlen);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor of the endpoint.</li>
<li>addr: The address structure (struct sockaddr_in for an IPv4 address
or struct sockaddr_in6 for an IPv6 address. See <xref target='RFC3542'/>).
</li>
<li>addrlen: The size of the address structure.</li>
</ul>
<t>connect() returns 0 on success and -1 on error.</t>
<t>connect() MUST be called before sending any handshake message.</t>
</section>
<section title='close()'>
<t>Applications use close() to gracefully close down a connection.</t>
<t>The function prototype is</t>
<sourcecode type='language C'>
int close(int sd);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor of the connection to be closed.</li>
</ul>
<t>close() returns 0 on success and -1 in case of an error.</t>
<t>After an application calls close() on a socket descriptor, no further
socket operations will succeed on that descriptor.</t>
<t>close() will send a CLOSE frame to the peer. The close information MAY
be set via the CONNECTION_CLOSE socket option
<xref target='sockopt_close'/> before calling close().</t>
</section>
<section title='shutdown()'>
<t>QUIC differs from TCP in that it does not have half close
semantics.</t>
<t>The function prototypes are</t>
<sourcecode type='language C'>
int shutdown(int sd,
int how);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor of the connection to be closed.</li>
<li><t>how: Specifies the type of shutdown</t>
<ul>
<li>SHUT_RD: Disables further receive operations. All incoming data and
connection requests SHOULD be discard quietly.</li>
<li>SHUT_WR: Disables further send operations. It SHOULD send a CLOSE
frame.</li>
<li>SHUT_RDWR: Similar to SHUT_WR.</li>
</ul>
</li>
</ul>
<t>shutdown() returns 0 on success and -1 in case of an error.</t>
<t>Note that users MAY use SHUT_WR to send the CLOSE frame multiple times.
The implementation MUST be capable of unblocking sendmsg(), recvmsg(), and
accept() operations with SHUT_RDWR.</t>
</section>
<section title='sendmsg() and recvmsg()'>
<t>An application uses the sendmsg() and recvmsg() calls to transmit
data to and receive data from its peer.</t>
<t>The function prototypes are</t>
<sourcecode type='language C'>
ssize_t sendmsg(int sd,
const struct msghdr *message,
int flags);
ssize_t recvmsg(int sd,
struct msghdr *message,
int flags);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor of the endpoint.</li>
<li>message: Pointer to the msghdr structure that contains a single user
message and possibly some ancillary data. See <xref target='struct'/>
for a complete description of the data structures.</li>
<li>flags: No new flags are defined for QUIC at this level. See
<xref target='struct'/> for QUIC-specific flags used in the
msghdr structure.</li>
</ul>
<t>sendmsg() returns the number of bytes accepted by kernel or -1 in
case of an error. recvmsg() returns the number of bytes received or
-1 in case of an error.</t>
<t>If the application does not provide enough buffer space to completely
receive a data message in recvmsg(), MSG_EOR will not be set in msg_flags.
Successive reads will consume more of the same message until the entire
message has been delivered, and MSG_EOR will be set. This is particularly
useful for reading datagram and event messages.</t>
<t>As described in <xref target='struct'/>, different types of ancillary
data MAY be sent and received along with user data.</t>
<t>During Handshake, users SHOULD use sendmsg() and recvmsg() with
Handshake msg_control <xref target='struct_handshake'/> to send raw TLS
messages to and receive from the kernel and to exchange TLS messages in
userspace with the help of a third-party TLS library, such as GnuTLS.
A pair of high-level APIs MAY be defined to wrap the handshake process
in userspace. See <xref target='advanced_handshake'/>.</t>
<t>Post Handshake, users SHOULD use sendmsg() and recvmsg() with Stream
msg_control <xref target='struct_stream'/> to send data msgs to and
receive from the kernel with stream_id and stream_flags. One pair of
high-level APIs MAY be defined to wrap Stream msg_control.
See <xref target='advanced_stream'/>.</t>
</section>
<section title='send(), recv(), read() and write()'>
<t>Applications MAY use send() and recv() to transmit and receive data with
basic access to the peer.</t>
<t>The function prototypes are</t>
<sourcecode type='language C'>
ssize_t send(int sd,
const void *msg,
size_t len,
int flags);
ssize_t recv(int sd,
void *buf,
size_t len,
int flags);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor of the endpoint.</li>
<li>msg: The message to be sent.</li>
<li>len: The size of the message or the size of the buffer.</li>
<li>flags: (described below).</li>
</ul>
<t>send() returns the number of bytes accepted by kernel or -1 in
case of an error. recv() returns the number of bytes received or
-1 in case of an error.</t>
<t>Since ancillary data (msg_control field) cannot be used, the flags will
operate as described in <xref target='struct_msghdr'/>, but without the
context provided by Stream or Handshake msg_control. While sending, the
flags function as intended; however, when receiving, users will not be
able to obtain any flags from the kernel.</t>
<t>send() can transmit data as datagram messages when MSG_DATAGRAM is set
in the flags, and as stream messages on the most recently opened stream
if MSG_DATAGRAM is not set. However, it cannot send handshake messages.</t>
<t>recv() can receive datagram, stream, or event messages, but it cannot
determine the message type or stream ID without the appropriate flags and
ancillary data from the kernel. It SHOULD return -1 with errno set to
EINVAL when attempting to receive a handshake message.</t>
<t>Alternatively, applications can use read() and write() to transmit and
receive data to and from a peer. These functions are similar to recv() and
send() but offer less functionality, as they do not allow the use of a
flags parameter.</t>
</section>
<section title='setsockopt() and getsockopt()'>
<t>Applications use setsockopt() and getsockopt() to set or retrieve
socket options. Socket options are used to change the default
behavior of socket calls. They are described in <xref target='sockopt'/>
.</t>
<t>The function prototypes are</t>
<sourcecode type='language C'>
int getsockopt(int sd,
int level,
int optname,
void *optval,
socklen_t *optlen);
int setsockopt(int sd,
int level,
int optname,
const void *optval,
socklen_t optlen);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor.</li>
<li>level: Set to SOL_QUIC for all QUIC options.</li>
<li>optname: The option name.</li>
<li>optval: The buffer to store the value of the option.</li>
<li>optlen: The size of the buffer (or the length of the option
returned).</li>
</ul>
<t>These functions return 0 on success and -1 in case of an error.</t>
</section>
<section title='getsockname() and getpeername()'>
<t>Applications use getsockname() to retrieve the locally bound socket
address of the specified socket and use getpeername() to retrieve the
peer socket address. These functions are particularly useful when
connection migration occurs, and the corresponding event notifications
are not enabled.</t>
<t>The function prototypes are</t>
<sourcecode type='language C'>
int getsockname(int sd,
struct sockaddr *address,
socklen_t *len);
int getpeername(int sd,
struct sockaddr *address,
socklen_t *len);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor to be queried.</li>
<li>address: On return, one locally bound or peer address (chosen by the
QUIC stack) is stored in this buffer. If the socket is an IPv4 socket,
the address will be IPv4. If the socket is an IPv6 socket, the
address will be either an IPv6 or IPv4 address.</li>
<li>len: The caller SHOULD set the length of the address buffer here. On
return, this is set to the length of the returned address.</li>
</ul>
<t>These functions return 0 on success and -1 in case of an error.</t>
<t>If the actual length of the address is greater than the length of the
supplied sockaddr structure, the stored address will be truncated.</t>
</section>
</section>
<section title='Advanced Operation'>
<section title='quic_sendmsg() and quic_recvmsg()'
anchor='advanced_stream'>
<t>An application uses quic_sendmsg() and quic_recvmsg() calls to
transmit data to and receive data from its peer with stream_id and
flags.</t>
<t>The function prototype of quic_sendmsg() is</t>
<sourcecode type='language C'>
ssize_t quic_sendmsg(int sd,
const void *msg,
size_t len,
int64_t sid,
uint32_t flags);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor.</li>
<li>msg: The message buffer to be filled.</li>
<li>len: The length of the message buffer.</li>
<li>sid: stream_id to point for sending.</li>
<li>flags: function as stream_flags in <xref target='struct_stream'/> and
msg_flags/flags in <xref target='struct_msghdr'/>. Any unknown flags
passed into the kernel MUST be rejected with an error returned.</li>
</ul>
<t>quic_sendmsg() returns the number of bytes accepted by kernel or -1
in case of an error.</t>
<t>The function prototype of quic_recvmsg() is</t>
<sourcecode type='language C'>
ssize_t quic_recvmsg(int sd,
void *msg,
size_t len,
int64_t *sid,
uint32_t *flags);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor.</li>
<li>msg: The message buffer to be filled.</li>
<li>len: The length of the message buffer.</li>
<li>sid: stream_id to point for receiving.</li>
<li>flags: function as stream_flags in <xref target='struct_stream'/> and
msg_flags/flags in <xref target='struct_msghdr'/> used for passing flags
to the kernel and then obtaining them from the kernel. Any unknown
flags passed into the kernel MUST be rejected with an error returned.</li>
</ul>
<t>quic_recvmsg() returns the number of bytes received or -1 in case of
an error.</t>
<t>These two functions wrap the sendmsg() and recvmsg() with Stream
information msg_control and are important for using QUIC multiple
streams. See an example in <xref target='example_stream'/></t>
</section>
<section title='quic_client/server_handshake()'
anchor='advanced_handshake'>
<t>An application uses quic_client_handshake() or quic_server_handshake()
to initiate a QUIC handshake, either with Certificate or PSK mode, from
the client or server side..</t>
<t>The function prototype of quic_server_handshake() is:</t>
<sourcecode type='language C'>
int quic_server_handshake(int sd,
const char *pkey_file,
const char *cert_file,
const char *alpns);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor.</li>
<li>pkey_file: Private key file for Certificate mode or pre-shared
key file for PSK mode.</li>
<li>cert_file: Certificate file for Certificate mode or null for
PSK mode.</li>
<li>alpns: ALPNs supported and split by ','.</li>
</ul>
<t>The function prototype of quic_client_handshake() is:</t>
<sourcecode type='language C'>
int quic_client_handshake(int sd,
const char *pkey_file,
const char *hostname,
const char *alpns);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>sd: The socket descriptor.</li>
<li>psk_file: Pre-shared key file for PSK mode.</li>
<li>hostname: Server name for Certificate mode.</li>
<li>alpns: ALPNs supported and split by ','.</li>
</ul>
<t>These functions return 0 for success and errcode in case of an
error.</t>
</section>
<section title='quic_handshake()' anchor='advanced_do_handshake'>
<t>Using quic_handshake() allows an application to have greater control
over the configuration of the handshake session.</t>
<t>The function prototype is</t>
<sourcecode type='language C'>
int quic_handshake(void *session);
</sourcecode>
<t>and the arguments are</t>
<ul>
<li>session: A TLS session, which is represented differently across
various TLS libraries, such as gnutls_session_t in GnuTLS or SSL * in
OpenSSL.</li>
</ul>
<t>With the session argument, users can configure additional parameters at
TLS level and define custom client_handshake() and server_handshake()
functions. An example is provided in
<xref target='example_early_handshake'/>.</t>
<t>In the future, quic_handshake() MAY be considered for integration into
these TLS libraries to provide comprehensive support for QUIC stack.</t>
<t>Here are some guidelines for handling TLS and QUIC communications
between kernel and userspace when implement quic_handshake():</t>
<ul>
<li><t>Handling Raw TLS Messages:</t>
<t>The implementation SHOULD utilize sendmsg() and recvmsg() with Handshake
msg_control <xref target='struct_handshake'/> to send and receive raw TLS
messages between the kernel and userspace. These messages should be
processed in userspace using a TLS library, such as GnuTLS.</t>
</li>
<li><t>Processing the TLS QUIC Transport Parameters Extension:</t>
<t>The implementation SHOULD retrieve the local TLS QUIC transport
parameters extension from the kernel using the TRANSPORT_PARAM_EXT
socket option <xref target='sockopt_transport_param_ext'/> for
building TLS messages. Additionally, remote handshake parameters should
be set in the kernel using the same socket option for constructing QUIC
packets.</t>
</li>
<li><t>Setting Secrets for Different Crypto Levels:</t>
<t>The implementation SHOULD set secrets for various crypto levels using
the CRYPTO_SECRET socket option <xref target='sockopt_crypto_secret'/>.</t>
</li>
</ul>
</section>
</section>
</section>
<section title='Data Structures' anchor='struct'>
<t>This section discusses key data structures specific to QUIC that are
used with sendmsg() and recvmsg() calls. These structures control QUIC
endpoint operations and provide access to ancillary information and
notifications.</t>
<section title='The msghdr and cmsghdr Structures' anchor='struct_msghdr'>
<t>The msghdr structure used in sendmsg() and recvmsg() calls, along with
the ancillary data it carries, is crucial for applications to set and
retrieve various control information from the QUIC endpoint.</t>
<t>The msghdr and the related cmsghdr structures are defined and
discussed in detail in <xref target='RFC3542'/>. They are defined as</t>
<sourcecode type='language C'>
struct msghdr {
void *msg_name; /* ptr to socket address structure */
socklen_t msg_namelen; /* size of socket address structure */
struct iovec *msg_iov; /* scatter/gather array */
int msg_iovlen; /* # elements in msg_iov */
void *msg_control; /* ancillary data */
socklen_t msg_controllen; /* ancillary data buffer length */
int msg_flags; /* flags on message */
};
struct cmsghdr {
socklen_t cmsg_len; /* # bytes, including this header */
int cmsg_level; /* originating protocol */
int cmsg_type; /* protocol-specific type */
/* followed by unsigned char cmsg_data[]; */
};
</sourcecode>
<t>The msg_name is not used when sending a message with sendmsg().</t>
<t>The scatter/gather buffers, or I/O vectors (pointed to by the msg_iov
field) are treated by QUIC as a single user message for both sendmsg()
and recvmsg().</t>
<t>The QUIC stack uses the ancillary data (msg_control field) to
communicate the attributes, such as QUIC_STREAM_INFO, of the message
stored in msg_iov to the socket endpoint. The different ancillary
data types are described in <xref target='control_struct'/>.</t>
<t>On send side:</t>
<ul>
<li><t>The flags parameter in sendmsg() can be set to:</t>
<ul>
<li>MSG_MORE: Indicates that data will be held until the next data is
sent without this flag.</li>
<li>MSG_DONTWAIT: Prevents blocking if there is no send buffer.</li>
<li>MSG_DATAGRAM: Sends data as an unreliable datagram.</li>
</ul>
<t>Additionally, the flags can also be set to the values in stream_flags
on the send side <xref target='struct_stream'/> if Stream msg_control is
not being used. In this case, the most recently opened stream will be
used for sending data.</t>
</li>
<li><t>msg_flags of msghdr passed to the kernel is ignored.</t></li>
</ul>
<t>On receive side:</t>
<ul>
<li><t>The flags parameter in recvmsg() might be set to:</t>
<ul>
<li>MSG_DONTWAIT: Prevents blocking if there is no data in recv
buffer.</li>
</ul>
</li>
<li><t>msg_flags of msghdr returned from the kernel might be set to:</t>
<ul>
<li>MSG_EOR: Indicates that the received data is read completely.</li>
<li>MSG_DATAGRAM: Indicates that the received data is a datagram.</li>
<li>MSG_NOTIFICATION: Indicates that the received data is a notification
message.</li>
<li>These flags might also be set to the values in
stream_flags on the receive side <xref target='struct_stream'/> if Stream
msg_control is not being used. In this case, the stream id for received
data is invisible to user space.</li>
</ul>
<t>Additionally, the flags might also be set to the values in
stream_flags on the receive side <xref target='struct_stream'/> if Stream
msg_control is not being used. In this case, the stream ID for received
data is not visible to users.</t>
</li>
</ul>
</section>
<section title='Ancillary Data Considerations and Semantics'>
<t>Programming with ancillary socket data (msg_control) contains some
subtleties and pitfalls, which are discussed below.</t>
<section title='Multiple Items and Ordering'>
<t>Multiple ancillary data items MAY be included in any call to sendmsg()
or recvmsg(). These MAY include QUIC-specific items, non-QUIC items (such
as IP-level items), or both. The ordering of ancillary data items, whether
QUIC-related or from another protocol, is implementation-dependent and not
significant. Therefore, applications MUST NOT rely on any specific
ordering. </t>
<t>QUIC_STREAM_INFO and QUIC_HANDSHAKE_INFO type ancillary data always
correspond to the data in the msghdr's msg_iov member. Only one such type
of ancillary data is allowed per sendmsg() or recvmsg() call. </t>
</section>
<section title='Accessing and Manipulating Ancillary Data'>
<t>Applications can infer the presence of data or ancillary data by
examining the msg_iovlen and msg_controllen msghdr members,
respectively</t>
<t>Implementations MAY have different padding requirements for ancillary
data, so portable applications SHOULD make use of the macros
CMSG_FIRSTHDR, CMSG_NXTHDR, CMSG_DATA, CMSG_SPACE, and CMSG_LEN. See
<xref target='RFC3542'/> for more information. The following is an example
from <xref target='RFC3542'/>, demonstrating the use of these macros to
access ancillary data</t>
<sourcecode type='language C'>
struct msghdr msg;
struct cmsghdr *cmsgptr;
/* fill in msg */
/* call recvmsg() */
for (cmsgptr = CMSG_FIRSTHDR(&msg); cmsgptr != NULL;
cmsgptr = CMSG_NXTHDR(&msg, cmsgptr)) {
if (cmsgptr->cmsg_len == 0) {
/* Error handling */
break;
}
if (cmsgptr->cmsg_level == ... && cmsgptr->cmsg_type == ... ) {
u_char *ptr;
ptr = CMSG_DATA(cmsgptr);
/* process data pointed to by ptr */
}
}
</sourcecode>
</section>
<section title='Control Message Buffer Sizing'>
<t>The information conveyed via QUIC_STREAM_INFO and QUIC_HANDSHAKE_INFO
ancillary data will often be fundamental to the correct and sane
operation of the sockets application. For example, if an application
needs to send and receive data on different QUIC streams, QUIC_STREAM_INFO
ancillary data is indispensable.</t>
<t>Given that some ancillary data is critical and that multiple
ancillary data items MAY appear in any order, applications SHOULD be
carefully written to always provide a large enough buffer to contain
all possible ancillary data that can be presented by recvmsg(). If
the buffer is too small and crucial data is truncated, it MAY pose a
fatal error condition.</t>
<t>Thus, it is essential that applications be able to deterministically
calculate the maximum required buffer size to pass to recvmsg(). One
constraint imposed on this specification that makes this possible is
that all ancillary data definitions are of a fixed length. One way
to calculate the maximum required buffer size might be to take the
sum of the sizes of all enabled ancillary data item structures, as
calculated by CMSG_SPACE. For example, if we enabled QUIC_STREAM_INFO
and IPV6_RECVPKTINFO <xref target='RFC3542'/>, we would calculate and
allocate the buffer size as follows</t>
<sourcecode type='language C'>
size_t total;
void *buf;
total = CMSG_SPACE(sizeof(struct quic_stream_info)) +
CMSG_SPACE(sizeof(struct in6_pktinfo));
buf = malloc(total);
</sourcecode>
<t>We could then use this buffer (buf) for msg_control on each call to
recvmsg() and be assured that we would not lose any ancillary data to
truncation.</t>
</section>
</section>
<section title='QUIC msg_control Structures' anchor='control_struct'>
<section title='Stream Information' anchor='struct_stream'>
<t>This control message (cmsg) specifies QUIC stream options for sendmsg()
and describes QUIC stream information about a received message via
recvmsg(). It uses struct quic_stream_info</t>
<sourcecode type='language C'>
struct quic_stream_info {
uint64_t stream_id;
uint32_t stream_flags;
};
</sourcecode>
<t>On send side:</t>
<ul>
<li><t>stream_id:</t>
<ul>
<li><t>-1:</t>
<ul>
<li>If MSG_STREAM_NEW is set: Open the next bidirectional stream and
uses it for sending data.</li>
<li>If both MSG_STREAM_NEW and MSG_STREAM_UNI are set: Opens the next
unidirectional stream and uses it for sending data.</li>
<li>Otherwise: Use the latest opened stream for sending data.</li>
</ul>
</li>
<li><t>!-1: The specified stream ID is used with the first 2 bits:</t>
<ul>
<li>QUIC_STREAM_TYPE_SERVER_MASK(0x1): Indicates if it is a server-side
stream.</li>
<li>QUIC_STREAM_TYPE_UNI_MASK(0x2): Indicates if it is a unidirectional
stream.</li>
</ul>
</li>
</ul>
</li>
<li><t>stream_flags:</t>
<ul>
<li>MSG_STREAM_NEW: Open a stream and send the first data.</li>
<li>MSG_STREAM_FIN: Send the last data and close the stream.</li>
<li>MSG_STREAM_UNI: Open the next unidirectional stream.</li>
<li>MSG_STREAM_DONTWAIT: Open the stream without blocking.</li>
<li>MSG_STREAM_SNDBLOCK: Send streams blocked when no capacity.</li>
</ul>
</li>
</ul>
<t>On receive side:</t>
<ul>
<li><t>stream_id: Identifies the stream to which the received data
belongs.</t></li>
<li><t>stream_flags:</t>
<ul>
<li>MSG_STREAM_FIN: Indicates that the data received is the last one for
this stream.</li>
</ul>
</li>
</ul>
<t>This cmsg is specifically used for sending user stream data, including
early or 0-RTT data. When sending user unreliable datagrams, this cmsg
SHOULD NOT be set.</t>
</section>
<section title='Handshake Information' anchor='struct_handshake'>
<t>This control message (cmsg) provides information for sending and
receiving handshake/TLS messages via sendmsg() or recvmsg(). It uses
struct quic_handshake_info</t>
<sourcecode type='language C'>
struct quic_handshake_info {
uint8_t crypto_level;
};
</sourcecode>
<t>crypto_level: Specifies the level of data:</t>
<ul>
<li>QUIC_CRYPTO_INITIAL: Initial level data.</li>
<li>QUIC_CRYPTO_HANDSHAKE: Handshake level data.</li>
</ul>
<t>This cmsg is used only during the handshake process.</t>
</section>
</section>
</section>
<section title='QUIC Events and Notifications'>
<t>A QUIC application MAY need to understand and process events and errors
that occur within the QUIC stack, such as stream updates, max_stream
changes, connection close, connection migration, key updates and new
tokens. These events are categorized under the quic_event_type enum:</t>
<sourcecode type='language C'>
enum quic_event_type {
QUIC_EVENT_NONE,
QUIC_EVENT_STREAM_UPDATE,
QUIC_EVENT_STREAM_MAX_STREAM,
QUIC_EVENT_CONNECTION_ID,
QUIC_EVENT_CONNECTION_CLOSE,
QUIC_EVENT_CONNECTION_MIGRATION,
QUIC_EVENT_KEY_UPDATE,
QUIC_EVENT_NEW_TOKEN,
QUIC_EVENT_NEW_SESSION_TICKET,
};
</sourcecode>
<t>When a notification arrives, recvmsg() returns the notification in the
application-supplied data buffer via msg_iov and sets MSG_NOTIFICATION
in msg_flags of msghdr in <xref target='struct_stream'/>.</t>
<t>The first byte of the received data indicates the type of the event,
corresponding to one of the values in the quic_event_type enum. The
subsequent bytes contain the content of the event, meaning the length of
the content is the total data length minus one byte. To manage and enable
these events, refer to the EVENT socket option
<xref target='sockopt_event'/>.</t>
<section title='QUIC Notification Structure' anchor='notification'>
<section title='QUIC_EVENT_STREAM_UPDATE'>
<t>Only notifications with one of the following states are delivered to
userspace:</t>
<ul>
<li><t>QUIC_STREAM_SEND_STATE_RECVD</t>
<t>An update is delivered when all data on the stream has been
acknowledged. This indicates that the peer has confirmed receipt of
all sent data for this stream.</t>
</li>
<li><t>QUIC_STREAM_SEND_STATE_RESET_SENT</t>
<t>An update is delivered only if a STOP_SENDING frame is received from
the peer and a STREAM_RESET frame is triggered to send out. The
STOP_SENDING frame MAY be sent by the peer via the STREAM_STOP_SENDING
socket option <xref target='sockopt_stream_stop_sending'/>.</t>
</li>
<li><t>QUIC_STREAM_SEND_STATE_RESET_RECVD</t>
<t>An update is delivered when a STREAM_RESET frame has been received and
acknowledged by the peer. The STREAM_RESET frame MAY be sent via the
socket option STREAM_RESET <xref target='sockopt_stream_reset'/>.</t>
</li>
<li><t>QUIC_STREAM_RECV_STATE_RECV</t>
<t>An update is delivered only when the last fragment of data has not
yet arrived. This event is sent to inform the application that there is
pending data for the stream.</t>
</li>
<li><t>QUIC_STREAM_RECV_STATE_SIZE_KNOWN</t>
<t> An update is delivered only if data arrives out of order. This
indicates that the size of the data is known, even though the fragments
are not in sequential order.</t>
</li>
<li><t>QUIC_STREAM_RECV_STATE_RECVD</t>
<t>An update is delivered when all data on the stream has been fully
received. This signifies that the application has received the complete
data for the stream.</t>
</li>
<li><t>QUIC_STREAM_RECV_STATE_RESET_RECVD</t>
<t> An update is delivered when a STREAM_RESET frame is received. This
indicates that the peer has reset the stream, and further data SHOULD
NOT be expected.</t>
</li>
</ul>
<t>Data format in the event</t>
<sourcecode type='language C'>
struct quic_stream_update {
uint64_t id;
uint32_t state;
uint32_t errcode;
uint64_t finalsz;
};
</sourcecode>
<t>id: The stream ID.</t>
<t>state: The new stream state. All valid states are listed above.</t>
<t>errcode: Error code for the application protocol. It is used for the
RESET_SENT or RESET_RECVD state update on send side, and for the
RESET_RECVD update on receive side.</t>
<t>finalsz: The final size of the stream. It is used for the SIZE_KNOWN,
RESET_RECVD, or RECVD state updates on receive side.</t>
</section>
<section title='QUIC_EVENT_STREAM_MAX_STREAM'>
<t>This notification is delivered when a MAX_STREAMS frame is received. It
is particularly useful when using MSG_STREAM_DONTWAIT stream_flags to open
a stream via the STREAM_OPEN socket option
<xref target='sockopt_stream_open'/> whose ID exceeds the current maximum
stream count. After receiving this notification, the application SHOULD
attempt to open the stream again.</t>
<t>Data format in the event</t>
<sourcecode type='language C'>
uint64_t max_stream;
</sourcecode>
<t>It indicates the maximum stream limit for a specific stream byte. The
stream type is encoded in the first 2 bits, and the maximum stream limit
is calculated by shifting max_stream right by 2 bits.</t>
</section>
<section title='QUIC_EVENT_CONNECTION_ID'>
<t>This notification is delivered when any source or destination connection
IDs are retired. This usually occurs during connection migration or when
managing connection IDs via the CONNECTION_ID socket option
<xref target='sockopt_connid'/>.</t>
<t>Data format in the event</t>
<sourcecode type='language C'>
struct quic_connection_id_info {
uint8_t dest;
uint32_t active;
uint32_t prior_to;
};
</sourcecode>
<ul>
<li>dest: Indicates whether to operate on destination connection IDs.</li>
<li>active: The number of the connection ID in use.</li>
<li>prior_to: The lowest connection ID number.</li>
</ul>
</section>
<section title='QUIC_EVENT_CONNECTION_CLOSE'>
<t>This notification is delivered when a CLOSE frame is received from the
peer. The peer MAY set the close information via the CONNECTION_CLOSE
socket option <xref target='sockopt_close'/> before calling close().</t>
<t>Data format in the event</t>
<sourcecode type='language C'>
struct quic_connection_close {
uint32_t errcode;
uint8_t frame;
uint8_t phrase[];
};
</sourcecode>
<t>errcode: Error code for the application protocol.</t>
<t>phrase: Optional string for additional details.</t>
<t>frame: Frame type that caused the closure.</t>
</section>