-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPv6 Hop-by-Hop & Destination Option #56
base: master
Are you sure you want to change the base?
Changes from 13 commits
3771848
10355fd
fe346e4
d4972e5
703fd49
cc70169
800c506
7456957
c5c1a52
9e75a1f
06a5ea0
4638a0f
4601746
5ff83a8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -494,19 +494,69 @@ corruption at preceding hops. | |
|
||
## Header Location | ||
|
||
We describe three encapsulation formats in this specification, covering | ||
We describe five encapsulation formats in this specification, covering | ||
different deployment scenarios, with and without network virtualization: | ||
|
||
1. *INT over TCP/UDP* - A shim header is inserted following TCP/UDP | ||
2. "INT over IPv6" - INT Headers are carried in the IPv6 packets as Hop-by-Hop option. | ||
3. *INT over TCP/UDP* - A shim header is inserted following TCP/UDP | ||
header. INT Headers are carried between this shim header and TCP/UDP payload. | ||
This approach doesn’t rely on any tunneling/virtualization mechanism and is | ||
versatile to apply INT to both native and virtualized traffic. | ||
2. *INT over VXLAN* - VXLAN generic protocol extensions | ||
4. *INT over VXLAN* - VXLAN generic protocol extensions | ||
(draft-ietf-nvo3-vxlan-gpe) are used to carry INT Headers between | ||
the VXLAN header and the encapsulated VXLAN payload. | ||
3. *INT over Geneve* - Geneve is an extensible tunneling framework, allowing | ||
5. *INT over Geneve* - Geneve is an extensible tunneling framework, allowing | ||
Geneve options to be defined for INT Headers. | ||
|
||
### INT over IPv6 | ||
|
||
INT in IPv6 can be supported by encapsulating the INT Metadata Header and | ||
Metadata in "option data" field of the Hop-by-Hop Options header. In order | ||
for INT to work in IPv6 networks, INT must be explicitly enabled per interface | ||
on every node within the INT domain. Unless a particular interface is explicity | ||
enabled (i.e. explicity configured) for INT, a router MUST drop packets which | ||
contain extension headers carrying INT Metadata Header and Metadata. This | ||
ensures that INT data does not unitentionally get forwarded outside the | ||
INT domain. | ||
|
||
IPv6 Hop-by-Hop Option format for carrying INT Header and | ||
Metadata: | ||
|
||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| Option Type | Opt Data Len |Reserved (MBZ) | INT TYPE | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | ||
| Variable Option Data (INT Metadata Headers and Metadata) | | | ||
. . | | ||
. . N | ||
. . T | ||
. . | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | ||
|
||
* Option Type: 8-bit identifier of the type of option. | ||
|
||
001xxxxxx 8-bit identifier of the type of option. xxxxxx=TBD_IANA_INT_HOP_BY_HOP_OPTION_IPV6. | ||
001xxxxxx 8-bit identifier of the type of option. xxxxxx=TBD_IANA_INT_DESTINATION_OPTION_IPV6. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking at the IANA registry, there are a total of 32 code points of which 17 have already been allocated. The registration procedure is IESG Approval, IETF Review or Standards Action. IOAM is asking for 4 code points, which seems unlikely. The chances for INT to get any code points are not high. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see two options:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see another problem with the corresponding IETF IOAM IPv6 draft. The text says that "a router MUST drop packets which contain extension headers carrying IOAM data-fields", to "ensure that the IOAM data does not unintentionally get forwarded outside the IOAM domain." However, they asked for an Option Type codepoint starting with "00", which means when the option type is unrecognized, "skip over this option and continue processing the header". If the text is correct, then they should ask for any of the other codepoint prefixes "01" (discard the packet), "10" (discard and send ICMP parameter problem, code 2, back to the packet's source address), or "11" (discard and send ICMP only if the packet's destination address was not a multicast address). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will close loop with IETF and address this comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Whatever we do, two codepoints will not fly. At a minimum we would have to go with TBD_IANA_INT_OPTION_IPV6 (not distinguishing between INT hop-by-hop and INT destination), which would later get resolved to either experimental hop-by-hop options codeopint or whatever IOAM has assigned. If we go with IOAM then the INT Type values might need to be shifted to avoid conflicts. I also wonder if we should use xxx or yyy for the first 3 bits as well given the other open issue I stated above. |
||
|
||
* Opt Data Len: 8-bit unsigned integer. Length of the Reserved and Option Data field of this | ||
option, in octets. | ||
|
||
* Reserved (MBZ): 16 bit field, must be filled with zeroes upon transmission and ignored upon | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 8 bit field |
||
reception. | ||
|
||
* Type: This field indicates the type of INT Metadata Header and Metadata following. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: The figure above labels the field as "INT TYPE". We should make this consistent, I guess with "INT Type"? |
||
Two Type values are used: one for the hop-by-hop header type and the other for | ||
the destination header type (See Section [#sec-int-header-types]). | ||
|
||
* Variable Option Data: Variable length field. INT Metadata Header and Metadata, multiple of | ||
four octets in length. | ||
|
||
The INT IPv6 options defined here have alignment requirements. Specifically, they require 4n alignment. | ||
This ensures that 4 octet fields of the INT metadata, such as Hop Latency, are aligned at a multiple-of-4 | ||
offset from the start of the Hop-by-Hop Options header. In INT v2.0, there are 4-octets in the | ||
shim header and 12-octets in the fixed header. In order to maintain IPv6 extension header 8-octet | ||
alignment...padding requirement TBD | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to add some text to clarify the following regarding INT within IPv6:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let us discuss the padding issue in person. |
||
|
||
### INT over TCP/UDP | ||
|
||
In case the traffic being monitored is not encapsulated by any virtualization | ||
|
@@ -755,11 +805,14 @@ and the metadata itself. | |
INT Metadata Header and Metadata Stack: | ||
` | ||
0 1 2 3 | ||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||
|
||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
|Ver = 2|Res|D|E|M| Reserved | Hop ML |RemainingHopCnt| | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| Ver |Rep|C|E|M| Reserved | Hop ML |RemainingHopCnt| | ||
| Instruction Bitmap | Domain Specific ID | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| Instruction Bitmap | Reserved | | ||
| DS Instruction | DS Flags | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| INT Metadata Stack (Each hop inserts Hop ML * 4B of metadata) | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
|
@@ -769,38 +822,16 @@ INT Metadata Header and Metadata Stack: | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
` | ||
|
||
* INT metadata header is 8 bytes long followed by a stack of INT metadata. | ||
* INT metadata header is 12 bytes long followed by a stack of INT metadata. | ||
Each metadata is either 4 bytes or 8 bytes in length. Each INT hop adds | ||
the same length of metadata. The total length of the metadata stack is | ||
variable as different packets may traverse different paths and hence | ||
different number of INT hops. | ||
|
||
* The fields in the INT metadata header are interpreted the following way: | ||
- Ver (4b): INT metadata header version. Should be 1 for this version. | ||
- Rep (2b): Replication requested. Support for this request is optional. If | ||
this value is non-zero, the device may replicate the INT packet. This is useful | ||
to explore all the valid physical forwarding paths when multi-path forwarding | ||
techniques (e.g., ECMP, LAG) are used in the network. Note the Rep bits should | ||
be used judiciously (e.g., only for probe packets, not for every data packet). | ||
While we recommend that Rep bits be set only for probe packets, the INT | ||
architecture does not (and perhaps cannot) disallow use of the Rep bits for real | ||
data packets. | ||
- 0: No replication requested. | ||
- 1: Port-level (L2-level) replication requested. If the INT packet is | ||
forwarded through a logical port that is a port-channel (LAG), then replicate | ||
the packet on each physical port in the port-channel and send a single copy per | ||
physical port. | ||
- 2: Next-hop-level (L3-level) replication requested. Forward the packet | ||
to each L3 ECMP next-hop valid for the destination address, with INT headers | ||
replicated in each forwarded copy. | ||
- 3: Port-level and Next-hop-level replication requested. | ||
- C (1b): Copy. | ||
- If replication is requested for data packets, the INT Sink must be | ||
able to distinguish the original packet from replicas so that it can forward | ||
only original packets up the protocol stack, and drop all the replicas. The C | ||
bit must be set to 1 on each copy, whenever an INT hop replicates a packet. | ||
The original packet must have C bit set to 0. | ||
- C bit must be set to 0 in the original packet by INT source | ||
- Ver (4b): INT metadata header version. Should be 2 for this version. | ||
- Res (2b): Reserved | ||
- D (1b): Discard Copy/Clone. INT Sink should Discard the packet after Extracting INT data | ||
- E (1b): Max Hop Count exceeded. | ||
- This flag must be set if a device cannot prepend its own metadata due to | ||
the Remaining Hop Count reaching zero. | ||
|
@@ -821,17 +852,18 @@ The original packet must have C bit set to 0. | |
switch(es) set the M bit based on knowledge of the network topology | ||
and "Switch ID, Ingress port ID, Egress port ID" tuples in the INT | ||
metadata stack. | ||
- R: Reserved bits. | ||
- Hop ML (5b): Per-hop Metadata Length, the length of metadata in 4-Byte words | ||
to be inserted at each INT hop. | ||
- While the largest value of Per-hop Metadata Length is 31, an INT-capable | ||
device may be limited in the maximum number of instructions it can process | ||
and/or maximum length of metadata it can insert in data packets. An INT | ||
hop that cannot process all instructions must still insert Per-hop | ||
Metadata Length \* 4 bytes, with all-ones reserved value (4 or 8 bytes | ||
of 0xFF depending on the length of metadata) for the metadata | ||
corresponding to instructions it cannot process. An INT hop that | ||
cannot insert Per-hop Metadata Length \* 4 bytes must skip INT | ||
- R (10b): Reserved bits. | ||
|
||
- Hop ML (5b): Per-hop Metadata Length, the length of metadata, including the | ||
Domain Specific Metadata in 4-Byte words to be inserted at each INT hop. | ||
- The largest value of Per-hop Metadata Length for baseline and domain specific | ||
metadata is 31. An INT-capable device may be limited in the maximum number | ||
of instructions it can process and/or maximum length of metadata it can | ||
insert in data packets. An INT hop that cannot process all instructions | ||
must still insert Per-hop Metadata Length \* 4 bytes, with all-ones | ||
reserved value (4 or 8 bytes of 0xFF depending on the length of metadata) | ||
for the metadata corresponding to instructions it cannot process. An | ||
INT hop that cannot insert Per-hop Metadata Length \* 4 bytes must skip INT | ||
processing altogether and not insert any metadata in the packet. | ||
- Remaining Hop Count (8b): The remaining number of hops that are allowed to | ||
add their metadata to the packet. | ||
|
@@ -844,6 +876,7 @@ The original packet must have C bit set to 0. | |
- When a packet is received with the Remaining Hop Count equal to 0, the | ||
device must ignore the INT instruction, pushing no new metadata onto | ||
the stack, and the device must set the E bit. | ||
|
||
* INT instructions are encoded as a bitmap in the 16-bit INT Instruction field: | ||
each bit corresponds to a specific standard metadata as specified in Section 3. | ||
- bit0 (MSB): Switch ID | ||
|
@@ -854,11 +887,42 @@ each bit corresponds to a specific standard metadata as specified in Section 3. | |
- bit5: Egress timestamp | ||
- bit6: Level 2 Ingress Port ID + Egress Port ID (4 bytes each) | ||
- bit7: Egress port Tx utilization | ||
- bit8: Buffer ID (8 bits) + Buffer occupancy (24 bits) | ||
- bit15: Checksum Complement | ||
- The remaining bits are reserved. | ||
Each instruction requests 4 bytes of metadata to be inserted at each hop, | ||
except if bit 6 is set, which requires 8 bytes of metadata. Per-hop | ||
metadata length is set accordingly at the INT source. | ||
|
||
Semantics of Queue occupancy and Buffer occupancy is the default semantics of | ||
those two metadata. Additional semantics as needed for different implementation | ||
can be defined in the metadata semantics YANG model. | ||
|
||
Details of the metadata semantics YANG model can be accessed at the link below: | ||
https://github.com/p4lang/p4-applications/blob/master/telemetry/code/models/p4-dtel-metadata-semantics.yang | ||
|
||
Bits 0 - 14 are Baseline INT Instructions. Each instruction requests 4 bytes of metadata to be | ||
inserted at each hop, except for bit 6. If bit 6 is set, the instruction requires 8 bytes of | ||
metadata. Per-hop metadata length is set accordingly at the INT source. | ||
|
||
* Domain Specific ID (16b): the unique ID of the INT Domain. | ||
|
||
* DS Instruction (16b): Instruction bit map specific to the INT domain identified by the | ||
Domain Specific ID. Domain Specific Instruction is an instruction that requires additional | ||
processing of Domain Specific Flags (DS Flags) for the INT Domain identified by Domain Specific ID. | ||
|
||
If the Domain Specific ID matches any Domain ID known to this node, then additional processing | ||
of the Domain Specific Flags and Domain Specific Instruction is required and Domain Specific | ||
Metadata is appended to the Baseline Metadata before Checksum Complement is inserted. The amount | ||
of Domain Specific Metadata must be a multiple of 4 bytes, determined from the Domain Specific | ||
Instruction and consistent with the per-hop metadata length (Hop ML) set by the INT source. | ||
|
||
If the Domain Specific ID does not match any Domain ID known to this node, then | ||
the node is required to either: | ||
|
||
- Pad the node's INT Metadata stack with the special all-ones reserved value for a | ||
Domain Specific Metadata length, calculated by subtracting from the Hop ML a length | ||
computed from all bits in the 16-bit INT Instruction field, or | ||
|
||
- Skip INT processing altogether and not insert any metadata into the packet. | ||
|
||
* Each INT Transit device along the path that supports INT adds its own metadata | ||
values as specified in the instruction bitmap immediately after the INT metadata | ||
header. | ||
|
@@ -885,7 +949,7 @@ header. | |
used for a 4B metadata in a subsequent minor version while still being | ||
backward compatible with this specification. However, an instruction bit | ||
marked reserved in this specification may be used for a 8B metadata only | ||
in the next major version, breaking backward compatibility and requring all | ||
in the next major version, breaking backward compatibility and requiring all | ||
INT switches to be upgraded to the new major version. For example | ||
a version 1.0 INT switch cannot operate alongside version 2.0 INT switches | ||
if a new 8B metadata is introduced in version 2.0, as the version 1.0 | ||
|
@@ -898,18 +962,20 @@ header. | |
metadata header. | ||
* Summary of the field usage | ||
- The INT Source must set the following fields: | ||
- Ver, Rep, C, M, Per-hop Metadata Length, Remaining Hop Count, | ||
- Ver, D, M, Per-hop Metadata Length, Remaining Hop Count, | ||
and Instruction Bitmap. | ||
- INT Source must set all reserved bits to zero. | ||
- INT Source may set the Domain-specific fileds. | ||
- Intermediate devices can set the following fields: | ||
- C, E, M, Remaining Hop Count | ||
- D, E, M, Remaining Hop Count, Domain-specific fields | ||
* The length (in bytes) of the INT metadata stack must always | ||
be a multiple of (Per-hop Metadata Length \* 4). This length can be determined | ||
by subtracting the total INT fixed header sizes (12 bytes) | ||
from (shim header length \* 4). | ||
For INT over Geneve it is 8 bytes subtracted from (length in Geneve tunnel | ||
option header \* 4). | ||
|
||
|
||
# Examples | ||
|
||
This section shows example INT Headers with two hosts (Host1 and Host2), | ||
|
@@ -1017,7 +1083,40 @@ INT Metadata Header and Metadata Stack, followed by TCP payload: | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| TCP payload | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
` | ||
|
||
|
||
## Example with INT over IPv6 using Hop-by-Hop option | ||
|
||
The format of the IPv6 packet with Hop-by-Hop option for INT-MD | ||
(Embedded Metadata) where there are no other Hop-by-Hop option present | ||
is shown below: | ||
|
||
0 1 2 3 | ||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
|Version| Traffic Class | Flow Label | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| Payload Length | Nxt HDR = HbyH| Hop Limit | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| (Outer) Source IPv6 Address | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| (Outer) Destination IPv6 Address | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | ||
| Nxt HDR = IPv6| HbyH Ext Len | Padding|(MBZ) | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| Option Type | Opt Data Len |Reserved (MBZ) | INT TYPE | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | ||
|Ver = 2|Rep|D|E|M| Reserved | Hop ML |RemainingHopCnt| | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ||
| Instruction Bitmap | Domain Specific ID | I | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ N | ||
| DS Flags | DS Instruction | T | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ||
| Variable Option Data (INT DATA) | | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | ||
| Payload Original Packet | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
|
||
|
||
## Example with INT over VXLAN GPE | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add INT over IPv6 after the other three encaps? Specially because the text in the paragraph is referring to scenarios where "INT over VXLAN or Geneve is not helpful"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed what was done earlier. TCP/UDP was listed first and it referenced encaps. I just stuck to that. I am fine with changing the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be woefully out of date on IPv6 extension header behavior, but regarding the option '"INT over IPv6" - INT Headers are carried in the IPv6 packets as Hop-by-Hop option.', I had thought that switches in practice have to punt packets with an IPv6 Hop-by-Hop extension header to the slow path, e.g. software forwarding on a general purpose CPU.
I did a quick search and found that RFC 7045 (published Dec 2013) says this in Section 2.2 "Hop-by-Hop Options":
The IPv6 Hop-by-Hop Options header SHOULD be processed by
intermediate forwarding nodes as described in [RFC2460]. However, it
is to be expected that high-performance routers will either ignore it
or assign packets containing it to a slow processing path. Designers
planning to use a hop-by-hop option need to be aware of this likely
behaviour.
Is there really a desire to put INT data into a header that will likely result in slow path processing in the network?