-
Notifications
You must be signed in to change notification settings - Fork 5
Initial IPv6 Support #533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Initial IPv6 Support #533
Conversation
Removes one of the multiple duplicate types that implement an IP Prefix. Signed-off-by: Trey Aspelund <[email protected]>
Moves to ddm/oxnet types to represent an IP Prefix instead of custom types. Step 2 in standardizing around oxnet Prefix types. Signed-off-by: Trey Aspelund <[email protected]>
Remove unused BgpAttributes4, BgpAttributes6, OriginChangeSet, and Status types. There were a couple other unused types that seemed more forward-looking, so I left those in place (i.e. fn to_buf(), Policy4Key) Signed-off-by: Trey Aspelund <[email protected]>
Updates Db to incorporate new methods that support IPv6. The RIB is now split into three separate types: - Rib: Holds Prefix (protocol-agnostic enum) - Rib4: Holds Prefix4 (IPv4 specific struct) - Rib6: Holds Prefix6 (IPv6 specific struct) Updates bestpaths() to be completely prefix-agnostic (now accepts paths as an argument in lieu of a prefix and Rib). Callers of prefix add/remove methods are now responsible for managing locks for the appropriate RIB structures, allowing each protocol's RIB to be locked independently and giving the caller the option to reduce how many times the lock needs to be acquired/released. Signed-off-by: Trey Aspelund <[email protected]>
Initial implementation of IPv6 Static Routing: - New mgd APIs - New mgadm commands (mirroring v4 static routing commands) Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
During creation of Prefix4/Prefix6 structs, an Ipv{4,6}Addr was blindly
allowed w/o any validation or zeroing of the host bits. This allowed
for the RIB to hold multiple entries for routes w/ the same underlying
subnet (e.g. 1.1.1.1/24 and 1.1.1.2/24 would be considered unique keys,
but the underlying subnet is still 1.1.1.0/24), possibly even with
differing next-hop values.
This commit addresses that issue by adding a step to zero the host bits
during the creation of new Prefix4/Prefix6 structs. Figuring out a
migration strategy for the on-disk sled::Db remains a concern, but
this prevents new problematic entries from being created.
Fixes: #529
Signed-off-by: Trey Aspelund <[email protected]>
Second half of the fix for #529. First we moved to a constructor that enforced all host bits be zeroed when creating a Prefix* type. Now, we check whether the routes in the on-disk db are valid (host bits unset) and remove them if they're invalid. This also moves most of the logic into methods of the Prefix* types, with the exception of the mg-admin-client mirrors of them, as progenitor can't know about methods/traits from the openapi spec so that logic has to live somewhere. Fixes: #529 Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
Some helpful parsing and retry macros were sitting in bgp and other crates couldn't use them... so I moved them to mg-common. Signed-off-by: Trey Aspelund <[email protected]>
Adds APIs, db methods, mgadm commands and unit tests for IPv6 BGP Origin Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
Moves RIB queries from /bgp to /rib (this will likely require a corresponding update in Nexus), as rib_in and rib_loc are not specific to BGP. Moves RIB logic into its own set of files (mgd/src/rib_admin.rs, mgadm/src/rib.rs). Refactors print_rib() and move it into rib.rs. Adds support for server-side filtering of RIB queries by AddressFamily and Protocol (BGP/Static). Removes a bunch of duplicate types and impls by using progenitor's "replace" directive in generate_api!() macro to automatically map progenitor's auto-derived types back to their original rdb types (this is something we should do further refactoring on, as there is quite a lot of type/impl duplication going on). Fixes: #525 Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
| // Ensure that r1's peer session to r2 has gone back to connect. | ||
| r2.shutdown(); | ||
| d2.shutdown(); | ||
| wait_for_eq!(r1_session.state(), FsmStateKind::Connect); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we expecting state machine behavior changes? It looks like tests are now failing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I modified this is because I believed the FSM state observed is timing-dependent, not consistent.
It seemed to me like the goal of this test case is to confirm that "shutdown" on one side brings the session down.
If so, then looking for "!established" is a less timing-dependent way to confirm that things have been completely torn down.
Otherwise, the state on the active side could be "connect", "opensent" (since the listener isn't torn down on the shutdown side, the transport can still come up) or "idle".
I was hoping to avoid a scenario where the test fails because the active peer is in "idle" or "opensent" and we were only looking for "connect".
I don't have super strong opinions about this though, so if reverting that change means the test passes consistently again, then I'm fine with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure at what point this test became unreliable. I'll go back through and figure out which of my changes borked things. I don't suspect the issue is with the logic change of "state != established", but I could be wrong
bgp/src/router.rs
Outdated
| }; | ||
|
|
||
| for p in prefixes { | ||
| update.nlri.push(p.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't v6 be in mp_reach_nlri/mp_unreach_nlri?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, yes. MP_{REACH,UNREACH}_NLRI hasn't been implemented yet
Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
Properly handle connection termination scenarios to prevent zombie threads and improve error logging. Changes include: - Distinguish between recv timeout and disconnection in channel connections - Add explicit break on disconnection to cleanly exit recv loops - Detect EOF (n == 0 reads) and convert to UnexpecteEof error - Break recv loop on TCP read errors to prevent infinite retry loops - Add debug logging when peer disconnects These changes ensure that threads properly terminate when connections close, preventing resource leaks and reducing noise in logs from threads attempting to read from closed connections. Signed-off-by: Trey Aspelund <[email protected]>
fs2 was used for file locking in the LoopbackIpManager test helper, but it does not have support for Illumos as a target OS. This removes the fs2 dependency in favor of using the libc crate that supports Illumos, MacOS and Linux. Signed-off-by: Trey Aspelund <[email protected]>
Somewhere along the line in the refactor, some of the md5 code got orphaned and was no longer being used, which went unnoticed while doing dev on MacOS (for some reason even using --all-targets for cargo build, check and clippy, this still didn't surface?). This updates the BgpConnection code to use the md5 logic again (i.e. actually running the SA keepalive thread). Signed-off-by: Trey Aspelund <[email protected]>
The ipadm syntax used to install/uninstall addresses on illumos was invalid - specifically, the `addrobj` string was not alphanumeric, which resulted in an "Invalid argument provided" error. This updates the `addrobj` string to use "dot" as delimiters between octets since it's valid/alphanumeric. Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
Connection Collision Resolution requires that the BGP-ID is used as a tie-breaker to determine which connection to close: the connection initiated by the router with the lowest ID is closed. We were correctly using the BGP-ID as the tie-breaker, but were implicitly using the "new" connection (whichever completed second) as if it were always initiated by the peer and the "existing" connection (whichever completed first) as if it were initiated by the local system. This is a bad assumption since the ordering of connection completion is non-deterministic. This fixes issues that can arise when those assumptions are made, i.e. r1 and r2 agree that r2 has the higher BGP-ID, but each decide to close a different connection. Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
The accept() path (via the Dispatcher) for TCP was incorrectly passing the local bind address to the BgpConnection constructor (with_conn()) which is set to [::]:179 in non-test environments, which resulted in us later encoding this into an Update message in the NEXT_HOP field. This addresses the issue by correctly passing the local SocketAddr of the live socket, which is an IPv4 address for an IPv4 TCP connection. This also adds both tx and rx validation of NEXT_HOP length based on AFI of prefixes and their next-hop. Signed-off-by: Trey Aspelund <[email protected]>
|
Interop tests are now passing with the latest commit. Will do some more manual validation to try and force Collisions to occur, since that is the primary reason for the FSM changes |
Updates connect() logic to use either a plain IPv4 or IPv6 address from the new TcpStream's local SocketAddr, rather than using it as-is. This is needed because our listening socket is a dual-stack socket (binding to [::] with `v6_only` disabled), so any new IPv4 TcpStreams we accept() from it are presented to the application as an IPv6 socket with an IPv4-mapped IPv6 address (e.g. a new connection from 192.168.0.1 is handed to the application with a peer IP of ::ffff:192.168.0.1). The IP we strip from the TcpStream is then used if we attempt to bind() the socket to a specific Source IP before initiating an outbound connection, which we bail out of upon a failure. So no outbound connections will be initiated while this is busted. The IP is also used to determine the address-family of the next-hop we encode in sent Update messages, so we could inadvertently encode 16 bytes instead of 4 for the next-hop if this is wrong. Updates the bind_addr of a SessionRunner to be a Option all the way up through the API handler (and the test instantiation), where it is always set to None by the API (until we add support for update-source). This prevents erroneous bind() calls in non-test environments. Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
Add a ConnectionCreator check in OpenSent/OpenConfirm to ensure we never enter FsmState::ConnectionCollision with two connections that were initiated by the same peer (local or remote). It was observed during manual stress testing that mgd would sometimes enter Connection Collision with two connections that it initiated. This happens because connect() calls for outbound TCP sockets are currently handled as a one-shot operation in a separate thread without any synchronization methods to allow for a graceful shutdown to occur (e.g. sending a Shutdown message to the connect thread via a channel). However, since we use connect_timeout(), the timing window for these duplicate outbound connections is bounded by the timeout supplied (currently 100ms). Regardless, any timing window is still a window, even if this was only observed irregularly (seen 74 times after running overnight with 2 peers who were resetting the session once per second), so this adds logic to disallow this condition. If a second connection completes, the FSM will check its direction (inbound or outbound) and reject it if it matches the existing connection's direction. This also removes a couple (now) unneeded clippy lints, and renames SessionEvent::Connected to SessionEvent::TcpConnectionAcked (now aligning with RFC event 16, Tcp_CR_Acked). Signed-off-by: Trey Aspelund <[email protected]>
- Adds docstrings for several different types within session.rs that describe the SessionRunner and BgpConnection design choices. - Renames the FSM state methods to "fsm_<state>" format so it's clear which function implements which FsmState handler. - Moves capabilities_received into PeerConnection, since that state is inherently tied to a BgpConnection that has received an Open message rather than the SessionRunner itself. - Renames PeerConnection.capabilities_received -> PeerConnection.caps. - Renames SessionRunner.capabilities_sent -> SessionRunner.caps_tx. - Renames SessionRunner.primary_connection -> SessionRunner.primary. - Introduces PrimaryConnection enum that wraps either a Partial or Full BgpConnection, depending on whether it has the peer's Open data. - Changes SessionRunner.primary to hold a PrimaryConnection enum rather than simply a ConnectionId. This provides a handle to the connection and the peer's Open data so we can expose it to queries. - Adds a couple connection retries timer increments. - Moves a couple ConnectRetryTimer stop calls earlier so we don't forget to stop the timer before transitioning out of Connect/Active. Signed-off-by: Trey Aspelund <[email protected]>
Shortens names of SessionClock / ConnectionClock timers to make SessionRunner code shorter/narrower (avoiding cargo fmt indentation woes). Adds enabled checks for timer expiration FSM events. This is needed in the event of a valid timer event being added to the event queue behind another event that makes the timer event invalid. Example: T0: TcpConnectionConfirmed pushed into event queue T1: ConnectRetryTimerExpires pushed into event queue T2: TcpConnectionConfirmed read from queue, ConnectRetryTime stopped/disabled T3: FSM transition to OpenSent T4: ConnectRetryTimerExpires read from queue -> FSM error! T5: FSM transition to Idle In this case, the ConnectRetryTimerExpires event is pushed into the event queue because the timer legitimately expired. However, another event that advances the FSM state (e.g. TcpConnectionConfirmed, TcpConnectionAcked) was already in the queue to be processed, and the handler stops ConnectRetryTimer but doesn't clear pending ConnectRetryTimerExpires FSM events from the queue. The event queue is not read from again until after moving into the next FSM state, where the old event is then read and reacted to despite the timer being disabled. This is avoided by using the timer's enabled status as an indicator of whether its events are valid. Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
Add connector_handle field to SessionRunner to track the spawned connector thread throughout its lifecycle. This enables detection of connector thread panics and prevents spawning multiple simultaneous connector threads for the same session. Changes BgpConnector::connect() to return JoinHandle instead of () so the caller can manage the thread. The handle is checked on each connection attempt: if a thread is still running, new spawns are skipped; if finished, it's joined to detect panics before spawning a new thread. On shutdown, any running connector thread is joined for clean teardown. Adds connector_panics counter to SessionCounters for observability and introduces initiate_connection() helper to consolidate the spawn logic. Signed-off-by: Trey Aspelund <[email protected]>
Fix missing connection cleanup in connection_collision_open_sent() when transitioning to SessionSetup via the early-Keepalive path. The collision handler has two paths to SessionSetup: 1. Both connections receive Opens -> resolve_collision() 2. One connection receives Keepalive before both Opens Path 1 correctly calls stop() on the losing connection, unregistering it from the connection registry. Path 2 directly returned SessionSetup without cleanup, leaving the losing connection registered with its receive loop still active. In the failing scenario: 1. Connection A receives Open, stores it in exist_open 2. Connection A receives Keepalive before connection B gets Open 3. FSM transitions Connection A to SessionSetup 4. Connection B remains registered and continues receiving messages 5. Connection B's queued Open arrives after R2 reaches Established 6. Open handler in Established sees B as "unexpected known connection" 7. FSM Error notification sent, terminating session Fix by calling stop(StopReason::CollisionResolution) on the losing connection before transitioning to SessionSetup, matching the behavior of resolve_collision(). Also fix: - Missing Keepalive transmission on Open receipt (RFC 4271 violation) - Wrong connection timer restart (restarting 'new' instead of 'exist') - Add RFC 4271 excerpts to ConnectionCollision methods to make protocol requirements more clear. Signed-off-by: Trey Aspelund <[email protected]>
When two BGP routers start simultaneously with identical timer
configurations, they can enter a synchronization deadlock. Both
routers begin in Idle state, which rejects inbound connections per
RFC 4271. When they simultaneously initiate outbound connections,
each router rejects the other's inbound connection attempt, causing
both to fail and return to Idle state via the IdleHoldTimer.
Because both routers have identical IdleHoldTimer intervals, they
restart connection attempts at exactly the same cadence, perpetuating
the deadlock indefinitely. This manifests as both routers cycling
through Idle → Connect/Active → Idle without ever establishing a
session.
While RFC 4271 Section 10 recommends jitter for ConnectRetryTimer
and DelayOpenTimer to prevent synchronized behavior, it does not
explicitly mention IdleHoldTimer. However, IdleHoldTimer serves a
similar role in controlling retry timing after the DampPeerOscillations
state machine dampens connection attempts. In our implementation,
IdleHoldTimer is the primary control for retry cadence when moving
from Idle back to connection attempts (ConnectRetryTimer is disabled in
Idle per the RFC).
The synchronization issue was exposed by channel-based tests where
zero network latency and deterministic timing amplified the race
condition, causing test_basic_peering_active to fail intermittently
(iteration 20/300). While less likely with TCP due to network jitter,
the issue could theoretically occur in real deployments under unlucky
timing conditions.
Implementation adds jitter infrastructure for both IdleHoldTimer and
ConnectRetryTimer following RFC 4271's recommended 0.75-1.0 range:
- Add Timer::jitter_range field with recalculation every timer reset
- Add SessionInfo::{idle_hold_jitter, connect_retry_jitter} configuration
- Enable idle_hold_jitter by default (0.75, 1.0) to fix deadlock
- Disable connect_retry_jitter by default (None) - infrastructure ready
but not needed since ConnectRetryTimer doesn't control retry timing
in the Idle state scenario
- Support runtime jitter updates without session reset
- Add rand dependency for thread-safe RNG via rand::thread_rng()
Signed-off-by: Trey Aspelund <[email protected]>
Let's add a channel_id that is consistent across routers to make life easier when debugging. Signed-off-by: Trey Aspelund <[email protected]>
Renames PrimaryConnection enum to ConnectionKind. Updates resolve_collision() to take BGP-ID as an argument rather than taking it from one of the connections, and to use ConnectionKind enum for the two connections instead of strictly PeerConnections. This allows the callers to consistently order their args by connection completion time (`exist` = first, `new` = second) regardless of how far along the connection is. This is preparation for a refactor of the connection collision logic where we enter resolve_collision() earlier than we did before. Signed-off-by: Trey Aspelund <[email protected]>
Fix an intermittent test failure caused by a race condition during connection collision resolution. When two connections collide during the OpenConfirm state, the losing connection is closed but may still have in-flight messages in the event queue. The FSM was processing messages from dead connections without validating the ConnectionID, causing it to transition to Idle if a non-keepalive message was received for a connection other than the one owned by FsmState::OpenConfirm. This fix adds connection ID validation in OpenConfirm's Message handler, following the same pattern used by other event handlers like HoldTimerExpires. Messages from unknown or inactive connections are now: 1. Logged as unexpected 2. If the connection is known, it's closed with ConnectionRejected 3. If the connection is unknown, it's simply ignored 4. The FSM remains in OpenConfirm state Signed-off-by: Trey Aspelund <[email protected]>
Move collision resolution from lazy (wait for both Opens) to eager (resolve on first Open). When either connection receives an Open message during collision state, immediately trigger collision resolution instead of waiting in a loop for both connections. This eliminates the manual state tracking (exist_open/new_open) variables and simplifies the collision FSM. The collision detection is now more direct and better aligned with RFC 4271 section 6.8. Also fix error paths in connection_collision where set_primary_conn(None) was incorrectly called when one connection failed during Open message parsing, invalid message reception, or timer expiration. Now correctly promote the fallback connection to primary, maintaining the invariant that a single connection must always be designated primary. Signed-off-by: Trey Aspelund <[email protected]>
The mgadm commands to query the RIB have been moved to a new CLI,
so update the test to use the proper command.
Old: mgadm bgp status {imported,selected} $ASN
New: mgadm rib status {imported,selected}
Signed-off-by: Trey Aspelund <[email protected]>
Add collision resolution notification handling in OpenSent, OpenConfirm,
and Established states to prevent death spirals when one peer resolves
a collision before the other peer detects it.
When a peer receives a Collision Resolution notification and hasn't yet
detected the collision itself (second connection still in-flight), the
RFC-prescribed transition to Idle causes the in-flight connection to be
rejected. This creates a death spiral where both peers repeatedly reject
each other's connections.
The fix transitions to Active state instead of Idle when receiving a
Collision Resolution notification. This allows the in-flight connection
to be accepted, breaking the death spiral. ConnectRetryTimer provides
fallback if the expected connection never arrives.
Also refactor collision resolution into a pure function and fix several
related issues:
- Move register_conn() after send_open() to prevent sends on failed conn
- Silent drop rejected connections in Idle (avoid notification storms)
- Add IoError stop reason for failed sends during collision detection
- Fix typo in CeaseErrorSubcode display ("Collition" → "Collision")
Signed-off-by: Trey Aspelund <[email protected]>
Signed-off-by: Trey Aspelund <[email protected]>
Consolidate BGP's Prefix struct into the shared rdb::Prefix enum, standardizing prefix representation across the codebase. This adds explicit address family context to prefix_from_wire(), required for correct IPv4/IPv6 disambiguation in wire format parsing. Key improvements: - Prefix type is Copy, eliminating clones in announcement/origin methods - Property-based tests ensure wire format round-trip correctness for both IPv4 and IPv6 prefixes - CI job added for high-intensity proptest execution (1M+ cases per run) All BGP wire format message types continue to work correctly with the consolidated Prefix representation. Signed-off-by: Trey Aspelund <[email protected]>
Move the "passive or active" logic used when transitioning out of Idle state into a helper method. Immediately transition out of Idle if IdleHoldTimer is not configured using the new helper method, rather than by injecting an FSM event into the event queue. This ensures no enqueued events are pulled out of the event queue while in Idle state, where they could potentially be ignored because of the rules around Idle state. Signed-off-by: Trey Aspelund <[email protected]>
When both peers are configured to use identical IdleHoldTimer, they can encounter a situation where one peer tries to connect to the other while it's in Idle, which gets rejected. Upon that rejection (rather, the resulting send() failure) the first peer drops back to Idle. Then when the second peer comes out of Idle and attempts to connect, the first peer will reject its connection while it's sitting in Idle. This death spiral can sometimes happen indefinitely, which presents itself as random test failures when the BGP session fails to establish within the 30 second timout in the tests. By transitioning immediately out of Idle (i.e. disabling IdleHoldTime) without reading from the event queue, we avoid the whole situation where connections are rejected by the rules of the Idle FSM state. Signed-off-by: Trey Aspelund <[email protected]>
In an attempt to avoid the death spiral involved with both peers using IdleHoldTime (getting stuck in an Idle <-> Connect loop), I had tried adding special handling for Collision Connection Resolution Notification messages. After realizing that this problem is inherent to the Idle state (rejecting connections that complete asynchronously) and adjusting tests to disable IdleHoldTimer (aligning with the default config), I realized that we actually do not need this non-compliant handling. Signed-off-by: Trey Aspelund <[email protected]>
This PR adds initial support for IPv6 static routing in maghemite, with the very beginnings of BGP support (really just originated IPv6 prefixes).
This addresses a few bugs found along the way (e.g. #529), adds unit tests for old/new functionality, and does a good bit of code cleanup, e.g.
Along the way I found that our handling of Connection Collisions was incorrect in a number of ways
e.g.
.unwrap_or(0)would return 0 all the time leading to incorrect behavior when the peer has a higher RID)This eventually became a pretty significant rewrite of much of the FSM for various reasons,
e.g.
Some highlights from the FSM work:
BgpConnectortrait that mirrors theBgpListenertrait but for outbound connectionsTcpConnectionAcked(previously namedConnected) for inbound,TcpConnectionConfirmedfor outbound)BgpConnectororBgpListenerbefore a connection is sent to the FSM/SessionRunnerConnectionIdtype is introduced and added as a member of the FsmEvent variants that are meant only for a specific connection (this allows the FSM event queue to be multiplexed for multiple connections, since events can now be associated with a specific connection. e.g. a Notification for connection X doesn't affect connection Y)FsmState::ConnectionCollisionhas been introduced to provide an entire pseudo-state (as in, not RFC defined) where we can emulate an FSM for two connections in parallel despite only implementing a single FSMFsmEventvariants have been moved to anUnusedEventenum (decorated with dead_code) so we don't have to clutter the FSM methods with handlers for them (I deliberately wanted to make sure we don't use catch-all match arms)enum FsmEventnow holds event type sub-enums that group the FsmEvents based on the category of event (admin, session-wide, and connection-specific)FsmStatemethods have been almost completely rewritten in order to align with the RFC (some events took incorrect actions, some logic lived in the wrong state, some states held onto connections when they shouldn't, etc.)LoopbackIpManageradded to add/remove IPs as needed for TCP tests to work w/o changing the addr_to_session paradigm).slog::o!