Skip to content

Conversation

@jgallagher
Copy link
Contributor

Yesterday @internet-diglett and I were looking at some weird a4x2 failures where sled-agent successfully started the switch zone but failed to configure uplinks within it. This PR fixes a race condition and a subsequent logic bug which together were causing that failure.

I'm not sure if it's possible to hit this race condition in real hardware. I tried going over a real sled-agent startup log to figure out if we just happened to start up the switch zone "fast enough", or if something in the real startup path was (implicitly) blocked on that setup being done. I think it's the latter but don't have great confidence in that; this is based on comparing timestamps of logs and things that appear backed up waiting on mutexes held during the whole switch zone setup process. All of this is pretty gnarly; we have multiple issues discussing the need for some rework here anyway, but this is yet another spot fix to unblock active work.


Race condition: If we get our underlay info while we're still starting up the switch zone, we don't inform the task doing that startup about it, and therefore it doesn't attempt to configure uplinks.

We saw these sequence in sled-agent's logs:

19:44:47.976Z INFO SledAgent (ServiceManager): Ensuring scrimlet services (enabling services)
    file = sled-agent/src/services.rs:3177
19:44:47.976Z INFO SledAgent (ServiceManager): Enabling switch zone (new)
    file = sled-agent/src/services.rs:3549
19:44:47.976Z INFO SledAgent (ServiceManager): Starting switch zone
    file = sled-agent/src/services.rs:4002

... snip ...

19:44:51.456Z INFO SledAgent (ServiceManager): Ensuring scrimlet services (enabling services)
    file = sled-agent/src/services.rs:3177
19:44:51.456Z INFO SledAgent (ServiceManager): Enabling switch zone (already underway)
    file = sled-agent/src/services.rs:3562

This is consistent with initializing starting the switch zone with no underlay information (i.e., we pass underlay_info=None here):

(SwitchZoneState::Disabled, Some(request)) => {
info!(log, "Enabling {zone_typestr} zone (new)");
self.clone().start_switch_zone(
&mut sled_zone,
request,
filesystems,
data_links,
underlay_info,
);
}

and then subsequently swapping out the request inside the SwitchZoneState::Initializing variant here:

(
SwitchZoneState::Initializing { request, .. },
Some(new_request),
) => {
info!(log, "Enabling {zone_typestr} zone (already underway)");
// The zone has not started yet -- we can simply replace
// the next request with our new request.
*request = new_request;
}

In the "swapping out the request" path, we actually have Some(underlay_info), but we're discarding it: it's not stored in request or new_request - we only passed it as an argument to start_switch_zone. The first commit, fcf094d, fixes this by moving the underlay_info into request instead of passing it as function argument. Now when we swap out the request, the task running to perform initialization has access to the underlay_info and will attempt to configure uplinks.


Logic bug: Once we fixed the above, we saw the "ensure switch zone uplinks" worker stop after a single attempt as though it was told to:

21:29:03.504Z INFO SledAgent (ServiceManager): initialized switch zone (underlay info available: will attempt uplink configuration)
    file = sled-agent/src/services.rs:4034
21:29:03.506Z INFO SledAgent (ServiceManager): Determining physical location of our switch zone at fd00:1122:3344:101::2
    file = sled-agent/src/bootstrap/early_networking.rs:370
21:29:13.096Z WARN SledAgent (ServiceManager): Failed to configure switch zone uplinks
    error = Early networking setup error: Error during request to MGS: failed to determine local switch ID via MGS: Communication Error: error sending request for url (http://[fd00:1122:3344:101::2]:12225/local/switch-id): error sending request for url (http://[fd00:1122:3344:101::2]:12225/local/switch-id): client error (Connect): tcp connect error: Connection refused (os error 146)
    file = sled-agent/src/services.rs:3322
21:29:13.096Z INFO SledAgent (ServiceManager): instructed to give up on switch zone uplink configuration
    file = sled-agent/src/services.rs:3333

When we first start initializing the switch zone, we spawn a task to do the work and store it in the ::Initializing variant's worker field:

worker: Some(Task {
exit_tx,
initializer: tokio::task::spawn(async move {
self.initialize_switch_zone_loop(underlay_info, exit_rx)
.await
}),
}),

Once that task gets an Ok(_) response from try_initialize_switch_zone(), it will attempt to configure uplinks until the exit channel is sent an explicit message or is dropped:

// Then, if we have underlay info, go into a loop trying to configure
// our uplinks. As above, retry until we succeed or are told to stop.
if let Some(underlay_info) = underlay_info {
self.ensure_switch_zone_uplinks_configured_loop(
&underlay_info,
exit_rx,
)
.await;
}

However, inside of try_initialize_switch_zone() itself, the last thing it does before returning is change the state from ::Initializing to ::Running, with no worker task:

*sled_zone = SwitchZoneState::Running {
request: request.clone(),
zone: Box::new(zone),
worker: None,
};

This causes exit_tx to be dropped, which causes ensure_switch_zone_uplinks_configured_loop() to bail out after a single attempt, as we see in the logs above. This is fixed in 07aa525, which moves the worker task into the ::Running state instead of dropping it. (The ::Running state can have a non-None worker if we reconfigure the switch zone, so the supporting code already expects this to be present sometimes, and knows to stop the task when appropriate.)

This bug was mostly introduced by above (not fully correct!) change to fix the race condition: prior to that change, the ::Initializing state never had the underlay_info in it anyway, so ensure_switch_zone_uplinks_configured_loop() wouldn't have even been called.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine based on the very hard to parse description. Trusting that this fixes the issue I'm fine with a merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants