-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race between sled-agent and zone-setup service #6152
Conversation
- Fixes #6149 - Most zones run the `zone-network-setup` once, at startup, with their underlay addresses already provided by the sled-agent. That's not true for the switch zone, which starts with only a localhost address, and then is provided an underlay address by the sled-agent only after the bootstrapping process has proceededed further. However, the zone-setup-service previously deleted its IP interfaces prior to setting the underlay address on it, apparently as a workaround for oxidecomputer/stlouis#435. That's fine for other zones, but that races with the sled-agent setting that underlay address later in the switch zone. It's possible for the zone-setup-service to delete the interface _after_ those addresses are set, which obviously prevents the rest of the control plane from deploying correctly. This fixes the issue by simply removing that call to `ipadm delete-if` in the zone-setup-service. The mentioned issue has been resolved, and the workaround is no longer needed. - Move the `zone-network-setup` service depend on the network milestone, instead of multi-user. This just moves it earlier a bit in the dependency graph, though should not be strictly necessary. We might want to move the sled-agent's notion of "zone readiness" to depend on `multi-user` instead of `single-user` in the future, so this could help with that. - Extract out a few constants, some whitespace cleanup
I've now run this a few dozen times on my machine, and not seen any of the races mentioned in the issue. Instead, tracing the
That order is not strictly guaranteed, since there is still a window of concurrency between the sled-agent and the zone-setup tool. Note that this is necessary: the zone-setup tool attempts to create a default route, which will fail until the zone has an underlay IPv6 address. That's created by the sled-agent. As |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding the bug and providing a quick fix! Left a couple of comments, but otherwise looks good to me.
I'll leave the review of the manifest dependencies to @citrus-it though. He'll have a better grasp on that than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits on the manifest, and I concur with @karencfv's comment on IPV4_STATIC_ADDROBJ_NAME
.
- Addrobj name typo - Cleanup zone setup manifest
zone-network-setup
once, at startup, with their underlay addresses already provided by the sled-agent. That's not true for the switch zone, which starts with only a localhost address, and then is provided an underlay address by the sled-agent only after the bootstrapping process has proceededed further. However, the zone-setup-service previously deleted its IP interfaces prior to setting the underlay address on it, apparently as a workaround for https://github.com/oxidecomputer/stlouis/issues/435. That's fine for other zones, but that races with the sled-agent setting that underlay address later in the switch zone. It's possible for the zone-setup-service to delete the interface after those addresses are set, which obviously prevents the rest of the control plane from deploying correctly. This fixes the issue by simply removing that call toipadm delete-if
in the zone-setup-service. The mentioned issue has been resolved, and the workaround is no longer needed.zone-network-setup
service depend on the network milestone, instead of multi-user. This just moves it earlier a bit in the dependency graph, though should not be strictly necessary. We might want to move the sled-agent's notion of "zone readiness" to depend onmulti-user
instead ofsingle-user
in the future, so this could help with that.