Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race between sled-agent and zone-setup service #6152

Merged
merged 2 commits into from
Jul 25, 2024
Merged

Conversation

bnaecker
Copy link
Collaborator

  • Fixes Possible race between sled-agent and self-assembling switch zone #6149
  • Most zones run the zone-network-setup once, at startup, with their underlay addresses already provided by the sled-agent. That's not true for the switch zone, which starts with only a localhost address, and then is provided an underlay address by the sled-agent only after the bootstrapping process has proceededed further. However, the zone-setup-service previously deleted its IP interfaces prior to setting the underlay address on it, apparently as a workaround for https://github.com/oxidecomputer/stlouis/issues/435. That's fine for other zones, but that races with the sled-agent setting that underlay address later in the switch zone. It's possible for the zone-setup-service to delete the interface after those addresses are set, which obviously prevents the rest of the control plane from deploying correctly. This fixes the issue by simply removing that call to ipadm delete-if in the zone-setup-service. The mentioned issue has been resolved, and the workaround is no longer needed.
  • Move the zone-network-setup service depend on the network milestone, instead of multi-user. This just moves it earlier a bit in the dependency graph, though should not be strictly necessary. We might want to move the sled-agent's notion of "zone readiness" to depend on multi-user instead of single-user in the future, so this could help with that.
  • Extract out a few constants, some whitespace cleanup

- Fixes #6149
- Most zones run the `zone-network-setup` once, at startup, with their
  underlay addresses already provided by the sled-agent. That's not true
  for the switch zone, which starts with only a localhost address, and
  then is provided an underlay address by the sled-agent only after the
  bootstrapping process has proceededed further. However, the
  zone-setup-service previously deleted its IP interfaces prior to
  setting the underlay address on it, apparently as a workaround for
  oxidecomputer/stlouis#435. That's fine for
  other zones, but that races with the sled-agent setting that underlay
  address later in the switch zone. It's possible for the
  zone-setup-service to delete the interface _after_ those addresses are
  set, which obviously prevents the rest of the control plane from
  deploying correctly. This fixes the issue by simply removing that call
  to `ipadm delete-if` in the zone-setup-service. The mentioned issue
  has been resolved, and the workaround is no longer needed.
- Move the `zone-network-setup` service depend on the network milestone,
  instead of multi-user. This just moves it earlier a bit in the
  dependency graph, though should not be strictly necessary. We might
  want to move the sled-agent's notion of "zone readiness" to depend on
  `multi-user` instead of `single-user` in the future, so this could
  help with that.
- Extract out a few constants, some whitespace cleanup
@bnaecker
Copy link
Collaborator Author

I've now run this a few dozen times on my machine, and not seen any of the races mentioned in the issue. Instead, tracing the ipadm calls always shows this:

bnaecker@shale : ~/omicron $ pfexec dtrace -n 'proc:::exec-success /execname == "ipadm" && strstr(curpsinfo->pr_psargs, "oxControlService0") != NULL || strstr(curpsinfo->pr_psargs, "fd00:1122:3344:101::2") != NULL/ { printf("zone %d: '%s'\n", curpsinfo->pr_zoneid, curpsinfo->pr_psargs); }'
dtrace: description 'proc:::exec-success ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  3   2857         exec_common:exec-success zone 11: /usr/sbin/ipadm create-if -t oxControlService0

  2   2857         exec_common:exec-success zone 11: /usr/sbin/ipadm set-ifprop -t -p mtu=9000 -m ipv4 oxControlService0

 15   2857         exec_common:exec-success zone 11: /usr/sbin/ipadm set-ifprop -t -p mtu=9000 -m ipv6 oxControlService0

  4   2857         exec_common:exec-success zone 11: /usr/sbin/ipadm show-addr -p -o ADDR oxControlService0/omicron6

  0   2857         exec_common:exec-success zone 11: /usr/sbin/ipadm show-addr -p -o TYPE oxControlService0/ll

 15   2857         exec_common:exec-success zone 11: /usr/sbin/ipadm create-addr -t -T addrconf oxControlService0/ll

  2   2857         exec_common:exec-success zone 11: bash -c /usr/sbin/ipadm create-addr -t -T static -a fd00:1122:3344:101::2/64 ox

  2   2857         exec_common:exec-success zone 11: /usr/sbin/ipadm create-addr -t -T static -a fd00:1122:3344:101::2/64 oxControlS

  4   2857         exec_common:exec-success zone 11: /usr/sbin/ipadm show-addr -p -o ADDR oxControlService0/omicron6

That order is not strictly guaranteed, since there is still a window of concurrency between the sled-agent and the zone-setup tool. Note that this is necessary: the zone-setup tool attempts to create a default route, which will fail until the zone has an underlay IPv6 address. That's created by the sled-agent. As dpd and tfportd are both dependent on the zone-setup service running to completion, we can't really avoid that AFAIK.

Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding the bug and providing a quick fix! Left a couple of comments, but otherwise looks good to me.

I'll leave the review of the manifest dependencies to @citrus-it though. He'll have a better grasp on that than me.

illumos-utils/src/addrobj.rs Outdated Show resolved Hide resolved
illumos-utils/src/addrobj.rs Outdated Show resolved Hide resolved
zone-setup/src/bin/zone-setup.rs Show resolved Hide resolved
Copy link
Contributor

@citrus-it citrus-it left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits on the manifest, and I concur with @karencfv's comment on IPV4_STATIC_ADDROBJ_NAME.

smf/zone-network-setup/manifest.xml Outdated Show resolved Hide resolved
smf/zone-network-setup/manifest.xml Outdated Show resolved Hide resolved
- Addrobj name typo
- Cleanup zone setup manifest
@bnaecker bnaecker enabled auto-merge (squash) July 25, 2024 17:25
@bnaecker bnaecker merged commit 1c27e88 into main Jul 25, 2024
22 checks passed
@bnaecker bnaecker deleted the fix-zone-setup-race branch July 25, 2024 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible race between sled-agent and self-assembling switch zone
3 participants