Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dont ignore failure to create cgroup after timeout #349

Merged
merged 1 commit into from
Nov 5, 2024

Conversation

jchorl
Copy link
Contributor

@jchorl jchorl commented Sep 18, 2024

I noticed that creating a cgroup will silently ignore timeouts and continue on. Concretely, I've hit cases where a cgroup fails to get created, and the caller ends up looking for cgroups.controllers only to find it isn't there.

It's very difficult as a caller to deal with this case, where NewSystemd succeeds but the group doesn't exist.

I traced the origins of this code to 5efa14e#diff-3331981e4ac06a8d9b06e91842b7f2759c7af3b65287e489a88385948d311ebdR672 - which was written 5+ years ago.

runc added roughly the same logic here: opencontainers/runc#3782

@dcantah
Copy link
Member

dcantah commented Sep 20, 2024

Looks good, but can you squash the commits and write the same blurb you have in the PR description in the commit message? Thanks!

@jchorl
Copy link
Contributor Author

jchorl commented Sep 21, 2024

Looks good, but can you squash the commits and write the same blurb you have in the PR description in the commit message? Thanks!

Done!

log.G(ctx).Warnf("Timed out while waiting for StartTransientUnit(%s) completion signal from dbus. Continuing...", group)
case <-time.After(systemdStartUnitTimeout):
attemptFailedUnitReset(conn, group)
return fmt.Errorf("Timed out while waiting for StartTransientUnit(%s) completion signal from dbus after %v", group, systemdStartUnitTimeout)
Copy link
Member

@dcantah dcantah Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, one nit: Make Timed lowercase to appease the Golang idiom gods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Before this commit, creating a cgroup would silently ignore timeouts and
carry on. Concretely, this caused cases where a cgroup failed to create,
but the caller doesn't realize and ends up looking for files that should
exist (e.g. cgroups.controllers), only to find they don't exist. It's
very difficult as a caller to deal with this case, where NewSystemd
succeeds but the group doesn't exist.

The origins of this code seem to trace back to an initial implementation
written 5+ years ago:
containerd@5efa14e#diff-3331981e4ac06a8d9b06e91842b7f2759c7af3b65287e489a88385948d311ebdR672

runc added roughly the same logic here to deal with the same issue:
opencontainers/runc#3782

Now, containerd will also error if a cgroup cannot be created within the
timeout window.

Signed-off-by: Josh Chorlton <[email protected]>
@jchorl
Copy link
Contributor Author

jchorl commented Nov 5, 2024

@dcantah what's the path to getting this merged?

@dcantah
Copy link
Member

dcantah commented Nov 5, 2024

@jchorl Just needs another approval, I'll shop this around to folks

@fuweid fuweid merged commit ffca7ca into containerd:main Nov 5, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants