Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman ps: fix racy pod name query #23325

Merged
merged 2 commits into from
Jul 18, 2024

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Jul 18, 2024

The pod name was queried without holding the container lock, thus it was possible that the pod was deleted in the meantime and podman just failed with "no such pod" as the errors.Is() check matched the wrong error.

Move it into the locked code this should prevent anyone from removing the pod while the container is part of it. Also fix the returned error, there is no reason to special case one specific error just wrap any error here so callers at least know where it happened.

Fixes #23282

Does this PR introduce a user-facing change?

Fixed a race condition that could make podman ps --pod fail.

@Luap99 Luap99 added the No New Tests Allow PR to proceed without adding regression tests label Jul 18, 2024
Copy link
Contributor

openshift-ci bot commented Jul 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

We were not able to find or create Copr project packit/containers-podman-23325 specified in the config with the following error:

Packit received HTTP 500 Internal Server Error from Copr Service. Check the Copr status page: https://copr.fedorainfracloud.org/status/stats/, or ask for help in Fedora Build System matrix channel https://matrix.to/#/#buildsys:fedoraproject.org.

Unless the HTTP status code above is >= 500, please check your configuration for:

  1. typos in owner and project name (groups need to be prefixed with @)
  2. whether the project name doesn't contain not allowed characters (only letters, digits, underscores, dashes and dots must be used)
  3. whether the project itself exists (Packit creates projects only in its own namespace)
  4. whether Packit is allowed to build in your Copr project
  5. whether your Copr project/group is not private

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2024
@Luap99
Copy link
Member Author

Luap99 commented Jul 18, 2024

@edsantiago another one

@Luap99
Copy link
Member Author

Luap99 commented Jul 18, 2024

/hold need to do more debugging #23326 (comment)

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2024
The pod name was queried without holding the container lock, thus it was
possible that the pod was deleted in the meantime and podman just failed
with "no such pod" as the errors.Is() check matched the wrong error.

Move it into the locked code this should prevent anyone from removing
the pod while the container is part of it. Also fix the returned error,
there is no reason to special case one specific error just wrap any error
here so callers at least know where it happened. However this is not
good enough because the batch doesn't update the state which means it
see everything before the container was locked. In this case it might be
possible the ctr and pod was already removed so let the caller skip both
ctr and pod removed errors.

Fixes containers#23282

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99
Copy link
Member Author

Luap99 commented Jul 18, 2024

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2024
@Luap99
Copy link
Member Author

Luap99 commented Jul 18, 2024

@edsantiago please try this version, seems to work for me now (of course there are still other flakes)

If a pod is removed when calling podman pod stats there is a race where
the command might fail with no such pod. This is not a user error, like
the ps/ls command skip it and move to the next one.

Fixes containers#23327

Signed-off-by: Paul Holzinger <[email protected]>
@edsantiago
Copy link
Member

Looks good - I'm now seeing the system connection remove flake and no others. New push on #23275.

@edsantiago
Copy link
Member

The flake is refusing to manifest on my laptop now.

LGTM!

@mheon
Copy link
Member

mheon commented Jul 18, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 18, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 164ecb2 into containers:main Jul 18, 2024
82 checks passed
@Luap99 Luap99 deleted the ps-pod-err branch July 18, 2024 17:54
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 17, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Oct 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. No New Tests Allow PR to proceed without adding regression tests release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

completion: container diff: no such pod
3 participants