-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kopia Backup failed on 5k pods on sn , exit on " Podvolumebackups already exists, the server was not able to generate a unique name for the object. #6807
Comments
@sseago @shubham-pampattiwar @shawn-hurley FYI.. thank you @TzahiAshkenazi! |
I'm going to take this one on. From initial analysis, it looks like the root cause is the use of |
I've seen this same error in the past when generating test data from a bash script, creating objects from yaml using |
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
related to kubernetes/kubernetes#115489 |
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
I don't think we need to retry whenever a resource is created. Could you clarify whether this is highly reproducible? |
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
@reasonerjt The only time we need to retry is when we set |
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
I've done one test of String(n int) string function, below is statistics of the results of a run: increasing the length of random alphanumeric string: The effect is most obvious when increasing the length of random alphanumeric string, and retry when a collision happening also works. In our real-world scenario, the number of resources created with the same prefix within a namespace does not exceed 1 million (0.32% in the above run). If we attempt to create resources with the same prefix twice, the probability of them being duplicated is 0.32% * 0.32%, which is 0.001%, which means if we create over 100,000 resources in one namespace may have one name repeated. This probability is extremely low. |
update
|
|
So in the above restore example, we're seeing that this is common enough to happen twice in a single restore. @TzahiAshkenazi note that my PR should handle the restore case too. I've basically taken every instance where Velero relies on GenerateName and used the client-go/util/retry func in all cases to retry up to 5 times when generated name is repeated. |
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
thanks for the update @sseago |
@sseago |
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
@TzahiAshkenazi I just updated the fix PR to resolve a conflict with recent commits on main. At this point, the PR has one approval, so as soon as a second maintainer approves it, it will be merged (and thus on-target for Velero 1.13 inclusion) @shubham-pampattiwar @reasonerjt Can we also consider this for 1.12.1? The bug is currently impacting our performance testing against 1.12. I can submit a cherry-pick PR against release-1.12 once the PR against main merges. |
+1 for velero 1.12.1 |
issue #6807: Retry failed create when using generateName
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
[release-1.12]: CP issue #6807: Retry failed create when using generateName
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]>
What steps did you take and what happened:
Running kopia backup .
create SN with 5000 pods & PVC size 32MB storage class = cephfs
backup CR failed exit on : "PartiallyFailed"
Steps & Commands:
Create single namespaces with 5000 Pods&PVs. (Each PV size is32MB without data , created using BusyBox.) sc=cephfs
Created Backup CR
cr_name': 'backup-kopia-busybox-perf-single-5000-pods-cephfs
./velero backup create backup-kopia-busybox-perf-single-5000-pods-cephfs --include-namespaces busybox-perf-single-ns-5000-pods --default-volumes-to-fs-backup=true --snapshot-volumes=false -n velero-1-12
backup CR failed and exit on "PartiallyFailed" after 2:09:33 HH::MM:SS
Errors :
What did you expect to happen:
backup CR will complete successfully
Anything else you would like to add:
attached a link as the possible root cause in the issue.
https://github.com/kubernetes/apiserver/blame/master/pkg/storage/names/generate.go#L45
Environment:
Version: release-1.12-dev
features:
Kubernetes version : Kustomize Version: v4.5.4
Kubernetes installer & version:
Cloud provider or hardware configuration:
OCP running over BM servers
3 masters & 12 workers nodes
OS (e.g. from
/etc/os-release
):NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="411.86.202306021408-0"
VERSION_ID="4.11"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 411.86.202306021408-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.11/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.11"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.11"
OPENSHIFT_VERSION="4.11"
RHEL_VERSION="8.6"
OSTREE_VERSION="411.86.202306021408-0"
P.S
i couldn't run "velero debug command " since i re-deploy velero for new test cases , if the attached logs missing some info which is needed , i can reproduce the issue again
kopia_logs.tar.gz
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: