Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataUpload CR sometimes not updated / finalized or Bytes missing? #8497

Closed
realjump opened this issue Dec 9, 2024 · 4 comments
Closed

DataUpload CR sometimes not updated / finalized or Bytes missing? #8497

realjump opened this issue Dec 9, 2024 · 4 comments
Assignees
Milestone

Comments

@realjump
Copy link

realjump commented Dec 9, 2024

What steps did you take and what happened:
Velero has been upgraded to 1.15.0. When executing a backup, I noticed that for a few DataUploads that are in status Completed, the status for BytesDone and TotalBytes differ, so that BytesDone < TotalBytes, but status is still Completed.

What did you expect to happen:
All DataUploads should only have the status Completed, insofar as BytesDone = TotalBytes, as otherwise you could ask yourself whether this is an error when updating the CR or whether bytes were actually not successfully uploaded to S3.

The following information will help us better understand what's going on:
As requested from the discussion #8492, here is the information on the quantity and the node utilization at this time on CPU:

./check_datauploads.sh
backupname-backup-20241206063243-7rwjd is Completed but 42990145943 < 58373579741 at xxx00214.xxx.de
backupname-backup-20241206063243-dvk4k is Completed but 878837521525 < 886194227785 at xxx00114.xxx.de
backupname-backup-20241206063243-xjxp7 is Completed but 16786601663 < 21805896215 at xxx00114.xxx.de
backupname-backup-20241206063243-zlxtm is Completed but 447458401232 < 466810484914 at xxx00214.xxx.de

kubectl get datauploads.velero.io -n velero | grep backupname-backup-20241206063243 | wc -l
733

As you can see, the behavior occurred in 4 of a total of 733 DataUploads as part of the same backup process.

The two nodes mentioned have uploaded some more DataUploads in this backup process (i.e. not only those that show the corresponding behavior). They are RHEL 8.10 nodes and are each equipped with 20 cores (or 40 threaded) and 512GB RAM.

The CPU load over the entire backup process is as follows for the two nodes (I am only including the nodes on which the behavior occurred in the pods, otherwise there would be too many):

xxx00114.xxx.de:
image

xxx00214.xxx.de:
image

Please note the timezones, the dashboard has CET timezone (+1) by default.

backupname-backup-20241206063243-7rwjd: Start 2024-12-06T11:57:24Z, Completion: 2024-12-06T12:06:22Z
backupname-backup-20241206063243-dvk4k: Start 2024-12-06T08:55:37Z, Completion: 2024-12-06T11:47:19Z
backupname-backup-20241206063243-xjxp7: Start 2024-12-06T07:59:21Z, Completion: 2024-12-06T08:08:30Z
backupname-backup-20241206063243-zlxtm: Start 2024-12-06T08:56:30Z, Completion: 2024-12-06T09:24:45Z

Anything else you would like to add:

In this section, I will add Pod-Log excerpts for the affected DataUploads. The only difference or commonality in the affected pods is a timeout, which I cannot see in the other data uploads at random.

Judging by the logs, it looks to me as if the DataPath is complete and nothing is missing - only the CR update is not complete. You can see that these pods remain on Completed for a little longer (30 seconds - 1 minute) and the status only changes from InProgress to Completed in the CR after "deletion". The other DataUpload pods end immediately after completion.

backupname-backup-20241206063243-7rwjd:
image

backupname-backup-20241206063243-dvk4k:
image

backupname-backup-20241206063243-xjxp7:
image

image

backupname-backup-20241206063243-zlxtm:
image

"Working" one - backupname-backup-20241206063243-tsdh5:
image

image

Environment:

  • Velero version (use velero version): 1.15.0
  • Velero features (use velero client config get features): EnableCSI
  • Kubernetes version (use kubectl version): 1.27.10
  • Kubernetes installer & version: kubeadm 1.27.10 as customized Ansible rollout
  • OS (e.g. from /etc/os-release): RHEL 8.10

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

This is due to limitation of Kubernete Event that it doesn't wait and drain out the queue before closing the event sink.
But this problem should be rare, as the current behavior, Velero will wait for as long as 1 min before closing the event sink.

Let me try to enhance this in 1.16.

@Lyndon-Li Lyndon-Li added this to the v1.16 milestone Dec 11, 2024
@Lyndon-Li
Copy link
Contributor

You can use the backed up data safely even for the DU with this problem, the backup has actually completed, merely, the progress is not updated correctly due to not receiving the last several event updates.

@realjump
Copy link
Author

@Lyndon-Li all clear, thanks for the quick confirmation that the log entries are correct, only the CR is no longer finally updated. Maybe a longer timeout is not a bad idea in case someone else encounters the same behavior.

@msfrucht
Copy link
Contributor

We've also seen this issue as well. The kopia backup integrity is in good state on restore. Just the reporting of BytesDone vs BytesTotal is incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants