-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation for the restore queue #128
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mpryc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@mateusoliveira43 note the file where I've moved the common types. Let me know if that is fine, currently we were mixing backup types inside restore object: https://github.com/migtools/oadp-non-admin/pull/128/files#diff-bce3cc24a9a3be49f46a28f3a11baff2a1f07f5e735a1dbc683e535e1aaf625d |
@mpryc it is fine I would change name to something like |
CRD updates for the following non-admin PR: migtools/oadp-non-admin#128 Signed-off-by: Michal Pryc <[email protected]>
Blocked by: openshift/oadp-operator#1607 |
@mpryc avoid using |
var veleroRestoreList velerov1.RestoreList | ||
labelSelector := client.MatchingLabels{labelKey: labelValue} | ||
|
||
if err := clientInstance.List(ctx, &veleroRestoreList, client.InNamespace(namespace), labelSelector); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what you want here is to get all NAC "owned" velero restores?
I suggest doing like this
func GetActiveOwnedVeleroRestores(ctx context.Context, clientInstance client.Client, namespace string) ([]velerov1.Restore, error) {
var veleroRestoreList velerov1.RestoreList
if err := ListObjectsByLabel(ctx, clientInstance, namespace, constant.ManagedByLabel, constant.ManagedByLabelValue, veleroRestoreList); err != nil {
return nil, err
}
var activeRestores []velerov1.Restore
for _, restore := range veleroRestoreList.Items {
if restore.Status.CompletionTimestamp == nil && CheckVeleroRestoreMetadata(restore) {
activeRestores = append(activeRestores, restore)
}
}
if len(activeRestores) == 0 {
return nil, nil
}
return activeRestores, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want all the NAC owned restores that does not have restore.Status.CompletionTimestamp
, which means they are still subject to be handled. We are making queue number based on the list of the objects that are still subject to some work by Velero, the ones that are completed (has the CompletionTimestamp) are not valid anymore.
Will change the implementation to use the ListObjectsByLabel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also be changed? I don't want this to be any different? This PR/separate ?
oadp-non-admin/internal/common/function/function.go
Lines 246 to 267 in 56afada
func GetActiveVeleroBackupsByLabel(ctx context.Context, clientInstance client.Client, namespace, labelKey, labelValue string) ([]velerov1.Backup, error) { | |
var veleroBackupList velerov1.BackupList | |
labelSelector := client.MatchingLabels{labelKey: labelValue} | |
if err := clientInstance.List(ctx, &veleroBackupList, client.InNamespace(namespace), labelSelector); err != nil { | |
return nil, err | |
} | |
// Filter out backups with a CompletionTimestamp | |
var activeBackups []velerov1.Backup | |
for _, backup := range veleroBackupList.Items { | |
if backup.Status.CompletionTimestamp == nil { | |
activeBackups = append(activeBackups, backup) | |
} | |
} | |
if len(activeBackups) == 0 { | |
return nil, nil | |
} | |
return activeBackups, nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok doing change in this PR as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave as it is in this PR if you agree - changed the backup function and now the tests are failing showing always queue 0. I don't want to spent time in this PR to debug why. Current implementation works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what I checked (even without checking metadata):
func GetActiveOwnedVeleroBackups(ctx context.Context, clientInstance client.Client, namespace string) ([]velerov1.Backup, error) {
var veleroBackupList := &velerov1.BackupList{}
if err := ListObjectsByLabel(ctx, clientInstance, namespace, constant.ManagedByLabel, constant.ManagedByLabelValue, veleroBackupList); err != nil {
return nil, err
}
// Filter out backups with a CompletionTimestamp
var activeBackups []velerov1.Backup
for _, backup := range veleroBackupList.Items {
if backup.Status.CompletionTimestamp == nil {
activeBackups = append(activeBackups, backup)
}
}
if len(activeBackups) == 0 {
return nil, nil
}
return activeBackups, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we need new predicate and handler for queue
doing these changes on the current ones, does not give the same result?
diff --git a/internal/handler/velerorestore_handler.go b/internal/handler/velerorestore_handler.go
index 515de66..ff2b23c 100644
--- a/internal/handler/velerorestore_handler.go
+++ b/internal/handler/velerorestore_handler.go
@@ -21,6 +21,7 @@ import (
"k8s.io/apimachinery/pkg/types"
"k8s.io/client-go/util/workqueue"
+ "sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/event"
"sigs.k8s.io/controller-runtime/pkg/reconcile"
@@ -29,7 +30,10 @@ import (
)
// VeleroRestoreHandler contains event handlers for Velero Restore objects
-type VeleroRestoreHandler struct{}
+type VeleroRestoreHandler struct {
+ Client client.Client
+ OADPNamespace string
+}
// Create event handler
func (VeleroRestoreHandler) Create(_ context.Context, _ event.CreateEvent, _ workqueue.RateLimitingInterface) {
@@ -37,18 +41,42 @@ func (VeleroRestoreHandler) Create(_ context.Context, _ event.CreateEvent, _ wor
}
// Update event handler adds Velero Restore's NonAdminRestore to controller queue
-func (VeleroRestoreHandler) Update(ctx context.Context, evt event.UpdateEvent, q workqueue.RateLimitingInterface) {
+func (h VeleroRestoreHandler) Update(ctx context.Context, evt event.UpdateEvent, q workqueue.RateLimitingInterface) {
logger := function.GetLogger(ctx, evt.ObjectNew, "VeleroRestoreHandler")
annotations := evt.ObjectNew.GetAnnotations()
nonAdminRestoreName := annotations[constant.NarOriginNameAnnotation]
nonAdminRestoreNamespace := annotations[constant.NarOriginNamespaceAnnotation]
- q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
- Name: nonAdminRestoreName,
- Namespace: nonAdminRestoreNamespace,
- }})
- logger.V(1).Info("Handled Update event")
+ if function.CheckVeleroRestoreMetadata(evt.ObjectNew) {
+ q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
+ Name: nonAdminRestoreName,
+ Namespace: nonAdminRestoreNamespace,
+ }})
+ logger.V(1).Info("Handled Update event")
+ }
+
+ restores, err := function.GetActiveVeleroRestoresByLabel(ctx, h.Client, h.OADPNamespace)
+ if err != nil {
+ logger.Error(err, "Failed to get Velero Restores by label")
+ return
+ }
+
+ if restores != nil {
+ for _, restore := range restores {
+ annotations := restore.GetAnnotations()
+ originName := annotations[constant.NarOriginNameAnnotation]
+ originNamespace := annotations[constant.NarOriginNamespaceAnnotation]
+
+ if originName != nonAdminRestoreName || originNamespace != nonAdminRestoreNamespace {
+ logger.V(1).Info("Processing Queue update for the NonAdmin Restore referenced by Velero Restore", "Name", restore.Name, constant.NamespaceString, restore.Namespace, "CreatedAt", restore.CreationTimestamp)
+ q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
+ Name: originName,
+ Namespace: originNamespace,
+ }})
+ }
+ }
+ }
}
// Delete event handler
diff --git a/internal/predicate/velerorestore_predicate.go b/internal/predicate/velerorestore_predicate.go
index 8dfdae8..e63e75b 100644
--- a/internal/predicate/velerorestore_predicate.go
+++ b/internal/predicate/velerorestore_predicate.go
@@ -19,6 +19,7 @@ package predicate
import (
"context"
+ velerov1 "github.com/vmware-tanzu/velero/pkg/apis/velero/v1"
"sigs.k8s.io/controller-runtime/pkg/event"
"github.com/migtools/oadp-non-admin/internal/common/function"
@@ -40,6 +41,12 @@ func (p VeleroRestorePredicate) Update(ctx context.Context, evt event.UpdateEven
logger.V(1).Info("Accepted Update event")
return true
}
+ newRestore, _ := evt.ObjectNew.(*velerov1.Restore)
+ oldRestore, _ := evt.ObjectOld.(*velerov1.Restore)
+ if oldRestore.Status.CompletionTimestamp == nil && newRestore.Status.CompletionTimestamp != nil {
+ logger.V(1).Info("Accepted Update event: Restore completion timestamp")
+ return true
+ }
}
logger.V(1).Info("Rejected Update event")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think we should stick with handler/predicate approach for two reasons:
- Be similar with how backup works:
oadp-non-admin/internal/predicate/velerobackup_queue_predicate.go
Lines 37 to 60 in 56afada
func (p VeleroBackupQueuePredicate) Update(ctx context.Context, evt event.UpdateEvent) bool { logger := function.GetLogger(ctx, evt.ObjectNew, "VeleroBackupQueuePredicate") // Ensure the new and old objects are of the expected type newBackup, okNew := evt.ObjectNew.(*velerov1.Backup) oldBackup, okOld := evt.ObjectOld.(*velerov1.Backup) if !okNew || !okOld { logger.V(1).Info("Rejected Backup Update event: invalid object type") return false } namespace := newBackup.GetNamespace() if namespace == p.OADPNamespace { if oldBackup.Status.CompletionTimestamp == nil && newBackup.Status.CompletionTimestamp != nil { logger.V(1).Info("Accepted Backup Update event: new completion timestamp") return true } } logger.V(1).Info("Rejected Backup Update event: no changes to the CompletionTimestamp in the VeleroBackup object") return false } func (h VeleroBackupQueueHandler) Update(ctx context.Context, evt event.UpdateEvent, q workqueue.RateLimitingInterface) {
- In k8s the predicate comes before handler, so we won't be invoking reconcile for cases where we can save some calls, in the above we are saving when the
CompletionTimestamp
is set from nil to something, meaning the object is "done". Only those are interested for us in regards how queue works. Those objects may have came to queue before ones that needs updating queue info. Anything that happened later is not interested as those are put later in the queue, meaning those will not affect any other objects that are in front. This allows to save quite a bit of calls on the predicate level. Secondly you are removing annotations check from the handler, which makes reconcile responsible for it. Reconcile comes after handler, which is also making more calls instead of reducing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some tests out of a cluster, and I am afraid we need to merge them together
If queue predicate is triggered, both handlers are triggered, right?
If a admin backup triggers this, velero backup predicate will try to add NonAdminBackup with empty name and namespace to the queue, right? and this creates and endless error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow this use case?
The following handlers are triggered after predicates:
oadp-non-admin/internal/controller/nonadminbackup_controller.go
Lines 699 to 700 in 56afada
Watches(&velerov1.Backup{}, &handler.VeleroBackupHandler{}). | |
Watches(&velerov1.Backup{}, &handler.VeleroBackupQueueHandler{ |
So both handlers are triggered based on the composite predicate, each handler does different job.
This one only for the object which actually triggered the event:
https://github.com/migtools/oadp-non-admin/blob/master/internal/handler/velerobackup_handler.go#L41-L53
And this one for all other objects to update statuses:
https://github.com/migtools/oadp-non-admin/blob/master/internal/handler/velerobackup_handler.go#L41-L53
See the namespace thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a backup not owned by NAC updates to completed, does that trigger the queue predicate? yes, right?
then, both handlers are called, and velerobackup_handler.go
will try to add NonAdminBackup with empty name and namespace to the queue, right? and this creates and endless error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will open PR for that then
Can you wait that one to continue on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do prefer to have working and similar implementation on both parts. With test day proven this works I don't want to modify too much in this area at the moment as I want to focus on #36 and then improve if we find issues. This for me is not an issue, just small improvement in the implementation. Going off the reconcile is fine imo at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit more why I think it's really not that important at the moment to make this effort:
- We already have tested and working implementation, possibly there are areas to improve, but we need to focus on other parts (sync controller, nabsl)
- The current flow is pretty clean we have
a) predicates for each type of interesting event (first defence/filter against unnecessary reconciles)
b) separate handlers for velero events that works with:
Current NAB object:q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
Other interested NAB objects: https://github.com/migtools/oadp-non-admin/blob/master/internal/handler/velerobackup_queue_handler.go#L74-L81
The two above are not causing the same NAB to be reconciled twice.
If we want to modify and check if velero object event is not triggering the reconcile on non-existing NAB we would need to move the check if NAB really exists to the handler, which is not the way it should work. Reconcile function is the proper place for this check, as it centralizes the logic, ensuring there is only one point where the NAB's existence is verified, rather than duplicating this check in the handler.
Aren't we rewriting yet one more time reconcile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mateusoliveira43 @mpryc Lets keep the restore queue work similar to what we had in backup queue. If there are cases that are not being covered with the current approach, lets create an issue for those and then the issues can be fixed in follow on PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mateusoliveira43 based on what @shubham-pampattiwar wrote, is there anything for me to get this merged? Please let me know so I can work on it if needed.
@@ -357,6 +364,9 @@ var _ = ginkgo.Describe("Test full reconcile loop of NonAdminRestore Controller" | |||
LastTransitionTime: metav1.NewTime(time.Now()), | |||
}, | |||
}, | |||
QueueInfo: &nacv1alpha1.QueueInfo{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know what happens if we tell Velero to restore a backup in progress. I suspect it fails
To make this test scenario more real, would change VeleroBackup.Status
to phase completed, have CompletionTimestamp and QueueInfo.EstimatedQueuePosition
to be zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you cannot restore an InProgress
backup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change this, it's just a test and will be more aligned with world case scenario.
That said, if a user attempts to restore an InProgress backup, it should still be added to the queue. When the time comes to process that InProgress backup, it will simply fail immediately, effectively reducing the queue size at that point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VeleroBackup.Status
and CompletionTimestamp
have been updated. However, I want to ensure that EstimatedQueuePosition
remains greater than 0. In our test, we’ve set it to a random number, 5. This is intentional to verify that regardless of the state of the objects or the status number before reconciliation, the queue position is always accurately calculated during reconciliation. Using 0 before and after would fail to test scenarios where the position needs to be updated to reflect the correct number after the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this can not happen
oadp-non-admin/internal/common/function/function.go
Lines 280 to 284 in 9ba3da3
// If the target backup has a CompletionTimestamp, it means that it's already served. | |
if targetBackup.Status.CompletionTimestamp != nil { | |
queueInfo.EstimatedQueuePosition = 0 | |
return queueInfo, nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, but the purpose of the test is to ensure that any future changes to the code don’t alter this behavior, correct? You’re suggesting modifying the test scenario to match the current implementation, but I believe we should take a more black-box approach to verify that the implementation behaves as expected.
CRD updates for the following non-admin PR: migtools/oadp-non-admin#128 Signed-off-by: Michal Pryc <[email protected]>
Adds the estimated queue to the NAB restore object Signed-off-by: Michal Pryc <[email protected]>
Test coverage for the status queue within restore operation. Signed-off-by: Michal Pryc <[email protected]>
Signed-off-by: Michal Pryc <[email protected]>
- Added coverage for the GetRestoreQueueInfo - Fixed status comparison to include nil QueueInfo checks - Modified test to cover Restore of BackingOff Backup - Added Completed status for the Backup in test Signed-off-by: Michal Pryc <[email protected]>
@mateusoliveira43 @shubham-pampattiwar rebased and addressed comments in the 4th commit to this PR, so it's easier to review latest changes. |
@@ -388,7 +418,7 @@ var _ = ginkgo.Describe("Test full reconcile loop of NonAdminRestore Controller" | |||
}, | |||
}, | |||
}), | |||
ginkgo.Entry("Should update NonAdminRestore until it invalidates and then delete it", nonAdminRestoreFullReconcileScenario{ | |||
ginkgo.Entry("Should not include queueInfo for the given backup as NonAdminBackup is not ready to be restored (BackingOff)", nonAdminRestoreFullReconcileScenario{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would not update test name
the test scenario did not suffer great changes
This info can be added as a comment in line 449
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it more precise of what it does, so why not have it better if I've changed status check that affected this run?
Adds the estimated queue to the NAB restore object
Why the changes were made
To include estimated queue number for the restore object.
How to test the changes made
make simulation-test