Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle resource snapshot missing but work already synced and add cro/ro annotation to all the works #936

Merged
merged 7 commits into from
Oct 30, 2024

Conversation

ryanzhang-oss
Copy link
Contributor

@ryanzhang-oss ryanzhang-oss commented Oct 24, 2024

  1. Handle the case that the resource snapshot is deleted when the work generator try to generate the work from the binding. We can safely continue if the work is up to date.
  2. Add cluster override resource snapshots and resource override snapshots hash annotation to work so we can be sure in the future

Description of your changes

Fixes #

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

@ryanzhang-oss ryanzhang-oss changed the title fix: handle resource snapshot missing but work already synced and add cro/ro label to all the works fix: handle resource snapshot missing but work already synced and add cro/ro annotation to all the works Oct 24, 2024
@ryanzhang-oss ryanzhang-oss force-pushed the fix-resourcesnapshot-deleted branch 2 times, most recently from 78db570 to 251b3a3 Compare October 25, 2024 20:52
Copy link
Contributor

@michaelawyu michaelawyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments. LGTM ;)

pkg/controllers/workgenerator/controller.go Show resolved Hide resolved
@@ -485,6 +503,32 @@ func (r *Reconciler) syncAllWork(ctx context.Context, resourceBinding *fleetv1be
return true, updateAny.Load(), nil
}

// areAllWorkSynced checks if all the works are synced with the resource binding.
func areAllWorkSynced(existingWorks map[string]*fleetv1beta1.Work, resourceBinding *fleetv1beta1.ClusterResourceBinding, _, _ string) bool {
syncedCondition := resourceBinding.GetCondition(string(fleetv1beta1.ResourceBindingWorkSynchronized))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! For clusters that have been in the faulted state this condition might have been removed; of course we have instructed the impacted user to fix things manually and they are the only affected party we knew, so it might be alright

michaelawyu
michaelawyu previously approved these changes Oct 29, 2024
Copy link
Contributor

@michaelawyu michaelawyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ;)

@ryanzhang-oss ryanzhang-oss force-pushed the fix-resourcesnapshot-deleted branch from c7d247c to a8225a9 Compare October 29, 2024 20:50
@ryanzhang-oss ryanzhang-oss force-pushed the fix-resourcesnapshot-deleted branch from a8225a9 to f079f32 Compare October 29, 2024 21:10
Eventually(crpStatusActual, eventuallyDuration, eventuallyInterval).Should(Succeed(), "Failed to update CRP status as expected")
})

It("update work to trigger a work generator reconcile", func() {
Copy link
Contributor

@michaelawyu michaelawyu Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! Is this step necessary? It's clearing the manifestConditions on the work objects, which will set the failedPlacements part in the resource binding status to nil, but this does not seem to be related to the test cause. Sorry if I missed anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And normally the work object should have already been marked as unavailable after image change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the work doesn't change, the already available binding won't get reconcile then its status will stay as "ready" then it won't hit this bug.

if err != nil {
return err
}
testDeployment.Spec.Template.Spec.Containers[0].Image = "1.26.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, Ryan, I think this would be still be an invalid image name? But it does align with the situation Infosys encountered (which is, even if the old snapshot has been removed and all clusters are failing, updates should still be processed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this image is valid

binding.Spec.ResourceSnapshotName = "next"
Expect(k8sClient.Update(ctx, binding)).Should(Succeed())
updateRolloutStartedGeneration(&binding)
// check the binding status that it should be marked as override succeed but not synced
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! Just a nit: here it means rollout started but not overridden, right?

Copy link
Contributor

@michaelawyu michaelawyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ;)

@ryanzhang-oss ryanzhang-oss merged commit cb9a7a0 into Azure:main Oct 30, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants