Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workspace resource is stuck with external-create-pending annotation if provider-terraform pod is deleted during Create #340

Open
bobh66 opened this issue Jul 14, 2022 · 9 comments
Labels
bug Something isn't working

Comments

@bobh66
Copy link
Contributor

bobh66 commented Jul 14, 2022

What happened?

The provider-terraform Create() implementation runs terraform apply which can take "a while" - certainly tens of seconds and often several minutes.

If the provider pod is terminated while terraform apply is running, the external-create-pending annotation can get stuck on the Workspace resource and the managed Reconciler will not process it without manual intervention to remove the annotation.

This is most easily demonstrated when provider-terraform has code that detected the context cancellation and returns failure from Create():

crossplane-contrib/provider-terraform#76

This should cause the Reconciler to set the ExternalCreateFailed annotation, which would allow the resource to be retried on the next cycle. However the update of the resource to set that annotation fails because the context has already been cancelled, and so the resource is stuck with the pending annotation.

The logs indicate that the critical annotations cannot be updated because the context has been canceled:

2022-07-14T01:33:06.401Z        DEBUG   provider-terraform      Cannot create external resource {"controller": "managed/workspace.tf.crossplane.io", "request": "/test7", "uid": "9b803b6d-2380-47dc-91e2-99f94d0dd0b5", "version": "64480800", "external-name": "", "error": "cannot apply Terraform configuration: context canceled shutdown while running terraform command: error waiting for child process to terminate: context canceled", "errorVerbose": "context canceled\nerror waiting for child process to terminate\ngithub.com/crossplane-contrib/provider-terraform/internal/terraform.runCommand\n\t/home/bobh/git/bobh66/provider-terraform/internal/terraform/terraform.go:528\ngithub.com/crossplane-contrib/provider-terraform/internal/terraform.Harness.Apply\n\t/home/bobh/git/bobh66/provider-terraform/internal/terraform/terraform.go:474\ngithub.com/crossplane-contrib/provider-terraform/internal/controller/workspace.(*external).Update\n\t/home/bobh/git/bobh66/provider-terraform/internal/controller/workspace/workspace.go:331\ngithub.com/crossplane-contrib/provider-terraform/internal/controller/workspace.(*external).Create\n\t/home/bobh/git/bobh66/provider-terraform/internal/controller/workspace/workspace.go:315\ngithub.com/crossplane/crossplane-runtime/pkg/reconciler/managed.(*Reconciler).Reconcile\n\t/home/bobh/git/bobh66/provider-terraform/vendor/github.com/crossplane/crossplane-runtime/pkg/reconciler/managed/reconciler.go:886\ngithub.com/crossplane/crossplane-runtime/pkg/ratelimiter.(*Reconciler).Reconcile\n\t/home/bobh/git/bobh66/provider-terraform/vendor/github.com/crossplane/crossplane-runtime/pkg/ratelimiter/reconciler.go:54\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581\ncontext canceled shutdown while running terraform command\ngithub.com/crossplane-contrib/provider-terraform/internal/terraform.runCommand\n\t/home/bobh/git/bobh66/provider-terraform/internal/terraform/terraform.go:528\ngithub.com/crossplane-contrib/provider-terraform/internal/terraform.Harness.Apply\n\t/home/bobh/git/bobh66/provider-terraform/internal/terraform/terraform.go:474\ngithub.com/crossplane-contrib/provider-terraform/internal/controller/workspace.(*external).Update\n\t/home/bobh/git/bobh66/provider-terraform/internal/controller/workspace/workspace.go:331\ngithub.com/crossplane-contrib/provider-terraform/internal/controller/workspace.(*external).Create\n\t/home/bobh/git/bobh66/provider-terraform/internal/controller/workspace/workspace.go:315\ngithub.com/crossplane/crossplane-runtime/pkg/reconciler/managed.(*Reconciler).Reconcile\n\t/home/bobh/git/bobh66/provider-terraform/vendor/github.com/crossplane/crossplane-runtime/pkg/reconciler/managed/reconciler.go:886\ngithub.com/crossplane/crossplane-runtime/pkg/ratelimiter.(*Reconciler).Reconcile\n\t/home/bobh/git/bobh66/provider-terraform/vendor/github.com/crossplane/crossplane-runtime/pkg/ratelimiter/reconciler.go:54\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581\ncannot apply Terraform configuration\ngithub.com/crossplane-contrib/provider-terraform/internal/controller/workspace.(*external).Update\n\t/home/bobh/git/bobh66/provider-terraform/internal/controller/workspace/workspace.go:332\ngithub.com/crossplane-contrib/provider-terraform/internal/controller/workspace.(*external).Create\n\t/home/bobh/git/bobh66/provider-terraform/internal/controller/workspace/workspace.go:315\ngithub.com/crossplane/crossplane-runtime/pkg/reconciler/managed.(*Reconciler).Reconcile\n\t/home/bobh/git/bobh66/provider-terraform/vendor/github.com/crossplane/crossplane-runtime/pkg/reconciler/managed/reconciler.go:886\ngithub.com/crossplane/crossplane-runtime/pkg/ratelimiter.(*Reconciler).Reconcile\n\t/home/bobh/git/bobh66/provider-terraform/vendor/github.com/crossplane/crossplane-runtime/pkg/ratelimiter/reconciler.go:54\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"}
2022-07-14T01:33:06.401Z        DEBUG   provider-terraform      cannot update managed resource annotations      {"controller": "managed/workspace.tf.crossplane.io", "request": "/test7", "uid": "9b803b6d-2380-47dc-91e2-99f94d0dd0b5", "version": "64480800", "external-name": "", "error": "cannot update critical annotations: context canceled"}
2022-07-14T01:33:06.401Z        ERROR   controller.managed/workspace.tf.crossplane.io   Reconciler error        {"reconciler group": "tf.crossplane.io", "reconciler kind": "Workspace", "name": "test7", "namespace": "", "error": "cannot update managed resource status: context canceled"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2022-07-14T01:33:06.401Z        DEBUG   provider-terraform      Reconciling     {"controller": "managed/workspace.tf.crossplane.io", "request": "/test7"}
2022-07-14T01:33:06.401Z        DEBUG   provider-terraform      cannot determine creation result - remove the crossplane.io/external-create-pending annotation if it is safe to proceed {"controller": "managed/workspace.tf.crossplane.io", "request": "/test7", "uid": "9b803b6d-2380-47dc-91e2-99f94d0dd0b5", "version": "64480909", "external-name": "test7"}
2022-07-14T01:33:06.401Z        ERROR   controller.managed/workspace.tf.crossplane.io   Reconciler error        {"reconciler group": "tf.crossplane.io", "reconciler kind": "Workspace", "name": "test7", "namespace": "", "error": "cannot update managed resource status: context canceled"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /home/bobh/git/bobh66/provider-terraform/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2022-07-14T01:33:06.401Z        DEBUG   events  Warning {"object": {"kind":"Workspace","name":"test7","uid":"9b803b6d-2380-47dc-91e2-99f94d0dd0b5","apiVersion":"tf.crossplane.io/v1alpha1","resourceVersion":"64480909"}, "reason": "CannotCreateExternalResource", "message": "cannot apply Terraform configuration: context canceled shutdown while running terraform command: error waiting for child process to terminate: context canceled"}
2022-07-14T01:33:06.401Z        DEBUG   events  Warning {"object": {"kind":"Workspace","name":"test7","uid":"9b803b6d-2380-47dc-91e2-99f94d0dd0b5","apiVersion":"tf.crossplane.io/v1alpha1","resourceVersion":"64480909"}, "reason": "CannotUpdateManagedResource", "message": "cannot update managed resource annotations: cannot update critical annotations: context canceled"}

How can we reproduce it?

Create a terraform Workspace object that uses the kubernetes backend with a time_sleep resource that has a create_duration of 5 minutes

After applying the Workspace object, delete the provider-terraform pod within 5 minutes

Observe the above log messages and that the resource cannot be managed by the Reconciler because of the orphaned external-create-pending annotation

What environment did it happen in?

Crossplane version: 1.8.1

Question - should the Create() process be considered to be idempotent? Should it always be possible to rerun Create() even when the external-create-pending annotation is set?

The existing implementation of the external-create annotations has the "incomplete" check immediately at the start of reconciliation. I'm wondering if it would make sense to allow the Observe to happen and only check for incomplete external creation if the Observe indicates that the resource exists? If the resource does not exist then maybe Create should be responsible for creating it, even if the pending annotation is set?

@ytsarev @negz FYI

@bobh66 bobh66 added the bug Something isn't working label Jul 14, 2022
@negz
Copy link
Member

negz commented Jul 15, 2022

I'm wondering if it would make sense to allow the Observe to happen and only check for incomplete external creation if the Observe indicates that the resource exists?

Unfortunately that won't cover the key case this logic was added for, which is when Observe might not be unable to know whether the resource exists. Some resources have non-deterministic identifiers - e.g. when you call Create the external system generates an identifier and returns it to you. If we have a "partial create" such that we successfully created a resource in the external system but didn't record its identifier it will appear on the next reconcile that it was never created and Observe will move on to calling Create again, resulting in creating duplicates.

You can read a little more about this in crossplane/crossplane#3037 (comment).

So this is kind of "working as intended" - we intentionally take the conservative route and ask that a human checks and removes the offending annotation when we get into this state. I can see that this isn't great for provider-terraform that is at higher risk of getting into this state due to its very long-running Create logic though. I'm open to ideas to address this - maybe we make it possible to opt out of this functionality?

@bobh66
Copy link
Contributor Author

bobh66 commented Jul 15, 2022

Thanks Nic, that makes sense. What about if we allow the Observe to happen, which would give provider-terraform a chance to remove or update the external-create-pending annotation, and then do the same check that is there today, just after the Observe instead of before? Otherwise I think we would need to add some more complex "opt out" logic so the Reconciler would "know" to not enfore that check. If we let the provider manage the annotations (when it wants to) then the Reconciler can continue as it does today.

@patst
Copy link

patst commented Oct 24, 2023

We see that behaviour quite often as well using the grafana-provider, even without Pod restarts.

In our observations that often happens when bursts of resources are synchronized (created). We see some throttling warnings from the API-server (managed AKS).
We have the suspection that some API-Server requests fail, leaving the resources in the state described above.

We have about 500 managed resources for that provider
and that leaves us in a state where we often have to cleanup the resources by delete-recreate them or remove the annotation manually.

@toastwaffle
Copy link
Contributor

@negz @bobh66 With upbound/provider-terraform#231 landed, this is now my team's biggest pain point when using provider-terraform. How do you feel about adding a ReconcilerOption which disables this check? provider-terraform can then expose that option with a flag (something like --disable-create-incomplete-check) for users to opt in to (i.e. the default behaviour is to retain the existing behaviour)

@bobh66
Copy link
Contributor Author

bobh66 commented Feb 6, 2024

@toastwaffle I'm not thrilled about bypassing the annotation but I agree we have to do something if this is going to be a common problem. How often are you seeing this? What is causing the hung Workspace? Is the provider pod restarting? Is the context timing out?

I wonder if provider-terraform should be exempt from this annotation given that it runs terraform plan and can presumably determine the remote state in many (most?) cases and therefore it is more likely to be able to dig itself out of a hole than the other providers. The downside of that is that it hides the root cause of whatever is causing the problem to happen in the first place, which would be a better thing to fix.

@toastwaffle
Copy link
Contributor

It definitely happens when the provider pod is restarted while an apply is running, but I think timeouts might also have the same result given an error is still returned on timeout.

In our specific case, our CI/CD pipeline caused the pod to be restarted frequently - we're using a sidecar to provide credentials for our internal module registry, and every time CD ran it rolled the pod. We disabled CI/CD for provider-terraform, and now the pod is mostly only restarting when we want it to (yay!). That means we are seeing this much less often, but it does make releasing changes and upgrading the provider a little more work than it could be.

IIRC, the context used by crossplane-runtime to update the annotations is cancelled when the pod receives the signal to die, which means it can't apply the external-create-success annotation even with the new graceful shutdown from your PR. We'd have to plumb a separate context through (or use context.Background()) to allow that write to succeed and prevent the check failing.

One alternative model we could use is disabling the check on a per-resource level, rather than globally for the controller. That could be done with an annotation, or an explicit field.

@toastwaffle
Copy link
Contributor

It occurs to me that upbound/provider-terraform#189 could also provide a solution for this. If long-running applies/destroys were performed as separate Jobs, the "external create" which the annotation refers to would actually be creating the job, which should be ~instantaneous and thus less likely to hit this problem if the provider restarts. The provider can keep track of the job in a status field to avoid recreating it.

I think using Jobs in this way would come at the cost of needing to schedule the job and re-initialise the workspace (with no access to a plugin cache), but for long running applies that cost should be negligible in comparison, while also providing scalability benefits. This would need to be controlled by a parameter on the Workspace resource such that smaller/simpler workspaces could still be run in-process without the additional complexity of spawning Jobs.

@bobh66 what do you think of this? If you agree with the idea in principle, I can hopefully persuade my manager to give me some time to develop a proof of concept

@turkenh
Copy link
Member

turkenh commented Jun 6, 2024

We are hitting the same problem in provider-helm and even provider-k8s. Basically, resources are stuck if the create request is interrupted somehow, e.g. a pod restart or api server misbehaving.

Some resources have non-deterministic identifiers - e.g. when you call Create the external system generates an identifier and returns it to you

This doesn't hold for all type of resources. It makes sense for a VPC but not a Bucket for example. Provider Helm for example, always knows the name of the release (as it passes it to the client), so no risk of leaking releases for a given MR.

I'm open to ideas to address this - maybe we make it possible to opt out of this functionality?

I believe opting out for individually managed resources sounds like a decent solution.

@Kidswiss
Copy link

Kidswiss commented Nov 22, 2024

We are hitting the same problem in provider-helm and even provider-k8s. Basically, resources are stuck if the create request is interrupted somehow, e.g. a pod restart or api server misbehaving.

We're having this problem mainly with provider-kubernetes. It happens on rather busy clusters that already have quite some k8s API load. And when a lot of compositions are instantiated that create a lot of resources for the provider. Reducing the --max-reconcile-rate seems to mitigate this.

The create-pending annotation gets set, but adding the created-failed/create-succeeded annotation might not work due to rate-limitting from the kube API or other issues, leaving the manage resources in this state.

Given that kubernetes objects have deterministic names, the usefulness of the creation-pending annotation is somewhat questionable from what I'm getting from @negz explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants