Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to remove duplicate or stale instance entries of a service in Consul catalog when Consul Connect inject enabled pod moves from one node to another. Currently running with an agentless setup. #4219

Open
MageshSrinivasulu opened this issue Jul 30, 2024 · 4 comments
Labels
type/bug Something isn't working

Comments

@MageshSrinivasulu
Copy link

After upgrading the consul from version 1.14.10 to 1.16.6 using an agentless setup, I noticed duplicate entries of the same instances under service. I am unable to remove them, and one of the entries is orphaned, pointing to a pod that is no longer running in the cluster.

image

How to resolve it?

@MageshSrinivasulu MageshSrinivasulu added the type/bug Something isn't working label Jul 30, 2024
@MageshSrinivasulu
Copy link
Author

MageshSrinivasulu commented Jul 31, 2024

Apart from deleting the node that doesn't exist anymore what helps me is scaling down the impacted service to zero and scaling it back up again which removes the duplicate or bad entries. These are just temporary fixes. Issue persists.

Below is how I can consistently reproduce the issue

  1. pod A running in node A.
  2. cordon node A
  3. Let the pod A schedule in node B
  4. It leaves 2 entries of an instance in the consul catalog meaning the pod IP of both old and new pod A. Where the health of new pod A flips between healthy and unhealthy and old pod A entry is always unhealthy

This is Crazy. In Kubernetes pod can move between nodes at any given point in time. If it moves consul must deregister the old entry and create a new one.

@MageshSrinivasulu MageshSrinivasulu changed the title Unable to remove the duplicate instance entries of a service in consul catalog after upgrading consul with agentless setup Unable to remove duplicate or stale instance entries of a service in Consul catalog when Consul Connect inject enabled pod moves from one node to another. Currently running with an agentless setup. Jul 31, 2024
@MageshSrinivasulu
Copy link
Author

MageshSrinivasulu commented Jul 31, 2024

@david-yu @blake Can you please guide me on this?

@MageshSrinivasulu
Copy link
Author

Below is what I found in the connect inject pod logs. I have masked the actual service name

2024-08-01T01:06:32.083Z ERROR controller.endpoints failed to deregister endpoints {"name": "SERVICE", "ns": "NAMESPACE", "error": "2 errors occurred:\n\t* failed to update service health status for pod NAMESPACE/POD to critical: Unexpected response code: 500 (rpc error making call: Unknown service ID 'SERVICE ID' for check ID 'NAMESPACE/SERVICE ID')\n\t* failed to update service health status for pod NAMESPACE/POD to critical: Unexpected response code: 500 (rpc error making call: Unknown service ID 'SERVICE ID-sidecar-proxy' for check ID 'NAMESPACE/SERVICE ID-sidecar-proxy')\n\n"}

@MageshSrinivasulu
Copy link
Author

MageshSrinivasulu commented Aug 6, 2024

This is how I found the working version of consul when trying to upgrade from 1.14.10 to 1.16.6

The nearest working version is 1.15.9. All the versions from 1.15.10 to 1.16.6 have one issue or another it is not stable and results are not consistent

I was able to deploy the agentless feature of consul with 1.15.9 and didn't observer any major issues

image

The issue mentioned below is predominant in the 1.16 release

hashicorp/consul#19717

I kindly request to give some attention to this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant