Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster with virtual kubelet blocking NEG sync #2508

Open
marwanad opened this issue Mar 22, 2024 · 7 comments
Open

Cluster with virtual kubelet blocking NEG sync #2508

marwanad opened this issue Mar 22, 2024 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@marwanad
Copy link
Member

We have a cluster that has some VK nodes (Those VK nodes have no provider ids). After a GKE upgrade (which moved the ingress pods) to new hosts, we got the below error on the ingress Service with the NEGs failing to add any endpoints.

Warning  SyncNetworkEndpointGroupFailed  35m (x10 over 27h)  neg-controller         Failed to sync NEG "k8s1-endpoint-bla" (will not retry): Failed to lookup NEG in zone "", candidate zones map[us-central1-a:{} us-central1-b:{} us-central1-c:{} us-central1-f:{}], err - Key Key{"k8s1-endpoint-bla"} not valid for zonal resource NetworkEndpointGroup k8s1-endpoint-bla 

We tracked this to be the below codepath:

return nil, nil, fmt.Errorf("Failed to lookup NEG in zone %q, candidate zones %v, err - %w", zone, candidateZonesMap, err)

After removing the virtual nodes, the NEGs sync again but the issue is a bit confusing because how does an empty zone string make it there:

func getZone(node *api_v1.Node) (string, error) {
if node.Spec.ProviderID == "" {
return "", fmt.Errorf("%w: node %s does not have providerID", ErrProviderIDNotFound, node.Name)
}
matches := providerIDRE.FindStringSubmatch(node.Spec.ProviderID)
if len(matches) != 4 {
return "", fmt.Errorf("%w: providerID %q of node %s is not valid", ErrSplitProviderID, node.Spec.ProviderID, node.Name)
}
if matches[2] == "" {
return "", fmt.Errorf("%w: node %s has an empty zone", ErrSplitProviderID, node.Name)
}
return matches[2], nil
}

if err != nil {
logger.Error(err, "Failed to get zone from providerID", "nodeName", n.Name)
continue
}

GKE version: v1.27.11-gke.1118000

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 22, 2024
@gauravkghildiyal
Copy link
Member

After removing the virtual nodes, the NEGs sync again but the issue is a bit confusing because how does an empty zone string make it there:

I think GKE 1.27 might not have these changes yet (so the latest code from master here may not be entirely representative of all GKE versions)

/cc @songrx1997
/cc @swetharepakula

@marwanad
Copy link
Member Author

We've seen another failure mode where the controller would fail to sync IPs and the LB backends end up with stale endpoints.

@marwanad
Copy link
Member Author

We've hit the above with 1.28.8-gke.1095000 (although the nodes were on 1.27)

  Warning  SyncNetworkEndpointGroupFailed  33s (x7 over 2m23s)  neg-controller         Failed to sync NEG "k8s1-blaxxxx" (will retry): failed to get current NEG endpoints: Failed to lookup NEG in zone "", candidate zones map[us-central1-a:{} us-central1-b:{} us-central1-c:{} us-central1-f:{}], err - Key Key{"k8s1-blaxxxx"} not valid for zonal resource NetworkEndpointGroup k8s1-blaxxxx

@swetharepakula
Copy link
Member

The fix is made available starting 1.29.1-gke.1119000+. We have just backported to Ingress 1.26 which will be released to GKE 1.28 in the next few weeks. We will include a release note when we do

@marwanad
Copy link
Member Author

marwanad commented May 9, 2024

@swetharepakula seems like upgrading to 1.29 did the trick. I am slightly confused by this comment "Ingress 1.26 which will be released to GKE 1.28 in the next few weeks" - what's the current versioning chart between the release-xx branches and what's running on GKE? I was expecting release-1.28 to be what's on GKE 1.28 but that doesn't seem to be the case?

The README.me used to be updated but hasn't been updated for long. Knowing this information would be great for debugging things and mitigating things on our end before we escalate to support.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 7, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants