Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents with toleration can't scale up nodes on tainted node pool in GKE #1291

Open
charlesmulder opened this issue Jan 21, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@charlesmulder
Copy link

charlesmulder commented Jan 21, 2025

Describe the bug

I have tainted an autoscaling GKE node-pool with the taint cloud.google.com/gke-nodepool=jenkins-agent-pool, but agents with a toleration are unable to be spin up due to the taint.

If I remove the taint from the node pool, agents are spun up.

Example in values.yaml file shows how to add toleration to an agent using yamlTemplate, which I have implemented as follows:

agent:
  kubernetesConnectTimeout: 30
  kubernetesReadTimeout: 30
  image: 
    repository: jenkins/inbound-agent
    tag: "alpine-jdk21"
  #idleMinutes: 5
  #websocket: true
  alwaysPullImage: true
  showRawYaml: true
  resources:
    requests: 
      cpu: 600m
      memory: 2G
    limits: 
      cpu: 940m
      memory: 2.75G
  podName: jenkins-agent
  connectTimeout: 300
  yamlTemplate: |-
    apiVersion: v1
    kind: Pod
    spec:
      tolerations:
      - key: cloud.google.com/gke-nodepool
        operator: Equal
        value: jenkins-agent-pool
        effect: NoSchedule
  yamlMergeStrategy: "merge"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-nodepool
            operator: In
            values:
            - jenkins-agent-pool
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "true"

Version of Helm and Kubernetes

- Helm: version.BuildInfo{Version:"v3.13", GitCommit:"", GitTreeState:"", GoVersion:"go1.23.1"}
- Kubernetes: 1.30.5-gke.1713000

Chart version

jenkins/jenkins 5.8.2

What happened?

1. Create tainted, autoscaling agent node pool on GKE
2. Add agent toleration to values.yaml as per example 
3. Run a jenkins job
4. Job logs report that Jenkins can't scale up nodes due to taint

What you expected to happen?

A node to be created for the agent.

How to reproduce it

agent:
  kubernetesConnectTimeout: 30
  kubernetesReadTimeout: 30
  image: 
    repository: jenkins/inbound-agent
    tag: "alpine-jdk21"
  #idleMinutes: 5
  #websocket: true
  alwaysPullImage: true
  showRawYaml: true
  resources:
    requests: 
      cpu: 600m
      memory: 2G
    limits: 
      cpu: 940m
      memory: 2.75G
  podName: jenkins-agent
  connectTimeout: 300
  yamlTemplate: |-
    apiVersion: v1
    kind: Pod
    spec:
      tolerations:
      - key: cloud.google.com/gke-nodepool
        operator: Equal
        value: jenkins-agent-pool
        effect: NoSchedule
  yamlMergeStrategy: "merge"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-nodepool
            operator: In
            values:
            - jenkins-agent-pool
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "true"

Anything else we need to know?

Searched kubernetes-plugin issues and found a potentially relevant issue, though it is related to the controller and not the agent Tolerations are not getting overwritten via "Raw YAML for the Pod"

@charlesmulder charlesmulder added the bug Something isn't working label Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant