Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHIP-0039: Build Scheduler Options #205

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
282 changes: 282 additions & 0 deletions ships/0039-build-scheduler-opts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
<!--
Copyright The Shipwright Contributors

SPDX-License-Identifier: Apache-2.0
-->

---
title: build-scheduler-options
authors:
- "@adambkaplan"
reviewers:
- "@apoorvajagtap"
- "@HeavyWombat"
approvers:
- "@qu1queee"
- "@SaschaSchwarze0"
creation-date: 2024-05-15
last-updated: 2024-06-20
status: Implementable
see-also: []
replaces: []
superseded-by: []
---

# Build Scheduler Options

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [x] Design details are appropriately documented from clear requirements
- [x] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [docs](/docs/)

## Open Questions [optional]

- Should this be enabled always? Should we consider an alpha -> beta lifecycle for this feature? (ex: off by default -> on by default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never did that on a per-feature base so far. I know Tekton is doing that. I am not really in support of this. One groups arbitrary independent features into a group (called alpha) that one can either all enable or not. I like more the way Knative does it (https://knative.dev/docs/serving/configuration/feature-flags) on a per-feature base. Though, we never did that in Shipwright so far.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that I prefer knative-style individual feature flags to "all or nothing." Is this worth a separate SHIP, or just discussing/implementing as a feature within builds?


## Summary

Add API options that influece where `BuildRun` pods are scheduled on Kubernetes. This can be
acomplished through the following mechanisms:

- [Node Selectors](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector)
- [Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)
- [Custom Schedulers](https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/)

## Motivation

Today, `BuildRun` pods will run on arbitrary nodes - developers, platform engineers, and admins do
not have the ability to control where a specific build pod will be scheduled. Teams may have
several motivations for controlling where a build pod is scheduled:

- Builds can be CPU/memory/storage intensive. Scheduling on larger worker nodes with additional
memory or compute can help ensure builds succeed.
- Clusters may have mutiple worker node architectures and even OS (Windows nodes). Container images
adambkaplan marked this conversation as resolved.
Show resolved Hide resolved
are by their nature specific to the OS and CPU architecture, and default to the host operating
system and architecture. Builds may need to specify OS and architecture through node selectors.
- The default Kubernetes scheduler may not efficiently schedule build workloads - especially
considering how Tekton implements step containers and sidecars. A custom scheduler optimized for
Tekton or other batch workloads may lead to better cluster utulization.

### Goals

- Allow build pods to run on specific nodes using node selectors.
- Allow build pods to tolerate node taints.
- Allow build pods to use a custom scheduler.

### Non-Goals

- Primary feature support for multi-arch builds.
- Allow node selection, pod affinity, and taint toleration to be set at the cluster level.
While this may be desirable, it requires a more sophisticated means of configuring the build
controller. Setting default values for scheduling options can be considered as a follow-up
feature.
- Prevent use of build pod scheduling fields. This is best left to an admission controller like
[OPA Gatekeeper](https://www.openpolicyagent.org/docs/latest/kubernetes-introduction/) or
[Kyverno](https://kyverno.io/).
- Allow build pods to set node affinity/anti-affinity rules. Affinity/anti-affinity is an
incredibly rich and complex API (see [docs](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity)
for more information). We should strive to provide a simpler interface that is tailored
specifically to builds. This feature is being dropped to narrow the scope of this SHIP. Build
affinity rules can/should be addressed in a follow up feature.

## Proposal

### User Stories

#### Node Selection - platform engineer

As a platform engineer, I want builds to use node selectors to ensure they are scheduled on nodes
optimized for builds so that builds are more likely to succeed

#### Node Selection - arch-specific images

As a developer, I want to select the OS and architecture of my build's node so that I can run
builds on worker nodes with multiple architectures.

#### Taint toleration - cluster admin

As a cluster admin, I want builds to be able to tolerate provided node taints so that they can
be scheduled on nodes that are not suitable/designated for application workloads.

#### Custom Scheduler

As a platform engineer/cluster admin, I want builds to use a custom scheduler so that I can provide
my own scheduler that is optimized for my build workloads.

### Implementation Notes

#### API Updates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not it be also very valid that the build strategy author defines those. It knows which platforms are supported for by the build tool(s) that are used in the strategy. A first-class property "supported architectures" on the strategy would be very interesting for this also thinking forward towards multi-arch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this worth considering in the SHIP that (should) come out of our discussion in shipwright-io/build#1119. I'm less inclined to add nodeSelector and related APIs to the build strategy. As you identified (and I noted as a drawback) - these scheduler APIs leak the internals of the cluster. Adding these APIs to the build strategy could make it harder to share strategies across clusters.


The `BuildSpec` API for Build and BuildRun will be updated to add the following fields:

```yaml
spec:
...
nodeSelector: # map[string]string
<node-label>: "label-value"
tolerations: # []Toleration
- key: "taint-key"
operator: Exists|Equal
value: "taint-value"
schedulerName: "custom-scheduler-name" # string
```

The `nodeSelector` and `schedulerName` fields will use golang primitives that match their k8s
equivalents.

#### Tolerations

The Tolerations API for Shipwright will support a limited subset of the upstream Kubernetes
Tolerations API. For simplicity, any Shipwright Build or BuildRun with a toleration set will use
the `NoSchedule` [taint effect](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).

```yaml
spec:
tolerations: # Optional array
- key: "taint-key" # Aligns with upstream k8s taint labels. Required
operator: Exists|Equal # Aligns with upstream k8s - key exists or node label key = value. Required
value: "taint-value" # Alights with upstream k8s taint value. Optional.
```

As with upstream k8s, the Shipwright Tolerations API array should support
[strategic merge JSON patching](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/update-api-object-kubectl-patch/#notes-on-the-strategic-merge-patch).

#### Precedence Ordering and Value Merging

Values in `BuildRun` will override those in the referenced `Build` object (if present). Values for
`nodeSelector` and `tolerations` should use strategic merge logic when possible:

- `nodeSelector` merges using map keys. If the map key is present in the `Build` and `BuildRun`
object, the `BuildRun` overrides the value.
- `tolerations` merges using the taint key. If the taint key is present in the `Build` and
`BuildRun` object, the `BuildRun` overrides the value.

This allows the `BuildRun` object to "inherit" values from a parent `Build` object.

#### Impact on Tekton TaskRun

Tekton supports tuning the pod of the `TaskRun` using the
[podTemplate](https://tekton.dev/docs/pipelines/taskruns/#specifying-a-pod-template) field. When
Shipwright creates the `TaskRun` for a build, the respective node selector, tolerations, and
scheduler name can be passed through.

#### Command Line Enhancements

The `shp` CLI _may_ be enhanced to add flags that set the node selector, tolerations, and custom
scheduler for a `BuildRun`. For example, `shp build run` can have the following new options:

- `--node=<key>=<value>`: Use the node label key/value pair in the selector. Can be set more than
once for multiple key/value pairs..
- `--tolerate=<key>` or `--tolerate=<key>=<value>`: Tolerate the taint key, in one of two ways:
- First form: taint key `Exists`.
- Second form: taint key `Equals` provided value.
- In both cases, this flag can be set more than once.
- `--scheduler=<name>`: use custom scheduler with given name. Can only be set once.


#### Hardening Guidelines

Exposing `nodeSelector` and `tolerations` to end developers adds risk with respect to overall
system availability. Some platform teams may not want these Kubernetes internals exposed or
modifiable by end developers at all. To address these concerns, a hardening guideline for
Shipwright Builds should also be published alongside documentation for this feature. This guideline
should recommend the use of third party admission controllers (ex: OPA, Kyverno) to prevent builds
from using values that impact system availability and performance. For example:

- Block toleration of `node.kubernetes.io/*` taints. These are reserved for nodes that are not
ready to receive workloads for scheduling.
- Block node selectors with the `node-role.kubernetes.io/control-plane` label key. This is reserved
for control plane components (`kube-apiserver`, `kube-controller-manager`, etc.)
- Block toleration of the `node-role.kubernetes.io/control-plane` taint key. Same as above.

See the [well known labels](https://kubernetes.io/docs/reference/labels-annotations-taints/#node-role-kubernetes-io-control-plane)
documentation for more information.

### Test Plan

- Unit testing can verify that the generated `TaskRun` object for a build contains the desired pod
template fields.
- End to end tests using `KinD` is possible for the `nodeSelector` and `tolerations` fields:
- KinD has support for configuring multiple [nodes](https://kind.sigs.k8s.io/docs/user/configuration/#nodes)
- Once set up, KinD nodes can simulate real nodes when it comes to pod scheduling, node labeling,
and node taints.
- End to end testing for the `schedulerName` field requires the deployment of a custom scheduler,
plus code to verify that the given scheduler was used. This is non-trivial (see
[upstream example](https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/#specify-schedulers-for-pods))
and adds a potential failure point to the test suite. Relying on unit testing alone is our best
option.


### Release Criteria

TBD

**Note:** *Section not required until targeted at a release.*

#### Removing a deprecated feature [if necessary]

Not applicable.

#### Upgrade Strategy [if necessary]

The top-level API fields will be optional and default to Golang empty values.
On upgrade, these values will remain empty on existing `Build`/`BuildRun` objects.


### Risks and Mitigations

**Risk:** Node selector field allows disruptive workloads (builds) to be scheduled on control plane
nodes.

*Mitigation*: Hardening guideline added as a requirement for this feature. There may be some
cluster topologies (ex: single node clusters) where scheduling builds on the "control plane" is not
only desirable, but necessary. Hardening guidelines referencing third party admission controllers
preserves flexibility while giving cluster administrators/platform teams the knowledge needed to
harden their environments as they see fit.


## Drawbacks

Exposing these fields leaks - to a certain extent - our abstraction over Kubernetes. This proposal
places k8s pod scheduling fields up front in the API for `Build` and `BuildRun`, a deviation from
Tekton which exposes the fields through a `PodTemplate` sub-field. Cluster administrators may not
want end developers to have control over where these pods are scheduled - they may instead wish to
control pod scheduling through Tekton's
[default pod template](https://github.com/tektoncd/pipeline/blob/main/docs/podtemplates.md#supported-fields)
mechanism at the controller level.

Exposing `nodeSelector` may also conflict with future enhancements to support
[multi-architecture image builds](https://github.com/shipwright-io/build/issues/1119). A
hypothetical build that fans out individual image builds to nodes with desired OS/architecture
pairs may need to explicitly set the `k8s.io/os` and `k8s.io/architecture` node selector fields on
generated `TaskRuns`. With that said, there is currently no mechanism for Shipwright to control
where builds execute on clusters with multiple worker node architectures and operating systems.


## Alternatives

An earlier draft of this proposal included `affinity` for setting pod affinity/anti-affinity rules.
This was rejected due to the complexities of Kubernetes pod affinity and anti-affinity. We need
more concrete user stories from the community to understand what - if anything - we should do with
respect to distributing build workloads through affinity rules. This may also conflict with
Tekton's [affinity assistant](https://tekton.dev/docs/pipelines/affinityassistants/) feature - an optional configuration that is enabled by default in upstream Tekton.

An earlier draft also included the ability to set default values for these fields at the cluster
level. This would be similar to Tekton's capability with the Pipeline controller configuration.
Since this option is available at the Tekton pipeline level, adding nearly identical features to
Shipwright is being deferred. Tuning pod template values with the Tekton pipeline controller may
also be an acceptable alternative to this feature in some circumstances.


## Infrastructure Needed [optional]

No additional infrastructure antipated.
Test KinD clusters may need to deploy with additional nodes where these features can be verified.

## Implementation History

- 2024-05-15: Created as `provisional`
- 2024-06-20: Draft updated to `implementable`