This guide walks you through implementing sharding for your own controller. Prerequisite for using a sharded controller setup is to install the sharding components in the cluster, see Install the Sharding Components.
After installing the sharding components, you can go ahead and configure a ClusterRing
object for your controller.
For all controllers that you want to shard, configure the controller's main resource and the controlled resources in ClusterRing.spec.resources
.
As an example, let's consider a subset of kube-controller-manager's controllers: Deployment
and ReplicaSet
.
- The
Deployment
controller reconciles thedeployments
resource and controlsreplicasets
. - The
ReplicaSet
controller reconciles thereplicaset
resource and controlspods
.
The corresponding ClusterRing
for the Deployment
controller would need to be configured like this:
apiVersion: sharding.timebertt.dev/v1alpha1
kind: ClusterRing
metadata:
name: kube-controller-manager-deployment
spec:
resources:
- group: apps
resource: deployments
controlledResources:
- group: apps
resource: replicasets
To allow the sharder to reassign the sharded objects during rebalancing, we need to grant the corresponding permissions.
We need to grant these permissions explicitly depending on what is configured in the ClusterRing
.
Otherwise, the sharder would basically require cluster-admin
access.
For the above example, we would use these RBAC manifests:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: sharding:clusterring:kube-controller-manager
rules:
- apiGroups:
- apps
resources:
- deployments
- replicaset
verbs:
- list
- patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: sharding:clusterring:kube-controller-manager
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: sharding:clusterring:kube-controller-manager
subjects:
- kind: ServiceAccount
name: sharder
namespace: sharding-system
To support sharding in your Kubernetes controller, only three aspects need to be implemented:
- announce ring membership and shard health: maintain individual shard
Leases
instead of performing leader election on a singleLease
- only watch, cache, and reconcile objects assigned to the respective shard: add a shard-specific label selector to watches
- acknowledge object movements during rebalancing: remove the drain and shard label when the drain label is set and stop reconciling the object
pkg/shard
contains reusable reference implementations for these aspects.
cmd/shard
serves as an example shard implementation that shows how to put the pieces together in controllers based on controller-runtime.
However, sharding can also be implemented in controllers that don't use controller-runtime or that are written in another programming language than Go.
The following sections outline the exact requirements that a sharded controller needs to fulfill and then show how to implement them in controllers based on controller-runtime. Don't be scared by the long descriptions. Implementing these aspects is simple (especially if reusing the helpers designed for controller-runtime controllers) and only needs to be done once. The long descriptions just make sure the requirements are perfectly clear if you need to implement them yourself.
In short: ensure your shard maintains a Lease
object like this and only runs its controllers as long as it holds the Lease
:
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
labels:
alpha.sharding.timebertt.dev/clusterring: my-clusterring
name: my-operator-565df55f4b-5vwpj
namespace: operator-system
spec:
holderIdentity: my-operator-565df55f4b-5vwpj # needs to equal the Lease's name
leaseDurationSeconds: 15 # pick whatever you would use for leader election as well
Most controllers already perform leader election using a central Lease
lock object.
Only if the instance is elected as the leader, it is allowed to run the controllers.
If it fails to renew the Lease
in time, another instance is allowed to acquire the Lease
and can run the controllers.
Hence, an instance must not run any controllers when it looses its Lease
.
In fact, most implementations exit the entire process when failing to renew the lock for safety.
On graceful termination (e.g., during a rolling update), the active leader may release the lock by setting the holderIdentity
field of the Lease
to the empty string.
This allows another instance to acquire the Lease
immediately without waiting for it to expire, which helps in quick leadership handovers.
The same mechanisms apply to sharded controllers.
But instead of using a central Lease
object for all instances, each instance acquires and maintains its own Lease
object to announce itself to the sharder.
A shard may only run its controllers as long as it holds its shard Lease
.
I.e., when it fails to renew the shard Lease
in time, it also needs to stop all controllers.
Similar to usual leader election, a shard may release its own shard Lease
on graceful termination by removing the holderIdentity
.
This immediately triggers reassignments by the sharder to minimize the duration where no shard is acting on a subset of objects.
In essence, all the existing machinery for leader election can be reused for maintaining the shard Lease
– that is, with two minor changes.
First, the shard Lease
needs to be labelled with alpha.sharding.timebertt.dev/clusterring=<clusterring-name>
to specify which ClusterRing
the shard belongs to.
Second, the name of the shard Lease
needs to match the holderIdentity
.
By default, the instance's hostname is used for both values.
If the holderIdentity
differs from the name, the sharder assumes that the shard is unavailable.
In controller-runtime, you can configure your shard to maintain its shard Lease
as follows:
package main
import (
shardlease "github.com/timebertt/kubernetes-controller-sharding/pkg/shard/lease"
"sigs.k8s.io/controller-runtime/pkg/client/config"
"sigs.k8s.io/controller-runtime/pkg/manager"
)
func run() error {
restConfig := config.GetConfigOrDie()
shardLease, err := shardlease.NewResourceLock(restConfig, nil, shardlease.Options{
ClusterRingName: "my-clusterring",
})
if err != nil {
return err
}
mgr, err := manager.New(restConfig, manager.Options{
// SHARD LEASE
// Use manager's leader election mechanism for maintaining the shard lease.
// With this, controllers will only run as long as manager holds the shard lease.
// After graceful termination, the shard lease will be released.
LeaderElection: true,
LeaderElectionResourceLockInterface: shardLease,
LeaderElectionReleaseOnCancel: true,
// other options ...
})
if err != nil {
return err
}
// add controllers and start manager as usual ...
return nil
}
In short: use the following label selector on watches for all sharded resources listed in the ClusterRing
.
shard.alpha.sharding.timebertt.dev/clusterring-50d858e0-example: my-operator-565df55f4b-5vwpj
The sharder assigns all sharded objects by adding a shard label that is specific to the ClusterRing
(resources could be part of multiple ClusterRings
).
The shard label's key consists of the shard.alpha.sharding.timebertt.dev/clusterring-
prefix followed by the first 8 hex characters of the SHA256 checksum of the ClusterRing
name followed by a -
followed by the ClusterRing
name itself.
The key part after the /
is shortened to 63 characters so that it is a valid label key.
The checksum is added to the label key to derive unique label keys even for ClusterRings
with long names that would cause the pattern to exceed the 63 characters limit after the /
.
The shard label's value is the name of the shard, i.e., the name of the shard lease and the shard lease's holderIdentity
.
Once you have determined the shard label key for your ClusterRing
, use it as a selector on all watches that your controller starts for any of the sharded resources.
With this, the shard will only cache the objects assigned to it and the controllers will only reconcile these objects.
Note that when you use a label or field selector on a watch connection and the label or field changes so that the selector doesn't match anymore, the API server will emit a DELETE
watch event.
In controller-runtime, you can configure your shard to only watch and reconcile assigned objects as follows. This snippet works with controller-runtime v0.16 and v0.17, other versions might require deviating configuration.
package main
import (
shardingv1alpha1 "github.com/timebertt/kubernetes-controller-sharding/pkg/apis/sharding/v1alpha1"
"k8s.io/apimachinery/pkg/labels"
"sigs.k8s.io/controller-runtime/pkg/cache"
"sigs.k8s.io/controller-runtime/pkg/manager"
)
func run() error {
// ...
mgr, err := manager.New(restConfig, manager.Options{
// FILTERED WATCH CACHE
Cache: cache.Options{
// Configure cache to only watch objects that are assigned to this shard.
// This shard only watches sharded objects, so we can configure the label selector on the cache's global level.
// If your shard watches sharded objects as well as non-sharded objects, use cache.Options.ByObject to configure
// the label selector on object level.
DefaultLabelSelector: labels.SelectorFromSet(labels.Set{
shardingv1alpha1.LabelShard(shardingv1alpha1.KindClusterRing, "", "my-clusterring"): shardLease.Identity(),
}),
},
// other options ...
})
// ...
}
In short: ensure your sharded controller acknowledges drain operations. When the drain label like this is added by the sharder, the controller needs to remove both the shard and the drain label and stop reconciling the object.
drain.alpha.sharding.timebertt.dev/clusterring-50d858e0-example
When the sharder needs to move an object from an available shard to another shard for rebalancing, it first adds the drain label to instruct the currently responsible shard to stop reconciling the object.
The shard needs to acknowledge this operation, as the sharder must prevent concurrent reconciliations of the same object in multiple shards.
The drain label's key is specific to the ClusterRing
and follows the same pattern as the shard label (see above).
The drain label's value is irrelevant, only the presence of the label is relevant.
Apart from changing the controller's business logic to first check the drain label, also ensure that the watch event filtering logic (predicates) always reacts on events with the drain label set independent of the controller's actual predicates.
In controller-runtime, you can reuse the helpers for constructing correct predicates and a wrapping reconciler that correctly implements the drain operation as follows:
package controller
import (
shardcontroller "github.com/timebertt/kubernetes-controller-sharding/pkg/shard/controller"
"sigs.k8s.io/controller-runtime/pkg/builder"
"sigs.k8s.io/controller-runtime/pkg/manager"
)
// AddToManager adds a controller to the manager.
// shardName must match the shard lease's name/identity.
func (r *Reconciler) AddToManager(mgr manager.Manager, clusterRingName, shardName string) error {
// ACKNOWLEDGE DRAIN OPERATIONS
// Use the shardcontroller package as helpers for:
// - a predicate that triggers when the drain label is present (even if the actual predicates don't trigger)
// - wrapping the actual reconciler a reconciler that handles the drain operation for us
return builder.ControllerManagedBy(mgr).
Named("example").
For(&corev1.ConfigMap{}, builder.WithPredicates(shardcontroller.Predicate(clusterRingName, shardName, MyConfigMapPredicate()))).
Owns(&corev1.Secret{}, builder.WithPredicates(MySecretPredicate())).
Complete(
shardcontroller.NewShardedReconciler(mgr).
For(&corev1.ConfigMap{}). // must match the kind in For() above
InClusterRing(clusterRingName).
WithShardName(shardName).
MustBuild(r),
)
}