Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create v2 of node distribution standard (issues/#494) #524

Merged
merged 12 commits into from
Jun 17, 2024
29 changes: 29 additions & 0 deletions Standards/scs-0214-v1-k8s-node-distribution.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,34 @@ If the standard is used by a provider, the following decisions are binding and v
can also be scaled vertically first before scaling horizontally.
- Worker node distribution MUST be indicated to the user through some kind of labeling
in order to enable (anti)-affinity for workloads over "failure zones".
- To provide metadata about the node distribution, which also enables testing of this standard,
providers MUST label their K8s nodes with the labels listed below.
- `topology.kubernetes.io/zone`

Corresponds with the label described in [K8s labels documentation][k8s-labels-docs].
It provides a logical zone of failure on the side of the provider, e.g. a server rack
in the same electrical circuit or multiple machines bound to the internet through a
singular network structure. How this is defined exactly is up to the plans of the provider.
The field gets autopopulated most of the time by either the kubelet or external mechanisms
like the cloud controller.

- `topology.kubernetes.io/region`

Corresponds with the label described in [K8s labels documentation][k8s-labels-docs].
It describes the combination of one or more failure zones into a region or domain, therefore
showing a larger entity of logical failure zone. An example for this could be a building
containing racks that are put into such a zone, since they're all prone to failure, if e.g.
the power for the building is cut. How this is defined exactly is also up to the provider.
The field gets autopopulated most of the time by either the kubelet or external mechanisms
like the cloud controller.

- `topology.scs.community/host-id`

This is an SCS-specific label; it MUST contain the hostID of the physical machine running
the hypervisor (NOT: the hostID of a virtual machine). Here, the hostID is an arbitrary identifier,
which need not contain the actual hostname, but it should nonetheless be unique to the host.
This helps identify the distribution over underlying physical machines,
which would be masked if VM hostIDs were used.

## Conformance Tests

Expand All @@ -92,3 +120,4 @@ If also produces warnings and informational outputs, if e.g. labels don't seem t
[k8s-ha]: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/
[k8s-large-clusters]: https://kubernetes.io/docs/setup/best-practices/cluster-large/
[scs-0213-v1]: https://github.com/SovereignCloudStack/standards/blob/main/Standards/scs-0213-v1-k8s-nodes-anti-affinity.md
[k8s-labels-docs]: https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone
129 changes: 129 additions & 0 deletions Standards/scs-0214-v2-k8s-node-distribution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
---
title: Kubernetes Node Distribution and Availability
type: Standard
status: Draft
replaces: scs-0214-v1-k8s-node-distribution.md
track: KaaS
---

## Introduction

A Kubernetes instance is provided as a cluster, which consists of a set of machines,
so-called nodes. A cluster is composed of a control plane and at least one worker node.
The control plane manages the worker nodes and therefore the pods in the cluster by making
decisions about scheduling, event detection and rights management. Inside the control plane,
multiple components exist, which can be duplicated and distributed over multiple nodes
inside the cluster. Typically, no user workloads are run on these nodes in order to
separate the controller component from user workloads, which could pose a security risk.

### Glossary

The following terms are used throughout this document:

| Term | Meaning |
|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Worker | Virtual or bare-metal machine, which hosts workloads of customers |
| Control Plane | Virtual or bare-metal machine, which hosts the container orchestration layer that exposes the API and interfaces to define, deploy, and manage the lifecycle of containers. |
| Machine | Virtual or bare-metal entity with computational capabilities |

## Motivation

In normal day-to-day operation, it is not unusual for some operational failures, either
due to wear and tear of hardware, software misconfigurations, external problems or
user errors. Whichever was the source of such an outage, it always means down-time for
operations and users and possible even data loss.
Therefore, a Kubernetes cluster in a productive environment should be distributed over
multiple "failure zones" in order to provide fault-tolerance and high availability.
This is especially important for the control plane of the cluster, since it contains the
state of the whole cluster. A failure of this component could mean an unrecoverable failure
of the whole cluster.

## Design Considerations

Most design considerations of this standard follow the previously written Decision Record
[Kubernetes Nodes Anti Affinity][scs-0213-v1] as well as the Kubernetes documents about
[High Availability][k8s-ha] and [Best practices for large clusters][k8s-large-clusters].

SCS wishes to prefer distributed, highly-available systems due to their obvious advantages
like fault-tolerance and data redundancy. But it also understands the costs and overhead
for the providers associated with this effort, since the infrastructure needs to have
hardware which will just be used to provide fail-over safety or duplication.

The document [Best practices for large clusters][k8s-large-clusters] describes the concept of a failure zone.
This term isn't defined any further, but can in this context be described as a number of
physical (computing) machines in such a vicinity to each other (either through physical
or logical interconnection in some way), that specific problems inside this zone would put
all these machines at risk of failure/shutdown. It is therefore necessary for important
data or services to not be present just on one failure zone.
How such a failure zone should be defined is dependent on the risk model of the service/data
and its owner as well as the capabilities of the provider. Zones could be set from things
like single machines or racks up to whole datacenters or even regions, which could be
coupled by things like electrical grids. They're therefore purely logical entities, which
shouldn't be defined further in this document.

## Decision

This standard formulates the requirement for the distribution of Kubernetes nodes in order
to provide a fault-tolerant and available Kubernetes cluster infrastructure.

The control plane nodes MUST be distributed over multiple physical machines.
Kubernetes provides [best-practices][k8s-zones] on this topic, which are also RECOMMENDED by SCS.

At least one control plane instance MUST be run in each "failure zone" used for the cluster,
more instances per "failure zone" are possible to provide fault-tolerance inside a zone.
Comment on lines +69 to +73
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experienced an interesting scenario, where the node distribution test will succeed.
SCS Compliance results in SovereignCloudStack/k8s-cluster-api-provider#742 (comment) show INFO: The nodes are distributed across 2 host-ids.
I am wondering now if 3 control plane nodes are distributed over 2 hosts, it means that 2 control plane nodes are located on the same physical host. What will happen with the k8s cluster if this host goes down? AFAIK etcd has in this case only one member failure tolerance, it will stop functioning!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines in the standard and corresponding test don't require hard anti-affinity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern. We would probably need to adapt the lines above the ones you mentioned from

The control plane nodes MUST be distributed over multiple physical machines.
Kubernetes provides [best-practices][k8s-zones] on this topic, which are also RECOMMENDED by SCS.

to

The control plane nodes MUST be distributed over multiple physical machines; a control plane MUST contain at least three nodes and be distributed over three machines.
Kubernetes provides [best-practices][k8s-zones] on this topic, which are also RECOMMENDED by SCS.

Would that be alright?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least three nodes can this be considered as node distribution standard or is it more suited for some k8s HA standard?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we somehow incorporate the fact that etcd can be external? Then we are probably good with the current formulation, this comment is valid only if stacked etcd for k8s cluster is used.

Copy link
Contributor Author

@cah-hbaum cah-hbaum Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more relevant for a HA standard, but TBF that was always the problem I had with this standard in general, since node distribution or better fault tolerance also kind of (at least partly) includes (high) availability.
Personally, after working on this standard this long I would probably change it up altogether to create a (high) availability that is based on this document and make this distribution standard optional, since not every cluster needs to be distributed (IMO).

Coming back to the case presented by you above:

I am wondering now if 3 control plane nodes are distributed over 2 hosts, it means that 2 control plane nodes
are located on the same physical host. What will happen with the k8s cluster if this host goes down?
AFAIK etcd has in this case only one member failure tolerance, it will stop functioning!

I understand this scenario (and as far as I know you're right about the etcd functionality). The question would be how to fix this scenario in the standard. There shouldn't be a number of physical nodes smaller than three, if etcd in use has 3 members.
So I guess this means this is still a distribution issue for us and not a HA issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we somehow incorporate the fact that etcd can be external? Then we are probably good with the current formulation, this comment is valid only if stacked etcd for k8s cluster is used.

Now I am reading that also for external etcd you need 3+3 nodes. Do you know why it is so?

I understand this scenario (and as far as I know you're right about the etcd functionality). The question would be how to fix this scenario in the standard. There shouldn't be a number of physical nodes smaller than three, if etcd in use has 3 members.
So I guess this means this is still a distribution issue for us and not a HA issue?

https://kubernetes.io/docs/setup/best-practices/multiple-zones/#control-plane-behavior mentions:

If availability is an important concern, select at least three failure zones and replicate each individual control plane component (API server, scheduler, etcd, cluster controller manager) across at least three failure zones.

So maybe we can keep it and put it only later to the HA standard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I already read about this some time ago.
I guess the idea behind the external etcd is separation of concerns in your infrastructure and better control (and possible security) for a cluster. So it makes sense to separate the underlying etcd nodes and the kubernetes control plane. As the linked article mentions

This topology decouples the control plane and etcd member. It therefore provides an HA setup where losing a
control plane instance or an etcd member has less impact and does not affect the cluster redundancy
as much as the stacked HA topology.

Now I think that use-case wouldn't be implemented that often, but its an interesting thing to consider for a setup with required redundancy.

Personally, I would put it into a future High availability standard.

I will mention all this in #639 and we can keep on discussing everything there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote and mentioned most of the stuff in the description of #639


Worker nodes are RECOMMENDED to be distributed over multiple zones. This policy makes
it OPTIONAL to provide a worker node in each "failure zone", meaning that worker nodes
can also be scaled vertically first before scaling horizontally.

To provide metadata about the node distribution and possibly provide the ability
to schedule workloads efficiently, which also enables testing of this standard,
providers MUST annotate their K8s nodes with the labels listed below.
These labels MUST be kept up to date with the current state of the deployment.

- `topology.kubernetes.io/zone`

Corresponds with the label described in [K8s labels documentation][k8s-labels-docs].
It provides a logical zone of failure on the side of the provider, e.g. a server rack
in the same electrical circuit or multiple machines bound to the internet through a
singular network structure. How this is defined exactly is up to the plans of the provider.
The field gets autopopulated most of the time by either the kubelet or external mechanisms
like the cloud controller.

- `topology.kubernetes.io/region`

Corresponds with the label described in [K8s labels documentation][k8s-labels-docs].
It describes the combination of one or more failure zones into a region or domain, therefore
showing a larger entity of logical failure zone. An example for this could be a building
containing racks that are put into such a zone, since they're all prone to failure, if e.g.
the power for the building is cut. How this is defined exactly is also up to the provider.
The field gets autopopulated most of the time by either the kubelet or external mechanisms
like the cloud controller.

- `topology.scs.community/host-id`

This is an SCS-specific label; it MUST contain the hostID of the physical machine running
the hypervisor (NOT: the hostID of a virtual machine). Here, the hostID is an arbitrary identifier,
which need not contain the actual hostname, but it should nonetheless be unique to the host.
This helps identify the distribution over underlying physical machines,
which would be masked if VM hostIDs were used.

## Conformance Tests

The script `k8s-node-distribution-check.py` checks the nodes available with a user-provided
kubeconfig file. Based on the labels `topology.scs.community/host-id`,
`topology.kubernetes.io/zone`, `topology.kubernetes.io/region` and `node-role.kubernetes.io/control-plane`,
the script then determines whether the nodes are distributed according to this standard.
If this isn't the case, the script produces an error.
It also produces warnings and informational outputs, e.g., if labels don't seem to be set.

## Previous standard versions

This is version 2 of the standard; it extends [version 1](scs-0214-v1-k8s-node-distribution.md) with the
requirements regarding node labeling.

[k8s-ha]: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/
[k8s-large-clusters]: https://kubernetes.io/docs/setup/best-practices/cluster-large/
[scs-0213-v1]: https://github.com/SovereignCloudStack/standards/blob/main/Standards/scs-0213-v1-k8s-nodes-anti-affinity.md
[k8s-labels-docs]: https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone
martinmo marked this conversation as resolved.
Show resolved Hide resolved
[k8s-zones]: https://kubernetes.io/docs/setup/best-practices/multiple-zones/