OCPEDGE-1191: feat: initial arbiter cluster enhancement #1674

eggfoobar · 2024-09-04T08:51:29Z

No description provided.

Signed-off-by: ehila <[email protected]>

openshift-ci-robot · 2024-09-04T08:51:33Z

@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.18.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-09-04T08:51:42Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2024-09-04T08:54:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jeff-roche · 2024-09-04T15:05:10Z

enhancements/arbiter-clusters.md

+
+### Goals
+
+- Provide a new node arbiter role type that supports HA but is not a full master


Suggested change

- Provide a new node arbiter role type that supports HA but is not a full master

- Provide a new arbiter node role type that achieves HA but does not act as a full master node

jeff-roche · 2024-09-04T15:05:35Z

enhancements/arbiter-clusters.md

+### Goals
+
+- Provide a new node arbiter role type that supports HA but is not a full master
+- Support installing OpenShift with 2 regular nodes and 1 arbiter node.


Suggested change

- Support installing OpenShift with 2 regular nodes and 1 arbiter node.

- Support installing OpenShift with 2 master nodes and 1 arbiter node.

jeff-roche · 2024-09-04T15:06:22Z

enhancements/arbiter-clusters.md

+
+- Provide a new node arbiter role type that supports HA but is not a full master
+- Support installing OpenShift with 2 regular nodes and 1 arbiter node.
+- The arbiter node hardware requirement will be lower than regular nodes.


Suggested change

- The arbiter node hardware requirement will be lower than regular nodes.

- The arbiter node hardware requirements will be lower than regular nodes in both cost and performance.

jeff-roche · 2024-09-04T15:06:55Z

enhancements/arbiter-clusters.md

+- Running the arbiter node offsite
+- Running the arbiter node as a VM local to the cluster
+- Having a single arbiter supporting multiple clusters
+- Moving from 2 + 1 to conventional 3 node cluster


Suggested change

- Moving from 2 + 1 to conventional 3 node cluster

- Moving from 2 + 1 to a conventional 3 node cluster

Why are we stating this as a non-goal? The JIRA feature asks for this (see requirement number 6 in the description)

jeff-roche · 2024-09-04T15:08:31Z

enhancements/arbiter-clusters.md

+be scheduled on the arbiter node. The arbiter node will be tainted to make sure
+that only deployments that tolerate that taint are scheduled on the arbiter.
+
+Things that we are proposing of changing.


Suggested change

Things that we are proposing of changing.

Functionality that we are proposing to change:

jeff-roche · 2024-09-04T15:17:13Z

enhancements/arbiter-clusters.md

+
+A few drawbacks we have is that we will be creating a new variant of OpenShift
+that implements a new unique way of doing HA for kubernetes. This does mean an
+increase in the test matrix and all together a different type of tests since


...and all together a different type of tests since

since what?

jeff-roche · 2024-09-04T15:18:02Z

enhancements/arbiter-clusters.md

+
+- Running e2e test would be preferred but might prove to be tricky due to the
+  asymmetry in the control plane
+- We need a strategy for validating install and test failures


We may need to modify a lot of the tests in origin to account for this new configuration or add a whole new test suite to accommodate it

jeff-roche · 2024-09-04T15:18:36Z

enhancements/arbiter-clusters.md

+
+- Ability to utilize the enhancement end to end
+- End user documentation, relative API stability
+- Sufficient test coverage


How do we define sufficient here?

Whats the reason we need a Dev Preview? Could we start with a Tech Preview right away, to reduce potential time to market? We could extend TP phase if required.

jeff-roche · 2024-09-04T15:19:48Z

enhancements/arbiter-clusters.md

+We originally had tried using the pre-existing features in OCP, such as setting
+a node as NoSchedule to avoid customer workloads going on the arbiter node.
+While this whole worked as expected, the problem we faced is that the desire is
+to use a very lower powered and cheap device as the arbiter, this method would


Suggested change

to use a very lower powered and cheap device as the arbiter, this method would

to use a device that is lower power and is cheaper as the arbiter. This method would

jeff-roche · 2024-09-04T15:20:44Z

enhancements/arbiter-clusters.md

+a node as NoSchedule to avoid customer workloads going on the arbiter node.
+While this whole worked as expected, the problem we faced is that the desire is
+to use a very lower powered and cheap device as the arbiter, this method would
+still run a lot of the overhead on the arbiter node.


Suggested change

still run a lot of the overhead on the arbiter node.

still run most of the OCP overhead on the arbiter node.

qJkee · 2024-09-05T06:23:05Z

enhancements/arbiter-clusters.md

+- Running the arbiter node offsite
+- Running the arbiter node as a VM local to the cluster
+- Having a single arbiter supporting multiple clusters
+- Moving from 2 + 1 to conventional 3 node cluster


As i know, changing topology mode is not allowed

the etcd / ocp topology would not change to my understanding - it remains a 3 node cluster.

qJkee · 2024-09-05T06:23:39Z

enhancements/arbiter-clusters.md

+
+- Provide a new node arbiter role type that supports HA but is not a full master
+- Support installing OpenShift with 2 regular nodes and 1 arbiter node.
+- The arbiter node hardware requirement will be lower than regular nodes.


What about including hardware example from Daniel’s presentation in order to give an idea where this might fit?

qJkee · 2024-09-05T06:24:23Z

enhancements/arbiter-clusters.md

+
+## Open Questions [optional]
+
+1. In the future it might be desired to add another master and convert to a


As i know, changing topology mode is not allowed

brandisher

Questions from my side:

What is the concrete list of control plane components that will run on the arbiter node?
What non-control plane or control-plane supporting components need to exist on the arbiter node? (e.g. cluster observability)
How do we let cluster admins (or other RH operators) opt-in to adding components to the arbiter? I'm thinking in cases where a customer has a monitoring workload that they need to run on the control plane, or we have operators with agents that must run on the control plane like ACM or GitOps.
Are there any considerations needed for components that utilize the infrastructureTopology field to determine their replicas? (I believe this applies to observability components like prometheus et al)

brandisher · 2024-09-04T16:01:45Z

enhancements/arbiter-clusters.md

+
+### Goals
+
+- Provide a new node arbiter role type that supports HA but is not a full master


What does "supports HA" mean?

brandisher · 2024-09-05T08:40:41Z

enhancements/arbiter-clusters.md

+ideas for future features.
+
+- Running the arbiter node offsite
+- Running the arbiter node as a VM local to the cluster


Suggested change

- Running the arbiter node as a VM local to the cluster

- Running a virtualized arbiter node.

This suggestion is a little nitpicky and could be ignored if it doesn't make sense. I interpreted this non-goal to mean that we're not intending to support a virtualized arbiter node in any capacity; from within the cluster, adjacent to the cluster, or remote to the cluster.

I think we should not exclude running the arbiter node on a hypervisor. There might be situations where this would actually be helpful. I think the key point is really not to run the arbiter on OCPVirt on the same cluster, as this spoils the idea of 3 node redundancy.

brandisher · 2024-09-05T08:44:30Z

enhancements/arbiter-clusters.md

+
+#### For Cloud Installs
+
+1. User sits down at the computer.


What about Gitops installations? Just kidding :-)

This can be dropped IMO. The location of the cluster admin is an implementation detail.

brandisher · 2024-09-05T08:44:47Z

enhancements/arbiter-clusters.md

+#### For Cloud Installs
+
+1. User sits down at the computer.
+2. The user creates an `install-config.yaml` like normal.


Suggested change

2. The user creates an `install-config.yaml` like normal.

2. The user creates an `install-config.yaml`.

brandisher · 2024-09-05T08:48:57Z

enhancements/arbiter-clusters.md

+
+#### For Baremetal Installs
+
+1. User sits down at the computer.


Same as above - this line can be dropped.

brandisher · 2024-09-05T08:49:08Z

enhancements/arbiter-clusters.md

+#### For Baremetal Installs
+
+1. User sits down at the computer.
+2. The user creates an `install-config.yaml` like normal.


Suggested change

2. The user creates an `install-config.yaml` like normal.

2. The user creates an `install-config.yaml`.

brandisher · 2024-09-05T09:05:49Z

enhancements/arbiter-clusters.md

+mitigations we can take against that is to make sure we are testing installs and
+updates.
+
+Another risk we run is customers using an arbiter node with improper disk speeds


I'm leaning towards suggesting that this risk be removed since we document etcd disk best practices so they'd also apply here. With that in mind, I don't think this risk is specific to this enhancement. If anyone else wants to give this idea a 👍 then maybe it can be dropped otherwise it can be left as-is.

DanielFroehlich · 2024-09-05T11:26:49Z

enhancements/arbiter-clusters.md

+reviewers:
+  - "@tjungblu"
+  - "@patrickdillon"
+  - "@williamcaban"


william is no longer involved in controlplane, please tag Ramon Acedo

DanielFroehlich · 2024-09-05T11:30:51Z

enhancements/arbiter-clusters.md

+ideas for future features.
+
+- Running the arbiter node offsite
+- Running the arbiter node as a VM local to the cluster


I think we should not exclude running the arbiter node on a hypervisor. There might be situations where this would actually be helpful. I think the key point is really not to run the arbiter on OCPVirt on the same cluster, as this spoils the idea of 3 node redundancy.

DanielFroehlich · 2024-09-05T11:32:37Z

enhancements/arbiter-clusters.md

+- Running the arbiter node offsite
+- Running the arbiter node as a VM local to the cluster
+- Having a single arbiter supporting multiple clusters
+- Moving from 2 + 1 to conventional 3 node cluster


Why are we stating this as a non-goal? The JIRA feature asks for this (see requirement number 6 in the description)

DanielFroehlich · 2024-09-05T11:33:19Z

enhancements/arbiter-clusters.md

+- Running the arbiter node offsite
+- Running the arbiter node as a VM local to the cluster
+- Having a single arbiter supporting multiple clusters
+- Moving from 2 + 1 to conventional 3 node cluster


the etcd / ocp topology would not change to my understanding - it remains a 3 node cluster.

DanielFroehlich · 2024-09-05T11:36:41Z

enhancements/arbiter-clusters.md

+
+1. In the future it might be desired to add another master and convert to a
+   compact cluster, do we want to support changing ControlPlaneTopology field
+   after the fact?


see above, this is already listed in the requirements and should be possible. Maybe not on initial release, but then on the next one.

Feature Requirement number 8 states:
It must be possible to explicitly schedule additional workload to the arbiter node. That is important for 3d party solutions (e.g. storage provider) which also have quorum based mechanisms.
The use case is e.g. ODF in replica-2 mode, where some ODF components also need 3 deployments (the ceph mon for example).
How is this supported by this ER? I think we should mention / describe this case.

DanielFroehlich · 2024-09-05T11:40:58Z

enhancements/arbiter-clusters.md

+
+- Ability to utilize the enhancement end to end
+- End user documentation, relative API stability
+- Sufficient test coverage


Whats the reason we need a Dev Preview? Could we start with a Tech Preview right away, to reduce potential time to market? We could extend TP phase if required.

feat: initial arbiter cluster enhancement

ba86653

Signed-off-by: ehila <[email protected]>

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 4, 2024

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 4, 2024

jeff-roche reviewed Sep 4, 2024

View reviewed changes

qJkee reviewed Sep 5, 2024

View reviewed changes

brandisher reviewed Sep 5, 2024

View reviewed changes

DanielFroehlich reviewed Sep 5, 2024

View reviewed changes


		### Goals

		- Provide a new node arbiter role type that supports HA but is not a full master

	- Provide a new node arbiter role type that supports HA but is not a full master
	- Provide a new arbiter node role type that achieves HA but does not act as a full master node

	- Support installing OpenShift with 2 regular nodes and 1 arbiter node.
	- Support installing OpenShift with 2 master nodes and 1 arbiter node.

	- The arbiter node hardware requirement will be lower than regular nodes.
	- The arbiter node hardware requirements will be lower than regular nodes in both cost and performance.

	- Moving from 2 + 1 to conventional 3 node cluster
	- Moving from 2 + 1 to a conventional 3 node cluster

	Things that we are proposing of changing.
	Functionality that we are proposing to change:

	to use a very lower powered and cheap device as the arbiter, this method would
	to use a device that is lower power and is cheaper as the arbiter. This method would

	still run a lot of the overhead on the arbiter node.
	still run most of the OCP overhead on the arbiter node.


		## Open Questions [optional]

		1. In the future it might be desired to add another master and convert to a

	- Running the arbiter node as a VM local to the cluster
	- Running a virtualized arbiter node.

	2. The user creates an `install-config.yaml` like normal.
	2. The user creates an `install-config.yaml`.


		#### For Baremetal Installs

		1. User sits down at the computer.

OCPEDGE-1191: feat: initial arbiter cluster enhancement #1674

Are you sure you want to change the base?

OCPEDGE-1191: feat: initial arbiter cluster enhancement #1674

Conversation

eggfoobar commented Sep 4, 2024

openshift-ci-robot commented Sep 4, 2024 • edited by openshift-ci bot Loading

openshift-ci bot commented Sep 4, 2024

openshift-ci bot commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandisher left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Sep 4, 2024 •

edited by openshift-ci bot

Loading

brandisher left a comment •

edited

Loading