Update docs/design.md #2242

AndrewSirenko · 2024-11-25T14:44:15Z

What type of PR is this?

/kind documentation

What is this PR about? / Why do we need it?

This PR updates docs/design.md so that it is no longer out of date. It also provides a high-level overview of both components of the EBS CSI Driver and adds a few sequence diagrams to help visualize the most common driver workflows.

Note: Look at PR in 'rich diff' mode to see new sequence diagrams.

How was this change tested?

n/a

Does this PR introduce a user-facing change?

Update docs/design.md and add high-level overview.

k8s-ci-robot · 2024-11-25T14:44:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andrewsirenko. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2024-11-25T14:46:13Z

Code Coverage Diff

This PR does not change the code coverage

AndrewSirenko · 2024-11-25T14:47:31Z

docs/design.md

+
+To illustrate these relationships between Kubernetes, the EBS CSI Driver, and AWS, we can look at what happens when you dynamically provision a volume for a given stateful workload:
+
+```mermaid


Instead of having in-line diagrams, we can embed a .mermaid file from an attachments directory instead.

AndrewSirenko · 2024-11-25T16:19:10Z

docs/design.md

 ## High level overview of CSI calls

+TODO Question: Thoughts on turning this into a Markdown table? 


I lean towards having the RPCs listed in a table, so it's cleaner to look at.

More detailed explanations can be done in an appendix.

If we are in favor of this I will add a commit that turns it into a table (or three).

I really like this idea im game for it.

docs/design.md

ElijahQuinones · 2024-11-26T14:41:11Z

docs/design.md

+
+The ebs-csi-controller watches K8s storage resources for changes, acts on those changes by making AWS EC2 API calls, and then updates those K8s resources once all associated EBS volumes reflect those changes. Inside the ebs-csi-controller pod are several CSI sidecar containers that interact with Kubernetes resources, as well as the ebs-plugin container which manages EBS Volumes via AWS API calls. The sidecars trigger these EBS volume operations by making CSI Remote Procedure Calls (RPCs) against the ebs-plugin's controller service. 
+
+The ebs-csi-node DaemonSet ensures the Kubelet can manage the attached EBS storage devices via the ebs-plugin container. The Kubelet ensures these EBS volumes appear as block devices or mounted filesystem directories inside the container workloads, by triggering CSI RPCs against the ebs-plugin's node service.   


nit gRPC not RPC

docs/design.md

ElijahQuinones · 2024-11-26T14:46:47Z

docs/design.md

-4. Kubernetes sleeps for some time.
+1. csi-attacher calls ControllerPublishVolume(vol1, nodeA) ), i.e. &quot;attach vol1 to nodeA&quot;.
+2. ebs-plugin checks vol1 and sees it&#39;s not attached to nodeA yet. It calls EC2 AttachVolume(vol1, nodeA).
+3. The attachment takes a long time, RPC times out.


nit: RPC -> gRPC

docs/design.md

ElijahQuinones · 2024-11-26T14:52:17Z

docs/design.md

 ## High level overview of CSI calls

+TODO Question: Thoughts on turning this into a Markdown table? 


I really like this idea im game for it.

docs/design.md

ElijahQuinones · 2024-11-26T14:56:11Z

docs/design.md

+
+AWS exposes one unified ModifyVolume API to change the size, volume-type, IOPS, or throughput of your volume. AWS imposes a 6-hour cooldown after a successful volume modification.
+
+However, the CSI Specification exposes two separate RPCs that rely on ebs-plugin calling this EC2 ModifyVolume API: ControllerExpandVolume, for increasing volume size, and ControllerModifyVolume, for all other volume modifications. To avoid the 6-hour cooldown, we coalesce these separate expansion and modification requests by waiting for up to two seconds, and then perform one merged EC2 ModifyVolume API Call.


nit: RPCs -> gRPCs

ElijahQuinones · 2024-11-26T15:02:30Z

Thank you for doing these improvements to the docs !

One high level comment that I also added inline comments about is we use RPC alot in these docs but we actually mean gRPC which is slightly different as it is a more optimized implementation of RPC RPC Versus gRPC

AndrewSirenko · 2024-11-26T16:51:06Z

we use RPC alot in these docs but we actually mean gRPC

I was following the convention laid out in spec/spec.md where I use RPC anytime I mean "magic call that may happen on another computer, but written as if it were a local procedure call", and gRPC is more of an implementation detail (the specific implementation of RPC that CSI Spec relies upon).

But happy to change this if other folks agree.

Perhaps a repository glossary page is in order, like SOCI Snapshotter

torredil

Largely lgtm. Especially the sequence diagrams, this revision significantly elevates the doc.

/lgtm!

torredil · 2024-11-29T16:28:07Z

docs/design.md

+#### Not Implemented
+
+- ListVolumes
+- GetCapacity
+- ControllerGetVolume


Suggest removing this snippet since its duplicating information.

torredil · 2024-11-29T16:28:37Z

docs/design.md

 #### ValidateVolumeCapabilities

 Check whether access mode is supported for each capability

 #### ListVolumes

-Not implemented in the initial release, Kubernetes does not need it.
+Not implemented, Kubernetes does not need it.


np: Not implemented, Kubernetes does not need it. -> Not implemented.

torredil · 2024-11-29T17:07:13Z

docs/design.md

-The perfect CSI driver should be stateless. After start, it should recover its state by observing the actual status of AWS (i.e. describe instances / volumes). Current cloud provider follows this approach, however there are some corner cases around restarts when Kubernetes can try to attach two volumes to the same device on a node.
-
-When the stateless driver is not possible, it can use some persistent storage outside of the driver. Since the driver should support multiple Container Orchestrators (like Mesos), it must not use Kubernetes APIs. It should use AWS APIs instead to persist its state if needed (like AWS DynamoDB). We assume that costs of using such db will be negligible compared to rest of Kubernetes.
+The ideal CSI driver is stateless. After start, it should recover its state by observing the actual status of AWS (i.e. describe instances / volumes).

 ### No credentials on nodes

 General security requirements we follow in Kubernetes is &quot;if a node gets compromised then the damage is limited to the node&quot;. Paranoid people typically dedicate handful of nodes in Kubernetes cluster as &quot;infrastructure nodes&quot; and dedicate these nodes to run &quot;infrastructure pods&quot; only. Regular users can&#39;t run their pods there. CSI attacher and provisioner is an example of such &quot;infrastructure pod&quot; - it need permission to create/delete any PV in Kubernetes and CSI driver running there needs credentials to create/delete volumes in AWS.

 There should be a way how to run the CSI driver (=container) in &quot;node mode&quot; only. Such driver would then respond only to node service RPCs and it would not have any credentials to AWS (or very limited credentials, e.g. only to Describe things). Paranoid people would deploy CSI driver in &quot;node only&quot; mode on all nodes where Kubernetes runs user containers.


Thoughts on also removing the "No credentials on nodes" section? (which I understand has been in here for several years). It adds an arbitrary restriction, its not actually a design requirement. At the very least, please replace "Paranoid people" with "security conscious users".

Hmm, I'm inclined to keep this section because I believe it lays out a few great points on why daemonset pod should deploy ebs-plugin in node mode, and why daemonset pods should have limited credentials.

Will replace 'paranoid' wording though. Thanks!

I am still a bit confused about the second paragraph of this chapter. Why should people run an EBS CSI driver in "node only" mode which is not able to provision any EBS volume (because it only has Describe* permissions)

Both the EBS CSI Driver Controller Deployment and Node Daemonset pods rely on the same ebs-plugin container. The difference is that in the controller pod we set the plugin's mode to controller, and in the node pods we set the plugin's mode to node. For example, this node mode is set on Daemonset's ebs-plugin container in the helm template here

Because the node pods set the ebs-plugin container to node mode, it won't respond to Controller Service RPCs (ControllerCreateVolume).

Today, we also don't give any of the daemonset's pods access to the EBS CSI Controller IAM Role by associating the node pods with a different ServiceAccount than the controller pods, because the Node Service RPCs like NodeStageVolume don't need to talk to the EC2 API.

Finally, we ensure only the EBS CSI Controller pods have the higher-risk K8s RBAC for actions like patching PV and VolumeAttachment resources.

Because of the 3 points above, a security conscious customer can ensure that the EBS CSI Controller pods are only scheduled on extra hardened nodes. If an intruder gains access to any of the other nodes on the cluster they wouldn't be able to create/delete EBS volumes via the ebs-csi-node pod, or mess with the cluster's storage resources (PVs, VAs). At least, that's my understanding of the 'why'.

Did I understand your question correctly @youwalther65?

Exactly. Great, I'd like to see this answer in the doc itself!

ConnorJC3 · 2024-12-06T19:35:35Z

docs/design.md

 ```

 #### Probe

- Check that the driver is configured and it can do simple AWS operations, e.g. describe volumes or so.
- This call is used by Kubernetes liveness probe to check that the driver is healthy. It&#39;s called every ~10 seconds, so it should not do anything &quot;expensive&quot; or time consuming.  (10 seconds are configurable, we can recommend higher values).
+- Check that the driver is configured can do simple AWS operations, e.g. describe volumes or so.


This isn't actually true (and I don't think ever was?) and should be dropped

ConnorJC3 · 2024-12-06T19:36:14Z

docs/design.md


 #### GetCapacity

-Not implemented in the initial release, Kubernetes does not need it.
+Not implemented, Kubernetes does not need it.


Same as above, drop "Kubernetes does not need it"

Actually, there are so many RPCs that we don't implement (e.g. anything related to group snapshot) we should probably drop any not implemented RPCs from the doc rather than trying to list them all.

ConnorJC3 · 2024-12-06T19:39:05Z

docs/design.md

+
+If you remember one thing from this document, remember that:
+- ebs-csi-controller interacts with K8s storage resources, calls AWS EC2 APIs
+- ebs-csi-node runs on every Node, used by Kubelet to perform privileged operations on attached storage devices


Drop this paragraph entirely, it is just a duplicate of information already listed in a "high level summary".

It's a short summary that we want to re-iterate because of how essential it is. When ramping up folks on the EBS CSI Driver, I've had to re-iterate this point several times because it is not obvious.

I can rename the section to high-level overview if that helps, but I value keeping this here.

The wording might be misleading a bit and some folks may think these are containers . Probably something like:
EBS CSI controller component runs as a K8s Deployment called ebs-csi-controller and interacts ...
EBS CSI node component runs as a K8s DaemonSet called ebs-csi-node on every node, ...

ConnorJC3 · 2024-12-06T19:40:01Z

docs/design.md


 - CreateVolume call must first check that the requested EBS volume has been already provisioned and return it if so. It should create a new volume only when such volume does not exist.
 - ControllerPublish (=i.e. attach) does not do anything and returns &quot;success&quot; when given volume is already attached to requested node.
 - DeleteVolume does not do anything and returns success when given volume is already deleted (i.e. it does not exist, we don&#39;t need to check that it had existed and someone really deleted it)

-Note that it&#39;s task of the CSI driver to make these calls idempotent if related AWS API call is not.
+Note that it&#39;s task of the ebs-plugin to make these calls idempotent even if the related AWS API call is not.


Drop the first half of this diff, ebs-plugin is a confusing term and means the same thing as "CSI driver". We should be using what is essentially an internal term with no meaning, when "CSI Driver" is the official and well understood term.

I politely initially disagree with this feedback. I think the phrasing 'ebs-plugin' is clearer because that is what the container is called within our driver. Also the upstream csi driver documentation refers to this component (that gets called by sidecars and interacts with storage provider) as driver-plugin, no?

Calling both the whole deployment, and the container itself the EBS CSI Driver confuses many operators new to our driver, including me when I first started.

Open to hear more thoughts though.

Well, sometimes CSI documentation does use plugin to refer to sidecars and company

Node Plugin The node component should be deployed on every node in the cluster through a DaemonSet. It consists of the CSI driver that implements the CSI Node service and the node-driver-registrar sidecar container

Is there a better term we can use to differentiate? "ebs-plugin container"?

Or do we just call it "CSI Node/Controller service"

Calling both the whole deployment, and the container itself the EBS CSI Driver confuses many operators new to our driver, including me when I first started.

I would agree this could be confusing, but it is the former ("the whole deployment"), not the latter ("the container itself") that is inaccurate and causing the confusion. I don't think this document describes the Deployment or DaemonSet anywhere as "the EBS CSI Driver" with no qualifications, but if it does that language should be altered.

Using the term ebs-plugin causes confusion because it is not the standard terminology. It is important to get the terminology right here because the term "CSI Driver" is used this way extensively throughout the community and upstream code/documentation (e.g. https://kubernetes-csi.github.io/docs).

Depending on how the EBS CSI Driver is deployed, it may not even have an ebs-plugin container. For example, here is how the semi-popular Gardener tool deploys the driver, they use the name csi-driver for the driver container: https://github.com/gardener/gardener-extension-provider-aws/blob/master/charts/internal/shoot-system-components/charts/csi-driver-node/templates/daemonset.yaml#L43

Let me include a short glossary/terminology ala https://github.com/awslabs/soci-snapshotter/blob/main/docs/glossary.md in a future PR where we can make this distinction crisp.

Using the term ebs-plugin causes confusion because it is not the standard terminology.

On the other hand the term 'ebs-plugin container' here can be helpful because it lets operators know exactly which container to look at for troubleshooting. In the glossary section, we can mention that as of v1.37.0, our csi driver is deployed as the ebs-plugin container by default.

What's confusing to newcomers here is that the whole package of Deployment + Daemonset is often referred to as the EBS CSI Driver add-on. Perhaps a glossary can remedy this.

Depending on how the EBS CSI Driver is deployed, it may not even have an ebs-plugin container. For example Gardener.

Do they even deploy the EBS CSI Driver Controller? I consider installation methods not explicitly approved by this repository as out of scope for this document.

A glossary is a great idea. For me, mentioning and explaining the role of the ebs-plugin container in both controller deployment and node DaemonSet , is helpful. And it's used in the graphs as well!

ConnorJC3 · 2024-12-06T19:40:42Z

docs/design.md


 ### Timeouts

-gRPC always passes a timeout together with a request. After this timeout, the gRPC client call actually returns. The server (=CSI driver) can continue processing the call and finish the operation, however it has no means how to inform the client about the result.
+gRPC always passes a timeout together with a request. After this timeout, the gRPC client call actually returns. The server (i.e. ebs-plugin) can continue processing the call and finish the operation, however it has no means how to inform the client about the result.


Ditto above, drop this diff.

ConnorJC3 · 2024-12-06T19:48:38Z

docs/design.md

 #### NodePublishVolume

 Just bind-mount the volume.

+#### NodeUnstageVolume
+
+Just unmount the volume.


I understand you didn't add it, but suggest dropping "Just" from all these sections.

ConnorJC3 · 2024-12-06T19:49:09Z

docs/design.md

+If the attached volume has been formatted with a filesystem, resize the filesystem.
+
+#### NodeGetVolumeStats
+


Empty section?

Forgot to leave a TODO here.

Might add the following: "Returns the amount of available/total/used bytes and inodes for a given volume.

ConnorJC3 · 2024-12-06T19:51:48Z

docs/design.md


-Kubernetes will retry failed calls, usually after some exponential backoff. Kubernetes heavily relies on idempotency here - i.e. when the driver finished an operation after the client timed out, the driver will get the same call again and it should return success/error based on success/failure of the previous operation.
+Kubernetes sidecars will retry failed calls after exponential backoff. These sidecars rely on idempotency here - i.e. when ebs-plugin finished an operation after the client timed out, the ebs-plugin will get the same call again, and it should return success/error based on success/failure of the previous operation.


Ditto above, drop the "driver" -> "ebs-plugin" part of this diff.

docs/design.md

ConnorJC3 · 2024-12-06T19:56:30Z

docs/design.md

-Very rarely a node gets too busy and kubelet starves for CPU. It does not unmount a volume when it should and Kubernetes initiates detach of the volume.
+## EBS CSI Driver on Kubernetes
+
+### High Level Summary


This section is dangerously close to documenting CSI and/or CSI sidecar design rather than EBS CSI design. For example, if the information here can be found in the CSI spec's architecture section (https://github.com/container-storage-interface/spec/blob/master/spec.md#architecture) we are better off linking there than making our own version.

AndrewSirenko · 2024-12-06T22:08:50Z

docs/design.md

+    activate p
+    note over p: Wait up to 2s for other RPC
+    s->>p:  ControllerModifyVolume RPC
+


Need to remember to add a describevolumes call here.

And that modification state could be either optimizing OR completed

Update docs/design.md

c525b5a

k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Nov 25, 2024

k8s-ci-robot requested review from ElijahQuinones and torredil November 25, 2024 14:44

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 25, 2024

AndrewSirenko commented Nov 25, 2024

View reviewed changes

Add EC2 ModifyVolume request coalescing diagram to docs/design.md

0fff4f3

AndrewSirenko commented Nov 25, 2024

View reviewed changes

ElijahQuinones reviewed Nov 26, 2024

View reviewed changes

fixup! Update docs/design.md

6e844a8

torredil approved these changes Nov 29, 2024

View reviewed changes

k8s-ci-robot assigned torredil Nov 29, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 29, 2024

ElijahQuinones approved these changes Nov 29, 2024

View reviewed changes

k8s-ci-robot assigned ElijahQuinones Nov 29, 2024

ConnorJC3 reviewed Dec 6, 2024

View reviewed changes

AndrewSirenko commented Dec 6, 2024

View reviewed changes


		To illustrate these relationships between Kubernetes, the EBS CSI Driver, and AWS, we can look at what happens when you dynamically provision a volume for a given stateful workload:

		```mermaid

		## High level overview of CSI calls

		TODO Question: Thoughts on turning this into a Markdown table?


		The ebs-csi-controller watches K8s storage resources for changes, acts on those changes by making AWS EC2 API calls, and then updates those K8s resources once all associated EBS volumes reflect those changes. Inside the ebs-csi-controller pod are several CSI sidecar containers that interact with Kubernetes resources, as well as the ebs-plugin container which manages EBS Volumes via AWS API calls. The sidecars trigger these EBS volume operations by making CSI Remote Procedure Calls (RPCs) against the ebs-plugin's controller service.

		The ebs-csi-node DaemonSet ensures the Kubelet can manage the attached EBS storage devices via the ebs-plugin container. The Kubelet ensures these EBS volumes appear as block devices or mounted filesystem directories inside the container workloads, by triggering CSI RPCs against the ebs-plugin's node service.


		AWS exposes one unified ModifyVolume API to change the size, volume-type, IOPS, or throughput of your volume. AWS imposes a 6-hour cooldown after a successful volume modification.

		However, the CSI Specification exposes two separate RPCs that rely on ebs-plugin calling this EC2 ModifyVolume API: ControllerExpandVolume, for increasing volume size, and ControllerModifyVolume, for all other volume modifications. To avoid the 6-hour cooldown, we coalesce these separate expansion and modification requests by waiting for up to two seconds, and then perform one merged EC2 ModifyVolume API Call.

		If the attached volume has been formatted with a filesystem, resize the filesystem.

		#### NodeGetVolumeStats


		Kubernetes will retry failed calls, usually after some exponential backoff. Kubernetes heavily relies on idempotency here - i.e. when the driver finished an operation after the client timed out, the driver will get the same call again and it should return success/error based on success/failure of the previous operation.
		Kubernetes sidecars will retry failed calls after exponential backoff. These sidecars rely on idempotency here - i.e. when ebs-plugin finished an operation after the client timed out, the ebs-plugin will get the same call again, and it should return success/error based on success/failure of the previous operation.

Update docs/design.md #2242

Are you sure you want to change the base?

Update docs/design.md #2242

Conversation

AndrewSirenko commented Nov 25, 2024 • edited Loading

What type of PR is this?

What is this PR about? / Why do we need it?

How was this change tested?

Does this PR introduce a user-facing change?

k8s-ci-robot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

Code Coverage Diff

Choose a reason for hiding this comment

AndrewSirenko Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElijahQuinones commented Nov 26, 2024

AndrewSirenko commented Nov 26, 2024 • edited Loading

torredil left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSirenko Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSirenko Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youwalther65 Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSirenko Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSirenko Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

youwalther65 Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSirenko commented Nov 25, 2024 •

edited

Loading

AndrewSirenko Nov 25, 2024 •

edited

Loading

AndrewSirenko commented Nov 26, 2024 •

edited

Loading

AndrewSirenko Dec 4, 2024 •

edited

Loading

AndrewSirenko Dec 6, 2024 •

edited

Loading

youwalther65 Dec 9, 2024 •

edited

Loading

AndrewSirenko Dec 6, 2024 •

edited

Loading

AndrewSirenko Dec 9, 2024 •

edited

Loading

youwalther65 Dec 9, 2024 •

edited

Loading