Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs/design.md #2242

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

AndrewSirenko
Copy link
Contributor

@AndrewSirenko AndrewSirenko commented Nov 25, 2024

What type of PR is this?

/kind documentation

What is this PR about? / Why do we need it?

This PR updates docs/design.md so that it is no longer out of date. It also provides a high-level overview of both components of the EBS CSI Driver and adds a few sequence diagrams to help visualize the most common driver workflows.

Note: Look at PR in 'rich diff' mode to see new sequence diagrams.

How was this change tested?

n/a

Does this PR introduce a user-facing change?

Update docs/design.md and add high-level overview. 

@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Nov 25, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andrewsirenko. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 25, 2024
Copy link

Code Coverage Diff

This PR does not change the code coverage


To illustrate these relationships between Kubernetes, the EBS CSI Driver, and AWS, we can look at what happens when you dynamically provision a volume for a given stateful workload:

```mermaid
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having in-line diagrams, we can embed a .mermaid file from an attachments directory instead.

## High level overview of CSI calls

TODO Question: Thoughts on turning this into a Markdown table?
Copy link
Contributor Author

@AndrewSirenko AndrewSirenko Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards having the RPCs listed in a table, so it's cleaner to look at.

More detailed explanations can be done in an appendix.

If we are in favor of this I will add a commit that turns it into a table (or three).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this idea im game for it.

docs/design.md Show resolved Hide resolved

The ebs-csi-controller watches K8s storage resources for changes, acts on those changes by making AWS EC2 API calls, and then updates those K8s resources once all associated EBS volumes reflect those changes. Inside the ebs-csi-controller pod are several CSI sidecar containers that interact with Kubernetes resources, as well as the ebs-plugin container which manages EBS Volumes via AWS API calls. The sidecars trigger these EBS volume operations by making CSI Remote Procedure Calls (RPCs) against the ebs-plugin's controller service.

The ebs-csi-node DaemonSet ensures the Kubelet can manage the attached EBS storage devices via the ebs-plugin container. The Kubelet ensures these EBS volumes appear as block devices or mounted filesystem directories inside the container workloads, by triggering CSI RPCs against the ebs-plugin's node service.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit gRPC not RPC

docs/design.md Outdated Show resolved Hide resolved
docs/design.md Outdated Show resolved Hide resolved
4. Kubernetes sleeps for some time.
1. csi-attacher calls ControllerPublishVolume(vol1, nodeA) ), i.e. "attach vol1 to nodeA".
2. ebs-plugin checks vol1 and sees it's not attached to nodeA yet. It calls EC2 AttachVolume(vol1, nodeA).
3. The attachment takes a long time, RPC times out.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RPC -> gRPC

docs/design.md Outdated Show resolved Hide resolved
docs/design.md Outdated Show resolved Hide resolved
## High level overview of CSI calls

TODO Question: Thoughts on turning this into a Markdown table?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this idea im game for it.

docs/design.md Show resolved Hide resolved

AWS exposes one unified ModifyVolume API to change the size, volume-type, IOPS, or throughput of your volume. AWS imposes a 6-hour cooldown after a successful volume modification.

However, the CSI Specification exposes two separate RPCs that rely on ebs-plugin calling this EC2 ModifyVolume API: ControllerExpandVolume, for increasing volume size, and ControllerModifyVolume, for all other volume modifications. To avoid the 6-hour cooldown, we coalesce these separate expansion and modification requests by waiting for up to two seconds, and then perform one merged EC2 ModifyVolume API Call.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RPCs -> gRPCs

@ElijahQuinones
Copy link
Member

Thank you for doing these improvements to the docs !

One high level comment that I also added inline comments about is we use RPC alot in these docs but we actually mean gRPC which is slightly different as it is a more optimized implementation of RPC RPC Versus gRPC

@AndrewSirenko
Copy link
Contributor Author

AndrewSirenko commented Nov 26, 2024

we use RPC alot in these docs but we actually mean gRPC

I was following the convention laid out in spec/spec.md where I use RPC anytime I mean "magic call that may happen on another computer, but written as if it were a local procedure call", and gRPC is more of an implementation detail (the specific implementation of RPC that CSI Spec relies upon).

But happy to change this if other folks agree.

Perhaps a repository glossary page is in order, like SOCI Snapshotter

Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely lgtm. Especially the sequence diagrams, this revision significantly elevates the doc.

/lgtm!

Comment on lines +284 to +288
#### Not Implemented

- ListVolumes
- GetCapacity
- ControllerGetVolume
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest removing this snippet since its duplicating information.

#### ValidateVolumeCapabilities

Check whether access mode is supported for each capability

#### ListVolumes

Not implemented in the initial release, Kubernetes does not need it.
Not implemented, Kubernetes does not need it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np: Not implemented, Kubernetes does not need it. -> Not implemented.

The perfect CSI driver should be stateless. After start, it should recover its state by observing the actual status of AWS (i.e. describe instances / volumes). Current cloud provider follows this approach, however there are some corner cases around restarts when Kubernetes can try to attach two volumes to the same device on a node.

When the stateless driver is not possible, it can use some persistent storage outside of the driver. Since the driver should support multiple Container Orchestrators (like Mesos), it must not use Kubernetes APIs. It should use AWS APIs instead to persist its state if needed (like AWS DynamoDB). We assume that costs of using such db will be negligible compared to rest of Kubernetes.
The ideal CSI driver is stateless. After start, it should recover its state by observing the actual status of AWS (i.e. describe instances / volumes).

### No credentials on nodes

General security requirements we follow in Kubernetes is "if a node gets compromised then the damage is limited to the node". Paranoid people typically dedicate handful of nodes in Kubernetes cluster as "infrastructure nodes" and dedicate these nodes to run "infrastructure pods" only. Regular users can't run their pods there. CSI attacher and provisioner is an example of such "infrastructure pod" - it need permission to create/delete any PV in Kubernetes and CSI driver running there needs credentials to create/delete volumes in AWS.

There should be a way how to run the CSI driver (=container) in "node mode" only. Such driver would then respond only to node service RPCs and it would not have any credentials to AWS (or very limited credentials, e.g. only to Describe things). Paranoid people would deploy CSI driver in "node only" mode on all nodes where Kubernetes runs user containers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on also removing the "No credentials on nodes" section? (which I understand has been in here for several years). It adds an arbitrary restriction, its not actually a design requirement. At the very least, please replace "Paranoid people" with "security conscious users".

Copy link
Contributor Author

@AndrewSirenko AndrewSirenko Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm inclined to keep this section because I believe it lays out a few great points on why daemonset pod should deploy ebs-plugin in node mode, and why daemonset pods should have limited credentials.

Will replace 'paranoid' wording though. Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still a bit confused about the second paragraph of this chapter. Why should people run an EBS CSI driver in "node only" mode which is not able to provision any EBS volume (because it only has Describe* permissions)

Copy link
Contributor Author

@AndrewSirenko AndrewSirenko Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the EBS CSI Driver Controller Deployment and Node Daemonset pods rely on the same ebs-plugin container. The difference is that in the controller pod we set the plugin's mode to controller, and in the node pods we set the plugin's mode to node. For example, this node mode is set on Daemonset's ebs-plugin container in the helm template here

Because the node pods set the ebs-plugin container to node mode, it won't respond to Controller Service RPCs (ControllerCreateVolume).

Today, we also don't give any of the daemonset's pods access to the EBS CSI Controller IAM Role by associating the node pods with a different ServiceAccount than the controller pods, because the Node Service RPCs like NodeStageVolume don't need to talk to the EC2 API.

Finally, we ensure only the EBS CSI Controller pods have the higher-risk K8s RBAC for actions like patching PV and VolumeAttachment resources.

Because of the 3 points above, a security conscious customer can ensure that the EBS CSI Controller pods are only scheduled on extra hardened nodes. If an intruder gains access to any of the other nodes on the cluster they wouldn't be able to create/delete EBS volumes via the ebs-csi-node pod, or mess with the cluster's storage resources (PVs, VAs). At least, that's my understanding of the 'why'.

Did I understand your question correctly @youwalther65?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Great, I'd like to see this answer in the doc itself!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 29, 2024
```

#### Probe

- Check that the driver is configured and it can do simple AWS operations, e.g. describe volumes or so.
- This call is used by Kubernetes liveness probe to check that the driver is healthy. It's called every ~10 seconds, so it should not do anything "expensive" or time consuming. (10 seconds are configurable, we can recommend higher values).
- Check that the driver is configured can do simple AWS operations, e.g. describe volumes or so.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually true (and I don't think ever was?) and should be dropped


#### GetCapacity

Not implemented in the initial release, Kubernetes does not need it.
Not implemented, Kubernetes does not need it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, drop "Kubernetes does not need it"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there are so many RPCs that we don't implement (e.g. anything related to group snapshot) we should probably drop any not implemented RPCs from the doc rather than trying to list them all.


If you remember one thing from this document, remember that:
- ebs-csi-controller interacts with K8s storage resources, calls AWS EC2 APIs
- ebs-csi-node runs on every Node, used by Kubelet to perform privileged operations on attached storage devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop this paragraph entirely, it is just a duplicate of information already listed in a "high level summary".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a short summary that we want to re-iterate because of how essential it is. When ramping up folks on the EBS CSI Driver, I've had to re-iterate this point several times because it is not obvious.

I can rename the section to high-level overview if that helps, but I value keeping this here.

Copy link

@youwalther65 youwalther65 Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording might be misleading a bit and some folks may think these are containers . Probably something like:
EBS CSI controller component runs as a K8s Deployment called ebs-csi-controller and interacts ...
EBS CSI node component runs as a K8s DaemonSet called ebs-csi-node on every node, ...


- CreateVolume call must first check that the requested EBS volume has been already provisioned and return it if so. It should create a new volume only when such volume does not exist.
- ControllerPublish (=i.e. attach) does not do anything and returns "success" when given volume is already attached to requested node.
- DeleteVolume does not do anything and returns success when given volume is already deleted (i.e. it does not exist, we don't need to check that it had existed and someone really deleted it)

Note that it's task of the CSI driver to make these calls idempotent if related AWS API call is not.
Note that it's task of the ebs-plugin to make these calls idempotent even if the related AWS API call is not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the first half of this diff, ebs-plugin is a confusing term and means the same thing as "CSI driver". We should be using what is essentially an internal term with no meaning, when "CSI Driver" is the official and well understood term.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I politely initially disagree with this feedback. I think the phrasing 'ebs-plugin' is clearer because that is what the container is called within our driver. Also the upstream csi driver documentation refers to this component (that gets called by sidecars and interacts with storage provider) as driver-plugin, no?

Calling both the whole deployment, and the container itself the EBS CSI Driver confuses many operators new to our driver, including me when I first started.

Open to hear more thoughts though.

Copy link
Contributor Author

@AndrewSirenko AndrewSirenko Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, sometimes CSI documentation does use plugin to refer to sidecars and company

Node Plugin

The node component should be deployed on every node in the cluster through a DaemonSet. It consists of the CSI driver that implements the CSI Node service and the node-driver-registrar sidecar container

Is there a better term we can use to differentiate? "ebs-plugin container"?

Or do we just call it "CSI Node/Controller service"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling both the whole deployment, and the container itself the EBS CSI Driver confuses many operators new to our driver, including me when I first started.

I would agree this could be confusing, but it is the former ("the whole deployment"), not the latter ("the container itself") that is inaccurate and causing the confusion. I don't think this document describes the Deployment or DaemonSet anywhere as "the EBS CSI Driver" with no qualifications, but if it does that language should be altered.

Using the term ebs-plugin causes confusion because it is not the standard terminology. It is important to get the terminology right here because the term "CSI Driver" is used this way extensively throughout the community and upstream code/documentation (e.g. https://kubernetes-csi.github.io/docs).

Depending on how the EBS CSI Driver is deployed, it may not even have an ebs-plugin container. For example, here is how the semi-popular Gardener tool deploys the driver, they use the name csi-driver for the driver container: https://github.com/gardener/gardener-extension-provider-aws/blob/master/charts/internal/shoot-system-components/charts/csi-driver-node/templates/daemonset.yaml#L43

Copy link
Contributor Author

@AndrewSirenko AndrewSirenko Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me include a short glossary/terminology ala https://github.com/awslabs/soci-snapshotter/blob/main/docs/glossary.md in a future PR where we can make this distinction crisp.

Using the term ebs-plugin causes confusion because it is not the standard terminology.

On the other hand the term 'ebs-plugin container' here can be helpful because it lets operators know exactly which container to look at for troubleshooting. In the glossary section, we can mention that as of v1.37.0, our csi driver is deployed as the ebs-plugin container by default.

What's confusing to newcomers here is that the whole package of Deployment + Daemonset is often referred to as the EBS CSI Driver add-on. Perhaps a glossary can remedy this.


Depending on how the EBS CSI Driver is deployed, it may not even have an ebs-plugin container. For example Gardener.

Do they even deploy the EBS CSI Driver Controller? I consider installation methods not explicitly approved by this repository as out of scope for this document.

Copy link

@youwalther65 youwalther65 Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A glossary is a great idea. For me, mentioning and explaining the role of the ebs-plugin container in both controller deployment and node DaemonSet , is helpful. And it's used in the graphs as well!


### Timeouts

gRPC always passes a timeout together with a request. After this timeout, the gRPC client call actually returns. The server (=CSI driver) can continue processing the call and finish the operation, however it has no means how to inform the client about the result.
gRPC always passes a timeout together with a request. After this timeout, the gRPC client call actually returns. The server (i.e. ebs-plugin) can continue processing the call and finish the operation, however it has no means how to inform the client about the result.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto above, drop this diff.

#### NodePublishVolume

Just bind-mount the volume.

#### NodeUnstageVolume

Just unmount the volume.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you didn't add it, but suggest dropping "Just" from all these sections.

If the attached volume has been formatted with a filesystem, resize the filesystem.

#### NodeGetVolumeStats

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to leave a TODO here.

Might add the following: "Returns the amount of available/total/used bytes and inodes for a given volume.


Kubernetes will retry failed calls, usually after some exponential backoff. Kubernetes heavily relies on idempotency here - i.e. when the driver finished an operation after the client timed out, the driver will get the same call again and it should return success/error based on success/failure of the previous operation.
Kubernetes sidecars will retry failed calls after exponential backoff. These sidecars rely on idempotency here - i.e. when ebs-plugin finished an operation after the client timed out, the ebs-plugin will get the same call again, and it should return success/error based on success/failure of the previous operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto above, drop the "driver" -> "ebs-plugin" part of this diff.

docs/design.md Show resolved Hide resolved
Very rarely a node gets too busy and kubelet starves for CPU. It does not unmount a volume when it should and Kubernetes initiates detach of the volume.
## EBS CSI Driver on Kubernetes

### High Level Summary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is dangerously close to documenting CSI and/or CSI sidecar design rather than EBS CSI design. For example, if the information here can be found in the CSI spec's architecture section (https://github.com/container-storage-interface/spec/blob/master/spec.md#architecture) we are better off linking there than making our own version.

activate p
note over p: Wait up to 2s for other RPC
s->>p: ControllerModifyVolume RPC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remember to add a describevolumes call here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that modification state could be either optimizing OR completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants