-
Notifications
You must be signed in to change notification settings - Fork 806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update docs/design.md #2242
base: master
Are you sure you want to change the base?
Update docs/design.md #2242
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Code Coverage DiffThis PR does not change the code coverage |
|
||
To illustrate these relationships between Kubernetes, the EBS CSI Driver, and AWS, we can look at what happens when you dynamically provision a volume for a given stateful workload: | ||
|
||
```mermaid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having in-line diagrams, we can embed a .mermaid
file from an attachments directory instead.
## High level overview of CSI calls | ||
|
||
TODO Question: Thoughts on turning this into a Markdown table? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lean towards having the RPCs listed in a table, so it's cleaner to look at.
More detailed explanations can be done in an appendix.
If we are in favor of this I will add a commit that turns it into a table (or three).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea im game for it.
|
||
The ebs-csi-controller watches K8s storage resources for changes, acts on those changes by making AWS EC2 API calls, and then updates those K8s resources once all associated EBS volumes reflect those changes. Inside the ebs-csi-controller pod are several CSI sidecar containers that interact with Kubernetes resources, as well as the ebs-plugin container which manages EBS Volumes via AWS API calls. The sidecars trigger these EBS volume operations by making CSI Remote Procedure Calls (RPCs) against the ebs-plugin's controller service. | ||
|
||
The ebs-csi-node DaemonSet ensures the Kubelet can manage the attached EBS storage devices via the ebs-plugin container. The Kubelet ensures these EBS volumes appear as block devices or mounted filesystem directories inside the container workloads, by triggering CSI RPCs against the ebs-plugin's node service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit gRPC not RPC
4. Kubernetes sleeps for some time. | ||
1. csi-attacher calls ControllerPublishVolume(vol1, nodeA) ), i.e. "attach vol1 to nodeA". | ||
2. ebs-plugin checks vol1 and sees it's not attached to nodeA yet. It calls EC2 AttachVolume(vol1, nodeA). | ||
3. The attachment takes a long time, RPC times out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: RPC -> gRPC
## High level overview of CSI calls | ||
|
||
TODO Question: Thoughts on turning this into a Markdown table? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea im game for it.
|
||
AWS exposes one unified ModifyVolume API to change the size, volume-type, IOPS, or throughput of your volume. AWS imposes a 6-hour cooldown after a successful volume modification. | ||
|
||
However, the CSI Specification exposes two separate RPCs that rely on ebs-plugin calling this EC2 ModifyVolume API: ControllerExpandVolume, for increasing volume size, and ControllerModifyVolume, for all other volume modifications. To avoid the 6-hour cooldown, we coalesce these separate expansion and modification requests by waiting for up to two seconds, and then perform one merged EC2 ModifyVolume API Call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: RPCs -> gRPCs
Thank you for doing these improvements to the docs ! One high level comment that I also added inline comments about is we use RPC alot in these docs but we actually mean gRPC which is slightly different as it is a more optimized implementation of RPC RPC Versus gRPC |
I was following the convention laid out in spec/spec.md where I use RPC anytime I mean "magic call that may happen on another computer, but written as if it were a local procedure call", and gRPC is more of an implementation detail (the specific implementation of RPC that CSI Spec relies upon). But happy to change this if other folks agree. Perhaps a repository glossary page is in order, like SOCI Snapshotter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely lgtm. Especially the sequence diagrams, this revision significantly elevates the doc.
/lgtm!
#### Not Implemented | ||
|
||
- ListVolumes | ||
- GetCapacity | ||
- ControllerGetVolume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest removing this snippet since its duplicating information.
#### ValidateVolumeCapabilities | ||
|
||
Check whether access mode is supported for each capability | ||
|
||
#### ListVolumes | ||
|
||
Not implemented in the initial release, Kubernetes does not need it. | ||
Not implemented, Kubernetes does not need it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np: Not implemented, Kubernetes does not need it.
-> Not implemented.
The perfect CSI driver should be stateless. After start, it should recover its state by observing the actual status of AWS (i.e. describe instances / volumes). Current cloud provider follows this approach, however there are some corner cases around restarts when Kubernetes can try to attach two volumes to the same device on a node. | ||
|
||
When the stateless driver is not possible, it can use some persistent storage outside of the driver. Since the driver should support multiple Container Orchestrators (like Mesos), it must not use Kubernetes APIs. It should use AWS APIs instead to persist its state if needed (like AWS DynamoDB). We assume that costs of using such db will be negligible compared to rest of Kubernetes. | ||
The ideal CSI driver is stateless. After start, it should recover its state by observing the actual status of AWS (i.e. describe instances / volumes). | ||
|
||
### No credentials on nodes | ||
|
||
General security requirements we follow in Kubernetes is "if a node gets compromised then the damage is limited to the node". Paranoid people typically dedicate handful of nodes in Kubernetes cluster as "infrastructure nodes" and dedicate these nodes to run "infrastructure pods" only. Regular users can't run their pods there. CSI attacher and provisioner is an example of such "infrastructure pod" - it need permission to create/delete any PV in Kubernetes and CSI driver running there needs credentials to create/delete volumes in AWS. | ||
|
||
There should be a way how to run the CSI driver (=container) in "node mode" only. Such driver would then respond only to node service RPCs and it would not have any credentials to AWS (or very limited credentials, e.g. only to Describe things). Paranoid people would deploy CSI driver in "node only" mode on all nodes where Kubernetes runs user containers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on also removing the "No credentials on nodes" section? (which I understand has been in here for several years). It adds an arbitrary restriction, its not actually a design requirement. At the very least, please replace "Paranoid people" with "security conscious users".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm inclined to keep this section because I believe it lays out a few great points on why daemonset pod should deploy ebs-plugin in node mode, and why daemonset pods should have limited credentials.
Will replace 'paranoid' wording though. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still a bit confused about the second paragraph of this chapter. Why should people run an EBS CSI driver in "node only" mode which is not able to provision any EBS volume (because it only has Describe* permissions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the EBS CSI Driver Controller Deployment and Node Daemonset pods rely on the same ebs-plugin
container. The difference is that in the controller pod we set the plugin's mode to controller
, and in the node pods we set the plugin's mode to node
. For example, this node mode is set on Daemonset's ebs-plugin container in the helm template here
Because the node pods set the ebs-plugin container to node mode, it won't respond to Controller Service RPCs (ControllerCreateVolume).
Today, we also don't give any of the daemonset's pods access to the EBS CSI Controller IAM Role by associating the node pods with a different ServiceAccount than the controller pods, because the Node Service RPCs like NodeStageVolume don't need to talk to the EC2 API.
Finally, we ensure only the EBS CSI Controller pods have the higher-risk K8s RBAC for actions like patching PV and VolumeAttachment resources.
Because of the 3 points above, a security conscious customer can ensure that the EBS CSI Controller pods are only scheduled on extra hardened nodes. If an intruder gains access to any of the other nodes on the cluster they wouldn't be able to create/delete EBS volumes via the ebs-csi-node
pod, or mess with the cluster's storage resources (PVs, VAs). At least, that's my understanding of the 'why'.
Did I understand your question correctly @youwalther65?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. Great, I'd like to see this answer in the doc itself!
``` | ||
|
||
#### Probe | ||
|
||
- Check that the driver is configured and it can do simple AWS operations, e.g. describe volumes or so. | ||
- This call is used by Kubernetes liveness probe to check that the driver is healthy. It's called every ~10 seconds, so it should not do anything "expensive" or time consuming. (10 seconds are configurable, we can recommend higher values). | ||
- Check that the driver is configured can do simple AWS operations, e.g. describe volumes or so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't actually true (and I don't think ever was?) and should be dropped
|
||
#### GetCapacity | ||
|
||
Not implemented in the initial release, Kubernetes does not need it. | ||
Not implemented, Kubernetes does not need it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, drop "Kubernetes does not need it"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there are so many RPCs that we don't implement (e.g. anything related to group snapshot) we should probably drop any not implemented RPCs from the doc rather than trying to list them all.
|
||
If you remember one thing from this document, remember that: | ||
- ebs-csi-controller interacts with K8s storage resources, calls AWS EC2 APIs | ||
- ebs-csi-node runs on every Node, used by Kubelet to perform privileged operations on attached storage devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop this paragraph entirely, it is just a duplicate of information already listed in a "high level summary".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a short summary that we want to re-iterate because of how essential it is. When ramping up folks on the EBS CSI Driver, I've had to re-iterate this point several times because it is not obvious.
I can rename the section to high-level overview if that helps, but I value keeping this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording might be misleading a bit and some folks may think these are containers . Probably something like:
EBS CSI controller component runs as a K8s Deployment called ebs-csi-controller and interacts ...
EBS CSI node component runs as a K8s DaemonSet called ebs-csi-node on every node, ...
|
||
- CreateVolume call must first check that the requested EBS volume has been already provisioned and return it if so. It should create a new volume only when such volume does not exist. | ||
- ControllerPublish (=i.e. attach) does not do anything and returns "success" when given volume is already attached to requested node. | ||
- DeleteVolume does not do anything and returns success when given volume is already deleted (i.e. it does not exist, we don't need to check that it had existed and someone really deleted it) | ||
|
||
Note that it's task of the CSI driver to make these calls idempotent if related AWS API call is not. | ||
Note that it's task of the ebs-plugin to make these calls idempotent even if the related AWS API call is not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the first half of this diff, ebs-plugin
is a confusing term and means the same thing as "CSI driver". We should be using what is essentially an internal term with no meaning, when "CSI Driver" is the official and well understood term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I politely initially disagree with this feedback. I think the phrasing 'ebs-plugin' is clearer because that is what the container is called within our driver. Also the upstream csi driver documentation refers to this component (that gets called by sidecars and interacts with storage provider) as driver-plugin, no?
Calling both the whole deployment, and the container itself the EBS CSI Driver confuses many operators new to our driver, including me when I first started.
Open to hear more thoughts though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, sometimes CSI documentation does use plugin to refer to sidecars and company
Node Plugin
The node component should be deployed on every node in the cluster through a DaemonSet. It consists of the CSI driver that implements the CSI Node service and the node-driver-registrar sidecar container
Is there a better term we can use to differentiate? "ebs-plugin container"?
Or do we just call it "CSI Node/Controller service"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling both the whole deployment, and the container itself the EBS CSI Driver confuses many operators new to our driver, including me when I first started.
I would agree this could be confusing, but it is the former ("the whole deployment"), not the latter ("the container itself") that is inaccurate and causing the confusion. I don't think this document describes the Deployment
or DaemonSet
anywhere as "the EBS CSI Driver" with no qualifications, but if it does that language should be altered.
Using the term ebs-plugin
causes confusion because it is not the standard terminology. It is important to get the terminology right here because the term "CSI Driver" is used this way extensively throughout the community and upstream code/documentation (e.g. https://kubernetes-csi.github.io/docs).
Depending on how the EBS CSI Driver is deployed, it may not even have an ebs-plugin
container. For example, here is how the semi-popular Gardener tool deploys the driver, they use the name csi-driver
for the driver container: https://github.com/gardener/gardener-extension-provider-aws/blob/master/charts/internal/shoot-system-components/charts/csi-driver-node/templates/daemonset.yaml#L43
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me include a short glossary/terminology ala https://github.com/awslabs/soci-snapshotter/blob/main/docs/glossary.md in a future PR where we can make this distinction crisp.
Using the term ebs-plugin causes confusion because it is not the standard terminology.
On the other hand the term 'ebs-plugin container' here can be helpful because it lets operators know exactly which container to look at for troubleshooting. In the glossary section, we can mention that as of v1.37.0, our csi driver is deployed as the ebs-plugin
container by default.
What's confusing to newcomers here is that the whole package of Deployment + Daemonset is often referred to as the EBS CSI Driver add-on. Perhaps a glossary can remedy this.
Depending on how the EBS CSI Driver is deployed, it may not even have an ebs-plugin container. For example Gardener.
Do they even deploy the EBS CSI Driver Controller? I consider installation methods not explicitly approved by this repository as out of scope for this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A glossary is a great idea. For me, mentioning and explaining the role of the ebs-plugin container in both controller deployment and node DaemonSet , is helpful. And it's used in the graphs as well!
|
||
### Timeouts | ||
|
||
gRPC always passes a timeout together with a request. After this timeout, the gRPC client call actually returns. The server (=CSI driver) can continue processing the call and finish the operation, however it has no means how to inform the client about the result. | ||
gRPC always passes a timeout together with a request. After this timeout, the gRPC client call actually returns. The server (i.e. ebs-plugin) can continue processing the call and finish the operation, however it has no means how to inform the client about the result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto above, drop this diff.
#### NodePublishVolume | ||
|
||
Just bind-mount the volume. | ||
|
||
#### NodeUnstageVolume | ||
|
||
Just unmount the volume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you didn't add it, but suggest dropping "Just" from all these sections.
If the attached volume has been formatted with a filesystem, resize the filesystem. | ||
|
||
#### NodeGetVolumeStats | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to leave a TODO here.
Might add the following: "Returns the amount of available/total/used bytes and inodes for a given volume.
|
||
Kubernetes will retry failed calls, usually after some exponential backoff. Kubernetes heavily relies on idempotency here - i.e. when the driver finished an operation after the client timed out, the driver will get the same call again and it should return success/error based on success/failure of the previous operation. | ||
Kubernetes sidecars will retry failed calls after exponential backoff. These sidecars rely on idempotency here - i.e. when ebs-plugin finished an operation after the client timed out, the ebs-plugin will get the same call again, and it should return success/error based on success/failure of the previous operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto above, drop the "driver" -> "ebs-plugin" part of this diff.
Very rarely a node gets too busy and kubelet starves for CPU. It does not unmount a volume when it should and Kubernetes initiates detach of the volume. | ||
## EBS CSI Driver on Kubernetes | ||
|
||
### High Level Summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is dangerously close to documenting CSI and/or CSI sidecar design rather than EBS CSI design. For example, if the information here can be found in the CSI spec's architecture section (https://github.com/container-storage-interface/spec/blob/master/spec.md#architecture) we are better off linking there than making our own version.
activate p | ||
note over p: Wait up to 2s for other RPC | ||
s->>p: ControllerModifyVolume RPC | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remember to add a describevolumes call here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And that modification state could be either optimizing OR completed
What type of PR is this?
/kind documentation
What is this PR about? / Why do we need it?
This PR updates docs/design.md so that it is no longer out of date. It also provides a high-level overview of both components of the EBS CSI Driver and adds a few sequence diagrams to help visualize the most common driver workflows.
Note: Look at PR in 'rich diff' mode to see new sequence diagrams.
How was this change tested?
n/a
Does this PR introduce a user-facing change?