diff --git a/docs/design.md b/docs/design.md
index 96c3ffae29..08c456169f 100644
--- a/docs/design.md
+++ b/docs/design.md
@@ -1,78 +1,116 @@
# AWS EBS CSI Driver
-## Problems with current in-tree cloud provider
-### Cache of used / free device names
-
-On AWS, it's the client who [must assign device names](https://aws.amazon.com/premiumsupport/knowledge-center/ebs-stuck-attaching/) to volumes when calling AWS.AttachVolume. At the same time, AWS [imposes some restrictions on the device names](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html).
-
-Therefore Kubernetes AWS volume plugin maintains cache of used / free device names for each node. This cache is lost when controller-manager process restarts. We try to populate the cache during startup, however there are some corner cases when this fails. TODO: exact flow how we can get wrong cache.
-
-It would be great if either AWS itself assigned the device names, or there would be robust way how to restore the cache after restart, e.g. using some persistent database. Kubernetes should not care about the device names at all.
-
-### DescribeVolumes quota
-
-In order to attach/detach volume to/from a node, current AWS cloud provider issues AWS.AttachVolume/DetachVolume call and then it polls DescribeVolume until the volume is attached or detached. The frequency of DescribeVolume is quite high to minimize delay between AWS finishing attachment of the volume and Kubernetes discovering that. Sometimes we even hit API quota for these calls.
-
-It would be better if CSI driver could get reliable and fast event from AWS when a volume has become attached / detached.
-
-Or the driver could batch the calls and issue one big DescribeVolume call with every volume that's being attached/detached in it.
-
-### AWS API weirdness
-
-AWS API is quite different to all other clouds.
-
-- AWS.AttachVolume/DetachVolume can take ages (days, weeks) to complete. For example, when Kubernetes tries to detach a volume that's still mounted on a node, it will be Detaching until the volume is unmounted or Kubernetes issues force-detach. All other clouds return sane error relatively quickly (e.g. "volume is in use") or force-detach the volume.
-- AWS.DetachVolume with force-detach is useless. Documentation says: Forced detachment of a stuck volume can cause damage to the file system or the data it contains or an inability to attach a new volume using the same device name, unless you reboot the instance.
- - We cannot reboot instance after each force-detach nor we can afford to "loose" a device name. AWS supports only 40 volumes per node and even that is quite low number already.
-- AWS.CreateVolume is not idempotent. There is no way to create a volume with ID provided by user. Such call would then fail when such volume already exists.
-- AWS.CreateVolume does not return errors when creating an encrypted volume using either non-existing or non-available KMS key (e.g. with wrong permission). It returns success instead and it even returns volumeID of some volume! This volume exists for a short while and it's deleted in couple of seconds.
-### Errors with slow kubelet
-
-Very rarely a node gets too busy and kubelet starves for CPU. It does not unmount a volume when it should and Kubernetes initiates detach of the volume.
+## EBS CSI Driver on Kubernetes
+
+### High Level Summary
+
+The EBS CSI Driver is used by Container Orchestrators to manage the lifecycle of AWS Elastic Block Storage Volumes. It is compliant to the [Container Storage Interface Specification](https://github.com/container-storage-interface/spec/blob/master/spec.md).
+
+On Kubernetes (K8s), the driver is split into two components: the ebs-csi-controller Deployment and ebs-csi-node DaemonSet.
+
+The ebs-csi-controller watches K8s storage resources for changes, acts on those changes by making AWS EC2 API calls, and then updates those K8s resources once all associated EBS volumes reflect those changes. Inside the ebs-csi-controller pod are several CSI sidecar containers that interact with Kubernetes resources, as well as the ebs-plugin container which manages EBS Volumes via AWS API calls. The sidecars trigger these EBS volume operations by making CSI Remote Procedure Calls (RPCs) against the ebs-plugin's controller service.
+
+The ebs-csi-node DaemonSet ensures the Kubelet can manage the attached EBS storage devices via the ebs-plugin container. The Kubelet ensures these EBS volumes appear as block devices or mounted filesystem directories inside the container workloads, by triggering CSI RPCs against the ebs-plugin's node service.
+
+If you remember one thing from this document, remember that:
+- ebs-csi-controller interacts with K8s storage resources, calls AWS EC2 APIs
+- ebs-csi-node runs on every Node, used by Kubelet to perform privileged operations on attached storage devices
+
+To illustrate these relationships between Kubernetes, the EBS CSI Driver, and AWS, we can look at what happens when you dynamically provision a volume for a given stateful workload:
+
+```mermaid
+sequenceDiagram
+ actor o as Operator
+ participant k as Kubernetes
+ participant c as ebs-csi-controller
+ participant a as AWS EC2 API
+ participant n as ebs-csi-node
+ participant os as Node OS
+
+ o->>k: Create PVC + Pod
+ activate k
+
+ note over k: Pod 'Pending'
PVC 'Pending'
+ k->>c: Notifies about PVC
+ activate c
+ c->>a: Ensure volume created
+ a-->>c:
+ c-->>k: Create PV handle=
PVC and PV 'bound'
+ deactivate c
+
+ note over k: Pod 'Scheduled'
VolumeAttachment (VA) created
+ k->>c: Notify about VolumeAttachment (VA)
+ activate c
+ c->>a: Ensure volume attached
+ a-->>c:
+ c-->>k: Update VA:
Attached=true
DevicePath='/dev/xvdaa'
+ deactivate c
+
+ note over k: Pod 'ContainerCreating'
+ k->>n: Kubelet Triggers NodeStageVolume
+ activate n
+ opt If Filesystem
+ n->>os: Format/fsck device
+ os-->>n:
+ end
+ n->>os: Mount to '/var/lib/kubelet//volumes/'
+ os-->>n:
+ n-->>k: Kubelet update Node.volumesAttached
+ deactivate n
+
+ k->>n: Kubelet Triggers NodePublishVolume
+ activate n
+ n->>os: Bind-mount to container's 'mountPath'
+ os-->>n:
+ n-->>k: Kubelet update Node.volumesInUse
+ deactivate n
+
+ note over k: Pod 'Running'
+ k-->>o: Stateful workload running
+ deactivate k
+```
-## Requirements
+## CSI Driver Design Requirements
### Idempotency
-All CSI driver calls should be idempotent. A CSI method call with the same parameters must always return the same result. It's task of CSI driver to ensure that. Examples:
+All CSI calls should be idempotent. The CSI plugin must ensure that a CSI call with the same parameters will always return the same result. Examples:
- CreateVolume call must first check that the requested EBS volume has been already provisioned and return it if so. It should create a new volume only when such volume does not exist.
- ControllerPublish (=i.e. attach) does not do anything and returns "success" when given volume is already attached to requested node.
- DeleteVolume does not do anything and returns success when given volume is already deleted (i.e. it does not exist, we don't need to check that it had existed and someone really deleted it)
-Note that it's task of the CSI driver to make these calls idempotent if related AWS API call is not.
+Note that it's task of the ebs-plugin to make these calls idempotent even if the related AWS API call is not.
### Timeouts
-gRPC always passes a timeout together with a request. After this timeout, the gRPC client call actually returns. The server (=CSI driver) can continue processing the call and finish the operation, however it has no means how to inform the client about the result.
+gRPC always passes a timeout together with a request. After this timeout, the gRPC client call actually returns. The server (i.e. ebs-plugin) can continue processing the call and finish the operation, however it has no means how to inform the client about the result.
-Kubernetes will retry failed calls, usually after some exponential backoff. Kubernetes heavily relies on idempotency here - i.e. when the driver finished an operation after the client timed out, the driver will get the same call again and it should return success/error based on success/failure of the previous operation.
+Kubernetes sidecars will retry failed calls after exponential backoff. These sidecars rely on idempotency here - i.e. when ebs-plugin finished an operation after the client timed out, the ebs-plugin will get the same call again, and it should return success/error based on success/failure of the previous operation.
Example:
-1. Kubernetes calls ControllerPublishVolume(vol1, nodeA) ), i.e. "attach vol1 to nodeA".
-2. The CSI driver checks vol1 and sees it's not attached to nodeA yet. It calls AttachVolume(vol1, nodeA).
-3. The attachment takes a long time, Kubernetes times out.
-4. Kubernetes sleeps for some time.
+1. csi-attacher calls ControllerPublishVolume(vol1, nodeA) ), i.e. "attach vol1 to nodeA".
+2. ebs-plugin checks vol1 and sees it's not attached to nodeA yet. It calls EC2 AttachVolume(vol1, nodeA).
+3. The attachment takes a long time, RPC times out.
+4. csi-attacher sleeps for some time.
5. AWS finishes attaching of the volume.
-6. Kubernetes re-issues ControllerPublishVolume(vol1, nodeA) again.
-7. The CSI driver checks vol1 and sees it is attached to nodeA and returns success immediately.
+6. csi-attacher re-issues ControllerPublishVolume(vol1, nodeA) again.
+7. ebs-plugin checks vol1 and sees it is attached to nodeA and returns success immediately.
Note that there are some issues:
-- Kubernetes can change its mind at any time. E.g. a user that wanted to run a pod on the node in the example got impatient so he deleted the pod at step 4. In this case Kubernetes will call ControllerUnpublishVolume(vol1, nodeA) to "cancel" the attachment request. It's up to the driver to do the right thing - e.g. wait until the volume is attached and then issue detach() and wait until the volume is detached and \*then\* return from
-- Note that Kubernetes may time out waiting for ControllerUnpublishVolume too. In this case, it will keep calling it until it gets confirmation from the driver that the volume has been detached (i.e. until the driver returns either success or non-timeout error) or it needs the volume attached to the node again (and it will call ControllerPublishVolume in that case).
+- Kubernetes can change its mind at any time. E.g. a user that wanted to run a pod on the node in the example got impatient so he deleted the pod at step 4. In this case csi-attacher will call ControllerUnpublishVolume(vol1, nodeA) to "cancel" the attachment request. It's up to the ebs-plugin to do the right thing - e.g. wait until the volume is attached and then issue detach() and wait until the volume is detached and \*then\* return from
+- Note that Kubernetes may time out waiting for ControllerUnpublishVolume too. In this case, it will keep calling it until it gets confirmation from the driver that the volume has been detached (i.e. until ebs-plugin returns either success or non-timeout error) or it needs the volume attached to the node again (and it will call ControllerPublishVolume in that case).
- The same applies to NodeStage and NodePublish calls ("mount device, mount volume"). These are typically much faster than attach/detach, still they must be idempotent when it comes to timeouts.
-It looks complicated, but it should be actually simple - always check that if the required operation has been already done
+In summary, always check that if the required operation has already been done.
### Restarts
The CSI driver should survive its own crashes or reboots of the node where it runs. For the controller service, Kubernetes will either start a new driver on a different node or re-elect a new leader of stand-by drivers. For the node service, Kubernetes will start a new driver shortly.
-The perfect CSI driver should be stateless. After start, it should recover its state by observing the actual status of AWS (i.e. describe instances / volumes). Current cloud provider follows this approach, however there are some corner cases around restarts when Kubernetes can try to attach two volumes to the same device on a node.
-
-When the stateless driver is not possible, it can use some persistent storage outside of the driver. Since the driver should support multiple Container Orchestrators (like Mesos), it must not use Kubernetes APIs. It should use AWS APIs instead to persist its state if needed (like AWS DynamoDB). We assume that costs of using such db will be negligible compared to rest of Kubernetes.
+The ideal CSI driver is stateless. After start, it should recover its state by observing the actual status of AWS (i.e. describe instances / volumes).
### No credentials on nodes
@@ -80,8 +118,16 @@ General security requirements we follow in Kubernetes is "if a node gets co
There should be a way how to run the CSI driver (=container) in "node mode" only. Such driver would then respond only to node service RPCs and it would not have any credentials to AWS (or very limited credentials, e.g. only to Describe things). Paranoid people would deploy CSI driver in "node only" mode on all nodes where Kubernetes runs user containers.
+### Cache of used / free device names
+
+On AWS, it's the client who [must assign device names](https://aws.amazon.com/premiumsupport/knowledge-center/ebs-stuck-attaching/) to volumes when calling AWS.AttachVolume. At the same time, AWS [imposes some restrictions on the device names](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html).
+
+Therefore, the ebs-plugin maintains a cache of used / free device names for each node. This cache is lost when the container restarts. We try to populate the cache during startup, however there are some corner cases when this fails. TODO: exact flow how we can get wrong cache.
+
## High level overview of CSI calls
+TODO Question: Thoughts on turning this into a Markdown table?
+
### Identity Service RPC
#### GetPluginInfo
@@ -90,7 +136,7 @@ Blindly return:
```
Name: ebs.csi.aws.com
- VendorVersion: 0.x.y
+ VendorVersion: 1.x.y
```
#### GetPluginCapabilities
@@ -101,12 +147,13 @@ Blindly return:
Capabilities:
- CONTROLLER_SERVICE
- ACCESSIBILITY_CONSTRAINTS
+ - ...
```
#### Probe
-- Check that the driver is configured and it can do simple AWS operations, e.g. describe volumes or so.
-- This call is used by Kubernetes liveness probe to check that the driver is healthy. It's called every ~10 seconds, so it should not do anything "expensive" or time consuming. (10 seconds are configurable, we can recommend higher values).
+- Check that the driver is configured can do simple AWS operations, e.g. describe volumes or so.
+- This call is used by Kubernetes liveness probe to check that the driver is healthy. It's called every ~10 seconds, so it should not do anything "expensive" or time-consuming. (10 seconds are configurable, we can recommend higher values).
### Controller Service RPC
@@ -114,15 +161,33 @@ Blindly return:
Checks that the requested volume was not created yet and creates it.
-- Idempotency: several calls with the same name parameter must return the same volume. We can store this name in volume tags in case the driver crashes after CreateVolume call and before returning a response. In other words:
- - The driver first looks for an existing volume with tag CSIVolumeName=. It returns it if it's found.
- - When such volume is not found, it calls CreateVolume() to create the required volume with tag CSIVolumeName=
- - _Is this robust enough? Can this happen on AWS?_
- 1. A driver calls CreateVolume() and dies before the new volume is created.
- 2. New driver quickly starts, gets the same CreateVolume call, checks that there is no volume with given tag (previous CreateVolume() from step 1. has not finished yet) and issues a new CreateVolume().
- 3. Both AWS.CreateVolume() calls succeed -> the driver has provisioned 2 volumes for one driver.CreateVolume call.
- Snapshot: if creating volume from snapshot, read the snapshot ID from request.
+```mermaid
+sequenceDiagram
+ participant s as CSI Provisioner
+ participant d as EBS Plugin
+ participant e as AWS API
+
+ s->>d: CreateVolume RPC
+ activate d
+
+ d->>d: Parse Request + Ensure Idempotency
+
+ d->>e: EC2 CreateVolume
+ activate e
+ e-->>d: Get volumeID
state == 'creating'
+
+
+ d->>e: Poll EC2 DescribeVolumes
+ note over e: 1-3 Seconds
+ e-->>d: Volume state == 'available'
+ deactivate e
+
+ d-->>s: Return Volume Response
+ deactivate d
+```
+
#### DeleteVolume
Checks if the required volume exists and is "available" (not attached anywhere) and deletes it if so. Returns success if the volume can't be found. Returns error if the volume is attached anywhere.
@@ -133,21 +198,65 @@ Checks if the required volume exists and is "available" (not attached
- Checks that given volume is available (i.e. not attached to any other node) and returns error if it is attached.
- Chooses the right device name for the volume on the node (more on that below) and issues AttachVolume. TODO: this has complicated idempotency expectations. It cancels previously called ControllerUnpublishVolume that may be still in progress (i.e. AWS is still detaching the volume and Kubernetes now wants the volume to be attached back).
+```mermaid
+sequenceDiagram
+ participant s as CSI Attacher
+ participant d as EBS Plugin
+ participant e as AWS API
+
+ s->>d: ControllerPublishVolume RPC
+ activate d
+
+ d->>d: Parse Request + Ensure Idempotency
+
+ d->>e: EC2 DescribeInstances
+
+ e-->>d: Get instanceID + all device names
+
+ d->>d: Assign likely unused device name
+
+ d->>e: EC2 AttachVolume
+ activate e
+ e-->>d: Volume state == 'attaching'
+
+ d->>e: Poll EC2 DescribeVolumes
+ note over e: 2+ seconds
+ e-->>d: Volume state == 'in-use'
+ deactivate e
+
+ d-->>s: Return device path
+ deactivate d
+```
+
#### ControllerUnpublishVolume
Checks that given volume is not attached to given node. Returns success if so. Issues AWS.DetachVolume and marks the detached device name as free (more on that below). TODO: this has complicated idempotency expectations. It cancels previously called ControllerPublishVolume (i.e.AWS is still attaching the volume and Kubernetes now wants the volume to be detached).
+#### ControllerExpandVolume
+
+Checks that given volume is not expanded yet, calls EC2 ModifyVolume and ensures the modification enters the 'optimizing' state.
+
+Note: If ControllerModifyVolume is triggered within 2 seconds of ControllerExpandVolume, they will share an EC2 ModifyVolume call.
+
+#### ControllerModifyVolume
+
+Checks if volume needs modification, calls EC2 ModifyVolume, and ensures modification enters 'optimizing' state.
+
+- If tags need to be created/modified, call EC2 CreateTags.
+
+Note: If ControllerExpandVolume is triggered within 2 seconds of ControllerModifyVolume, they will share an EC2 ModifyVolume call.
+
#### ValidateVolumeCapabilities
Check whether access mode is supported for each capability
#### ListVolumes
-Not implemented in the initial release, Kubernetes does not need it.
+Not implemented, Kubernetes does not need it.
#### GetCapacity
-Not implemented in the initial release, Kubernetes does not need it.
+Not implemented, Kubernetes does not need it.
#### ControllerGetCapabilities
@@ -157,19 +266,26 @@ Blindly return:
rpc:
- CREATE\_DELETE\_VOLUME
- PUBLISH\_UNPUBLISH\_VOLUME
+ - ...
```
#### CreateSnapshot
-Not implemented yet.
+Create a new snapshot from a source volume.
#### DeleteSnapshot
-Not implemented yet.
+Deletes a snapshot.
#### ListSnapshots
-Not implemented yet.
+List all EBS-CSI-Driver managed snapshots.
+
+#### Not Implemented
+
+- ListVolumes
+- GetCapacity
+- ControllerGetVolume
### Node Service RPC
@@ -183,18 +299,24 @@ Not implemented yet.
Steps 3 and 4 can take some time, so the driver must ensure idempotency somehow.
-#### NodeUnstageVolume
-
-Just unmount the volume.
-
#### NodePublishVolume
Just bind-mount the volume.
+#### NodeUnstageVolume
+
+Just unmount the volume.
+
#### NodeUnpublishVolume
Just unmount the volume.
+#### NodeExpandVolume
+
+If the attached volume has been formatted with a filesystem, resize the filesystem.
+
+#### NodeGetVolumeStats
+
#### NodeGetInfo
Blindly return:
@@ -217,6 +339,66 @@ Blindly return:
- STAGE\_UNSTAGE\_VOLUME
```
+## Coalescing ControllerExpandVolume & ControllerModifyVolume
+
+### EC2 ModifyVolume and Request Coalescing
+
+AWS exposes one unified ModifyVolume API to change the size, volume-type, IOPS, or throughput of your volume. AWS imposes a 6-hour cooldown after a successful volume modification.
+
+However, the CSI Specification exposes two separate RPCs that rely on ebs-plugin calling this EC2 ModifyVolume API: ControllerExpandVolume, for increasing volume size, and ControllerModifyVolume, for all other volume modifications. To avoid the 6-hour cooldown, we coalesce these separate expansion and modification requests by waiting for up to two seconds, and then perform one merged EC2 ModifyVolume API Call.
+
+Here is an overview of what may happen when you patch a PVC's size and VolumeAttributesClassName at the same time:
+
+```mermaid
+sequenceDiagram
+ participant o as Kubernetes
+ box ebs-csi-controller Pod
+ participant s as csi-resizer
+ participant p as ebs-plugin
(controller service)
+ end
+ participant a as AWS API
+ participant n as ebs-plugin
(node service)
+
+ o->>s: Updated PVC VACName + PVC capacity
+ activate o
+ activate s
+
+ s->>p: ControllerExpandVolume RPC
+ activate p
+ note over p: Wait up to 2s for other RPC
+ s->>p: ControllerModifyVolume RPC
+
+ p->>p: Merge Expand + Modify Requests
+ p->>a: EC2 CreateTags
+ a-->>p:
+ p->>a: EC2 ModifyVolume
+ activate a
+ a-->>p:
+ p->>a: Poll EC2 DescribeVolumeModifications
+ note over a: 1+ seconds
+ a-->>p: Volume state == 'optimizing'
+ deactivate a
+
+ p-->>s: EBS Volume Modified
+ s-->>o: Emit VolumeModify Success Event
+ p-->>s: EBS Volume Expanded
+ deactivate p
+
+ alt if Block Device
+ s-->>o: Emit ExpandVolume Success Event
+ else if Filesystem
+ s-->>o: Mark PVC as FSResizeRequired
+ deactivate s
+ o->>n: Kubelet triggers NodeExpandVolume RPC
+ activate n
+ n->>n: Online resize of FS
+ note over n: 1+ seconds
+ n-->>o: Resize Success
+ deactivate n
+ deactivate o
+ end
+```
+
## Driver modes
Traditionally, you run the CSI controllers together with the EBS driver in the same Kubernetes cluster.
@@ -235,15 +417,3 @@ Example 2: `AWS_REGION=us-west-1 /bin/aws-ebs-csi-driver controller --extra-volu
- `node`: This will only start the node service of the CSI driver.\
Example: `/bin/aws-ebs-csi-driver node --endpoint=unix://...`
-
-## Custom volume limits
-
-For the Kubernetes in-tree volume provisioners (including the `kubernetes.io/aws-ebs` provisioner) it was possible for administrators to provide a custom volume limit overwrite (see https://kubernetes.io/docs/concepts/storage/storage-limits/#custom-limits).
-This solution is not working with CSI any longer.
-As part of [#347](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/347) we discuss how we can implement a sophisticated computation of the volume attach limit per node (e.g., based on the used machine types and already attached network interfaces).
-However, it turns out that such optimal implementation is not easily achievable.
-Each AWS machine type has different volume limits.
-Today, the EBS CSI driver parses the machine type name and then decides the volume limit.
-Unfortunately, this is only a rough approximation and not good enough in most cases.
-In order to allow existing clusters that are leveraging/relying on this feature to migrate to CSI, the EBS CSI driver is supporting the `--volume-attach-limit` flag.
-Specifying the volume attach limit via command line is the alternative until a more sophisticated solution presents itself (dynamically discovering the maximum number of attachable volume per EC2 machine type).