From 2260819b982dd71bb9ecd84a49b8e95ce1e25a7c Mon Sep 17 00:00:00 2001
From: xing-yang <xingyang105@gmail.com>
Date: Tue, 28 Jan 2020 03:45:13 +0000
Subject: [PATCH] Address review comments

---
 .../sig-storage/20190530-pv-health-monitor.md | 63 +++++++++----------
 1 file changed, 31 insertions(+), 32 deletions(-)

diff --git a/keps/sig-storage/20190530-pv-health-monitor.md b/keps/sig-storage/20190530-pv-health-monitor.md
index bd4a44eaaa60..ca0456cc572e 100644
--- a/keps/sig-storage/20190530-pv-health-monitor.md
+++ b/keps/sig-storage/20190530-pv-health-monitor.md
@@ -36,8 +36,8 @@ status: implementable
     - [Add Node Volume Health Function](#add-node-volume-health-function)
   - [External controller](#external-controller)
     - [CSI interface](#csi-interface)
-    - [Node failure event](#node-failure-event)
-  - [External agent](#external-agent)
+    - [Node down event](#node-down-event)
+  - [External node agent](#external-node-agent)
     - [CSI interface](#csi-interface-1)
   - [Alternatives](#alternatives)
     - [Alternative option 1](#alternative-option-1)
@@ -84,7 +84,7 @@ There could be conditions that cannot be reported by a CSI driver. There could b
 
 The Kubernetes components that monitor the volumes and report events with volume health information include the following:
 
-* An external monitoring agent on each kubernetes worker node.
+* An external monitoring node agent on each kubernetes worker node.
 * An external monitoring controller on the master node.
 
 Details will be described in the following proposal section.
@@ -93,12 +93,6 @@ Details will be described in the following proposal section.
 
 Volume monitoring is the main focus of this proposal. Reactions are not in the scope of this proposal.
 
-- If a volume provisioned by the CSI driver is deleted, we need to report an event and log a message to inform users.
-
-- If mounting error occurs, we need to report an event and log a message.
-
-- If other errors occur, we also need to report an event and log a message.
-
 The main architecture is shown below:
 
 ![pv health monitor architecture](./pv-health-monitor.png)
@@ -116,11 +110,10 @@ Three main parts are involved here in the architecture.
   - Trigger controller RPC to check the health condition of the CSI volumes.
   - The external controller sidecar will also watch for node failure events.
 
-- External Agent:
-  - The external agent will be deployed as a sidecar together with the CSI node driver on every Kubernetes worker node.
+- External Node Agent:
+  - The external node agent will be deployed as a sidecar together with the CSI node driver on every Kubernetes worker node.
   - Trigger node RPC to check volume's mounting conditions.
-  - Note that currently we do not have CSI support for local storage. As a workaround, we may check the local volumes directly by the agent at first, and then move the checks to RPC interfaces when it is ready.
-
+  - Note that currently we do not have CSI support for local storage. When the support is available, we will implement relavant CSI monitoring interfaces as well.
 
 ## Implementation
 
@@ -128,10 +121,10 @@ Three main parts are involved here in the architecture.
 
 Container Storage Interface (CSI) specification will be modified to provide volume health check leveraging existing RPCs and adding new ones.
 
-- GetVolume RPC
+- Add new GetVolume controller RPC
   - External controller calls a new `GetVolume` RPC to check health condition of a particular PV instead of calling `ListVolumes` with a `volume_id` filter. Some CSI drivers chose not to implement `ListVolumes` as it is an expensive operation. Therefore it is better to introduce a new RPC `GetVolume` which has a new capability.
-- NodeGetVolumeStats RPC
-  - For any PV that is mounted, the external agent calls `NodeGetVolumeStats` to see if volume is still mounted;
+- Extend the existing NodeGetVolumeStats RPC
+  - For any PV that is mounted, the external node agent calls `NodeGetVolumeStats` to see if volume is still mounted;
   - Calls `NodeGetVolumeStats` to check if volume is usable, e.g., filesystem corruption, bad blocks, etc.
 
 Two new controller capabilities are added. If a CSI driver supports `GET_VOLUME` capability, it MUST support calling `GetVolume` to provide general volume information in `GetVolumeResponse`. If a driver supports `GET_VOLUME_HEALTH`, it MUST provide additional health information in `GetVolumeResponse`.
@@ -162,7 +155,7 @@ message GetVolumeResponse {
     // health shows error conditions reported by the SP.
     // This field MUST be specified if the
     // GET_VOLUME_HEALTH controller capability is supported.
-    VolumeHealth health = 1;
+    repeated VolumeHealth health = 1;
   }
 
   // This field is REQUIRED
@@ -187,16 +180,15 @@ message VolumeHealth {
 
 The following common error codes are proposed for volume health:
 * VolumeNotFound
-* VolumeUnmounted
 * OutOfCapacity
-* RWIOError
 * DiskFailure
 * DiskRemoved
 * ConfigError
 * NetworkError
 * DiskDegrading
+* VolumeUnmounted
+* RWIOError
 * FilesystemCorruption
-* NodeFailure
 
 Add `GET_VOLUME` and `GET_VOLUME_HEALTH` controller capabilities.  If a driver supports `GET_VOLUME` capability, it MUST implement the `GetVolume` RPC.  If a driver supports `GET_VOLUME_HEALTH`, it MUST supports the `health` field in the `status` field in `GetVolumeResponse`.
 
@@ -382,20 +374,19 @@ message NodeServiceCapability {
 ### External controller
 
 #### CSI interface
-Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves.
+Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves. The frequency of the check should be tunalbe. A configure option will be available in the external controller to adjust this value.
 
-#### Node failure event
-Watch node failure events.
-In the case of a node failure, the controller will report an event for all PVCs on that node.
-For network storage in the case of a node failure, the controller will log an event.
+#### Node down event
+Watch node down events.
+In the case that a node goes down, the controller will report an event for all local PVCs on that node.
+For network storage in the case of a node failure, the controller will just log a general message, not specific to individual PVCs because the controller has no knowledge of what PVCs are on the affected node.
 
-### External agent
+### External node agent
 
 #### CSI interface
-Call NodeGetVolumeStats() RPC to check the mounting and other health conditions.
+Call NodeGetVolumeStats() RPC to check the mounting and other health conditions. The frequency of the check should be tunalbe. A configure option will be available in the external node agent to adjust this value.
 
-Call both GetVolume() and NodeGetVolumeStats() RPCs for local storage when local storage CSI support is enabled.
-For now, check local volumes directly by the agent.
+The external node agent will go through all the pods on that node and find out all volumes used by those pods. It will watch all those volumes and call NodeGetVolumeStatus() to find out their health status.
 
 ### Alternatives
 
@@ -468,6 +459,8 @@ To avoid stale information being stored on a PVC, each periodic update will mark
 
 As mentioned earlier, reaction is not in the scope of this proposal but will be considered in the future. Before reacting to any negative health condition, the controller responsible for the reaction should call GetVolume() again to ensure the information is update to date.
 
+Alternative option 1 is not in the main proposal because it requires giving all nodes access to a credential that has the ability to modify all PVCs. This is not desirable as it allows a malicious node to put an unhealthy status on a PVC.
+
 #### Alternative option 2
 
 If the agent on the node cannot be used to modify the PVC status and the monitoring logic cannot be added to Kubelet directly, we can introduce a CRD to represent the volume health. This volume health CRD is in the same namespace as the PVC that it is monitoring. It contains the PVC name and health conditions as defined in the main option. It needs to have a one on one mapping with the PVC. In the PVC status, there will be a `volumeHealthName` field pointing back to the volume health CRD.
@@ -512,6 +505,8 @@ type PersistentVolumeClaimStatus struct {
 }
 ```
 
+Alternative option 2 does not address the concerns from option 1. Assuming the node-level sidecar is running in a pod as a service account, we need to grant that service account permission to modify any instance of the volume health CR, and then the monitoring controller just copies that into the PVC status.
+
 #### Alternative option 3
 
 We can also reuse the PV Taints and introduce a new Taint called PVUnhealthMessage for PV health condition whose key is specific (PVUnhealthMessage) and value can be set differently. 
@@ -535,6 +530,8 @@ Note that:
 - all the VolumeTaintEffects are NoEffect now at first, we may talk about the reactions later in another proposal;
 - the taint Value is string now, it is theoretically possible that several errors are detected for one PV, we may extend the string to cover this situation: combine the errors together and splited by semicolon or other symbols.
 
+This was initially brought up to address the external data populator use case, but there were lots of concerns about this proposal. We do not want our main proposal to depend on something that is not approved.
+
 #### Optional HTTP(RPC) service
 
 Create a HTTP(RPC) service to receive volume health condition reports from other components.
@@ -543,11 +540,13 @@ Users can extend the volume health condition monitoring by setting up their own
 
 This optional service is out of scope for this KEP but may be introduced in a future KEP if needed.
 
+This option is not in the main proposal because it is a push-based method while CSI uses pull-based method.
+
 ## Graduation Criteria
 ### Alpha -> Beta
 * Feature complete, including:
   * External controller volume health monitoring.
-  * External agent volume health monitoring.
+  * External node agent volume health monitoring.
 * Tests outlined in design proposal implemented.
 
 ### Beta -> GA
@@ -557,11 +556,11 @@ This optional service is out of scope for this KEP but may be introduced in a fu
 ## Test Plan
 ### Unit tests
 * Unit tests for external controller volume health monitoring.
-* Unit tests for external agent volume health monitoring.
+* Unit tests for external node agent volume health monitoring.
 
 ### E2E tests
 * e2e tests for external controller volume health monitoring.
-* e2e tests for external agent volume health monitoring.
+* e2e tests for external node agent volume health monitoring.
 
 ## Implementation History