Skip to content

Commit

Permalink
Address review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
xing-yang committed Jan 28, 2020
1 parent e001825 commit 73e206a
Showing 1 changed file with 33 additions and 38 deletions.
71 changes: 33 additions & 38 deletions keps/sig-storage/20190530-pv-health-monitor.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ status: implementable
- [Add Node Volume Health Function](#add-node-volume-health-function)
- [External controller](#external-controller)
- [CSI interface](#csi-interface)
- [Node failure event](#node-failure-event)
- [External agent](#external-agent)
- [Node down event](#node-down-event)
- [External node agent](#external-node-agent)
- [CSI interface](#csi-interface-1)
- [Alternatives](#alternatives)
- [Alternative option 1](#alternative-option-1)
Expand Down Expand Up @@ -80,11 +80,11 @@ If the volume is mounted on a pod and used by an application, the following prob

If the CSI driver has implemented the CSI volume health function proposed in this design document, Kubernetes could communicate with the CSI driver to retrieve any errors detected by the underlying storage system. Kubernetes can report an event and log an error about this PVC so that user can inquire this information and decide how to handle them. For example, if the volume is out of capacity, user can request a volume expansion to get more space. In the first phase, volume health monitoring is informational only as it is only reported in the events and logs. In the future, we will also look into how Kubernetes may use volume health information to automatically reconcile.

There could be conditions that cannot be reported by a CSI driver. There could be network failure where Kubernetes may not be able to get response from the CSI driver. In this case, a call to the CSI driver may time out. There could be network congestion which causes slow response. One or more nodes where the volume is attached to may be down. This can be monitored and detected by the volume health controller so that user knows what has happened.
There could be conditions that cannot be reported by a CSI driver. One or more nodes where the volume is attached to may be down. This can be monitored and detected by the volume health controller so that user knows what has happened.

The Kubernetes components that monitor the volumes and report events with volume health information include the following:

* An external monitoring agent on each kubernetes worker node.
* An external monitoring node agent on each kubernetes worker node.
* An external monitoring controller on the master node.

Details will be described in the following proposal section.
Expand All @@ -93,19 +93,9 @@ Details will be described in the following proposal section.

Volume monitoring is the main focus of this proposal. Reactions are not in the scope of this proposal.

- If a volume provisioned by the CSI driver is deleted, we need to report an event and log a message to inform users.

- If mounting error occurs, we need to report an event and log a message.

- If other errors occur, we also need to report an event and log a message.

The main architecture is shown below:

![pv health monitor architecture](./pv-health-monitor.png)

The following areas will be the focus of this proposal at first:

- The health condition checking of volumes themselves. For example, whether the volume is deleted, whether the usage is reaching the threshold, and so on.
- Provide a mechanism for CSI drivers to report volume health problems. For example, whether the volume is deleted, whether the usage is reaching the threshold, and so on.
- Mounting conditions checking.
- Other errors that could affect the usability of the volume.

Expand All @@ -116,22 +106,21 @@ Three main parts are involved here in the architecture.
- Trigger controller RPC to check the health condition of the CSI volumes.
- The external controller sidecar will also watch for node failure events.

- External Agent:
- The external agent will be deployed as a sidecar together with the CSI node driver on every Kubernetes worker node.
- External Node Agent:
- The external node agent will be deployed as a sidecar together with the CSI node driver on every Kubernetes worker node.
- Trigger node RPC to check volume's mounting conditions.
- Note that currently we do not have CSI support for local storage. As a workaround, we may check the local volumes directly by the agent at first, and then move the checks to RPC interfaces when it is ready.

- Note that currently we do not have CSI support for local storage. When the support is available, we will implement relavant CSI monitoring interfaces as well.

## Implementation

### CSI changes

Container Storage Interface (CSI) specification will be modified to provide volume health check leveraging existing RPCs and adding new ones.

- GetVolume RPC
- Add new GetVolume controller RPC
- External controller calls a new `GetVolume` RPC to check health condition of a particular PV instead of calling `ListVolumes` with a `volume_id` filter. Some CSI drivers chose not to implement `ListVolumes` as it is an expensive operation. Therefore it is better to introduce a new RPC `GetVolume` which has a new capability.
- NodeGetVolumeStats RPC
- For any PV that is mounted, the external agent calls `NodeGetVolumeStats` to see if volume is still mounted;
- Extend the existing NodeGetVolumeStats RPC
- For any PV that is mounted, the external node agent calls `NodeGetVolumeStats` to see if volume is still mounted;
- Calls `NodeGetVolumeStats` to check if volume is usable, e.g., filesystem corruption, bad blocks, etc.

Two new controller capabilities are added. If a CSI driver supports `GET_VOLUME` capability, it MUST support calling `GetVolume` to provide general volume information in `GetVolumeResponse`. If a driver supports `GET_VOLUME_HEALTH`, it MUST provide additional health information in `GetVolumeResponse`.
Expand Down Expand Up @@ -162,7 +151,7 @@ message GetVolumeResponse {
// health shows error conditions reported by the SP.
// This field MUST be specified if the
// GET_VOLUME_HEALTH controller capability is supported.
VolumeHealth health = 1;
repeated VolumeHealth health = 1;
}
// This field is REQUIRED
Expand All @@ -187,16 +176,15 @@ message VolumeHealth {

The following common error codes are proposed for volume health:
* VolumeNotFound
* VolumeUnmounted
* OutOfCapacity
* RWIOError
* DiskFailure
* DiskRemoved
* ConfigError
* NetworkError
* DiskDegrading
* VolumeUnmounted
* RWIOError
* FilesystemCorruption
* NodeFailure

Add `GET_VOLUME` and `GET_VOLUME_HEALTH` controller capabilities. If a driver supports `GET_VOLUME` capability, it MUST implement the `GetVolume` RPC. If a driver supports `GET_VOLUME_HEALTH`, it MUST supports the `health` field in the `status` field in `GetVolumeResponse`.

Expand Down Expand Up @@ -382,20 +370,19 @@ message NodeServiceCapability {
### External controller

#### CSI interface
Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves.
Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves. The frequency of the check should be tunalbe. A configure option will be available in the external controller to adjust this value.

#### Node failure event
Watch node failure events.
In the case of a node failure, the controller will report an event for all PVCs on that node.
For network storage in the case of a node failure, the controller will log an event.
#### Node down event
* Watch node down events.
* The controller will track which pods are using which PVCs and what nodes they got scheduled to.
* In the case that a node goes down, the controller will report an event for all PVCs on that node.

### External agent
### External node agent

#### CSI interface
Call NodeGetVolumeStats() RPC to check the mounting and other health conditions.
Call NodeGetVolumeStats() RPC to check the mounting and other health conditions. The frequency of the check should be tunable. A configure option will be available in the external node agent to adjust this value.

Call both GetVolume() and NodeGetVolumeStats() RPCs for local storage when local storage CSI support is enabled.
For now, check local volumes directly by the agent.
The external node agent will go through all the pods on that node and find out all volumes used by those pods. It will watch all those volumes and call NodeGetVolumeStatus() to find out their health status.

### Alternatives

Expand Down Expand Up @@ -468,6 +455,8 @@ To avoid stale information being stored on a PVC, each periodic update will mark

As mentioned earlier, reaction is not in the scope of this proposal but will be considered in the future. Before reacting to any negative health condition, the controller responsible for the reaction should call GetVolume() again to ensure the information is update to date.

Alternative option 1 is not in the main proposal because it requires giving all nodes access to a credential that has the ability to modify all PVCs. This is not desirable as it allows a malicious node to put an unhealthy status on a PVC.

#### Alternative option 2

If the agent on the node cannot be used to modify the PVC status and the monitoring logic cannot be added to Kubelet directly, we can introduce a CRD to represent the volume health. This volume health CRD is in the same namespace as the PVC that it is monitoring. It contains the PVC name and health conditions as defined in the main option. It needs to have a one on one mapping with the PVC. In the PVC status, there will be a `volumeHealthName` field pointing back to the volume health CRD.
Expand Down Expand Up @@ -512,6 +501,8 @@ type PersistentVolumeClaimStatus struct {
}
```

Alternative option 2 does not address the concerns from option 1. Assuming the node-level sidecar is running in a pod as a service account, we need to grant that service account permission to modify any instance of the volume health CR, and then the monitoring controller just copies that into the PVC status.

#### Alternative option 3

We can also reuse the PV Taints and introduce a new Taint called PVUnhealthMessage for PV health condition whose key is specific (PVUnhealthMessage) and value can be set differently.
Expand All @@ -535,6 +526,8 @@ Note that:
- all the VolumeTaintEffects are NoEffect now at first, we may talk about the reactions later in another proposal;
- the taint Value is string now, it is theoretically possible that several errors are detected for one PV, we may extend the string to cover this situation: combine the errors together and splited by semicolon or other symbols.

This was initially brought up to address the external data populator use case, but there were lots of concerns about this proposal. We do not want our main proposal to depend on something that is not approved.

#### Optional HTTP(RPC) service

Create a HTTP(RPC) service to receive volume health condition reports from other components.
Expand All @@ -543,11 +536,13 @@ Users can extend the volume health condition monitoring by setting up their own

This optional service is out of scope for this KEP but may be introduced in a future KEP if needed.

This option is not in the main proposal because it is a push-based method while CSI uses pull-based method.

## Graduation Criteria
### Alpha -> Beta
* Feature complete, including:
* External controller volume health monitoring.
* External agent volume health monitoring.
* External node agent volume health monitoring.
* Tests outlined in design proposal implemented.

### Beta -> GA
Expand All @@ -557,11 +552,11 @@ This optional service is out of scope for this KEP but may be introduced in a fu
## Test Plan
### Unit tests
* Unit tests for external controller volume health monitoring.
* Unit tests for external agent volume health monitoring.
* Unit tests for external node agent volume health monitoring.

### E2E tests
* e2e tests for external controller volume health monitoring.
* e2e tests for external agent volume health monitoring.
* e2e tests for external node agent volume health monitoring.

## Implementation History

Expand Down

0 comments on commit 73e206a

Please sign in to comment.