-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PV health monitor KEP #1077
PV health monitor KEP #1077
Conversation
Welcome @NickrenREN! |
This KEP is quite interesting to me, and I think this topic is quite important, so I am curious about your answers. Some questions about the goals: I also have some questions/suggestions on the concrete proposal:
|
Hi @cdickmann, thanks for your comments! In addition to what @NickrenREN has replied to you on Slack, I want to add that I have an action item to add the CSI support in this KEP, which should cover the remote PV part. In general new features will only be added to support CSI drivers. |
|
||
|
||
## Implementation | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the case of a reserved volume: shared between multiple nodes, but RWO and this node does not own the reservation, do we need such a state? The volume could be functioning perfectly well, but is not available to this node at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, if the volume itself is problematical, we should definitely taint it.
If the volume is shared between multiple nodes and RWX, and because of network problem or sth like that, it is not available to one specific node, i think we need to taint it, because it may cause data loss.
But if it is RWO, and the node does not own the reservation, IIUC, the volume is healthy and will not lead to data loss, as you mentioned above, it may lead to reduced performance.I prefer not tainting it at first.
It seems that we need to list the specific causes. I will rethink the extension mechanism, and validate if Taint is suitable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. I had some thoughts especially on the CSI portion.
* Checks if volume still exists and/or is attached | ||
* Checks if volume is still mounted and usable | ||
|
||
Container Storage Interface (CSI) specification will be modified to add two RPCs ControllerCheckVolume and NodeCheckVolume. A monitor controller (external-monitor) that is responsible for watching the PersistentVolumeClaim, PersistentVolume, and VolumeAttachment API objects will be added. The external-monitor can be deployed as a sidecar container together with the CSI driver. For PVs, the external-monitor calls CSI driver’s monitor interface to check volume health status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to deploy a new daemon and further add to complexity? Intuitively, it feels to me that this should be able to fit into existing daemons, as it just extends CSI slightly to cover more state. CSI already communicates multiple pieces of state in both directions, and this is a natural extension to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently for every new feature added in kubernetes-csi, a new sidecar container will be added accordingly. For example, we have external-provisioner, external-attacher, external-snapshotter, and external-resizer, etc. This way it is easier to maintain each component, but I do see your point that this adds more complexity as well. I'll think more about this.
|
||
Container Storage Interface (CSI) specification will be modified to add two RPCs ControllerCheckVolume and NodeCheckVolume. A monitor controller (external-monitor) that is responsible for watching the PersistentVolumeClaim, PersistentVolume, and VolumeAttachment API objects will be added. The external-monitor can be deployed as a sidecar container together with the CSI driver. For PVs, the external-monitor calls CSI driver’s monitor interface to check volume health status. | ||
|
||
* ControllerCheckVolume RPC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to define a poll model. Would it also be possible for a push mechanism? I would assume that many storage systems underlying the CSI would actively notice and react to issues, and be able to notify upper layers in an expedient way. Thus, it should be possible for the CSI to communicate this. This is similar to the SCSI device specs, where a physical drive or physical SCSI controller can raise a notification when it detects an issue, so that upper layers can take immediate action. Being expedient can be important for applications that depend on application level replication and needing to know when to kick in rebuilds to reduce downtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Let me think about this.
Hi, @xing-yang @cdickmann @richardelling |
5842e52
to
150d28f
Compare
|
||
The reactions needs further discussion, and is not in the scope of this doc. | ||
- all the VolumeTaintEffects are NoEffect now at first, we may talk about the reactions later in another proposal; | ||
- the taint Value is string now, it is theoretically possible that several errors are detected for one PV, we may extend the string to cover this situation: combine the errors together and splited by semicolon or other symbols. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is only one key and values are in free text form, it will be difficult to add reactions later based on those information. Should we keep the keys such as "VolumeNotAttached" or "VolumeNotMounted" and define a few constant string values, i.e., VOLUME_DELETED, FILESYSTEM_CORRUPTED? Maybe allow user to define their own custom values too but those will always have "NoEffect".
- External controller calls ControllerCheckVolume() to check the health condition of PVs themselves, for example, if the PVs are deleted ... | ||
- Move Attach checking here ? | ||
- NodeCheckVolume RPC | ||
- If volume is attached to this node, the external agent calls NodeCheckVolume() to see if volume is still attached; Move this check to ControllerCheckVolume ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep this here. In the case of multi-attach, it is easier to check on each node as discussed earlier.
021f535
to
3d3f8a0
Compare
6e968bb
to
cd77e39
Compare
7b7bf42
to
54a0473
Compare
// Identity information for a specific volume. This field is | ||
// OPTIONAL. It can be used to list only a specific volume. | ||
// ListVolumes will return with current volume information. | ||
string volume_id = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new RPC GetVolume()
54a0473
to
687751d
Compare
Instead of adding a new RPC, we can leverage the existing NodeGetVolumeStats RPC. | ||
|
||
``` | ||
rpc NodeGetVolumeStats (NodeGetVolumeStatsRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull based method for fetching utilization makes sense, but for status changes (health) maybe we should consider push?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Push based method is brought up in the section "HTTP(RPC) Service". We didn't adopt that approach as the main proposal because CSI uses pull based method. I'll add some clarification there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a push model, can the node agent define some RPC service that CSI drivers can call out to?
@serathius can you point to any existing design that uses a push based model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current CSI spec, I don't see a push model. Therefore the design is based on a pull model. We can ask about it at the CSI community meeting during the CSI spec review. If a push model is possible, we can come back to update the KEP later.
Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves. | ||
|
||
#### Node failure event | ||
Watch node failure events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is node failure event here? Can you give example?
Assuming that this is some k8s event on node object, does it mean that it will be coping events from Node to PVC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refers to a node down event. So the monitoring controller will be checking if the node is still up and only report event when the node is down. I'll reword this section to clarify.
Yes, it is handled differently. For local volumes, we can figure out what local volumes are on a particular node. So an event can be reported for every PVC affected. For network storage, we will only report a general node down event, not specific to any volume. Added clarifications. |
What object is the node down event reported on? Also how will you tell the difference between a local volume vs network volume? |
For local storage, @NickrenREN said there is a way to tell what volumes are on that node. So this will be reported on the PVC objects. |
4dfecd6
to
2260819
Compare
|
||
If the CSI driver has implemented the CSI volume health function proposed in this design document, Kubernetes could communicate with the CSI driver to retrieve any errors detected by the underlying storage system. Kubernetes can report an event and log an error about this PVC so that user can inquire this information and decide how to handle them. For example, if the volume is out of capacity, user can request a volume expansion to get more space. In the first phase, volume health monitoring is informational only as it is only reported in the events and logs. In the future, we will also look into how Kubernetes may use volume health information to automatically reconcile. | ||
|
||
There could be conditions that cannot be reported by a CSI driver. There could be network failure where Kubernetes may not be able to get response from the CSI driver. In this case, a call to the CSI driver may time out. There could be network congestion which causes slow response. One or more nodes where the volume is attached to may be down. This can be monitored and detected by the volume health controller so that user knows what has happened. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A controller sidecar calling to a CSI driver goes through localhost. I don't think you can detect network errors that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I said "conditions that cannot be reported by a CSI driver" on line 83 so I meant CSI driver cannot report a network failure. Maybe I should just remove it to avoid confusion.
|
||
The main architecture is shown below: | ||
|
||
![pv health monitor architecture](./pv-health-monitor.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diagram seems a little out of date after addressing the comments. Should we remove it for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
|
||
The following areas will be the focus of this proposal at first: | ||
|
||
- The health condition checking of volumes themselves. For example, whether the volume is deleted, whether the usage is reaching the threshold, and so on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal is not actually doing the health or mount checks right? The proposal is about providing a mechanism for plugins to report volume health/errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will modify this.
``` | ||
|
||
The following common error codes are proposed for volume health: | ||
* VolumeNotFound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can discuss the error codes in more detail during CSI spec review
Instead of adding a new RPC, we can leverage the existing NodeGetVolumeStats RPC. | ||
|
||
``` | ||
rpc NodeGetVolumeStats (NodeGetVolumeStatsRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a push model, can the node agent define some RPC service that CSI drivers can call out to?
@serathius can you point to any existing design that uses a push based model?
#### Node down event | ||
Watch node down events. | ||
In the case that a node goes down, the controller will report an event for all local PVCs on that node. | ||
For network storage in the case of a node failure, the controller will just log a general message, not specific to individual PVCs because the controller has no knowledge of what PVCs are on the affected node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there's a local volume CSI driver, I'm not sure how the controller will be able to tell the difference between network and local.
One possibility is to track which pods are using which PVCs and what nodes they got scheduled to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
### External controller | ||
|
||
#### CSI interface | ||
Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves. The frequency of the check should be tunalbe. A configure option will be available in the external controller to adjust this value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: tunable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still see typo
2260819
to
73e206a
Compare
The following areas will be the focus of this proposal at first: | ||
|
||
- Provide a mechanism for CSI drivers to report volume health problems. For example, whether the volume is deleted, whether the usage is reaching the threshold, and so on. | ||
- Mounting conditions checking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the node agent actual check for mounts? Or is it the responsibility of the driver?
Maybe we should just say to provide a way for CSI drivers to report volume health problems at the controller and node levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, driver will check it. I'll reword this.
|
||
- Provide a mechanism for CSI drivers to report volume health problems. For example, whether the volume is deleted, whether the usage is reaching the threshold, and so on. | ||
- Mounting conditions checking. | ||
- Other errors that could affect the usability of the volume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these "other errors" things that a CSI driver would not be able to detect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refers to errors the CSI drivers can detect. I'll reword.
### External controller | ||
|
||
#### CSI interface | ||
Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves. The frequency of the check should be tunalbe. A configure option will be available in the external controller to adjust this value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still see typo
Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves. The frequency of the check should be tunalbe. A configure option will be available in the external controller to adjust this value. | ||
|
||
#### Node down event | ||
* Watch node down events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be watching Node status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will watch node status. In addition I think we also need to ping the node as it could be an unplanned shut down.
* Volume health feature deployed in production and have gone through at least one K8s upgrade. | ||
|
||
## Test Plan | ||
### Unit tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to be more specific about the test cases, what error scenarios will we be testing? What drivers are we going to use to test?
Also, any plan for stress/scale tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
7333982
to
41ece7c
Compare
@msau42 comments addressed. PTAL. Thanks. |
Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves. The frequency of the check should be tunable. A configure option will be available in the external controller to adjust this value. | ||
|
||
#### Node down event | ||
* Watch node down events by checking node status and also pinging the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already NodeLease objects that kubelet periodically updates (like a heartbeat). We could possibly use that, but even then, I'm still not sure why we cannot just rely on the NodeController to mark the Node as unhealthy when the NodeLease is stale. It seems like we're crossing over responsibilities here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
153d270
to
7bbf99f
Compare
Squashed commits into 1. |
7bbf99f
to
61a690b
Compare
/lgtm |
/assign @saad-ali |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Overall I like that this KEP has reduced scope to just creating events (inform user) no programmatic response. That makes it easier to implement. I'm still concerned about encoding error codes in CSI, but we can go over that in the CSI API review.
* DiskDegrading | ||
* VolumeUnmounted | ||
* RWIOError | ||
* FilesystemCorruption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enumerating the failure types as error codes in CSI makes me worry. Are we going to be able to capture all possible use cases for all possible storage systems? How will we handle backwards compat?
The main reason to have error codes (as opposed to opaque strings) is to enable programmatic response. Like the existing CSI error codes we should detail the expected recovery behavior for each error code that may help us reduce to a minimum set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Let's discuss more in the CSI API review.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, NickrenREN, saad-ali The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
will the volume health event be attached to the pod using the volume? |
@blackgold yes, we are planning to do this, track issue: kubernetes-csi/external-health-monitor#10 |
Moving kubernetes/community#1484 here