Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncertain handling for attach #400

Open
msau42 opened this issue Dec 28, 2022 · 17 comments
Open

Uncertain handling for attach #400

msau42 opened this issue Dec 28, 2022 · 17 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@msau42
Copy link
Collaborator

msau42 commented Dec 28, 2022

It looks like we check for finalError on ControllerPublishVolume:

return nil, isFinalError(err), err

But we ignore it later on:

// We're not interested in `detached` return value, the controller will

And we rely on the driver's implementation of ControllerUnpublishVolume to be able to properly detect if there is still an attach operation in progress.

This is different than how we handle other uncertain states for volume operations like provision and mount. Should we consider adding uncertain handling for attach?

@msau42
Copy link
Collaborator Author

msau42 commented Dec 28, 2022

cc @jsafrane @gnufied

@jsafrane
Copy link
Contributor

jsafrane commented Jan 5, 2023

IIRC, the external-attacher always calls ControllerUnpublish when a VolumeAttachment gets DeletionTimestamp, regardless if the previous attach was successful or not or if it got a final error or temporary one. In this sense, an attachment is always "uncertain" until fully attached.

It could be possible to optimize and mark attachments as "not attached" after a final ControllerPublish error and skip ControllerUnpublish in this case. I am not sure it's worth the effort.

@bswartz
Copy link

bswartz commented Jan 5, 2023

I agree it sounds like it should not cause problems, but one could imagine that some plugin depends on the current Kubernetes behavior to avoid leaking resources.

My reading of the spec on the requirements for the CO are ambiguous in this area. There is a lot of "the CO may choose to" language that suggests the CO has little if any obligation here.

@msau42
Copy link
Collaborator Author

msau42 commented Jan 6, 2023

The main challenge is that the current behavior relies on the plugin to potentially keep state about pending attaches requests, which may be difficult. You can have a sequence like:

  1. Attach disk started
  2. Attach disk timed out on client side, but is still queued in the backend
  3. Pod is deleted and we trigger detach
  4. Detach returns success because disk is not attached to the node yet.
  5. Queued attach operation proceeds and attaches the disk to the old node.

Some ways that a plugin could address this are:

  • During step 4, check for pending attach operations
  • During step 4, issue a detach operation regardless and wait for the operation to complete

However, keeping track of pending operations may be difficult. For example, in GCP, operation metadata only provides the instance id, not the disk id. So the CSI driver would have to potentially 1) serialize operations per instance, which would not be good for performance or 2) cache operations in memory and only serialize on restart 3) keep operations state in some CRD

@msau42
Copy link
Collaborator Author

msau42 commented Jan 18, 2023

For reference, here is a prototype of adding operation caching in the GCP driver: kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#923

@xing-yang
Copy link
Contributor

Discussed at the triage meeting. We'll work on a design that is similar to how provisioner works and the attach/detach operations will be synchronized with an operations cache.

@xing-yang
Copy link
Contributor

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 18, 2023
@xing-yang
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 18, 2023
@Sneha-at
Copy link
Contributor

I will work on this issue, unable to assign to myself.

@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels May 1, 2023
@msau42
Copy link
Collaborator Author

msau42 commented Sep 12, 2023

/priority important-longterm
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Sep 12, 2023
@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 19, 2024
@msau42
Copy link
Collaborator Author

msau42 commented Mar 9, 2024

/priority important-longterm
/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Mar 9, 2024
@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 7, 2024
@msau42
Copy link
Collaborator Author

msau42 commented Aug 2, 2024

/remove-priority important-soon

@k8s-ci-robot k8s-ci-robot removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 2, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2024
@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

7 participants