Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use beta API for hyperdisk in multi-writer mode. #1864

Closed
wants to merge 31 commits into from

Conversation

karkunpavan
Copy link

What type of PR is this?
/kind bug

What this PR does / why we need it:
Hyper disk multi-writer support is now in compute/v1 but the persistent disk csi driver still calls v0.beta which leads a error while creating a hyperdisk with multi-writer mode. Error:

Warning ProvisioningFailed 31m (x15 over 57m) pd.csi.storage.gke.io_wduan-1030a-g-vdnqh-master-0_dc46e9f6-1725-4f15-922c-99f0e281e102 failed to provision volume with StorageClass
"balanced-storage": rpc error: code = InvalidArgument desc = CreateVolume failed: rpc error: code = InvalidArgument desc = CreateVolume failed to create single zonal disk pvc-0b65d680-636b-46c6-876d-d3a6c412c3ef: failed to insert zonal disk: unknown Insert disk error:
googleapi: Error 400: Invalid value for field 'resource.multiWriter': 'true'.
Cannot specify the multi writer field for 'hyperdisk-balanced' disks,
please use access mode instead., invalid%!v(MISSING)

To fix this we will call the v1 API which accepts accessMode = ReadWriteMany.

Which issue(s) this PR fixes:

Fixes #1863

Special notes for your reviewer:

  • Discussed this issue and the fix with Matthew
  • Making this change only for disk types hyperdisk*, multi-writer support for other disk types is not the focus of this fix.

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Nov 7, 2024
Copy link

linux-foundation-easycla bot commented Nov 7, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Nov 7, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @karkunpavan!

It looks like this is your first PR to kubernetes-sigs/gcp-compute-persistent-disk-csi-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gcp-compute-persistent-disk-csi-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 7, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @karkunpavan. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 7, 2024
@mattcary
Copy link
Contributor

mattcary commented Nov 8, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 8, 2024
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 14, 2024
@karkunpavan
Copy link
Author

Detailed code changes as requested. Updated the PR with all of these code changes.

  1. In insertZonalDisk() PD CSI driver currently calls beta APIs when multi-writer is set. We need to add a switch based on disk type. For hyperdisk use v1 API and for persistent disks use beta APIs.

  2. Similar changes needed in insertRegionalDisk()

  3. AccessMode flag is already set here and no changes needed to explicitly set it for hyperdisks

  4. GetMultiWriter() currently thinks that v1 APIs do not support multi-writer and this needs to be fixed. We should return a True when AccessMode == READ_WRITE_MANY

  5. Remove comments about hyperdisk not supporting multi-writer

@karkunpavan
Copy link
Author

/retest-required

The failures are SSH timeouts which seem to be unrelated to my changes.

@karkunpavan
Copy link
Author

/retest - seems like a flaky test which timed out.

@k8s-ci-robot
Copy link
Contributor

@karkunpavan: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test pull-gcp-compute-persistent-disk-csi-driver-e2e
  • /test pull-gcp-compute-persistent-disk-csi-driver-kubernetes-integration
  • /test pull-gcp-compute-persistent-disk-csi-driver-sanity
  • /test pull-gcp-compute-persistent-disk-csi-driver-unit
  • /test pull-gcp-compute-persistent-disk-csi-driver-verify

The following commands are available to trigger optional jobs:

  • /test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

Use /test all to run the following jobs that were automatically triggered:

  • pull-gcp-compute-persistent-disk-csi-driver-e2e
  • pull-gcp-compute-persistent-disk-csi-driver-kubernetes-integration
  • pull-gcp-compute-persistent-disk-csi-driver-sanity
  • pull-gcp-compute-persistent-disk-csi-driver-unit
  • pull-gcp-compute-persistent-disk-csi-driver-verify

In response to this:

/retest - seems like a flaky test which timed out.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@karkunpavan
Copy link
Author

/retest

Sneha-at and others added 15 commits November 27, 2024 23:33
…lows for more accurate error code reporting if gRPC functionality is refactored
…ging

Refactor metric defer() statements to gRPC metric interceptor
Don't overwrite libc in distroless debian base image
update prow rc with 1.15.3-rc1 release candidate
The volume attribute class tests are valid for 1.31+ cluster
By making the param configurable in the run-k8s-integration-*.sh we can
disable the flag in k8s clusters lesser than intended minor version
Make the volume attribute class file a configurable input in the tester script
Make a new release candidate with 1.15.3
This fixes a regression introduced in kubernetes-sigs#1876 where the driver would start
panicking on startup if `--http-endpoint` was specified. This was caused
by the metrics not being initialized anymore during startup. The
proposed fix involves using the `Reset` methods of the metrics object
instead of trying to redefine them each time they need to be reset.
@gtomitsuka
Copy link

@mattcary @leiyiz any potential follow-up on why the test could be failing?

this is pretty important to my company, would let us move from standard-rwx with 1TB min capacity to something much more fine-grained. happy to support with anything

@mattcary
Copy link
Contributor

@mattcary @leiyiz any potential follow-up on why the test could be failing?

this is pretty important to my company, would let us move from standard-rwx with 1TB min capacity to something much more fine-grained. happy to support with anything

This is not a replacement for standard-rwx. It enables multiwriter on a block device, but this does not mean you have a multiwriter filesystem. A distributed filesystem is hard to make, and multiwriter devices is only one piece of the puzzle. ext4 and xfs are not distributed.

@gtomitsuka
Copy link

gtomitsuka commented Dec 20, 2024

@mattcary I understand, my case specifically is that we have Performance pods (= one pod per node) running ML models on GKE, and we'd like to enable rolling updates for it while keeping the models persistent to reduce overhead.

Right now, this rules us out from either doing rolling updates or using hyperdisk, due to a "true" distributed FS requirement, which has very high overhead. Making it read-only would prohibit us from using some Python libraries we use like stanza.

Our understanding is that this is the k8s equivalent to using a cannon to catch a fish and in practice a much simpler solution would suffice, since writes are extremely rare and no more than one pod would be writing at any given time (we can also guarantee this using external locks if needed).

Am I understanding this incorrectly?

@mattcary
Copy link
Contributor

@gtomitsuka The kernel filesystem modules try very hard to do things like cache in memory. So even if writes are rare, the in-memory structures on different machines are going to be out of date and desynchronized.

It seems possible to have some ROX with a single writer, which unmounts all readers, updates the single writer, and then remounts, while keeping all the disks attached. But that will require a new csi driver anyway.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: karkunpavan
Once this PR has been reviewed and has the lgtm label, please assign mattcary for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 9, 2025
@karkunpavan
Copy link
Author

Closing this pull request, will raise a new one with planned changes

@karkunpavan karkunpavan closed this Jan 9, 2025
@k8s-ci-robot
Copy link
Contributor

@karkunpavan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gcp-compute-persistent-disk-csi-driver-unit ce02f42 link true /test pull-gcp-compute-persistent-disk-csi-driver-unit
pull-gcp-compute-persistent-disk-csi-driver-e2e ce02f42 link true /test pull-gcp-compute-persistent-disk-csi-driver-e2e
pull-gcp-compute-persistent-disk-csi-driver-verify ce02f42 link true /test pull-gcp-compute-persistent-disk-csi-driver-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
10 participants