Do not use beta API for hyperdisk in multi-writer mode. #1864

karkunpavan · 2024-11-07T14:42:53Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Hyper disk multi-writer support is now in compute/v1 but the persistent disk csi driver still calls v0.beta which leads a error while creating a hyperdisk with multi-writer mode. Error:

Warning ProvisioningFailed 31m (x15 over 57m) pd.csi.storage.gke.io_wduan-1030a-g-vdnqh-master-0_dc46e9f6-1725-4f15-922c-99f0e281e102 failed to provision volume with StorageClass
"balanced-storage": rpc error: code = InvalidArgument desc = CreateVolume failed: rpc error: code = InvalidArgument desc = CreateVolume failed to create single zonal disk pvc-0b65d680-636b-46c6-876d-d3a6c412c3ef: failed to insert zonal disk: unknown Insert disk error:
googleapi: Error 400: Invalid value for field 'resource.multiWriter': 'true'.
Cannot specify the multi writer field for 'hyperdisk-balanced' disks,
please use access mode instead., invalid%!v(MISSING)

To fix this we will call the v1 API which accepts accessMode = ReadWriteMany.

Which issue(s) this PR fixes:

Fixes #1863

Special notes for your reviewer:

Discussed this issue and the fix with Matthew
Making this change only for disk types hyperdisk*, multi-writer support for other disk types is not the focus of this fix.

Does this PR introduce a user-facing change?:

NONE

linux-foundation-easycla · 2024-11-07T14:42:58Z

The committers listed above are authorized under a signed CLA.

✅ login: k8s-ci-robot / name: Kubernetes Prow Robot (d8c777e, c29a4ee, bc9302c, 486a336, f872f15, 13bf655, e85e9f8, 5675f13, a06d250, 2b5cadc, e457c73)
✅ login: pwschuurman (0862fd6, a1ed4e4, 50f8af2)
✅ login: saikat-royc / name: Saikat Roychowdhury (524895b, 96338ca, 2e9c71c)
✅ login: karkunpavan / name: Pavan Karkun (35d3d55, 4168542, 37e19fd, 8081b51, ce02f42)
✅ login: travisyx / name: Travis Xiang (51fb1d0, 1a7eeeb, 6039d2b)
✅ login: Fricounet / name: Baptiste Girard-Carrabin (7407af0)
✅ login: mattcary / name: Matt Cary (a28f8d3, 0f74882, 4249ee3)
✅ login: amacaskill / name: Alexis MacAskill (4f2fd63)
✅ login: Sneha-at / name: Sneha-at (6449ad1)

k8s-ci-robot · 2024-11-07T14:43:03Z

Welcome @karkunpavan!

It looks like this is your first PR to kubernetes-sigs/gcp-compute-persistent-disk-csi-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gcp-compute-persistent-disk-csi-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-11-07T14:43:04Z

Hi @karkunpavan. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mattcary · 2024-11-08T16:31:31Z

/ok-to-test

karkunpavan · 2024-11-14T05:45:54Z

Detailed code changes as requested. Updated the PR with all of these code changes.

In insertZonalDisk() PD CSI driver currently calls beta APIs when multi-writer is set. We need to add a switch based on disk type. For hyperdisk use v1 API and for persistent disks use beta APIs.
Similar changes needed in insertRegionalDisk()
AccessMode flag is already set here and no changes needed to explicitly set it for hyperdisks
GetMultiWriter() currently thinks that v1 APIs do not support multi-writer and this needs to be fixed. We should return a True when AccessMode == READ_WRITE_MANY
Remove comments about hyperdisk not supporting multi-writer

karkunpavan · 2024-11-14T06:39:28Z

/retest-required

The failures are SSH timeouts which seem to be unrelated to my changes.

karkunpavan · 2024-11-14T07:45:28Z

/retest - seems like a flaky test which timed out.

k8s-ci-robot · 2024-11-14T07:45:43Z

@karkunpavan: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test pull-gcp-compute-persistent-disk-csi-driver-e2e
/test pull-gcp-compute-persistent-disk-csi-driver-kubernetes-integration
/test pull-gcp-compute-persistent-disk-csi-driver-sanity
/test pull-gcp-compute-persistent-disk-csi-driver-unit
/test pull-gcp-compute-persistent-disk-csi-driver-verify

The following commands are available to trigger optional jobs:

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

Use /test all to run the following jobs that were automatically triggered:

pull-gcp-compute-persistent-disk-csi-driver-e2e
pull-gcp-compute-persistent-disk-csi-driver-kubernetes-integration
pull-gcp-compute-persistent-disk-csi-driver-sanity
pull-gcp-compute-persistent-disk-csi-driver-unit
pull-gcp-compute-persistent-disk-csi-driver-verify

In response to this:

/retest - seems like a flaky test which timed out.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

karkunpavan · 2024-11-14T08:34:47Z

/retest

Upgrade resizer to v1.12.0

Update change log

…lows for more accurate error code reporting if gRPC functionality is refactored

…ging Refactor metric defer() statements to gRPC metric interceptor

Don't overwrite libc in distroless debian base image

update prow rc with 1.15.3-rc1 release candidate

Require VACs to use SI units

The volume attribute class tests are valid for 1.31+ cluster By making the param configurable in the run-k8s-integration-*.sh we can disable the flag in k8s clusters lesser than intended minor version

Make the volume attribute class file a configurable input in the tester script

Skip xfs test for GCE test skip

Make a new release candidate with 1.15.3

This fixes a regression introduced in kubernetes-sigs#1876 where the driver would start panicking on startup if `--http-endpoint` was specified. This was caused by the metrics not being initialized anymore during startup. The proposed fix involves using the `Reset` methods of the metrics object instead of trying to redefine them each time they need to be reset.

gtomitsuka · 2024-12-20T04:22:26Z

@mattcary @leiyiz any potential follow-up on why the test could be failing?

this is pretty important to my company, would let us move from standard-rwx with 1TB min capacity to something much more fine-grained. happy to support with anything

mattcary · 2024-12-20T16:57:10Z

@mattcary @leiyiz any potential follow-up on why the test could be failing?

this is pretty important to my company, would let us move from standard-rwx with 1TB min capacity to something much more fine-grained. happy to support with anything

This is not a replacement for standard-rwx. It enables multiwriter on a block device, but this does not mean you have a multiwriter filesystem. A distributed filesystem is hard to make, and multiwriter devices is only one piece of the puzzle. ext4 and xfs are not distributed.

gtomitsuka · 2024-12-20T17:42:40Z

@mattcary I understand, my case specifically is that we have Performance pods (= one pod per node) running ML models on GKE, and we'd like to enable rolling updates for it while keeping the models persistent to reduce overhead.

Right now, this rules us out from either doing rolling updates or using hyperdisk, due to a "true" distributed FS requirement, which has very high overhead. Making it read-only would prohibit us from using some Python libraries we use like stanza.

Our understanding is that this is the k8s equivalent to using a cannon to catch a fish and in practice a much simpler solution would suffice, since writes are extremely rare and no more than one pod would be writing at any given time (we can also guarantee this using external locks if needed).

Am I understanding this incorrectly?

mattcary · 2024-12-20T18:42:36Z

@gtomitsuka The kernel filesystem modules try very hard to do things like cache in memory. So even if writes are rare, the in-memory structures on different machines are going to be out of date and desynchronized.

It seems possible to have some ROX with a single writer, which unmounts all readers, updates the single writer, and then remounts, while keeping all the disks attached. But that will require a new csi driver anyway.

Use correct path in error message for udev tooling

…am/fix-panic [metrics] Fix panic during metrics manager startup

…o e2e tests

…ute-persistent-disk-csi-driver into multi-witer

k8s-ci-robot · 2025-01-09T13:22:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: karkunpavan
Once this PR has been reviewed and has the lgtm label, please assign mattcary for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

karkunpavan · 2025-01-09T13:27:37Z

Closing this pull request, will raise a new one with planned changes

k8s-ci-robot · 2025-01-09T13:27:45Z

@karkunpavan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-gcp-compute-persistent-disk-csi-driver-unit	`ce02f42`	link	true	`/test pull-gcp-compute-persistent-disk-csi-driver-unit`
pull-gcp-compute-persistent-disk-csi-driver-e2e	`ce02f42`	link	true	`/test pull-gcp-compute-persistent-disk-csi-driver-e2e`
pull-gcp-compute-persistent-disk-csi-driver-verify	`ce02f42`	link	true	`/test pull-gcp-compute-persistent-disk-csi-driver-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Do not use beta API for hyperdisk in multi-writer mode.

8081b51

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Nov 7, 2024

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Nov 7, 2024

k8s-ci-robot requested review from leiyiz and mattcary November 7, 2024 14:43

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 7, 2024

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 7, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 8, 2024

karkunpavan added 2 commits November 14, 2024 04:35

Merge remote-tracking branch 'upstream/master' into multi-witer

35d3d55

Update GetMultiWriter() to reflect support for multi-writer in v1 disks

4168542

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 14, 2024

travisyx and others added 5 commits November 15, 2024 18:32

Upgrade resizer to v1.12.0

1a7eeeb

Upgrade rc resizer to v1.12.0

6039d2b

Merge pull request kubernetes-sigs#1872 from travisyx/master

a28f8d3

Upgrade resizer to v1.12.0

Require VACs to use SI units

51fb1d0

Fix ./hack/verify-docker-deps.sh script to run on build platform

50f8af2

Sneha-at and others added 15 commits November 27, 2024 23:33

update changelog

6449ad1

Merge pull request kubernetes-sigs#1877 from Sneha-at/update-changelog

2b5cadc

Update change log

Migrate metric defer() statements to gRPC metric interceptor. This al…

a1ed4e4

…lows for more accurate error code reporting if gRPC functionality is refactored

Merge pull request kubernetes-sigs#1876 from pwschuurman/grpc-err-log…

a06d250

…ging Refactor metric defer() statements to gRPC metric interceptor

Don't overwrite libc in distroless debian base image

0862fd6

Merge pull request kubernetes-sigs#1883 from pwschuurman/dockerfile-fix

486a336

Don't overwrite libc in distroless debian base image

update prow rc with 1.15.3-rc1 release candidate

524895b

Merge pull request kubernetes-sigs#1888 from saikat-royc/master

e457c73

update prow rc with 1.15.3-rc1 release candidate

Merge pull request kubernetes-sigs#1875 from travisyx/master

f872f15

Require VACs to use SI units

Make the volume attribute class file a configurable input

2e9c71c

The volume attribute class tests are valid for 1.31+ cluster By making the param configurable in the run-k8s-integration-*.sh we can disable the flag in k8s clusters lesser than intended minor version

Merge pull request kubernetes-sigs#1889 from saikat-royc/fix-vac-tests

bc9302c

Make the volume attribute class file a configurable input in the tester script

skip xfs test for GCE test skip

4f2fd63

Merge pull request kubernetes-sigs#1891 from amacaskill/test-skip

e85e9f8

Skip xfs test for GCE test skip

create new rc for 1.15.3 (kubernetes-sigs#1893)

96338ca

Make a new release candidate with 1.15.3

mattcary and others added 5 commits January 2, 2025 10:20

Use correct path in error message for udev tooling

0f74882

Merge pull request kubernetes-sigs#1897 from mattcary/toolpath

13bf655

Use correct path in error message for udev tooling

Merge pull request kubernetes-sigs#1895 from DataDog/fricounet/upstre…

c29a4ee

…am/fix-panic [metrics] Fix panic during metrics manager startup

adds the changes to support hyperdisk multi-writer mode and updates t…

37e19fd

…o e2e tests

Merge branch 'multi-witer' of https://github.com/karkunpavan/gcp-comp…

ce02f42

…ute-persistent-disk-csi-driver into multi-witer

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 9, 2025

karkunpavan closed this Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not use beta API for hyperdisk in multi-writer mode. #1864

Do not use beta API for hyperdisk in multi-writer mode. #1864

karkunpavan commented Nov 7, 2024

linux-foundation-easycla bot commented Nov 7, 2024 •

edited

Loading

k8s-ci-robot commented Nov 7, 2024

k8s-ci-robot commented Nov 7, 2024

mattcary commented Nov 8, 2024

karkunpavan commented Nov 14, 2024

karkunpavan commented Nov 14, 2024

karkunpavan commented Nov 14, 2024

k8s-ci-robot commented Nov 14, 2024

karkunpavan commented Nov 14, 2024

gtomitsuka commented Dec 20, 2024

mattcary commented Dec 20, 2024

gtomitsuka commented Dec 20, 2024 •

edited

Loading

mattcary commented Dec 20, 2024

k8s-ci-robot commented Jan 9, 2025

karkunpavan commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

Do not use beta API for hyperdisk in multi-writer mode. #1864

Do not use beta API for hyperdisk in multi-writer mode. #1864

Conversation

karkunpavan commented Nov 7, 2024

linux-foundation-easycla bot commented Nov 7, 2024 • edited Loading

k8s-ci-robot commented Nov 7, 2024

k8s-ci-robot commented Nov 7, 2024

mattcary commented Nov 8, 2024

karkunpavan commented Nov 14, 2024

karkunpavan commented Nov 14, 2024

karkunpavan commented Nov 14, 2024

k8s-ci-robot commented Nov 14, 2024

karkunpavan commented Nov 14, 2024

gtomitsuka commented Dec 20, 2024

mattcary commented Dec 20, 2024

gtomitsuka commented Dec 20, 2024 • edited Loading

mattcary commented Dec 20, 2024

k8s-ci-robot commented Jan 9, 2025

karkunpavan commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

linux-foundation-easycla bot commented Nov 7, 2024 •

edited

Loading

gtomitsuka commented Dec 20, 2024 •

edited

Loading