Consider support for RAID in local provisioner #65

schallert · 2019-03-18T21:26:48Z

For context, this issue stems from a Slack conversation with @msau42 that we wanted to capture here.

It would be awesome if the local static provisioner could support RAID'ing devices together.

Currently, if a user wants to RAID local disks together they must do so manually before presenting the provisioner with a filesystem or block device. Some ways this can currently be achieved include but are not limited to:

Constructing the RAID volume at node provisioning time and passing formatting it to the provisioner as an FS, or as block device.
Using an init container to the local static provisioner that RAIDs disks together, either as block devices or after dismantling FS's (this was @msau42's idea).
Using some other process to accomplish RAID'ing/formatting, and then labeling nodes that have been set up and using NodeAffinity to only run the local provisioner on those nodes.

Option (1) is unfortunately not suitable for managed Kubernetes platforms where the only knobs the user have may be how many local FS/block volumes they want, and not how they're formatted. The other options have the benefit of potentially being compatible with a managed platform, however they require manual intervention from the user. Many of these goals may be accomplished with the LVM provisioner, but it sounds like that might be far off enough to warrant work in the meantime.

If it aligns with the goals of the local static provisioner, it would be really helpful if the provisioner could handle being presented with block devices and RAID'ing (and potentially formatting) them before creating the PV. To relax the constraint of requiring block devices maybe this same functionality could eventually include deconstructing FS's as well. I noticed "Local block devices as a volume source, with partitioning and fs formatting" is on the roadmap so maybe this could fit in there?

I wanted to start this issue as a place to discuss some of these issues. If we can reach consensus on how to proceed I'd be happy to help contribute.

msau42 · 2019-03-19T18:25:27Z

cc @gnufied who has also been working on an operator

In the past, we've wanted to keep a clear separation between environment-specific prep and the general PV lifecycle management, but I can see value in providing some optional helpers if it's beneficial to many users (and I have seen many requests for supporting raid setup). I would still like to keep it separate from the actual provisioner process so that we don't complicate the logic there (and also potentially require installing mdadm in the container image for everyone). So either options 2) or 3) sounds good to me.

I think the biggest question to figure out is how will the disk names be passed in? List every disk? Pattern match? Nodes can have different number/names for disks.

cofyc · 2019-05-14T07:35:18Z

I'd prefer the second option.

I think the biggest question to figure out is how will the disk names be passed in? List every disk? Pattern match? Nodes can have different number/names for disks.

Yes, if we want to support local volume prep in various environments, the configuration must be flexible.

This is my proposal, what do you think?

class "local" {
    dir = "/mnt/raid-local"
    # mode defaults to "filesystem"
    # mode = "filesytem"
}

class "local-device" {
    dir = "/mnt/raid-local-device"
    mode = "block"
}

#
# For all gke-demo-default-pool-* nodes, we combine all local SSDs into one
# raid0 disk and format/mount it into "local" class directory.
#
node "gke-demo-default-pool-*" {
    raid0 md0 {
        class = "local"
        disks = ["/dev/disk/by-id/google-local-ssd-*"]
    }   
}

#
# For all gke-demo-another-pool-* nodes, we combine two local SSDs into one
# raid0 disk and link the disk to "local-device" class directory.
#
node "gke-demo-another-pool-*" {
    raid0 md0 {
        class = "local-device"
        disks = ["/dev/disk/by-id/google-local-ssd-0", "/dev/disks/by-id/google-local-ssd-1"]
    }   
    raid0 md1 {
        class = "local-device"
        disks = ["/dev/disk/by-id/google-local-ssd-2", "/dev/disks/by-id/google-local-ssd-3"]
    }   
    raid0 md2 {
        class = "local-device"
        disks = ["/dev/disk/by-id/google-local-ssd-4", "/dev/disks/by-id/google-local-ssd-5"]
    }   
}

The configuration language is HCl which is used by terraform.

gnufied · 2019-05-14T18:05:57Z

We have been working on a local-storage operator that uses following API to allow user to specify disks that can be used by local-storage-provisioner - https://github.com/openshift/local-storage-operator/blob/master/pkg/apis/local/v1alpha1/types.go#L54 (example: https://github.com/openshift/local-storage-operator/blob/master/examples/olm/create-cr.yaml )

@cofyc An earlier version of API we proposed for local-storage-operator allowed specifying wildcards and regexp, but at least we quickly realized that we may have to allow users to specify exclusion mechanism (like don't use this disk but use others that match this regex). It might be worth starting small and keeping surface area of API small and gather user feedback and then iterate on design. If we allow wildcards/regexes from v1 then, it will be hard to rollback on them.

I agree with @msau42 that separating disk preparation and general PV lifecycle managment is a good idea and since kubelet itself is capable of formatting disks, this provisioner does not need to do that (at least for non-RAID volumes).

cofyc · 2019-05-15T03:27:57Z

CRD is more flexible and Kubernetes-native way to configure, it seems a good idea to have an operator to do these tasks (option 3). Had a discussion with @gnufied, we can add raid support in local-storage-operator. What do you think?

cofyc · 2019-05-15T12:07:02Z

Another simpler solution is to annotate the node to tell provisioner or sidecar of it to combine the disks before mounting (filesystem) or symlinking the combined disk to discovery directory.

fejta-bot · 2019-08-13T12:07:25Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

gregwebs · 2019-08-22T22:02:14Z

I have implemented solution 2 for GKE. You can choose between LVM or RAID. It assumes that you want to combine all the available disks together (which is not necessarily a correct assumption for how everyone does K8s): https://github.com/pingcap/tidb-operator/blob/master/manifests/gke/local-ssd-provision/local-ssd-provision.yaml

msau42 · 2019-08-22T22:11:22Z

/remove-lifecycle stale

Awesome! We can consider adding the script here in some addons folder if you think that would be beneficial.

gregwebs · 2019-08-22T22:20:51Z

To be more generally useful you would probably want to do disk combining based on some node pool labeling scheme or other metadata available at startup.
This solution also causes a failure when the node restarts due to brittleness in GKE startup scripts. This has been reported in multiple places. When reporting this to GKE support they told me that un-mounting disks is not supported at this time and they don't care to make this situation more transparent in their documentation.

nyurik · 2019-10-01T23:10:18Z

@gregwebs this is awesome, I have been looking for something like this for a very long time! It would be absolutely awesome to have this as a ready-to-use component rather than a large code copy/paste. A few notes:

GCP will soon (hopefully) introduce NVMEs -- gcloud alpha already supports --local-ssd-volumes parameter. They will be listed as /dev/nvme* rather than ssd*. Also, it seems they can be created without being formatted with format=block.
for some reason mdadm kept raising 141 exit code, despite seemingly completing successfully. Have you had that issue?
could that code be packaged into a published docker hub image? I already created nyurik/kuberaid (uses a very simple script to force format), but yours is far better and more thorough.

Thank you for you awesome work on this!

fejta-bot · 2019-12-30T23:51:51Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

gregwebs · 2020-01-02T04:00:35Z

@nyurik sorry I missed your message. GCP improvements here are still in the alpha phase.
I haven't seen errors from mdadm.
We just updated the script for an incompatibility with newer GKE image verisons.
You are welcome to take the script for your docker image.

fejta-bot · 2020-02-01T04:26:04Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

msau42 · 2020-02-10T17:17:28Z

/lifecycle frozen

nerddelphi · 2020-04-10T05:41:00Z

/remove-lifecycle frozen

@msau42 @gregwebs @nyurik @cofyc @gnufied @schallert
Hi there.

I'm excited using Local SSDs in GKE and make a RAID-0 Volume using theses disks. Although, even with the daemonset of local-static-provisioner helm chart with initContainer and RAID script from @gregwebs, I have an critical issue simulating disrupting scenarios.

If I use a StatefulSet with PVC, like this

  volumeClaimTemplates:
  - metadata:
      name: local-vol
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "local-storage"
      resources:
        requests:
          storage: 700Gi

and node is recreated after a node-pool upgrade (example), the sts pod stuck on Pending state and I have to delete its PVC and that pod manually (PV is auto-deleted after PVC deletion, once PV-disk no more exist). So the new pod is scheduled on new upgraded node and the new PVC (poiting to new PV) is created as well.

My Pod description:

Are you facing that issue? If yes, how do you deal it? If no, what you suggest me?

Thank you.

cofyc · 2020-04-10T06:53:21Z

hi, @nerddelphi

your manual operation is correct but unfortunately, there is no automatic solution right now. I'm thinking about writing a cloud controller to automate this.

andyzhangx · 2020-05-23T01:03:26Z

Since this PR(#187) has already added namePattern parameter, what about this lightweight design: add a new parameter raid in storageClassMap, e.g. in following example, provisioner discovery will:

in the discovery loop, it will check whether /dev/md0 exists, if exists, then skip; if not:
- discover all /dev/nvme* devices(with basic capacity check), format those devices, and make RAID as /dev/md0.
- create a new PV with /dev/md0 as Filesystem volumeMode

if raid is empty, then don't set up RAID, compatible with default config.

So on every agent node, provisioner would at most create a new PV with local.path: /dev/md0 as Filesystem volumeMode

apiVersion: v1
kind: ConfigMap
metadata:
  name: local-provisioner-config
  namespace: default
data:
  storageClassMap: |
    fast-disks:
       hostDir: /dev
       mountDir:  /dev
       blockCleanerCommand:
         - "/scripts/shred.sh"
         - "2"
       volumeMode: Filesystem
       fsType: ext4
       namePattern: "nvme*"
       raid: "md0"

nerddelphi · 2020-05-23T01:21:33Z

@andyzhangx Excellent job! Is there any way to do with SCSI interface, once GKE only support NVMe in alpha clusters (beta and GA are SCSI)?

Thank you!

andyzhangx · 2020-05-23T01:36:20Z

@andyzhangx Excellent job! Is there any way to do with SCSI interface, once GKE only support NVMe in alpha clusters (beta and GA are SCSI)?

Thank you!

I am not aware of that. do you have the link about SCSI interface support? @nerddelphi

nerddelphi · 2020-05-26T16:10:46Z

@andyzhangx I'm using localssd in my GKE nodes and I can confirm only SCSI interface is available on GKE beta/ga clusters (once there're only /dev/sdX disk on node, pointing o localssds).

NVMe are available in alpha -> https://cloud.google.com/sdk/gcloud/reference/alpha/container/node-pools/create#--local-ssd-volumes

@nyurik said the same in this thread as well -> #65 (comment)

Perhaps your code could check if localssds are NVMe or SCSI (asking for the user).

Thank!

msau42 · 2020-05-26T20:45:23Z

@nerddelphi I believe the namePattern could be used to support matching scsi disks, although I think there will be challenges to distinguish a SCSI local SSD from a SCSI PD if you match on /dev/sd*. I believe the same issue happens for nvme as well.

@andyzhangx regarding extending the provisioner, can we make the setup action scriptable, similar to what we do for cleaning block devices? This will make the solution more customizable to any configuration.

fejta-bot · 2020-08-24T21:39:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-09-23T22:22:19Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Bessonov · 2020-09-28T09:22:03Z

/remove-lifecycle rotte

fejta-bot · 2020-10-28T09:49:35Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-10-28T09:49:42Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

msau42 · 2021-09-01T03:37:23Z

/reopen
/remove-lifecycle rotten
/lifecycle frozen

k8s-ci-robot · 2021-09-01T03:37:31Z

@msau42: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

msau42 · 2021-09-01T03:37:55Z

/kind feature

geastman3 · 2023-07-26T17:34:56Z

Any movement on this? This would be ideal for a solution I am working on.

msau42 · 2023-07-26T17:40:18Z

We published an example of a DaemonSet that can RAID the disks on GKE. If you're not on GKE, the example could be adapted to other enviornments: https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/blob/master/docs/getting-started.md#option-2-gke

cofyc mentioned this issue Apr 17, 2019

sidecar to automate tasks #78

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 13, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 22, 2019

msau42 mentioned this issue Oct 4, 2019

Creating RAID-0 with local SSDs on GCP kubernetes-retired/external-storage#1223

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 1, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Feb 10, 2020

msau42 mentioned this issue Feb 10, 2020

Consider adding SPDK support #167

Closed

k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 10, 2020

nerddelphi mentioned this issue Apr 10, 2020

Rotating cloud instances with PVCs in a StatefulSet #181

Open

mmatczuk mentioned this issue May 26, 2020

GKE RAID improvements scylladb/scylla-operator#93

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 24, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 23, 2020

k8s-ci-robot closed this as completed Oct 28, 2020

alice-sawatzky mentioned this issue May 3, 2021

Helm chart init container support #251

Merged

k8s-ci-robot reopened this Sep 1, 2021

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Sep 1, 2021

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider support for RAID in local provisioner #65

Consider support for RAID in local provisioner #65

schallert commented Mar 18, 2019

msau42 commented Mar 19, 2019

cofyc commented May 14, 2019 •

edited

Loading

gnufied commented May 14, 2019

cofyc commented May 15, 2019 •

edited

Loading

cofyc commented May 15, 2019

fejta-bot commented Aug 13, 2019

gregwebs commented Aug 22, 2019

msau42 commented Aug 22, 2019

gregwebs commented Aug 22, 2019 •

edited

Loading

nyurik commented Oct 1, 2019 •

edited

Loading

fejta-bot commented Dec 30, 2019

gregwebs commented Jan 2, 2020

fejta-bot commented Feb 1, 2020

msau42 commented Feb 10, 2020

nerddelphi commented Apr 10, 2020 •

edited

Loading

cofyc commented Apr 10, 2020

andyzhangx commented May 23, 2020 •

edited

Loading

nerddelphi commented May 23, 2020

andyzhangx commented May 23, 2020

nerddelphi commented May 26, 2020

msau42 commented May 26, 2020

fejta-bot commented Aug 24, 2020

fejta-bot commented Sep 23, 2020

Bessonov commented Sep 28, 2020

fejta-bot commented Oct 28, 2020

k8s-ci-robot commented Oct 28, 2020

msau42 commented Sep 1, 2021

k8s-ci-robot commented Sep 1, 2021

msau42 commented Sep 1, 2021

geastman3 commented Jul 26, 2023

msau42 commented Jul 26, 2023

Consider support for RAID in local provisioner #65

Consider support for RAID in local provisioner #65

Comments

schallert commented Mar 18, 2019

msau42 commented Mar 19, 2019

cofyc commented May 14, 2019 • edited Loading

gnufied commented May 14, 2019

cofyc commented May 15, 2019 • edited Loading

cofyc commented May 15, 2019

fejta-bot commented Aug 13, 2019

gregwebs commented Aug 22, 2019

msau42 commented Aug 22, 2019

gregwebs commented Aug 22, 2019 • edited Loading

nyurik commented Oct 1, 2019 • edited Loading

fejta-bot commented Dec 30, 2019

gregwebs commented Jan 2, 2020

fejta-bot commented Feb 1, 2020

msau42 commented Feb 10, 2020

nerddelphi commented Apr 10, 2020 • edited Loading

cofyc commented Apr 10, 2020

andyzhangx commented May 23, 2020 • edited Loading

nerddelphi commented May 23, 2020

andyzhangx commented May 23, 2020

nerddelphi commented May 26, 2020

msau42 commented May 26, 2020

fejta-bot commented Aug 24, 2020

fejta-bot commented Sep 23, 2020

Bessonov commented Sep 28, 2020

fejta-bot commented Oct 28, 2020

k8s-ci-robot commented Oct 28, 2020

msau42 commented Sep 1, 2021

k8s-ci-robot commented Sep 1, 2021

msau42 commented Sep 1, 2021

geastman3 commented Jul 26, 2023

msau42 commented Jul 26, 2023

cofyc commented May 14, 2019 •

edited

Loading

cofyc commented May 15, 2019 •

edited

Loading

gregwebs commented Aug 22, 2019 •

edited

Loading

nyurik commented Oct 1, 2019 •

edited

Loading

nerddelphi commented Apr 10, 2020 •

edited

Loading

andyzhangx commented May 23, 2020 •

edited

Loading