Dynamic Accelerator Slicer (DAS) is an operator that dynamically partitions GPU accelerators in Kubernetes and OpenShift. It currently ships with a reference implementation for NVIDIA Multi-Instance GPU (MIG) and is designed to support additional technologies such as NVIDIA MPS or GPUs from other vendors.
Minimum supported OpenShift versions: 4.18.21 and 4.19.6.
- Dynamic Accelerator Slicer (DAS) Operator
- On-demand partitioning of GPUs via a custom Kubernetes operator.
- Scheduler integration that allocates NVIDIA MIG slices through a plugin located at
pkg/scheduler/plugins/mig/mig.go
. AllocationClaim
custom resource to track slice reservations (pkg/apis/dasoperator/v1alpha1/allocation_types.go
).- Emulated mode to exercise the workflow without real hardware.
This project uses just for task automation. Install just first:
# On macOS
brew install just
# On Fedora/RHEL
dnf install just
# On Ubuntu/Debian
apt install just
# Or via cargo
cargo install just
-
Configure your images by editing
related_images.your-username.json
with your registry:[ {"name": "instaslice-operator-next", "image": "quay.io/your-username/instaslice-operator:latest"}, {"name": "instaslice-webhook-next", "image": "quay.io/your-username/instaslice-webhook:latest"}, {"name": "instaslice-scheduler-next", "image": "quay.io/your-username/instaslice-scheduler:latest"}, {"name": "instaslice-daemonset-next", "image": "quay.io/your-username/instaslice-daemonset:latest"} ]
-
Build and push all images:
just build-push-parallel
-
Deploy to OpenShift (with emulated mode for development):
export EMULATED_MODE=enabled export RELATED_IMAGES=related_images.your-username.json just deploy-das-ocp
-
Test the installation:
kubectl apply -f test/test-pod-emulated.yaml
For OpenShift clusters with GPU hardware:
-
Deploy prerequisites:
just deploy-cert-manager-ocp just deploy-nfd-ocp just deploy-nvidia-ocp
-
Deploy DAS operator:
export EMULATED_MODE=disabled export RELATED_IMAGES=related_images.your-username.json just deploy-das-ocp
-
Test with GPU workload:
kubectl apply -f test/test-pod.yaml
For local development:
-
Run operator locally (requires scheduler, webhook, and daemonset images to be built and pushed beforehand):
# Build and push images first just build-push-parallel # Run operator locally # Set EMULATED_MODE to control hardware emulation EMULATED_MODE=enabled just run-local
-
Run tests:
just test-e2e
-
Check code quality:
just lint
- Login into podman and have a repository created for the operator bundle.
- Set
BUNDLE_IMAGE
to point to your repository and tag of choice. - Run
just bundle-generate
to generate the bundle manifests. - Run
just build-push-bundle
to build and push the bundle image to your repository. - Run
just deploy-cert-manager-ocp
to install cert-manager on OpenShift. - Run
just deploy-nfd-ocp
to install Node Feature Discovery (NFD) on OpenShift. - Run
just deploy-nvidia-ocp
to install NVIDIA GPU operator on Openshift. - Run
operator-sdk run bundle --namespace <namespace> ${BUNDLE_IMAGE}
to deploy the operator. - Apply the
DASOperator
custom resource to initialize the operatorkubectl apply -f deploy/03_instaslice_operator.cr.yaml`
Running generate bundle
is the first step to publishing an operator to a catalog
and deploying it with OLM. A CSV manifest is generated by collecting data from the
set of manifests passed to this command, such as CRDs, RBAC, etc., and applying
that data to a "base" CSV manifest.
The steps to provide a base CSV:
- create a base CSV file that contains the desired metadata, the base CSV file name can be arbitrary, we can follow
the convention
{operator-name}.base.clusterserviceverison.yaml
- put the base CSV file in the
deploy
folder. This is the folder from which thegenerate bundle
command will collect the k8s manifests. Note that the base CSV file can be placed inside a sub-directory within thedeploy
folder. - make sure that the
metadata.name
of the base CSV is the same name as the package name provided to thegenerate bundle
command, otherwise thegenerate bundle
command will ignore the base CSV and will generate on an empty CSV.
Layout of an example deploy
folder:
tree deploy/
deploy/
├── crds
│ └── foo-operator.crd.yaml
├── base-csv
│ └── foo-operator.base.clusterserviceversion.yaml
├── deployment.yaml
├── role.yaml
├── role_binding.yaml
├── service_account.yaml
└── webhooks.yaml
The bundle generation command:
operator-sdk generate bundle --input-dir deploy --version 0.1.0 --output-dir=bundle --package foo-operator
The base CSV yaml:
apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
metadata:
name: foo-operator.base
annotations:
alm-examples:
# other annotations can be placed here
spec:
displayName: Instaslice
version: 0.0.2
apiservicedefinitions:
customresourcedefinitions:
install:
installModes:
- supported: false
type: OwnNamespace
- supported: false
type: SingleNamespace
- supported: false
type: MultiNamespace
- supported: true
type: AllNamespaces
maturity: alpha
minKubeVersion: 1.16.0
provider:
name: Codeflare
url: https://github.com/openshift/instaslice-operator
relatedImages:
keywords:
- Foo
links:
- name: My Operator
url: https://github.com/foo/bar
maintainers:
description:
icon:
- There is no need to provide any permission, or deployment spec inside the base CSV.
- Note that the
metadata.name
of the base CSV has a prefix offoo-operator.
which adheres to the format{package name}.
- if there are multiple CSV files inside the deploy folder, the one encountered first in lexical order will be selected as the base CSV
The CSV generation details can be found by inspecting the bundle generation code here: https://github.com/operator-framework/operator-sdk/blob/0eefc52889ff3dfe4af406038709e6c5ba7398e5/internal/generate/clusterserviceversion/clusterserviceversion.go#L148-L159
Emulated mode allows the operator to publish synthetic GPU capacity and skip NVML calls. This is handy for development and CI environments with no hardware. Emulated mode is controlled via the EMULATED_MODE
environment variable.
The EMULATED_MODE
environment variable is read by the operator at startup and determines how the daemonset components behave:
disabled
(default): Normal operation mode that requires real MIG compatible GPUs hardware and makes NVML callsenabled
: Emulated mode that simulates MIG capable GPUs capacity without requiring actual hardware
For local development:
# Run operator locally with emulation
EMULATED_MODE=enabled just run-local
For deployment:
# Deploy with emulated mode enabled
export EMULATED_MODE=enabled
export RELATED_IMAGES=related_images.your-username.json
just deploy-das-ocp
For production with MIG Compatible GPUs:
# Deploy with emulated mode disabled (default)
export EMULATED_MODE=disabled
export RELATED_IMAGES=related_images.your-username.json
just deploy-das-ocp
The operator reads the EMULATED_MODE
environment variable at startup and passes this configuration to the daemonset pods running on each node. When emulated mode is enabled:
- The daemonset skips hardware detection and NVML library calls
- Synthetic GPU resources are published to simulate hardware capacity
- MIG slicing operations are simulated rather than performed on real hardware
This allows for testing and development of the operator functionality without requiring physical GPU hardware.
This project includes a Justfile for convenient task automation. The Justfile provides several commands for building, pushing, and deploying the operator components.
Install just command runner:
# On macOS
brew install just
# On Fedora/RHEL
dnf install just
# On Ubuntu/Debian
apt install just
# Or via cargo
cargo install just
List all available commands:
just
View current configuration:
just info
Run the operator locally for development:
# Set EMULATED_MODE to 'enabled' for simulated GPUs or 'disabled' for real hardware
EMULATED_MODE=enabled just run-local
Run end-to-end tests:
just test-e2e
Run tests with a specific focus:
just test-e2e focus="GPU slices"
Generate operator bundle:
just bundle-generate
Build and push bundle image:
just build-push-bundle
Build and push developer bundle:
just build-push-developer-bundle
Deploy NVIDIA GPU operator to OpenShift:
just deploy-nvidia-ocp
Remove NVIDIA GPU operator from OpenShift:
just undeploy-nvidia-ocp
Deploy cert-manager for OpenShift:
just deploy-cert-manager-ocp
Remove cert-manager from OpenShift:
just undeploy-cert-manager-ocp
Deploy cert-manager for Kubernetes:
just deploy-cert-manager
Deploy Node Feature Discovery (NFD) operator for OpenShift:
just deploy-nfd-ocp
Run all linting (markdown and Go):
just lint
Run all linting with automatic fixes:
just lint-fix
Run only Go linting:
just lint-go
Run only markdown linting:
just lint-md
Run Go linting and automatically fix issues:
just lint-go-fix
Run markdown linting and automatically fix issues:
just lint-md-fix
Clean up all deployed Kubernetes resources:
just undeploy
Build and push individual component images:
just build-push-scheduler # Build and push scheduler image
just build-push-daemonset # Build and push daemonset image
just build-push-operator # Build and push operator image
just build-push-webhook # Build and push webhook image
Build and push all images in parallel:
just build-push-parallel
Deploy DAS on OpenShift Container Platform:
just deploy-das-ocp
Generate CRDs and clients:
just regen-crd # Generate CRDs into manifests directory
just regen-crd-k8s # Generate CRDs directly into deploy directory
just generate-clients # Generate client code
just verify-codegen # Verify generated client code is up to date
just generate # Generate all - CRDs and clients
Copy related_images.developer.json to related_images.username.json to use as a template and modify it to contain the target developer image repositories to use.
cp related_images.developer.json related_images.username.json
# Edit related_images.username.json with your registry
quay.io/username/image:latest
Then set the RELATED_IMAGES environment variable to related_images.username.json.
RELATED_IMAGES=related_images.username.json just
The Justfile uses environment variables for configuration. You can customize these by setting them in your
environment or creating a .env
file:
PODMAN
- Container runtime (default:podman
)KUBECTL
- Kubernetes CLI (default:oc
)EMULATED_MODE
- Enable emulated mode (default:disabled
)RELATED_IMAGES
- Path to related images JSON file (default:related_images.json
)DEPLOY_DIR
- Deployment directory (default:deploy
)OPERATOR_SDK
- Operator SDK binary (default:operator-sdk
)OPERATOR_VERSION
- Operator version for bundle generation (default:0.1.0
)GOLANGCI_LINT
- Golangci-lint binary (default:golangci-lint
)
Example:
export EMULATED_MODE=enabled
just deploy-das-ocp
The diagram below summarizes how the operator components interact. Pods requesting GPU slices are mutated by a
webhook to use the mig.das.com
extended resource. The scheduler plugin tracks slice availability and creates
AllocationClaim
objects processed by the device plugin on each node.
The plugin integrates with the Kubernetes scheduler and runs through three framework phases:
- Filter – ensures the node is MIG capable and stages
AllocationClaim
s for suitable GPUs. - Score – prefers nodes with the most free MIG slice slots after considering existing and staged claims.
- PreBind – promotes staged claims on the selected node to
created
and removes the rest.
Once promoted, the device plugin provisions the slices.
The daemonset advertises GPU resources only after the NVIDIA GPU Operator's
ClusterPolicy
reports a Ready state. This prevents the scheduler from
scheduling pods on a node before the GPU Operator has initialized the drivers.
AllocationClaim
is a namespaced CRD that records which MIG slice will be prepared for a pod. Claims start in the
staged
state and transition to created
once all requests are satisfied. Each claim stores the GPU UUID, slice
position and pod reference.
Example:
$ kubectl get allocationclaims -n das-operator
NAME AGE
8835132e-8a7a-4766-a78f-0cb853d165a2-busy-0 61s
$ kubectl get allocationclaims -n das-operator -o yaml
apiVersion: inference.redhat.com/v1alpha1
kind: AllocationClaim
...
All components run in the das-operator
namespace:
kubectl get pods -n das-operator
Inspect the active claims:
kubectl get allocationclaims -n das-operator
On the node, verify that the CDI devices were created:
ls -l /var/run/cdi/
Increase verbosity by editing the DASOperator
resource and setting operatorLogLevel
to Debug
or Trace
.
Run all unit tests for the project:
make test
Run unit tests with verbose output:
go test -v ./pkg/...
Run unit tests with coverage:
go test -cover ./pkg/...
A running cluster with a valid KUBECONFIG
is required:
just test-e2e
You can focus on specific tests:
just test-e2e focus="GPU slices"
Due to kubernetes/kubernetes#128043
pods may enter an UnexpectedAdmissionError
state if admission fails. Pods
managed by higher level controllers such as Deployments will be recreated
automatically. Naked pods, however, must be cleaned up manually with
kubectl delete pod
. Using controllers is recommended until the upstream issue
is resolved.
Remove the deployed resources with:
just undeploy
Contributions are welcome! Please open issues or pull requests.
This project is licensed under the Apache 2.0 License.