-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added GPU enabled sandbox image. #3256
base: master
Are you sure you want to change the base?
Conversation
Thank you for opening this pull request! 🙌 These tips will help get your PR across the finish line:
|
This is great. I'm wondering if there is a way to do this is in a more scalable way. Namely, perhaps we can refactor our sandbox image in a way that the community can easily layer new functionality on (e.g. GPUs). That way teams can build their own sandbox images, and run |
@jeevb Interesting idea! :-) I think there could be some creativity unlocked by such a solution. On the flip side I think there is a great benefit to having an easy way to start an "official" sandbox/demo cluster for the users that don't have k8s-expertise. The sandbox cluster is really perfect for that! GPU capabilities is (unfortunately) a must for a lot of data science use cases however, so I think that would be a nice addition to the official images. |
@ahlgol I am working on deploying flyte on prem and this is really of great help. I was able to build the However, when I start the sandbox cluster with Am I missing something? What is the correct way to use the gpu image? |
Didn't even know about the --image parameter to flytectl - nice :-). What I did for testing was that I replaced the local image with the one built with gpu support and spun up a cluster with a regular |
Right; so if I understand this correctly what is currently missing from the gpu demo image is the new bootstrapping functionality @jeevb added in February. I will try to update my PR with this and try it out, but in the meantime you can manually add it to the Dockerfile.gpu with something like this: @@ -10,7 +10,19 @@ WORKDIR /build
COPY images/manifest.txt images/preload ./
RUN --security=insecure ./preload manifest.txt
+FROM --platform=${BUILDPLATFORM} golang:1.19-bullseye AS bootstrap
+ARG TARGETARCH
+ENV CGO_ENABLED 0
+ENV GOARCH "${TARGETARCH}"
+ENV GOOS linux
+
+WORKDIR /flyteorg/build
+COPY bootstrap/go.mod bootstrap/go.sum ./
+RUN go mod download
+COPY bootstrap/ ./
+RUN --mount=type=cache,target=/root/.cache/go-build --mount=type=cache,target=/root/go/pkg/mod \
+ go build -o dist/flyte-sandbox-bootstrap cmd/bootstrap/main.go
# syntax=docker/dockerfile:1.4-labs
#Following
@@ -57,6 +69,8 @@ COPY images/tar/${TARGETARCH}/ /var/lib/rancher/k3s/agent/images/
COPY manifests/ /var/lib/rancher/k3s/server/manifests-staging/
COPY bin/ /bin/
+COPY --from=bootstrap /flyteorg/build/dist/flyte-sandbox-bootstrap /bin/
+
VOLUME /var/lib/kubelet
VOLUME /var/lib/rancher/k3s
VOLUME /var/lib/cni I could then deploy it with I haven't had time to test it out, but the cluster starts up as it should. Please let me know if it works for you. |
@Nan2018 PR updated now. |
This is really cool, how do we get it into the official version. |
@ahlgol I was able to build gpu image with updated PR. but the the sandbox container still immediately exited with code 1 (same with docker run). what version of flytectl did you test with? I am on
|
Yeah, I have the same one... do you have docker configured with the nvidia container runtime? https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html Notice the test I will also do a new test from a clean environment... |
@Nan2018 something you can try is start up the container with bash as an entrypoint:
|
That would be awesome, but I don't know... :-( |
ok brainstorming |
The conversation continued on slack, but Just as a reference: For /etc/docker/daemon.json: {
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
} |
What's the status? |
I'm not sure who you're asking @davidmirror-ops :-) From my side I can't maintain this out of tree, as I'm still not using flyte on a regular basis. From the projects side it seems that it can't be included since there is missing GPUs available in the build/testing environment. |
Hi @ahlgol |
@gakumar49606 However, given the and , it seems that if the nvidia container runtime is the default runtime, adding the environment variable |
@ahlgol I found the way out.
With the above changes, |
Glad it worked, and thanks for telling :-) |
@ahlgol How's going on this PR? |
Hello @Future-Outlier! The main problem with this PR was on the receiving end; as there was no way to test it, they couldn't accept it. I think it still has a value though, so if you have time to keep it up to date provide support in the chat you're more than welcome to! :-) Let me know if you need any help. //Björn |
My laptop has GPU, I will try to test it today. |
My laptop had a GPU. I tested it sometime back in Aug. It worked just fine and picked up the driver !!! |
The "nvidia/cuda" base image should be changed, currently, I am testing it. |
Please update the cuda version.
|
@Future-Outlier There you go... |
Hi @ahlgol, I hope this message finds you well. I'm currently focusing on finalizing the PR and could use your assistance with a couple of tasks. Could you please: Merge the PR with the latest master branch. |
@ahlgol Can you join Flyte's community? |
The process to test itGPU flytectl demo start --image futureoutlier/flyte-sandbox:gpu-v2 --disable-agent --force 2.set the config in flyte sandbox-config kubectl edit configmap flyte-sandbox-config -n flyte plugins:
k8s:
resource-tolerations:
- nvidia.com/gpu:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule" kubectl rollout restart deployment flyte-sandbox -n flyte 3.run the job
from flytekit import ImageSpec, Resources, task
gpu = "1"
@task(
retries=2,
cache=True,
cache_version="1.0",
requests=Resources(gpu=gpu),
# container_image=ImageSpec(
# cuda="12.2",
# python_version="3.9.13",
# packages=["torch"],
# apt_packages=["git"],
# registry="futureoutlier",)
)
def check_if_gpu_available() -> bool:
import torch
return torch.cuda.is_available() |
Ensure that your setting for running gpu on kubernetes is correct.
|
GPU Issue Full Steps Guide (Please change to root user, don't use sudo)0. PrerequisitesEnsure you have installed them and you can run them all
1. Ensure that your setting for running GPU on Kubernetes is correct
2. Build the GPU Image1. Create a GPU DockerFile, and add relevant changeI will give you 2 options here, apply @ahlgol 's diff, or use mine (already applied) Algo's PR is here: https://github.com/flyteorg/flyte/pull/3256/files
2. Change the
|
PR Updates: Please merge my changes here: https://github.com/danpf/flyte/tree/danpf-sandbox-gpu into this branch diff:
|
@ahlgol Can you sign off your all previous commits and merge the diff? |
A new Dockerfile and build-target "build-gpu" in docker/sandbox-bundled that builds a CUDA enabled image named flyte-sandbox-gpu.
Describe your changes
Check all the applicable boxes
Note to reviewers
Changes have been added following info from these sources (plus some trial and error):
https://itnext.io/enabling-nvidia-gpus-on-k3s-for-cuda-workloads-a11b96f967b0
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
https://k3d.io/v5.4.6/usage/advanced/cuda/