CUDAw

Goal

CUDAw is developed based on CUDA wrapper, which implements virtual GPU memory translation to support pause and resume of CUDA application without user's awareness. It can not only trace and record every GPU memory usage, like cudaMalloc and cudaMemcpy, and every kernel function call, helping them use virtual GPU memory managed by administrators correctly, it can also support checkpointing user application, release their GPU memory, hang on kernel function execution and restore in any specified time.

Based on CUDAw, we have implemented a mechanism for GPU sharing across multiple workloads. Our additional work as follows:

Anytime pause and resume of CUDA application
Fine-grained time-sharing of GPU cores and memory
Fine-grained GPU memory virtualization for CUDA application
Deep learning job migration across nodes
Co-design of deep learning jobs and scheduler

Approach

Like CUDA wrapper, we wrap the NVIDIA runtime API by dlsym and captures every memory usage and kernel launch. For general purpose and easier usage, we wrap all possible CUDA calls, which can be used as a fake libcudart.so independently.

We implement a full GPU memory management and translation for CUDA application. In other words, we implement a full GPU memory abstraction layer. When user application tries to allocate or access GPU memory, our wrapper will handle the request, pass the true GPU address to real libcudart.so, and make actual execution on GPU device through CUDA.

We use some novel techniques to get parameters of kernel functions of cudaLaunchKernel, which instructs us to translate memory addresses. We have implemented an automatic discovery of kernel function parameters.

We also implement mechanisms for a better usage and codesign checkpoint notifier with a custom DLT job scheduler on Kubernetes. The feature of any-time checkpoint and migration of jobs will bring unprecedented possibilities to the scheduling strategy and performance of the scheduler.

Some ongoing works are that reduce the overhead of checkpointing and framework and algorithm of scheduler codesign with the techniques. We will release job migration support shortly.

Usage

To be continued...

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
autotest.sh		autotest.sh
cudawblas.c		cudawblas.c
cudawrt.c		cudawrt.c
cudawrt.h		cudawrt.h
dotest.sh		dotest.sh
funcs.c		funcs.c
targs.c		targs.c
targs.h		targs.h
upload.sh		upload.sh
vaddr.c		vaddr.c
vaddr.h		vaddr.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDAw

Goal

Approach

Usage

About

Releases

Packages

Contributors 3

Languages

yzs981130/cudaw

Folders and files

Latest commit

History

Repository files navigation

CUDAw

Goal

Approach

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages