[PROPOSAL]: unify the access interface to different devices, such as CPU, GPU, XPU #3350
ftian1
started this conversation in
Development | Core
Replies: 1 comment 1 reply
-
@FrankLeeeee I move the proposal from |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Summary
This RFC is used to unify the access interface to different devices (cpu and gpu) in the components of ColossalAI. This propose will also benefit on adding Intel XPU support to ColossalAI in the future.
Motivation
As we know, the major features of ColossalAI are built on Nvidia GPU and Cuda package. This limits the scope of leveraging different device types to enable LLM by ColossalAI.
for example, in
utils/cuda.py
,context/parallel_context.py
and something else, there have had some seperate interfaces for other componenets to accesscpu
orgpu
device. Besides that, there are also many internal components invokingtorch.cuda
explicitly.We would like to propose a unified device access interface to provide not only Nvidia GPU support but also other device type support, like Intel X86 CPU and XPU.
Proposal
NOTE 1: Currently the proposal mainly focues on ColossalAI training part. The ColossalAI inference support is out of scope here.
NOTE 2: The RFC focus on pythonic API level only, the replacement of cuda kernels on cpu part and the corresponding upper layer features, like nn.optimizer, nn.layer, gemini and so on, are out of scope here. we plan to provide support in the other RFC/PRs.
As ColossalAI training is designed to boost up NLP & LLM training speed by data and model parallel, it has had a central place to store the
execution context
incore.global_context
. The first proposal of unifying the device access is to extend thiscore.global_context
structure to get and set device related informations.The
engine
andtrainer
user facing API will rely on it to copy tensor to the corresponding device the application is runing on.The details are something like below:
From user view, the training code is simpler than before as user doesn't need to care which device should be explicitly specified. The new logic will automatically move/cast tensors to the device user is using. For example, if the underneath hardware is Nvidia GPU, the tensor will be automatically go to CUDA device. If the underneath hardware is Intel X86 CPU, the tensor will automatically keep in CPU side.
Blow is some sample codes to demonstrate this idea.
Future Works
This RFC is focusing on discussing the unified device access interface. We will gradually add CPU support into the internal componenets to make the whole functionality work by following below TODO list.
Self-service
Beta Was this translation helpful? Give feedback.
All reactions