-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sycl: always set the main device after initialization #7909
Conversation
Because we are using the main device to determine the context for USM host allocations, we need to ensure it is set to a valid value after initialization, so set device zero as the initial main device. Also, adds a small refactor to the GPU detection logic, to ensure all GPUs are from the same backend. Although unlikely due to the max compute unit check, the prior code would attempt to use GPUs from different backends together if they happened to have the same maximum number of compute units. As an added bonus, the updates work with GPUs using the OpenCL backend, also.
@@ -17419,6 +17414,7 @@ GGML_API GGML_CALL void ggml_backend_sycl_set_mul_device_mode() { | |||
g_sycl_gpu_mgr = new sycl_gpu_mgr(); | |||
g_ggml_sycl_backend_gpu_mode = SYCL_MUL_GPU_MODE; | |||
ggml_init_by_gpus(g_sycl_gpu_mgr->get_gpu_count()); | |||
ggml_sycl_set_main_device(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this breaks multi-GPU semantics, @NeoZhangJianyu can you try this on a multi-GPU env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I convinced myself that this would be OK even in a multi-GPU environment, though admittedly I haven't tested this myself so it'd be great to confirm this works.
My thinking is: we're eventually going to set the main device via some other codepath, such as via ggml_backend_sycl_init
(probably through llama_new_context_with_model
). We may even set the main device multiple times. This is all fine; we just need some valid initial value, so if we happen to lookup the SYCL queue and hence the SYCL context, say to allocate host USM when loading a model, we have a valid value to perform the lookup.
@bashbaug Currently, the multiple GPUs model only support level-zero device. SYCL backend support two modes: single GPU and multiple GPUs. So, I think current PR should be updated according above description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After go through the changed code, I think this PR should be refactored all.
If you like, I want to know the original issue of this PR.
@@ -17400,6 +17394,7 @@ GGML_API GGML_CALL void ggml_backend_sycl_set_single_device_mode(int main_gpu_id | |||
g_sycl_gpu_mgr = new sycl_gpu_mgr(main_gpu_id); | |||
g_ggml_sycl_backend_gpu_mode = SYCL_SINGLE_GPU_MODE; | |||
ggml_init_by_gpus(g_sycl_gpu_mgr->get_gpu_count()); | |||
ggml_sycl_set_main_device(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In single mode, the main device ID is set by the parameter of cmd line.
So, set it as 0, will disable the parameter: --main-gpu in fact.
So rm it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about ggml_sycl_set_main_device(main_gpu_id)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused by this initially also, but I think zero is the only safe and correct initial value. Here's why:
There are two sets of devices we can get and iterate through: The first is the set of devices returned by dpct::dev_mgr::instance().get_device()
. This is the set of all devices in the system, and main_gpu_id
is an index into this set. The second is the set of devices stored in sycl_gpu_mgr
. This is essentially a "filtered" set of devices we've chosen to use, and it can be indexed from zero to sycl_gpu_mgr->get_gpu_count()
.
In the case where we choose a main GPU on the command line, the filtering will be performed when we create the sycl_gpu_mgr
above.
Line 17392 in 172c825
g_sycl_gpu_mgr = new sycl_gpu_mgr(main_gpu_id); |
After the filtering occurs, the only valid index to pass to ggml_sycl_set_main_device()
is index zero, because there is only one device in the sycl_gpu_mgr
.
@@ -17419,6 +17414,7 @@ GGML_API GGML_CALL void ggml_backend_sycl_set_mul_device_mode() { | |||
g_sycl_gpu_mgr = new sycl_gpu_mgr(); | |||
g_ggml_sycl_backend_gpu_mode = SYCL_MUL_GPU_MODE; | |||
ggml_init_by_gpus(g_sycl_gpu_mgr->get_gpu_count()); | |||
ggml_sycl_set_main_device(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In mulitple mode, set main gpu is not needed. #0 gpu is always default main gpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this isn't the case:
Line 3374 in 172c825
static int g_main_device = -1; |
We could change the initial value of g_main_device
from -1 to 0, but we'd probably also want to change some other initial values to stay in sync, say for g_main_device_id
. It seems safer to me to just call ggml_sycl_set_main_device(0)
instead, but let me know what you prefer.
Since #7640, SYCL support has been broken, in which base you test your code? |
I think it's well-understood what causes the breakage: host memory is being allocated ( The original PR #7777 fixes this by allocating & freeing the host memory using the correct context. Unfortunately (sometimes?) the main device isn't set before This PR, I believe, attempts to fix this by ensuring that a main device is set early enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kinda confusing that we get sycl::malloc_host
calls before a call to ggml_backend_sycl_init
, which should in turn call ggml_sycl_set_main_device(device);
.
@bashbaug do you know where these calls are coming from?
@@ -17400,6 +17394,7 @@ GGML_API GGML_CALL void ggml_backend_sycl_set_single_device_mode(int main_gpu_id | |||
g_sycl_gpu_mgr = new sycl_gpu_mgr(main_gpu_id); | |||
g_ggml_sycl_backend_gpu_mode = SYCL_SINGLE_GPU_MODE; | |||
ggml_init_by_gpus(g_sycl_gpu_mgr->get_gpu_count()); | |||
ggml_sycl_set_main_device(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about ggml_sycl_set_main_device(main_gpu_id)
?
Yes, I agree - we also need to prevent this because SYCL does not allow creating a context from devices in different platforms: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:interface.context.class
The check I added to ensure all of our chosen devices come from the same SYCL platform ensures that we do not use two "logical devices" based on the same "physical device", so I think this is covered. |
Yeah, here's a stack trace showing where the call is coming from:
Short answer: it's coming from |
I believe #7777 has been fixed in #7710, confirmed by AidanBeltonS, could you give a try? |
Fixes an issue reported in llama-bench and elsewhere after merging #7777, see also #7858.
Because we are using the main device to determine the SYCL context for USM host allocations, we need to ensure it is set to a valid value after initialization, so set device zero as the initial main device.
Also, adds a small refactor to the SYCL GPU detection logic, to ensure all GPUs are from the same backend. Although unlikely due to the max compute unit check, the prior code would attempt to use GPUs from different backends together if they happened to have the same maximum number of compute units. As an added bonus, the updates work with GPUs using the OpenCL backend, also.
Testing done (on an Intel A750) - all commands executed successfully:
$ ./llama-bench -m ./models/llama-2-7b-chat.Q4_K_M.gguf -ngl 77 --mmap 0 $ ONEAPI_DEVICE_SELECTOR=opencl:gpu ./llama-bench -m ./models/llama-2-7b-chat.Q4_K_M.gguf -ngl 77 --mmap 0 $ ONEAPI_DEVICE_SELECTOR=ext_oneapi_level_zero:* ./llama-bench -m ./models/llama-2-7b-chat.Q4_K_M.gguf -ngl 77 --mmap 0