sycl: always set the main device after initialization #7909

bashbaug · 2024-06-12T21:52:44Z

Fixes an issue reported in llama-bench and elsewhere after merging #7777, see also #7858.

Because we are using the main device to determine the SYCL context for USM host allocations, we need to ensure it is set to a valid value after initialization, so set device zero as the initial main device.

Also, adds a small refactor to the SYCL GPU detection logic, to ensure all GPUs are from the same backend. Although unlikely due to the max compute unit check, the prior code would attempt to use GPUs from different backends together if they happened to have the same maximum number of compute units. As an added bonus, the updates work with GPUs using the OpenCL backend, also.

Testing done (on an Intel A750) - all commands executed successfully:

$ ./llama-bench -m ./models/llama-2-7b-chat.Q4_K_M.gguf -ngl 77 --mmap 0
$ ONEAPI_DEVICE_SELECTOR=opencl:gpu ./llama-bench -m ./models/llama-2-7b-chat.Q4_K_M.gguf -ngl 77 --mmap 0
$ ONEAPI_DEVICE_SELECTOR=ext_oneapi_level_zero:* ./llama-bench -m ./models/llama-2-7b-chat.Q4_K_M.gguf -ngl 77 --mmap 0

Self Reported Review Complexity:
- Review Complexity : Low
- Review Complexity : Medium
- Review Complexity : High
I have read the contributing guidelines

Because we are using the main device to determine the context for USM host allocations, we need to ensure it is set to a valid value after initialization, so set device zero as the initial main device. Also, adds a small refactor to the GPU detection logic, to ensure all GPUs are from the same backend. Although unlikely due to the max compute unit check, the prior code would attempt to use GPUs from different backends together if they happened to have the same maximum number of compute units. As an added bonus, the updates work with GPUs using the OpenCL backend, also.

airMeng · 2024-06-13T01:32:06Z

@NeoZhangJianyu

airMeng · 2024-06-13T01:33:47Z

ggml-sycl.cpp

@@ -17419,6 +17414,7 @@ GGML_API GGML_CALL void ggml_backend_sycl_set_mul_device_mode() {
    g_sycl_gpu_mgr = new sycl_gpu_mgr();
    g_ggml_sycl_backend_gpu_mode = SYCL_MUL_GPU_MODE;
    ggml_init_by_gpus(g_sycl_gpu_mgr->get_gpu_count());
+    ggml_sycl_set_main_device(0);


I think this breaks multi-GPU semantics, @NeoZhangJianyu can you try this on a multi-GPU env?

I convinced myself that this would be OK even in a multi-GPU environment, though admittedly I haven't tested this myself so it'd be great to confirm this works.

My thinking is: we're eventually going to set the main device via some other codepath, such as via ggml_backend_sycl_init (probably through llama_new_context_with_model). We may even set the main device multiple times. This is all fine; we just need some valid initial value, so if we happen to lookup the SYCL queue and hence the SYCL context, say to allocate host USM when loading a model, we have a valid value to perform the lookup.

ggml-sycl.cpp

NeoZhangJianyu · 2024-06-13T06:03:57Z

@bashbaug
In SYCL/oneAPI, one physical device could be mapped to 2 logic devices: level-zero and openCL.
The logic device could run on one of running times: level-zero or openCL in same time.
If a device on level-zero is running, the memory and EU are occupied by this app.
If start another app on the device on openCL, there will be conflict on memory and EU.
So, we need to avoid to use two logic devices based on same physical device in same time.

Currently, the multiple GPUs model only support level-zero device.
In this PR, the filter for level-zero is removed, that means SYCL backend will use two devices (level-zero and openCL) on same physical device. That will lead to unknow issue.

SYCL backend support two modes: single GPU and multiple GPUs.
In multiple GPU modes, it will use level-zero GPUs which have same top EU numbers.
In single GPU, user could set to any device: including openCL GPU.

So, I think current PR should be updated according above description.

NeoZhangJianyu

After go through the changed code, I think this PR should be refactored all.
If you like, I want to know the original issue of this PR.

NeoZhangJianyu · 2024-06-13T06:06:49Z

ggml-sycl.cpp

@@ -17400,6 +17394,7 @@ GGML_API GGML_CALL void ggml_backend_sycl_set_single_device_mode(int main_gpu_id
    g_sycl_gpu_mgr = new sycl_gpu_mgr(main_gpu_id);
    g_ggml_sycl_backend_gpu_mode = SYCL_SINGLE_GPU_MODE;
    ggml_init_by_gpus(g_sycl_gpu_mgr->get_gpu_count());
+    ggml_sycl_set_main_device(0);


In single mode, the main device ID is set by the parameter of cmd line.
So, set it as 0, will disable the parameter: --main-gpu in fact.
So rm it.

What about ggml_sycl_set_main_device(main_gpu_id)?

I was confused by this initially also, but I think zero is the only safe and correct initial value. Here's why:

There are two sets of devices we can get and iterate through: The first is the set of devices returned by dpct::dev_mgr::instance().get_device(). This is the set of all devices in the system, and main_gpu_id is an index into this set. The second is the set of devices stored in sycl_gpu_mgr. This is essentially a "filtered" set of devices we've chosen to use, and it can be indexed from zero to sycl_gpu_mgr->get_gpu_count().

In the case where we choose a main GPU on the command line, the filtering will be performed when we create the sycl_gpu_mgr above.

llama.cpp/ggml-sycl.cpp

Line 17392 in 172c825

g_sycl_gpu_mgr = new sycl_gpu_mgr(main_gpu_id);

After the filtering occurs, the only valid index to pass to ggml_sycl_set_main_device() is index zero, because there is only one device in the sycl_gpu_mgr.

NeoZhangJianyu · 2024-06-13T06:07:38Z

ggml-sycl.cpp

@@ -17419,6 +17414,7 @@ GGML_API GGML_CALL void ggml_backend_sycl_set_mul_device_mode() {
    g_sycl_gpu_mgr = new sycl_gpu_mgr();
    g_ggml_sycl_backend_gpu_mode = SYCL_MUL_GPU_MODE;
    ggml_init_by_gpus(g_sycl_gpu_mgr->get_gpu_count());
+    ggml_sycl_set_main_device(0);


In mulitple mode, set main gpu is not needed. #0 gpu is always default main gpu.

Unfortunately this isn't the case:

llama.cpp/ggml-sycl.cpp

Line 3374 in 172c825

static int g_main_device = -1;

We could change the initial value of g_main_device from -1 to 0, but we'd probably also want to change some other initial values to stay in sync, say for g_main_device_id. It seems safer to me to just call ggml_sycl_set_main_device(0) instead, but let me know what you prefer.

NeoZhangJianyu · 2024-06-13T06:15:45Z

@bashbaug
I see the PR #7777.
I remember SYCL backend support openCL in the beginning.
I don't know which PR break it later.
If you want to fix the openCL issue, a easy way to check the PRs of SYCL end one by one.
Find the PR and fix it.

airMeng · 2024-06-13T07:25:33Z

Since #7640, SYCL support has been broken, in which base you test your code?

joeatodd · 2024-06-13T13:56:15Z

@bashbaug I see the PR #7777. I remember SYCL backend support openCL in the beginning. I don't know which PR break it later. If you want to fix the openCL issue, a easy way to check the PRs of SYCL end one by one. Find the PR and fix it.

I think it's well-understood what causes the breakage: host memory is being allocated (sycl::malloc_host) with one context, and then accessed using another.

The original PR #7777 fixes this by allocating & freeing the host memory using the correct context. Unfortunately (sometimes?) the main device isn't set before ggml_sycl_host_malloc() is called.

This PR, I believe, attempts to fix this by ensuring that a main device is set early enough.

joeatodd

It's kinda confusing that we get sycl::malloc_host calls before a call to ggml_backend_sycl_init, which should in turn call ggml_sycl_set_main_device(device);.

@bashbaug do you know where these calls are coming from?

joeatodd · 2024-06-13T13:55:16Z

ggml-sycl.cpp

@@ -17400,6 +17394,7 @@ GGML_API GGML_CALL void ggml_backend_sycl_set_single_device_mode(int main_gpu_id
    g_sycl_gpu_mgr = new sycl_gpu_mgr(main_gpu_id);
    g_ggml_sycl_backend_gpu_mode = SYCL_SINGLE_GPU_MODE;
    ggml_init_by_gpus(g_sycl_gpu_mgr->get_gpu_count());
+    ggml_sycl_set_main_device(0);


What about ggml_sycl_set_main_device(main_gpu_id)?

bashbaug · 2024-06-13T14:58:10Z

So, we need to avoid to use two logic devices based on same physical device in same time.

Yes, I agree - we also need to prevent this because SYCL does not allow creating a context from devices in different platforms:

https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:interface.context.class

The context class represents a SYCL context. A context represents the runtime data structures and state required by a SYCL backend API to interact with a group of devices associated with a platform.

The check I added to ensure all of our chosen devices come from the same SYCL platform ensures that we do not use two "logical devices" based on the same "physical device", so I think this is covered.

bashbaug · 2024-06-13T15:05:38Z

It's kinda confusing that we get sycl::malloc_host calls before a call to ggml_backend_sycl_init, which should in turn call ggml_sycl_set_main_device(device);.

do you know where these calls are coming from?

Yeah, here's a stack trace showing where the call is coming from:

(gdb) bt
#0  ggml_sycl_host_malloc (size=73728000) at /home/bashbaug/git/llama.cpp/ggml-sycl.cpp:13082
#1  0x000000000066af72 in ggml_backend_sycl_host_buffer_type_alloc_buffer (buft=0xf15ea8 <ggml_backend_sycl_host_buffer_type::ggml_backend_sycl_buffer_type_host>, size=73728000) at /home/bashbaug/git/llama.cpp/ggml-sycl.cpp:17018
#2  0x00000000005b459a in alloc_tensor_range (ctx=ctx@entry=0xe9ddc8 <g_state+200>, first=first@entry=0x4c8ff90, last=last@entry=0x0, buft=buft@entry=0xf15ea8 <ggml_backend_sycl_host_buffer_type::ggml_backend_sycl_buffer_type_host>, size=size@entry=73728000, buffers=buffers@entry=0x7fffffffbea0, n_buffers=0x7fffffffbeb0)
    at /home/bashbaug/git/llama.cpp/ggml-alloc.c:883
#3  0x00000000005b44dd in ggml_backend_alloc_ctx_tensors_from_buft (ctx=0xe9ddc8 <g_state+200>, buft=0xf15ea8 <ggml_backend_sycl_host_buffer_type::ggml_backend_sycl_buffer_type_host>) at /home/bashbaug/git/llama.cpp/ggml-alloc.c:961
#4  0x00000000004e00ef in llm_load_tensors (ml=..., model=..., n_gpu_layers=<optimized out>, split_mode=<optimized out>, split_mode@entry=LLAMA_SPLIT_MODE_LAYER, main_gpu=<optimized out>, tensor_split=<optimized out>, use_mlock=<optimized out>, progress_callback=<optimized out>, progress_callback_user_data=<optimized out>)
    at /home/bashbaug/git/llama.cpp/llama.cpp:6479
#5  0x00000000004b2658 in llama_model_load (fname=..., model=..., params=...) at /home/bashbaug/git/llama.cpp/llama.cpp:6617
#6  llama_load_model_from_file (path_model=<optimized out>, params=...) at /home/bashbaug/git/llama.cpp/llama.cpp:16009
#7  0x00000000004338b1 in main (argc=<optimized out>, argv=<optimized out>) at /home/bashbaug/git/llama.cpp/examples/llama-bench/llama-bench.cpp:1367

Short answer: it's coming from llama_load_model_from_file, which gets called before llama_new_context_with_model, which is the first place where ggml_backend_sycl_init gets called for llama-bench.

airMeng · 2024-06-15T02:02:07Z

I believe #7777 has been fixed in #7710, confirmed by AidanBeltonS, could you give a try?

github-actions bot added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Jun 12, 2024

bashbaug mentioned this pull request Jun 12, 2024

[SYCL] Revert "use the correct SYCL context for host USM allocations" #7858

Closed

airMeng reviewed Jun 13, 2024

View reviewed changes

bashbaug commented Jun 13, 2024

View reviewed changes

ggml-sycl.cpp Outdated Show resolved Hide resolved

check devices for the same SYCL platform rather than the same backend

d38f1ae

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 13, 2024

NeoZhangJianyu reviewed Jun 13, 2024

View reviewed changes

joeatodd reviewed Jun 13, 2024

View reviewed changes

airMeng closed this Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sycl: always set the main device after initialization #7909

sycl: always set the main device after initialization #7909

bashbaug commented Jun 12, 2024

airMeng commented Jun 13, 2024

airMeng Jun 13, 2024

bashbaug Jun 13, 2024

NeoZhangJianyu commented Jun 13, 2024

NeoZhangJianyu left a comment

NeoZhangJianyu Jun 13, 2024

joeatodd Jun 13, 2024

bashbaug Jun 13, 2024

NeoZhangJianyu Jun 13, 2024

bashbaug Jun 13, 2024

NeoZhangJianyu commented Jun 13, 2024

airMeng commented Jun 13, 2024

joeatodd commented Jun 13, 2024

joeatodd left a comment

joeatodd Jun 13, 2024

bashbaug commented Jun 13, 2024 •

edited

Loading

bashbaug commented Jun 13, 2024

airMeng commented Jun 15, 2024

sycl: always set the main device after initialization #7909

sycl: always set the main device after initialization #7909

Conversation

bashbaug commented Jun 12, 2024

airMeng commented Jun 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NeoZhangJianyu commented Jun 13, 2024

NeoZhangJianyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NeoZhangJianyu commented Jun 13, 2024

airMeng commented Jun 13, 2024

joeatodd commented Jun 13, 2024

joeatodd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bashbaug commented Jun 13, 2024 • edited Loading

bashbaug commented Jun 13, 2024

airMeng commented Jun 15, 2024

bashbaug commented Jun 13, 2024 •

edited

Loading