Recursive + GPU seems broken #548

therault · 2023-05-24T21:09:50Z

Describe the bug

An assert triggers when running a PTG test that combines recursive and CUDA bodies.

To Reproduce

Steps to reproduce the behavior:

Checkout version master (5306015)
Compile on a machine with CUDA with NOISIER and PARANOID debugs, and CUDA enabled, and RECURSIVE support enabled
Install, compile dplasma master (ICLDisco/dplasma@8b36497) using this install of parsec
Run test testing_spotrf -N 256 -t 128 -x 1 -g 1 on a machine with a GPU
See assert triggering.

Additional context

I tracked the issue, suspecting a problem with versioning from Aurelien's report, but it appears that we run first the CUDA kernel of POTRF(1), then we proceed to run the RECURSIVE kernel of POTRF(1) (which messes up the status of the CUDA copies, leading to unexpected issues raised by the assert).

Followup tasks

When this issue is resolved, this PR needs to be undone. Recursive disabled temporarily must be re-enabled when upstream issue fixed dplasma#84

The text was updated successfully, but these errors were encountered:

therault · 2023-05-25T19:34:47Z

Exploring this issue, I see the following behavior:

__parsec_execute selects first the RECURSIVE hook for POTRF(0)
During the preamble (generated code) of the RECURSIVE hook, we transfer the ownership of the tile to the device 0. We do this unconditionally, i.e. even if the hook decides to return NEXT because we decide to not split the task in something more refined.
Then, the RECURSIVE hook discovers that we don't want to split the task in multiple elements and it returns NEXT
Then, we select the CUDA hook, and execute it.
First thing the CUDA hook does is decide where to run (get_best_device), and that eventually acquire the data on the CUDA device.

For POTRF(0) that works, more or less, because the data was on the CPU and the CUDA task is the one moving it on the GPU.

But then, SYRK(0, 1) executes on the CUDA device and acquires A(1, 1) on the GPU.

When POTRF(1) is tried, we do like for POTRF(0), and in particular, while testing for the RECURSIVE device, we transfer the ownership to device 0, without pulling the data up to the CPU. Then we return NEXT.

When the CUDA POTRF(1) task is scheduled, it finds that it needs to do a transfer CPU -> GPU because the CPU is supposed to be the owner of the data (because of the transfer_ownership).

I believe that last time we tried the RECURSIVE device, we were running only the GEMM tasks on the GPU.
Now that the RECURSIVE task (POTRF) can run on the GPU, we have that problem.

I'm not sure how to solve it though.... I don't think the hook task of RECURSIVE should acquire the data before it decides if it's going to split the task in a DAG... And even when it does so, it's sub-optimal to acquire the data on the CPU... Maybe we want to schedule the operation at a finer grain on the GPU.... And in any case, if it acquires the data on the CPU, it should move the data from GPU to CPU.

therault · 2023-05-26T15:44:24Z

We looked at the code during the code review meeting, and we found a series of issues that need addressing:

the recursive device should make use of the evaluate() function of the task class interface to decide if it's going to recurse or not. In particular, it should not try to acquire data at this step. George is going to provide a PR to extend JDF with this;
because the code relies on the CPU data copy for flow tracking, the GPU code lies to the runtime by fake-synching the CPU data copy and the GPU data copy. We should remove these hacks (lines 2399 to 2430 of device_cuda_module.c), and allow tracking from any data copy.
If the recursive device decides to refine a task, it should indeed acquire the data on the CPU device, and that should go through the whole path of asking the CUDA manager to synch up the data, then split it etc...

Actionable item: meanwhile, recursive device will be disabled for POTRF, which is the one operation that is susceptible to these issues (because all kernels now execute on GPU. That also makes the recursive device less attractive for that operation).

…RECURSIVE device in test mode.

omor1 · 2023-05-31T21:31:17Z

parsec/parsec/recursive.h

Line 55 in 5306015

parsec_mca_device_taskpool_restrict( tp, PARSEC_DEV_CPU );

Doesn't this line force recursive taskpools to execute on the CPU? I'm not entirely surprised that there are issues with recursive + GPU, since I don't think it was ever a supported configuration.

bosilca · 2023-06-01T02:31:20Z

The title might not reflect it clearly, but this is not what the issue is. The issue here is that recursive will indeed run only on the CPU, but it inherits a data that might be located on the GPU and it might fail to transfer it before unfolding the recursive algorithm.

See issue ICLDisco#548 Signed-off-by: Aurelien Bouteiller <[email protected]>

See issue #548 Signed-off-by: Aurelien Bouteiller <[email protected]>

therault added the bug Something isn't working label May 24, 2023

therault added a commit to therault/dplasma that referenced this issue May 29, 2023

Provide a temporary solution to ICLDisco/parsec#548 by disabling the …

e735700

…RECURSIVE device in test mode.

therault mentioned this issue May 29, 2023

Temporarily disable RECURSIVE device in CI test mode ICLDisco/dplasma#83

Merged

abouteiller mentioned this issue Jun 1, 2023

Recursive disabled temporarily must be re-enabled when upstream issue fixed ICLDisco/dplasma#84

Open

bosilca self-assigned this Jun 2, 2023

abouteiller added a commit to abouteiller/parsec that referenced this issue Oct 24, 2023

Disable recursive device by default (temporarily)

1405846

See issue ICLDisco#548 Signed-off-by: Aurelien Bouteiller <[email protected]>

abouteiller mentioned this issue Oct 24, 2023

Disable recursive device by default (temporarily) #585

Merged

bosilca pushed a commit that referenced this issue Nov 4, 2023

Disable recursive device by default (temporarily) (#585)

08fa2b2

See issue #548 Signed-off-by: Aurelien Bouteiller <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recursive + GPU seems broken #548

Recursive + GPU seems broken #548

therault commented May 24, 2023 •

edited by abouteiller

Loading

therault commented May 25, 2023

therault commented May 26, 2023

omor1 commented May 31, 2023

bosilca commented Jun 1, 2023

Recursive + GPU seems broken #548

Recursive + GPU seems broken #548

Comments

therault commented May 24, 2023 • edited by abouteiller Loading

Describe the bug

To Reproduce

Additional context

Followup tasks

therault commented May 25, 2023

therault commented May 26, 2023

omor1 commented May 31, 2023

bosilca commented Jun 1, 2023

therault commented May 24, 2023 •

edited by abouteiller

Loading