Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive + GPU seems broken #548

Open
1 task
therault opened this issue May 24, 2023 · 4 comments
Open
1 task

Recursive + GPU seems broken #548

therault opened this issue May 24, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@therault
Copy link
Contributor

therault commented May 24, 2023

Describe the bug

An assert triggers when running a PTG test that combines recursive and CUDA bodies.

To Reproduce

Steps to reproduce the behavior:

  1. Checkout version master (5306015)
  2. Compile on a machine with CUDA with NOISIER and PARANOID debugs, and CUDA enabled, and RECURSIVE support enabled
  3. Install, compile dplasma master (ICLDisco/dplasma@8b36497) using this install of parsec
  4. Run test testing_spotrf -N 256 -t 128 -x 1 -g 1 on a machine with a GPU
  5. See assert triggering.

Additional context

I tracked the issue, suspecting a problem with versioning from Aurelien's report, but it appears that we run first the CUDA kernel of POTRF(1), then we proceed to run the RECURSIVE kernel of POTRF(1) (which messes up the status of the CUDA copies, leading to unexpected issues raised by the assert).

Followup tasks

@therault therault added the bug Something isn't working label May 24, 2023
@therault
Copy link
Contributor Author

Exploring this issue, I see the following behavior:

  • __parsec_execute selects first the RECURSIVE hook for POTRF(0)
  • During the preamble (generated code) of the RECURSIVE hook, we transfer the ownership of the tile to the device 0. We do this unconditionally, i.e. even if the hook decides to return NEXT because we decide to not split the task in something more refined.
  • Then, the RECURSIVE hook discovers that we don't want to split the task in multiple elements and it returns NEXT
  • Then, we select the CUDA hook, and execute it.
  • First thing the CUDA hook does is decide where to run (get_best_device), and that eventually acquire the data on the CUDA device.

For POTRF(0) that works, more or less, because the data was on the CPU and the CUDA task is the one moving it on the GPU.

But then, SYRK(0, 1) executes on the CUDA device and acquires A(1, 1) on the GPU.

When POTRF(1) is tried, we do like for POTRF(0), and in particular, while testing for the RECURSIVE device, we transfer the ownership to device 0, without pulling the data up to the CPU. Then we return NEXT.

When the CUDA POTRF(1) task is scheduled, it finds that it needs to do a transfer CPU -> GPU because the CPU is supposed to be the owner of the data (because of the transfer_ownership).

I believe that last time we tried the RECURSIVE device, we were running only the GEMM tasks on the GPU.
Now that the RECURSIVE task (POTRF) can run on the GPU, we have that problem.

I'm not sure how to solve it though.... I don't think the hook task of RECURSIVE should acquire the data before it decides if it's going to split the task in a DAG... And even when it does so, it's sub-optimal to acquire the data on the CPU... Maybe we want to schedule the operation at a finer grain on the GPU.... And in any case, if it acquires the data on the CPU, it should move the data from GPU to CPU.

@therault
Copy link
Contributor Author

We looked at the code during the code review meeting, and we found a series of issues that need addressing:

  • the recursive device should make use of the evaluate() function of the task class interface to decide if it's going to recurse or not. In particular, it should not try to acquire data at this step. George is going to provide a PR to extend JDF with this;
  • because the code relies on the CPU data copy for flow tracking, the GPU code lies to the runtime by fake-synching the CPU data copy and the GPU data copy. We should remove these hacks (lines 2399 to 2430 of device_cuda_module.c), and allow tracking from any data copy.
  • If the recursive device decides to refine a task, it should indeed acquire the data on the CPU device, and that should go through the whole path of asking the CUDA manager to synch up the data, then split it etc...

Actionable item: meanwhile, recursive device will be disabled for POTRF, which is the one operation that is susceptible to these issues (because all kernels now execute on GPU. That also makes the recursive device less attractive for that operation).

therault added a commit to therault/dplasma that referenced this issue May 29, 2023
@omor1
Copy link
Contributor

omor1 commented May 31, 2023

parsec_mca_device_taskpool_restrict( tp, PARSEC_DEV_CPU );

Doesn't this line force recursive taskpools to execute on the CPU? I'm not entirely surprised that there are issues with recursive + GPU, since I don't think it was ever a supported configuration.

@bosilca
Copy link
Contributor

bosilca commented Jun 1, 2023

The title might not reflect it clearly, but this is not what the issue is. The issue here is that recursive will indeed run only on the CPU, but it inherits a data that might be located on the GPU and it might fail to transfer it before unfolding the recursive algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants