Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hip] Added hip_device_group_device to the runtime. #18790

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

AWoloszyn
Copy link
Contributor

@AWoloszyn AWoloszyn commented Oct 16, 2024

This gives us an interface for creating a logical device from a set of physical hip devices. In a future PR I plan on removing the normal hip_device ut for now, until the device_group_device is completed and hardened, I am keeping the original around. There are also some optimizations to do for when we have a single device in our device group.

This implementation currently passes CTS (as well as the new CTS tests added for device groups), but there is some work to complete.

  • Fix memory pooling (Will be a follow-up PR)
  • Make sure that collectives work as expected. (Follow-up PR)
  • Optimize our synchronization.
    • Currently synchronization across physical GPUs goes through the host, we should be able to avoid that, but it will take some additional work.
  • Rework the CTS tests a bit so that they are just normal CTS tests that get ignored if needed.
  • Move any cuda-specific bits out back into cuda.
  • fix iree_hal_hip_device_queue_flush which should no longer try and use the work queue.
  • audit all new functions and make sure static is used where necessary.

Copy link
Collaborator

@benvanik benvanik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A doozy! It'll take me a bit to go through all of this but I've sprinkled a few comments in to start.

How much of this code would change if you weren't trying to keep the old hip_device around? Are there any simplifications you could do? If so, do you think you'll remember them or should you tag them with TODO(#XXX) and track that in an issue? Given the complexity here I want to ensure we don't end up with both copies or shadows of the old copy living forever. It doesn't feel like much here would or should change if you have 1 device or N devices and if you're just not sure about your code yet it's ok to let this sit in a branch for a bit while you get it to a point of stability. It's better that then it getting context switched out of your head after it lands and then we end up with lingering design decisions that were made for short term staging.

The major thing I'm concerned about is the several extra vtables as they imply a level of decoupling that brings about a lot of complexity in the code. Updating any signature for any call now requires traversing several layers of indirection in several files (including shared utils and given that it's in utils/ across other backends) and reading the code becomes more difficult. Given that I'm hell-bent on deleting HIP it feels like additional baggage for something that is unlikely to be reused. I know there's a hope of sharing this with CUDA but since that's not currently in the plans and CUDA would be the only mid-term/long-term user of it (maybe) the added cost feels hard to swallow for the project as a whole. Avoiding the vtables and keeping things simple is going to add the least overhead to the project followed second by moving this out of utils/ and keeping it local to the hip target would be best. Shared utils dirs should be for durable things that we want to ossify and be heavily reused both in-tree and out-of-tree - we may need this now but we don't want this forever :)

A good way to reason about HAL code is that is should be optimized for deletion/rewrites/refactorings: what we have will be deleted and rewritten several more times, the API will change as new devices/device types/features are introduced, and it's almost always better to have some duplication than it is to have things tightly coupled across the deletion/rewrite boundaries. A bulk of what's happening here in particular is plumbing, and plumbing pays the highest cost of spaghettification and the lowest cost of duplication (as find/replace can solve the duplication but can't solve the spaghetti).

Happy to chat more about this - I think we can simplify things and keep the scope small to unblock the work requiring this without adding too much extra complexity to the rest of the system. Anything that adds complexity just to HIP is fine and it's just the stuff that bleeds out of hip/ that is my concern.

runtime/src/iree/hal/utils/deferred_command_buffer.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/utils/deferred_work_queue.c Outdated Show resolved Hide resolved
// iree_hal_hip_device_group_device_t
//===----------------------------------------------------------------------===//

typedef enum iree_hip_device_group_device_commandbuffer_type_e {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

command_buffer

runtime/src/iree/hal/drivers/hip/native_executable.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/per_device_information.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/event_semaphore.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/local_task/cts/CMakeLists.txt Outdated Show resolved Hide resolved
runtime/src/iree/hal/cts/device_group_copy_test.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/cts/device_group_copy_test.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/cts/CMakeLists.txt Outdated Show resolved Hide resolved
@ScottTodd ScottTodd added the hal/hip Runtime HIP HAL backend label Oct 17, 2024
@AWoloszyn AWoloszyn force-pushed the multidevice branch 3 times, most recently from 3fe45ae to f3019a6 Compare November 4, 2024 20:01
runtime/src/iree/hal/queue.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/utils/stream_tracing.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/utils/stream_tracing.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/utils/stream_tracing.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/utils/stream_tracing.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_driver.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_driver.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_driver.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_driver.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.h Outdated Show resolved Hide resolved
runtime/src/iree/base/tree.h Outdated Show resolved Hide resolved
runtime/src/iree/base/tree.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/cts/multi_queue_dispatch_test.h Outdated Show resolved Hide resolved
runtime/src/iree/base/tree.h Outdated Show resolved Hide resolved
runtime/src/iree/base/tree.h Outdated Show resolved Hide resolved
runtime/src/iree/base/queue.c Outdated Show resolved Hide resolved
runtime/src/iree/base/queue.c Outdated Show resolved Hide resolved
runtime/src/iree/base/queue.c Outdated Show resolved Hide resolved
runtime/src/iree/base/queue.c Outdated Show resolved Hide resolved
runtime/src/iree/base/queue_test.cc Outdated Show resolved Hide resolved
Instead of rebasing each of the individual 30+ changes, rebase
the entire thing, because there were a number of conflicts
against main.

Signed-off-by: Andrew Woloszyn <[email protected]>
IREE_ASSERT_ARGUMENT(base_driver);
IREE_ASSERT_ARGUMENT(out_device);

uint64_t multi_count = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not iree_host_size_t


#include "iree/hal/drivers/hip/util/queue.h"

#include "iree/base/api.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to include something in a .c already included in the header

Suggested change
#include "iree/base/api.h"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we are (at least loosely) following the google style guide, but just want to make sure we are intentionally ignoring it here.

Anywhere else we are ignoring it that I should know about?

https://google.github.io/styleguide/cppguide.html#Include_What_You_Use

runtime/src/iree/hal/drivers/hip/util/queue.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/util/queue.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/util/queue.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
It was submitting command buffers with an empty affinity.
Changed to IREE_HAL_QUEUE_AFFINITY_ANY instead.

Signed-off-by: Andrew Woloszyn <[email protected]>
This allows us to allocate/deallocate async so long as we are using the
default hip allocator. Based on iree-org#19074

---------

Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hal/hip Runtime HIP HAL backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants