-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hip] Added hip_device_group_device to the runtime. #18790
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A doozy! It'll take me a bit to go through all of this but I've sprinkled a few comments in to start.
How much of this code would change if you weren't trying to keep the old hip_device around? Are there any simplifications you could do? If so, do you think you'll remember them or should you tag them with TODO(#XXX)
and track that in an issue? Given the complexity here I want to ensure we don't end up with both copies or shadows of the old copy living forever. It doesn't feel like much here would or should change if you have 1 device or N devices and if you're just not sure about your code yet it's ok to let this sit in a branch for a bit while you get it to a point of stability. It's better that then it getting context switched out of your head after it lands and then we end up with lingering design decisions that were made for short term staging.
The major thing I'm concerned about is the several extra vtables as they imply a level of decoupling that brings about a lot of complexity in the code. Updating any signature for any call now requires traversing several layers of indirection in several files (including shared utils and given that it's in utils/ across other backends) and reading the code becomes more difficult. Given that I'm hell-bent on deleting HIP it feels like additional baggage for something that is unlikely to be reused. I know there's a hope of sharing this with CUDA but since that's not currently in the plans and CUDA would be the only mid-term/long-term user of it (maybe) the added cost feels hard to swallow for the project as a whole. Avoiding the vtables and keeping things simple is going to add the least overhead to the project followed second by moving this out of utils/ and keeping it local to the hip target would be best. Shared utils dirs should be for durable things that we want to ossify and be heavily reused both in-tree and out-of-tree - we may need this now but we don't want this forever :)
A good way to reason about HAL code is that is should be optimized for deletion/rewrites/refactorings: what we have will be deleted and rewritten several more times, the API will change as new devices/device types/features are introduced, and it's almost always better to have some duplication than it is to have things tightly coupled across the deletion/rewrite boundaries. A bulk of what's happening here in particular is plumbing, and plumbing pays the highest cost of spaghettification and the lowest cost of duplication (as find/replace can solve the duplication but can't solve the spaghetti).
Happy to chat more about this - I think we can simplify things and keep the scope small to unblock the work requiring this without adding too much extra complexity to the rest of the system. Anything that adds complexity just to HIP is fine and it's just the stuff that bleeds out of hip/ that is my concern.
// iree_hal_hip_device_group_device_t | ||
//===----------------------------------------------------------------------===// | ||
|
||
typedef enum iree_hip_device_group_device_commandbuffer_type_e { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
command_buffer
cd7443a
to
aaa52bc
Compare
3fe45ae
to
f3019a6
Compare
Instead of rebasing each of the individual 30+ changes, rebase the entire thing, because there were a number of conflicts against main. Signed-off-by: Andrew Woloszyn <[email protected]>
3675196
to
e1d13e4
Compare
IREE_ASSERT_ARGUMENT(base_driver); | ||
IREE_ASSERT_ARGUMENT(out_device); | ||
|
||
uint64_t multi_count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not iree_host_size_t
|
||
#include "iree/hal/drivers/hip/util/queue.h" | ||
|
||
#include "iree/base/api.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to include something in a .c already included in the header
#include "iree/base/api.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we are (at least loosely) following the google style guide, but just want to make sure we are intentionally ignoring it here.
Anywhere else we are ignoring it that I should know about?
https://google.github.io/styleguide/cppguide.html#Include_What_You_Use
It was submitting command buffers with an empty affinity. Changed to IREE_HAL_QUEUE_AFFINITY_ANY instead. Signed-off-by: Andrew Woloszyn <[email protected]>
This allows us to allocate/deallocate async so long as we are using the default hip allocator. Based on iree-org#19074 --------- Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
d974379
to
c0494aa
Compare
This gives us an interface for creating a logical device from a set of physical hip devices. In a future PR I plan on removing the normal hip_device ut for now, until the device_group_device is completed and hardened, I am keeping the original around. There are also some optimizations to do for when we have a single device in our device group.
This implementation currently passes CTS (as well as the new CTS tests added for device groups), but there is some work to complete.
iree_hal_hip_device_queue_flush
which should no longer try and use the work queue.