-
Notifications
You must be signed in to change notification settings - Fork 769
[WIP] No handler submit #18842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
[WIP] No handler submit #18842
Conversation
void queue_impl::submit_no_handler( | ||
const std::shared_ptr<queue_impl> &Self, | ||
detail::NDRDescT NDRDesc, const char *KernelName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we template this on the number of dimensions, like you did for queue::submit_no_handler
?
I think that would allow us to get rid of detail::NDRDescT
and just pass the nd_range<D>
all the way down the stack. We could get rid of padId
and padRange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. It is a good point, that there is potential for optimization related to NDRDescT. Currently, the enqueueImpKernel scheduler function takes it as an argument, and applies some transformations, so I left it as is. I think there is work in progress to optimize this, and once it is complete, I can update the flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree. We have so much instance of ugliest code duplication added with a promise to refactor once design settles. Please do the right thing from the start.
sycl/include/sycl/queue.hpp
Outdated
const char *KernelN = detail::getKernelName<KernelName>(); | ||
KernelType Kernel = KernelFunc; | ||
void *KernelFuncPtr = reinterpret_cast<void *>(&Kernel); | ||
int KernelNumParams = detail::getKernelNumParams<KernelName>(); | ||
detail::kernel_param_desc_t (*KernelParamDescGetter)(int) = &(detail::getKernelParamDesc<KernelName>); | ||
bool IsKernelESIMD = detail::isKernelESIMD<KernelName>(); | ||
bool HasSpecialCapt = detail::hasSpecialCaptures<KernelName>(); | ||
detail::KernelNameBasedCacheT *KernelNameBasedCachePtr = detail::getKernelNameBasedCache<KernelName>(); | ||
|
||
assert(HasSpecialCapt == false); | ||
assert(IsKernelESIMD == false); | ||
|
||
submit_no_handler_impl(Range, KernelN, KernelFuncPtr, KernelNumParams, KernelParamDescGetter, | ||
KernelNameBasedCachePtr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How feasible do you think it is to basically inline submit_no_handler_impl
here?
This is definitely a huge improvement over what we had before, but we're still taking compile-time information (the kernel name, kernel parameters, etc) and turning it into run-time information that crosses over into libsycl.so
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, I'm not sure if we can inline this. The public header doesn't have access to the internal queue_impl object, also at some point the templated parameters need to be transformed into run-time ones. Do you have any suggestions how to handle those parameters more efficiently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to go through these and figure out what has to become a runtime parameter and why, and consider changing the interface to the impl
object. My understanding is that some of the original decisions that were made for handler
were motivated by design issues that no longer apply.
For example, I think the KernelName
and KernelParamDescGetter
here had to be stored as runtime information before, because the type information couldn't possibly be stored between the call to parallel_for
and the deferred submission inside handler::finalize()
. And since the information had to be stored in a runtime variable, there was no reason not to push the functions using that information into the impl
.
Things are different with the new design, because we can submit directly to the backend without deferring anything. I think that means we can lift things out of the impl
and back into the headers, where it makes sense to do so.
I haven't done an exhaustive analysis, but here's an example of the sort of thing we should do, focusing on KernelName
:
- We convert
KernelType
toKernelName
here only so we can eventually pass it toenqueueImpKernel
. enqueueImpKernel
only needs theKernelName
to find the rightur_kernel_handle_t
handle.- If
enqueueImpKernel
(or a new version of it) accepted aur_kernel_handle_t
instead of aKernelName
, we could:- Use
KernelType
to jump straight to aKernelNameBasedCacheT
; - Look up the right
ur_kernel_handle_t
; and then - Pass the
ur_kernel_handle_t
tolibsycl.so
.
- Use
...which would bypass the need to create a std::string
or do any string operations. (We could also choose to pass just the KernelNameBasedCacheT*
, instead of a ur_kernel_handle_t
).
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that means we can lift things out of the impl and back into the headers, where it makes sense to do so.
ABI stability is another major reason why we want everything in impl
and not in the sycl::__class__
objects. I haven't looked into the PR yet, so I'd just say that "lifting" code is fine, lifting data is likely not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and I'm not going to propose that we do anything other than follow our existing ABI break policy.
All I meant was that I expect there are going to be many instances where something doesn't need to be stored at all in the new design. We should take advantage of that to do as much as possible in the body of parallel_for
, and pass the minimum amount of information through to the impl
.
For example, instead of passing a function pointer through to the impl
object to set arguments:
detail::kernel_param_desc_t (*KernelParamDescGetter)(int) = &(detail::getKernelParamDesc<KernelName>);
...
enqueueImpKernel(..., KernelParamDescGetter, ...);
We could split enqueueImpKernel
into a few separate functions and move anything type-dependent back into the header, something like:
// Interaction with the scheduler is probably too complicated to put in the headers.
enqueueImpKernelSchedulerParts(...);
// We have the KernelName type info here, so leverage it.
ur_kernel_handle_t Kernel = detail::getKernelNameBasedCache<KernelName>()->getKernel();
// Doing this here instead of in impl allows the compiler to unroll and optimize away the branches.
// (Because we know everything about the kernel at compile-time.)
for (int i = 0; i < detail::getKernelNumParams<KernelName>(); ++i) {
switch (detail::getKernelParamDesc<KernelName>(i))
{
case accessor:
enqueueImpKernelSetAccessorArg(Kernel, i, ...);
case std_layout:
enqueueImpKernelSetArg(Kernel, i, ...);
}
}
enqueueImpKernelSubmit(Kernel);
Kernel submission is very complicated, especially in the old handler
path, so I'm not sure exactly what the new path has to look like. But hopefully this gives you a bit more insight into where I'm coming from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as I said, I agree with "lifting code" :)
f0422df
to
161bef1
Compare
161bef1
to
d016855
Compare
q, [&](handler &h) { launch_grouped<KernelType>(h, r, size, k); }, | ||
codeLoc); | ||
(void)codeLoc; | ||
q.parallel_for_no_handler(nd_range<1>(r, size), k); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we call it queue::launch_grouped
?
// no_handler | ||
|
||
private: | ||
// NOTE: the name of this function - "kernel_single_task" - is used by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this live outside queue
directly in sycl::detail
? If private queue
access is an issue, then maybe
struct LaunchUtils {
static blah-blah(...) {}
};
class queue {
friend struct LaunchUtils;
}
|
||
void queue_impl::extractArgsAndReqsFromLambda( | ||
char *LambdaPtr, detail::kernel_param_desc_t (*ParamDescGetter)(int), | ||
size_t NumKernelParams, std::vector<ArgDesc> &Args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you just return Args
?
} | ||
|
||
void queue_impl::submit_no_handler( | ||
const std::shared_ptr<queue_impl> &Self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary, see #18715.
void *KernelFunc, int KernelNumParams, | ||
detail::kernel_param_desc_t (*KernelParamDescGetter)(int), | ||
detail::KernelNameBasedCacheT *KernelNameBasedCachePtr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this should be a single argument. Ideally, renamed KerneNameBasedCachePtr
->TypeErasedKernelInfo
. I've had a discussion with @sergey-semenov about this, he was going to prototype that change. Can you sync with him about this?
|
||
// TODO external event | ||
|
||
bool KernelFastPath = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very bad variable name. Fast
in a legacy code as a hack is somewhat meaningful, but in the new highly-optimized code we need to be very specific about what exactly is fast here and why it isn't true in general.
std::unique_lock<std::mutex> Lock(MMutex); | ||
MDefaultGraphDeps.LastEventPtr = EventImpl; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work with the existing queue_impl::MLastEvent
if there are concurrent old-style submissions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the intention is to support concurrent new and old submissions.
void queue_impl::submit_no_handler( | ||
const std::shared_ptr<queue_impl> &Self, | ||
detail::NDRDescT NDRDesc, const char *KernelName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree. We have so much instance of ugliest code duplication added with a promise to refactor once design settles. Please do the right thing from the start.
|
||
if (isInOrder() && LastEvent && !Scheduler::CheckEventReadiness(MContext, LastEvent)) { | ||
KernelFastPath = false; | ||
ur_event_handle_t LastEventHandle = LastEvent->getHandle(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the event is not ready, I think getHandle() can return NULL (if the command is a host task or not yet enqueued yet)
No description provided.