Skip to content

[WIP] No handler submit #18842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: sycl
Choose a base branch
from
Draft

Conversation

slawekptak
Copy link
Contributor

No description provided.

Comment on lines +443 to +439
void queue_impl::submit_no_handler(
const std::shared_ptr<queue_impl> &Self,
detail::NDRDescT NDRDesc, const char *KernelName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we template this on the number of dimensions, like you did for queue::submit_no_handler?

I think that would allow us to get rid of detail::NDRDescT and just pass the nd_range<D> all the way down the stack. We could get rid of padId and padRange

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. It is a good point, that there is potential for optimization related to NDRDescT. Currently, the enqueueImpKernel scheduler function takes it as an argument, and applies some transformations, so I left it as is. I think there is work in progress to optimize this, and once it is complete, I can update the flow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree. We have so much instance of ugliest code duplication added with a promise to refactor once design settles. Please do the right thing from the start.

Comment on lines 2853 to 2869
const char *KernelN = detail::getKernelName<KernelName>();
KernelType Kernel = KernelFunc;
void *KernelFuncPtr = reinterpret_cast<void *>(&Kernel);
int KernelNumParams = detail::getKernelNumParams<KernelName>();
detail::kernel_param_desc_t (*KernelParamDescGetter)(int) = &(detail::getKernelParamDesc<KernelName>);
bool IsKernelESIMD = detail::isKernelESIMD<KernelName>();
bool HasSpecialCapt = detail::hasSpecialCaptures<KernelName>();
detail::KernelNameBasedCacheT *KernelNameBasedCachePtr = detail::getKernelNameBasedCache<KernelName>();

assert(HasSpecialCapt == false);
assert(IsKernelESIMD == false);

submit_no_handler_impl(Range, KernelN, KernelFuncPtr, KernelNumParams, KernelParamDescGetter,
KernelNameBasedCachePtr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How feasible do you think it is to basically inline submit_no_handler_impl here?

This is definitely a huge improvement over what we had before, but we're still taking compile-time information (the kernel name, kernel parameters, etc) and turning it into run-time information that crosses over into libsycl.so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, I'm not sure if we can inline this. The public header doesn't have access to the internal queue_impl object, also at some point the templated parameters need to be transformed into run-time ones. Do you have any suggestions how to handle those parameters more efficiently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to go through these and figure out what has to become a runtime parameter and why, and consider changing the interface to the impl object. My understanding is that some of the original decisions that were made for handler were motivated by design issues that no longer apply.

For example, I think the KernelName and KernelParamDescGetter here had to be stored as runtime information before, because the type information couldn't possibly be stored between the call to parallel_for and the deferred submission inside handler::finalize(). And since the information had to be stored in a runtime variable, there was no reason not to push the functions using that information into the impl.

Things are different with the new design, because we can submit directly to the backend without deferring anything. I think that means we can lift things out of the impl and back into the headers, where it makes sense to do so.

I haven't done an exhaustive analysis, but here's an example of the sort of thing we should do, focusing on KernelName:

  • We convert KernelType to KernelName here only so we can eventually pass it to enqueueImpKernel.
  • enqueueImpKernel only needs the KernelName to find the right ur_kernel_handle_t handle.
  • If enqueueImpKernel (or a new version of it) accepted a ur_kernel_handle_t instead of a KernelName, we could:
    • Use KernelType to jump straight to a KernelNameBasedCacheT;
    • Look up the right ur_kernel_handle_t; and then
    • Pass the ur_kernel_handle_t to libsycl.so.

...which would bypass the need to create a std::string or do any string operations. (We could also choose to pass just the KernelNameBasedCacheT*, instead of a ur_kernel_handle_t).

Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that means we can lift things out of the impl and back into the headers, where it makes sense to do so.

ABI stability is another major reason why we want everything in impl and not in the sycl::__class__ objects. I haven't looked into the PR yet, so I'd just say that "lifting" code is fine, lifting data is likely not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I'm not going to propose that we do anything other than follow our existing ABI break policy.

All I meant was that I expect there are going to be many instances where something doesn't need to be stored at all in the new design. We should take advantage of that to do as much as possible in the body of parallel_for, and pass the minimum amount of information through to the impl.

For example, instead of passing a function pointer through to the impl object to set arguments:

detail::kernel_param_desc_t (*KernelParamDescGetter)(int) = &(detail::getKernelParamDesc<KernelName>);
...
enqueueImpKernel(..., KernelParamDescGetter, ...);

We could split enqueueImpKernel into a few separate functions and move anything type-dependent back into the header, something like:

// Interaction with the scheduler is probably too complicated to put in the headers.
enqueueImpKernelSchedulerParts(...);

// We have the KernelName type info here, so leverage it.
ur_kernel_handle_t Kernel = detail::getKernelNameBasedCache<KernelName>()->getKernel();

// Doing this here instead of in impl allows the compiler to unroll and optimize away the branches.
// (Because we know everything about the kernel at compile-time.)
for (int i = 0; i < detail::getKernelNumParams<KernelName>(); ++i) {
  switch (detail::getKernelParamDesc<KernelName>(i))
  {
     case accessor:
       enqueueImpKernelSetAccessorArg(Kernel, i, ...);
     case std_layout:
       enqueueImpKernelSetArg(Kernel, i, ...);
   }
}
enqueueImpKernelSubmit(Kernel);

Kernel submission is very complicated, especially in the old handler path, so I'm not sure exactly what the new path has to look like. But hopefully this gives you a bit more insight into where I'm coming from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as I said, I agree with "lifting code" :)

q, [&](handler &h) { launch_grouped<KernelType>(h, r, size, k); },
codeLoc);
(void)codeLoc;
q.parallel_for_no_handler(nd_range<1>(r, size), k);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we call it queue::launch_grouped?

// no_handler

private:
// NOTE: the name of this function - "kernel_single_task" - is used by the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this live outside queue directly in sycl::detail? If private queue access is an issue, then maybe

struct LaunchUtils {
  static blah-blah(...) {}
};

class queue {
friend struct LaunchUtils;
}


void queue_impl::extractArgsAndReqsFromLambda(
char *LambdaPtr, detail::kernel_param_desc_t (*ParamDescGetter)(int),
size_t NumKernelParams, std::vector<ArgDesc> &Args) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you just return Args?

}

void queue_impl::submit_no_handler(
const std::shared_ptr<queue_impl> &Self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, see #18715.

Comment on lines +440 to +442
void *KernelFunc, int KernelNumParams,
detail::kernel_param_desc_t (*KernelParamDescGetter)(int),
detail::KernelNameBasedCacheT *KernelNameBasedCachePtr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this should be a single argument. Ideally, renamed KerneNameBasedCachePtr->TypeErasedKernelInfo. I've had a discussion with @sergey-semenov about this, he was going to prototype that change. Can you sync with him about this?


// TODO external event

bool KernelFastPath = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very bad variable name. Fast in a legacy code as a hack is somewhat meaningful, but in the new highly-optimized code we need to be very specific about what exactly is fast here and why it isn't true in general.

std::unique_lock<std::mutex> Lock(MMutex);
MDefaultGraphDeps.LastEventPtr = EventImpl;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with the existing queue_impl::MLastEvent if there are concurrent old-style submissions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the intention is to support concurrent new and old submissions.

Comment on lines +443 to +439
void queue_impl::submit_no_handler(
const std::shared_ptr<queue_impl> &Self,
detail::NDRDescT NDRDesc, const char *KernelName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree. We have so much instance of ugliest code duplication added with a promise to refactor once design settles. Please do the right thing from the start.


if (isInOrder() && LastEvent && !Scheduler::CheckEventReadiness(MContext, LastEvent)) {
KernelFastPath = false;
ur_event_handle_t LastEventHandle = LastEvent->getHandle();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the event is not ready, I think getHandle() can return NULL (if the command is a host task or not yet enqueued yet)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants