[WIP] No handler submit #18842

slawekptak · 2025-06-06T08:12:36Z

No description provided.

Pennycook · 2025-06-06T08:38:24Z

sycl/source/detail/queue_impl.cpp

+void queue_impl::submit_no_handler(
+  const std::shared_ptr<queue_impl> &Self,
+  detail::NDRDescT NDRDesc, const char *KernelName,


Can we template this on the number of dimensions, like you did for queue::submit_no_handler?

I think that would allow us to get rid of detail::NDRDescT and just pass the nd_range<D> all the way down the stack. We could get rid of padId and padRange

Thank you for the review. It is a good point, that there is potential for optimization related to NDRDescT. Currently, the enqueueImpKernel scheduler function takes it as an argument, and applies some transformations, so I left it as is. I think there is work in progress to optimize this, and once it is complete, I can update the flow.

I don't agree. We have so much instance of ugliest code duplication added with a promise to refactor once design settles. Please do the right thing from the start.

Pennycook · 2025-06-06T08:45:54Z

sycl/include/sycl/queue.hpp

+    const char *KernelN = detail::getKernelName<KernelName>();
+    KernelType Kernel = KernelFunc;
+    void *KernelFuncPtr = reinterpret_cast<void *>(&Kernel);
+    int KernelNumParams = detail::getKernelNumParams<KernelName>();
+    detail::kernel_param_desc_t (*KernelParamDescGetter)(int) = &(detail::getKernelParamDesc<KernelName>);
+    bool IsKernelESIMD = detail::isKernelESIMD<KernelName>();
+    bool HasSpecialCapt = detail::hasSpecialCaptures<KernelName>();
+    detail::KernelNameBasedCacheT *KernelNameBasedCachePtr = detail::getKernelNameBasedCache<KernelName>();
+
+    assert(HasSpecialCapt == false);
+    assert(IsKernelESIMD == false);
+
+    submit_no_handler_impl(Range, KernelN, KernelFuncPtr, KernelNumParams, KernelParamDescGetter,
+      KernelNameBasedCachePtr);


How feasible do you think it is to basically inline submit_no_handler_impl here?

This is definitely a huge improvement over what we had before, but we're still taking compile-time information (the kernel name, kernel parameters, etc) and turning it into run-time information that crosses over into libsycl.so.

At this point, I'm not sure if we can inline this. The public header doesn't have access to the internal queue_impl object, also at some point the templated parameters need to be transformed into run-time ones. Do you have any suggestions how to handle those parameters more efficiently?

I think we might need to go through these and figure out what has to become a runtime parameter and why, and consider changing the interface to the impl object. My understanding is that some of the original decisions that were made for handler were motivated by design issues that no longer apply.

For example, I think the KernelName and KernelParamDescGetter here had to be stored as runtime information before, because the type information couldn't possibly be stored between the call to parallel_for and the deferred submission inside handler::finalize(). And since the information had to be stored in a runtime variable, there was no reason not to push the functions using that information into the impl.

Things are different with the new design, because we can submit directly to the backend without deferring anything. I think that means we can lift things out of the impl and back into the headers, where it makes sense to do so.

I haven't done an exhaustive analysis, but here's an example of the sort of thing we should do, focusing on KernelName:

We convert KernelType to KernelName here only so we can eventually pass it to enqueueImpKernel.

enqueueImpKernel only needs the KernelName to find the right ur_kernel_handle_t handle.

If enqueueImpKernel (or a new version of it) accepted a ur_kernel_handle_t instead of a KernelName, we could:

Use KernelType to jump straight to a KernelNameBasedCacheT;

Look up the right ur_kernel_handle_t; and then

Pass the ur_kernel_handle_t to libsycl.so.

...which would bypass the need to create a std::string or do any string operations. (We could also choose to pass just the KernelNameBasedCacheT*, instead of a ur_kernel_handle_t).

Does that make sense?

I think that means we can lift things out of the impl and back into the headers, where it makes sense to do so.

ABI stability is another major reason why we want everything in impl and not in the sycl::__class__ objects. I haven't looked into the PR yet, so I'd just say that "lifting" code is fine, lifting data is likely not.

I agree, and I'm not going to propose that we do anything other than follow our existing ABI break policy.

All I meant was that I expect there are going to be many instances where something doesn't need to be stored at all in the new design. We should take advantage of that to do as much as possible in the body of parallel_for, and pass the minimum amount of information through to the impl.

For example, instead of passing a function pointer through to the impl object to set arguments:

detail::kernel_param_desc_t (*KernelParamDescGetter)(int) = &(detail::getKernelParamDesc<KernelName>); ... enqueueImpKernel(..., KernelParamDescGetter, ...);

We could split enqueueImpKernel into a few separate functions and move anything type-dependent back into the header, something like:

// Interaction with the scheduler is probably too complicated to put in the headers. enqueueImpKernelSchedulerParts(...); // We have the KernelName type info here, so leverage it. ur_kernel_handle_t Kernel = detail::getKernelNameBasedCache<KernelName>()->getKernel(); // Doing this here instead of in impl allows the compiler to unroll and optimize away the branches. // (Because we know everything about the kernel at compile-time.) for (int i = 0; i < detail::getKernelNumParams<KernelName>(); ++i) { switch (detail::getKernelParamDesc<KernelName>(i)) { case accessor: enqueueImpKernelSetAccessorArg(Kernel, i, ...); case std_layout: enqueueImpKernelSetArg(Kernel, i, ...); } } enqueueImpKernelSubmit(Kernel);

Kernel submission is very complicated, especially in the old handler path, so I'm not sure exactly what the new path has to look like. But hopefully this gives you a bit more insight into where I'm coming from.

Yeah, as I said, I agree with "lifting code" :)

aelovikov-intel · 2025-06-06T14:40:34Z

sycl/include/sycl/khr/free_function_commands.hpp

-      q, [&](handler &h) { launch_grouped<KernelType>(h, r, size, k); },
-      codeLoc);
+  (void)codeLoc;
+  q.parallel_for_no_handler(nd_range<1>(r, size), k);


Why don't we call it queue::launch_grouped?

aelovikov-intel · 2025-06-06T14:41:26Z

sycl/include/sycl/queue.hpp

+  // no_handler
+
+private:
+  // NOTE: the name of this function - "kernel_single_task" - is used by the


Can this live outside queue directly in sycl::detail? If private queue access is an issue, then maybe

struct LaunchUtils { static blah-blah(...) {} }; class queue { friend struct LaunchUtils; }

aelovikov-intel · 2025-06-06T14:44:22Z

sycl/source/detail/queue_impl.cpp

+
+void queue_impl::extractArgsAndReqsFromLambda(
+    char *LambdaPtr, detail::kernel_param_desc_t (*ParamDescGetter)(int),
+    size_t NumKernelParams, std::vector<ArgDesc> &Args) {


Why can't you just return Args?

aelovikov-intel · 2025-06-06T14:44:54Z

sycl/source/detail/queue_impl.cpp

+}
+
+void queue_impl::submit_no_handler(
+  const std::shared_ptr<queue_impl> &Self,


Not necessary, see #18715.

aelovikov-intel · 2025-06-06T14:46:24Z

sycl/source/detail/queue_impl.cpp

+  void *KernelFunc, int KernelNumParams,
+  detail::kernel_param_desc_t (*KernelParamDescGetter)(int),
+  detail::KernelNameBasedCacheT *KernelNameBasedCachePtr) {


IMO, this should be a single argument. Ideally, renamed KerneNameBasedCachePtr->TypeErasedKernelInfo. I've had a discussion with @sergey-semenov about this, he was going to prototype that change. Can you sync with him about this?

aelovikov-intel · 2025-06-06T14:47:24Z

sycl/source/detail/queue_impl.cpp

+
+  // TODO external event
+
+  bool KernelFastPath = true;


This is a very bad variable name. Fast in a legacy code as a hack is somewhat meaningful, but in the new highly-optimized code we need to be very specific about what exactly is fast here and why it isn't true in general.

aelovikov-intel · 2025-06-06T14:49:31Z

sycl/source/detail/queue_impl.cpp

+      std::unique_lock<std::mutex> Lock(MMutex);
+      MDefaultGraphDeps.LastEventPtr = EventImpl;
+    }
+  }


Does this work with the existing queue_impl::MLastEvent if there are concurrent old-style submissions?

Yes, the intention is to support concurrent new and old submissions.

aelovikov-intel · 2025-06-06T14:50:23Z

sycl/source/detail/queue_impl.cpp

+void queue_impl::submit_no_handler(
+  const std::shared_ptr<queue_impl> &Self,
+  detail::NDRDescT NDRDesc, const char *KernelName,


I don't agree. We have so much instance of ugliest code duplication added with a promise to refactor once design settles. Please do the right thing from the start.

igchor · 2025-06-06T15:33:50Z

sycl/source/detail/queue_impl.cpp

+
+    if (isInOrder() && LastEvent && !Scheduler::CheckEventReadiness(MContext, LastEvent)) {
+      KernelFastPath = false;
+      ur_event_handle_t LastEventHandle = LastEvent->getHandle();


If the event is not ready, I think getHandle() can return NULL (if the command is a host task or not yet enqueued yet)

slawekptak had a problem deploying to WindowsCILock June 6, 2025 08:12 — with GitHub Actions Failure

slawekptak requested review from Pennycook, uditagarwal97 and aelovikov-intel June 6, 2025 08:22

slawekptak had a problem deploying to WindowsCILock June 6, 2025 08:37 — with GitHub Actions Failure

Pennycook reviewed Jun 6, 2025

View reviewed changes

slawekptak force-pushed the no_handler_submit branch from f0422df to 161bef1 Compare June 6, 2025 10:32

slawekptak had a problem deploying to WindowsCILock June 6, 2025 10:32 — with GitHub Actions Failure

slawekptak had a problem deploying to WindowsCILock June 6, 2025 10:54 — with GitHub Actions Failure

slawekptak force-pushed the no_handler_submit branch from 161bef1 to d016855 Compare June 6, 2025 14:18

slawekptak had a problem deploying to WindowsCILock June 6, 2025 14:19 — with GitHub Actions Failure

slawekptak had a problem deploying to WindowsCILock June 6, 2025 14:45 — with GitHub Actions Failure

aelovikov-intel reviewed Jun 6, 2025

View reviewed changes

[WIP] No handler submit

d016855

igchor reviewed Jun 6, 2025

View reviewed changes

[WIP] No handler submit #18842

Are you sure you want to change the base?

[WIP] No handler submit #18842

Uh oh!

Conversation

slawekptak commented Jun 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!