Enhance querying kernels preferred wgsize #16186

omarahmed1111 · 2024-11-26T14:44:24Z

Work-group sizes currently rely on device maximum rather than the max from a kernel query. This could result in an error raised as the device maximum could be more than what the kernel is actually allowed to use.

This PR uses an approach to make choosing the wgsize more safer for the kernels. The approach used composed of 2 sides:

if the reduction kernel was given a name by the user parallel_for<class Name> then we use this name to query the best wgsize for this kernel.
If the reduction kernel is not name defined by the user, we use an approximate safe approach where we query all the reduction kernels in the sycl application for their best wgsize, and we pick the minimum wgsize and use it for the kernel.

The second approximate approach part could be more accurate by using this PR that would give each reduction kernel a unique name that would make querying them possible at runtime.

Co-authored-by: Georgi Mirazchiyski <[email protected]>

omarahmed1111 · 2024-12-02T11:57:20Z

@intel/llvm-reviewers-runtime Could I get a review on this when it is possible, Thanks!

uditagarwal97 · 2024-12-02T12:35:19Z

sycl/test/abi/sycl_symbols_linux.dump

-_ZN4sycl3_V16detail22reduGetPreferredWGSizeERSt10shared_ptrINS1_10queue_implEEm
+_ZN4sycl3_V16detail28reduGetPreferredDeviceWGSizeERSt10shared_ptrINS1_10queue_implEEm


ABI breaking changes? If so, we need to put them under the fpreview-breaking-flag.

I think there is little to no need to rename reduGetPreferredWGSize to reduGetPreferredDeviceWGSize. Leaving it as it was avoids an ABI breaking change too.

uditagarwal97 · 2024-12-02T12:43:25Z

sycl/include/sycl/reduction.hpp

+
+  // If the reduction kernel is not name defined, we won't be able to query the
+  // exact kernel for the best wgsize, so we query all the reduction kernels for
+  // thier wgsize and use the minimum wgsize as a safe and approximate option.


Suggested change

// thier wgsize and use the minimum wgsize as a safe and approximate option.

// their wgsize and use the minimum wgsize as a safe and approximate option.

uditagarwal97 · 2024-12-02T12:46:08Z

sycl/include/sycl/reduction.hpp

@@ -2741,7 +2779,29 @@ void reduction_parallel_for(handler &CGH, range<Dims> Range,
  // TODO: currently the preferred work group size is determined for the given


Should this TODO be updated based on the changes in this PR?

GeorgeWeb

Added just an initial comment now based on first skim, I'll finish up review on a follow-up more involved look.

As a first pass this looks okay, though it will be inaccurate for unnamed lambda kernels / auto_name as they are not queried via kernel bundles here.

Ideally, I am interested to hear from @steffenlarsen and/or @aelovikov-intel if time allows them, if have any suggestions on tackling this issue design-wise. (here was a related quite brutal attempt to use kernel bundles in all cases / unnamed vs named kernel lambdas #16009 but the refactoring is quite large and unsightly)

GeorgeWeb · 2024-12-02T12:52:01Z

sycl/include/sycl/reduction.hpp

@@ -1515,6 +1536,8 @@ template <> struct NDRangeReduction<reduction::strategy::range_basic> {
    using Name = __sycl_reduction_kernel<reduction::MainKrn, KernelName,
                                         reduction::strategy::range_basic>;

+    WGSize = std::min(WGSize, reduGetPreferredKernelWGSize<Name>(Queue));
+
    CGH.parallel_for<Name>(NDRange, Properties, [=](nd_item<1> NDId) {


If we are recalculating WGSize now based on the named kernel's info query, we likely need to update the NDRange for the kernel dispatch here and in all of the other reduction strategy implementations specialisations.

GeorgeWeb · 2024-12-02T18:05:21Z

sycl/include/sycl/reduction.hpp

+    auto ExecBundle =
+        get_kernel_bundle<KernelName, bundle_state::executable>(Ctx, {Dev});
+    kernel Kernel = ExecBundle.template get_kernel<KernelName>();
+    MaxWGSize = Kernel.template get_info<work_group_size>(Dev);


Similarly to reduGetPreferredWGSize in reduction.cpp, I think this function should probably also respect the SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE environment variable value from SYCLConfig.

aelovikov-intel · 2024-12-02T23:42:59Z

Ideally, I am interested to hear from @steffenlarsen and/or @aelovikov-intel if time allows them, if have any suggestions on tackling this issue design-wise. (here was a related quite brutal attempt to use kernel bundles in all cases / unnamed vs named kernel lambdas #16009 but the refactoring is quite large and unsightly)

I haven't had time today, but will try to do it tomorrow.

aelovikov-intel · 2024-12-03T21:30:17Z

I had a chat with @tahonermann and he pointed me to this

llvm/sycl/include/sycl/kernel.hpp

Lines 44 to 58 in 4881d6d

    
           /// Helper struct to get a kernel name type based on given \c Name and \c Type 
        
           /// types: if \c Name is undefined (is a \c auto_name) then \c Type becomes 
        
           /// the \c Name. 
        
           template <typename Name, typename Type> struct get_kernel_name_t { 
        
             using name = Name; 
        
           }; 
        
           /// Specialization for the case when \c Name is undefined. 
        
           /// This is only legal with our compiler with the unnamed lambda extension or if 
        
           /// the kernel is a functor object. For the case where \c Type is a lambda 
        
           /// function and unnamed lambdas are disabled, the compiler will issue a 
        
           /// diagnostic. 
        
           template <typename Type> struct get_kernel_name_t<detail::auto_name, Type> { 
        
             using name = Type; 
        
           };

and suggested we could do something like this here:

  template <KernelName, KernelType>
  auto reduction_parallel_for(...) {
    using MainKrn = unnamed ? RedMainKrn<KernelType> : RedMainKrn<KernelName>;
    
    auto kb = get_kernel_bundle<MainKrn>(...);
    
    // Use kb to deduce submit info...
    
    q.parallel_for<MainKrn>(...);
    
    // repeat for AuxKrn.
  }

omarahmed1111 · 2024-12-12T17:00:53Z

I had a chat with @tahonermann and he pointed me to this

llvm/sycl/include/sycl/kernel.hpp

Lines 44 to 58 in 4881d6d

/// Helper struct to get a kernel name type based on given \c Name and \c Type

/// types: if \c Name is undefined (is a \c auto_name) then \c Type becomes

/// the \c Name.

template <typename Name, typename Type> struct get_kernel_name_t {

using name = Name;

};

/// Specialization for the case when \c Name is undefined.

/// This is only legal with our compiler with the unnamed lambda extension or if

/// the kernel is a functor object. For the case where \c Type is a lambda

/// function and unnamed lambdas are disabled, the compiler will issue a

/// diagnostic.

template <typename Type> struct get_kernel_name_t<detail::auto_name, Type> {

using name = Type;

};

and suggested we could do something like this here:
  template <KernelName, KernelType>
  auto reduction_parallel_for(...) {
    using MainKrn = unnamed ? RedMainKrn<KernelType> : RedMainKrn<KernelName>;
    
    auto kb = get_kernel_bundle<MainKrn>(...);
    
    // Use kb to deduce submit info...
    
    q.parallel_for<MainKrn>(...);
    
    // repeat for AuxKrn.
  }

@aelovikov-intel sry for late reply, I gave that a try but the kernel type seem to be an unnamed type in the context of reduction kernels. It gave me this error:

error: unnamed type "the_kernel_given_name_that_I_provided" is invalid; provide a kernel name, or use '-fsycl-unnamed-lambda' to enable unnamed kernel lambdas

I might have a misunderstanding here so if you could elaborate the idea more that would be great. (I was using the kernelType template param passed to the reduction kernel classes like here). I was doing it like that:

using Name = reduction::MainKrn<KernelName, reduction::strategy::multi, KernelType>;
using ReduName = std::conditional_t<std::is_same_v<KernelName, auto_name>, Name, KernelName>;
q.parallel_for<ReduName>(...);

aelovikov-intel · 2024-12-12T17:51:50Z

or use '-fsycl-unnamed-lambda' to enable unnamed kernel lambdas

Why isn't this enabled?

using ReduName = std::conditional_t<std::is_same_v<KernelName, auto_name>, Name, KernelName>;

You'd still need to wrap KernelName with reduction::MainKrn, but that should be irrelevant for the error you have.

tahonermann · 2024-12-12T21:51:17Z

or use '-fsycl-unnamed-lambda' to enable unnamed kernel lambdas

Why isn't this enabled?

I suspect it is. I think the issue is that a named kernel is being provided (a specialization of reduction::MainKrn) that is parameterized by an unnamed type. The compiler doesn't allow that even when unnamed lambda support is enabled.

The only workaround I was able to come up with was to, instead of wrapping the kernel name, to wrap the kernel object instead. This might impose some overhead, but it should work. https://godbolt.org/z/35T9absYd.

template<typename KernelName, int Disambiguator>
struct WrappedKernelName;

template<typename KernelName = sycl::detail::auto_name, typename KernelType>
void f(sycl::handler &h, KernelType k) {
  constexpr bool IsUnnamed = std::is_same_v<KernelName, sycl::detail::auto_name>;
  if constexpr (IsUnnamed) {
    h.single_task([k]{ k(); });
  } else {
	h.single_task<WrappedKernelName<KernelName, 1>>(k);
  }
}

omarahmed1111 · 2024-12-16T14:53:22Z

The only workaround I was able to come up with was to, instead of wrapping the kernel name, to wrap the kernel object instead. This might impose some overhead, but it should work. https://godbolt.org/z/35T9absYd.
template<typename KernelName, int Disambiguator>
struct WrappedKernelName;

template<typename KernelName = sycl::detail::auto_name, typename KernelType>
void f(sycl::handler &h, KernelType k) {
  constexpr bool IsUnnamed = std::is_same_v<KernelName, sycl::detail::auto_name>;
  if constexpr (IsUnnamed) {
    h.single_task([k]{ k(); });
  } else {
	h.single_task<WrappedKernelName<KernelName, 1>>(k);
  }
}

@tahonermann I tried that and it seems reasonable for wrapping the kernel name but i am still a little confused by how should we get the unnamed kernel name at runtime to query it for the wgsize?

tahonermann · 2024-12-16T17:02:31Z

@omarahmed1111,

I tried that and it seems reasonable for wrapping the kernel name but i am still a little confused by how should we get the unnamed kernel name at runtime to query it for the wgsize?

What I demonstrated was wrapping the kernel name when a named type is provided and wrapping the kernel object in a lambda otherwise (and letting the kernel name default to auto_name).

The SYCL 2020 specification doesn't provide an interface to reflect a kernel name given a kernel type or object. This seems intentional since the same kernel type and/or object can be associated with multiple (explicitly provided) kernel names. If we can design a useful interface to reflect kernel names that appropriately handles the potential 1-N relationship, I think it would be worthwhile proposing it for standardization.

In the meantime, you can use the __builtin_sycl_unique_stable_name() builtin function to lookup the name that the Intel SYCL library will use for implicitly named kernel object invocations (this reflects the name that is used when the kernel name is defaulted to sycl::detail::auto_name).

As part of the SYCL upstreaming effort, we are planning to retire the __builtin_sycl_unique_stable_name() builtin function in favor of a set of builtins that reflect various properties of SYCL kernels. Feel free to use that builtin function now, just understand that you'll be required to migrate to something else in the (hopefully) near future.

omarahmed1111 · 2024-12-19T17:04:37Z

@omarahmed1111,

I tried that and it seems reasonable for wrapping the kernel name but i am still a little confused by how should we get the unnamed kernel name at runtime to query it for the wgsize?

What I demonstrated was wrapping the kernel name when a named type is provided and wrapping the kernel object in a lambda otherwise (and letting the kernel name default to auto_name).

Ah okay, that make sense.

The SYCL 2020 specification doesn't provide an interface to reflect a kernel name given a kernel type or object. This seems intentional since the same kernel type and/or object can be associated with multiple (explicitly provided) kernel names. If we can design a useful interface to reflect kernel names that appropriately handles the potential 1-N relationship, I think it would be worthwhile proposing it for standardization.

Yeah, that would be useful to have an interface like that to make that cases more concrete. Might give that a thought and see if I could come with some good ideas about that.

In the meantime, you can use the __builtin_sycl_unique_stable_name() builtin function to lookup the name that the Intel SYCL library will use for implicitly named kernel object invocations (this reflects the name that is used when the kernel name is defaulted to sycl::detail::auto_name).

As part of the SYCL upstreaming effort, we are planning to retire the __builtin_sycl_unique_stable_name() builtin function in favor of a set of builtins that reflect various properties of SYCL kernels. Feel free to use that builtin function now, just understand that you'll be required to migrate to something else in the (hopefully) near future.

Thanks for sharing this info, I wasn't aware of the __builtin_sycl_unique_stable_name(). I was trying to avoid the heavy refactoring in this PR by trying to see if there is a way to get an accurate query about the kernel preferred wgsize without refactoring the reduction kernels or even trying to make the kernel preferred wgsize a more of an estimation (current situation in this PR). I think if we used __builtin_sycl_unique_stable_name() and had to migrate from it later. Then, it seems the other PR is a more suitable long term solution anyway then. I think I will use the time better to complete the other PR as it should be a better long term solution for that.

omarahmed1111 requested a review from a team as a code owner November 26, 2024 14:44

omarahmed1111 requested a review from uditagarwal97 November 26, 2024 14:44

omarahmed1111 had a problem deploying to WindowsCILock November 26, 2024 14:44 — with GitHub Actions Failure

omarahmed1111 requested a review from GeorgeWeb November 26, 2024 14:45

omarahmed1111 temporarily deployed to WindowsCILock November 26, 2024 15:23 — with GitHub Actions Inactive

omarahmed1111 force-pushed the enhance-querying-kernel-wgsize branch from b6a69bc to d57635b Compare November 26, 2024 17:20

omarahmed1111 temporarily deployed to WindowsCILock November 26, 2024 17:21 — with GitHub Actions Inactive

omarahmed1111 temporarily deployed to WindowsCILock November 26, 2024 18:26 — with GitHub Actions Inactive

Enhance querying kernels preferred wgsize

71739a8

Co-authored-by: Georgi Mirazchiyski <[email protected]>

omarahmed1111 force-pushed the enhance-querying-kernel-wgsize branch from d57635b to 71739a8 Compare November 28, 2024 14:53

omarahmed1111 temporarily deployed to WindowsCILock November 28, 2024 14:54 — with GitHub Actions Inactive

omarahmed1111 temporarily deployed to WindowsCILock November 28, 2024 15:28 — with GitHub Actions Inactive

uditagarwal97 reviewed Dec 2, 2024

View reviewed changes

GeorgeWeb reviewed Dec 2, 2024

View reviewed changes

omarahmed1111 marked this pull request as draft January 10, 2025 12:29

		_ZN4sycl3_V16detail22reduGetPreferredWGSizeERSt10shared_ptrINS1_10queue_implEEm
		_ZN4sycl3_V16detail28reduGetPreferredDeviceWGSizeERSt10shared_ptrINS1_10queue_implEEm

	// thier wgsize and use the minimum wgsize as a safe and approximate option.
	// their wgsize and use the minimum wgsize as a safe and approximate option.

		@@ -2741,7 +2779,29 @@ void reduction_parallel_for(handler &CGH, range<Dims> Range,
		// TODO: currently the preferred work group size is determined for the given

Enhance querying kernels preferred wgsize #16186

Are you sure you want to change the base?

Enhance querying kernels preferred wgsize #16186

Uh oh!

Conversation

omarahmed1111 commented Nov 26, 2024

Uh oh!

omarahmed1111 commented Dec 2, 2024

Uh oh!

uditagarwal97 Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

GeorgeWeb Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uditagarwal97 Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

uditagarwal97 Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

GeorgeWeb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GeorgeWeb Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GeorgeWeb Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

aelovikov-intel commented Dec 2, 2024

Uh oh!

aelovikov-intel commented Dec 3, 2024

Uh oh!

omarahmed1111 commented Dec 12, 2024

Uh oh!

aelovikov-intel commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tahonermann commented Dec 12, 2024

Uh oh!

omarahmed1111 commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tahonermann commented Dec 16, 2024

Uh oh!

omarahmed1111 commented Dec 19, 2024

Uh oh!

Uh oh!

GeorgeWeb Dec 2, 2024 •

edited

Loading

GeorgeWeb left a comment •

edited

Loading

GeorgeWeb Dec 2, 2024 •

edited

Loading

aelovikov-intel commented Dec 12, 2024 •

edited

Loading

omarahmed1111 commented Dec 16, 2024 •

edited

Loading