Skip to content

Enhance querying kernels preferred wgsize #16186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: sycl
Choose a base branch
from

Conversation

omarahmed1111
Copy link
Contributor

Work-group sizes currently rely on device maximum rather than the max from a kernel query. This could result in an error raised as the device maximum could be more than what the kernel is actually allowed to use.

This PR uses an approach to make choosing the wgsize more safer for the kernels. The approach used composed of 2 sides:

  • if the reduction kernel was given a name by the user parallel_for<class Name> then we use this name to query the best wgsize for this kernel.
  • If the reduction kernel is not name defined by the user, we use an approximate safe approach where we query all the reduction kernels in the sycl application for their best wgsize, and we pick the minimum wgsize and use it for the kernel.

The second approximate approach part could be more accurate by using this PR that would give each reduction kernel a unique name that would make querying them possible at runtime.

@omarahmed1111
Copy link
Contributor Author

@intel/llvm-reviewers-runtime Could I get a review on this when it is possible, Thanks!

_ZN4sycl3_V16detail22reduGetPreferredWGSizeERSt10shared_ptrINS1_10queue_implEEm
_ZN4sycl3_V16detail28reduGetPreferredDeviceWGSizeERSt10shared_ptrINS1_10queue_implEEm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABI breaking changes? If so, we need to put them under the fpreview-breaking-flag.

Copy link
Contributor

@GeorgeWeb GeorgeWeb Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is little to no need to rename reduGetPreferredWGSize to reduGetPreferredDeviceWGSize. Leaving it as it was avoids an ABI breaking change too.


// If the reduction kernel is not name defined, we won't be able to query the
// exact kernel for the best wgsize, so we query all the reduction kernels for
// thier wgsize and use the minimum wgsize as a safe and approximate option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// thier wgsize and use the minimum wgsize as a safe and approximate option.
// their wgsize and use the minimum wgsize as a safe and approximate option.

@@ -2741,7 +2779,29 @@ void reduction_parallel_for(handler &CGH, range<Dims> Range,
// TODO: currently the preferred work group size is determined for the given
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this TODO be updated based on the changes in this PR?

Copy link
Contributor

@GeorgeWeb GeorgeWeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added just an initial comment now based on first skim, I'll finish up review on a follow-up more involved look.

As a first pass this looks okay, though it will be inaccurate for unnamed lambda kernels / auto_name as they are not queried via kernel bundles here.

Ideally, I am interested to hear from @steffenlarsen and/or @aelovikov-intel if time allows them, if have any suggestions on tackling this issue design-wise. (here was a related quite brutal attempt to use kernel bundles in all cases / unnamed vs named kernel lambdas #16009 but the refactoring is quite large and unsightly)

@@ -1515,6 +1536,8 @@ template <> struct NDRangeReduction<reduction::strategy::range_basic> {
using Name = __sycl_reduction_kernel<reduction::MainKrn, KernelName,
reduction::strategy::range_basic>;

WGSize = std::min(WGSize, reduGetPreferredKernelWGSize<Name>(Queue));

CGH.parallel_for<Name>(NDRange, Properties, [=](nd_item<1> NDId) {
Copy link
Contributor

@GeorgeWeb GeorgeWeb Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are recalculating WGSize now based on the named kernel's info query, we likely need to update the NDRange for the kernel dispatch here and in all of the other reduction strategy implementations specialisations.

auto ExecBundle =
get_kernel_bundle<KernelName, bundle_state::executable>(Ctx, {Dev});
kernel Kernel = ExecBundle.template get_kernel<KernelName>();
MaxWGSize = Kernel.template get_info<work_group_size>(Dev);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to reduGetPreferredWGSize in reduction.cpp, I think this function should probably also respect the SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE environment variable value from SYCLConfig.

@aelovikov-intel
Copy link
Contributor

Ideally, I am interested to hear from @steffenlarsen and/or @aelovikov-intel if time allows them, if have any suggestions on tackling this issue design-wise. (here was a related quite brutal attempt to use kernel bundles in all cases / unnamed vs named kernel lambdas #16009 but the refactoring is quite large and unsightly)

I haven't had time today, but will try to do it tomorrow.

@aelovikov-intel
Copy link
Contributor

I had a chat with @tahonermann and he pointed me to this

/// Helper struct to get a kernel name type based on given \c Name and \c Type
/// types: if \c Name is undefined (is a \c auto_name) then \c Type becomes
/// the \c Name.
template <typename Name, typename Type> struct get_kernel_name_t {
using name = Name;
};
/// Specialization for the case when \c Name is undefined.
/// This is only legal with our compiler with the unnamed lambda extension or if
/// the kernel is a functor object. For the case where \c Type is a lambda
/// function and unnamed lambdas are disabled, the compiler will issue a
/// diagnostic.
template <typename Type> struct get_kernel_name_t<detail::auto_name, Type> {
using name = Type;
};
and suggested we could do something like this here:

  template <KernelName, KernelType>
  auto reduction_parallel_for(...) {
    using MainKrn = unnamed ? RedMainKrn<KernelType> : RedMainKrn<KernelName>;
    
    auto kb = get_kernel_bundle<MainKrn>(...);
    
    // Use kb to deduce submit info...
    
    q.parallel_for<MainKrn>(...);
    
    // repeat for AuxKrn.
  }

@omarahmed1111
Copy link
Contributor Author

I had a chat with @tahonermann and he pointed me to this

/// Helper struct to get a kernel name type based on given \c Name and \c Type
/// types: if \c Name is undefined (is a \c auto_name) then \c Type becomes
/// the \c Name.
template <typename Name, typename Type> struct get_kernel_name_t {
using name = Name;
};
/// Specialization for the case when \c Name is undefined.
/// This is only legal with our compiler with the unnamed lambda extension or if
/// the kernel is a functor object. For the case where \c Type is a lambda
/// function and unnamed lambdas are disabled, the compiler will issue a
/// diagnostic.
template <typename Type> struct get_kernel_name_t<detail::auto_name, Type> {
using name = Type;
};

and suggested we could do something like this here:

  template <KernelName, KernelType>
  auto reduction_parallel_for(...) {
    using MainKrn = unnamed ? RedMainKrn<KernelType> : RedMainKrn<KernelName>;
    
    auto kb = get_kernel_bundle<MainKrn>(...);
    
    // Use kb to deduce submit info...
    
    q.parallel_for<MainKrn>(...);
    
    // repeat for AuxKrn.
  }

@aelovikov-intel sry for late reply, I gave that a try but the kernel type seem to be an unnamed type in the context of reduction kernels. It gave me this error:

error: unnamed type "the_kernel_given_name_that_I_provided" is invalid; provide a kernel name, or use '-fsycl-unnamed-lambda' to enable unnamed kernel lambdas

I might have a misunderstanding here so if you could elaborate the idea more that would be great. (I was using the kernelType template param passed to the reduction kernel classes like here). I was doing it like that:

using Name = reduction::MainKrn<KernelName, reduction::strategy::multi, KernelType>;
using ReduName = std::conditional_t<std::is_same_v<KernelName, auto_name>, Name, KernelName>;
q.parallel_for<ReduName>(...);

@aelovikov-intel
Copy link
Contributor

aelovikov-intel commented Dec 12, 2024

or use '-fsycl-unnamed-lambda' to enable unnamed kernel lambdas

Why isn't this enabled?

using ReduName = std::conditional_t<std::is_same_v<KernelName, auto_name>, Name, KernelName>;

You'd still need to wrap KernelName with reduction::MainKrn, but that should be irrelevant for the error you have.

@tahonermann
Copy link
Contributor

or use '-fsycl-unnamed-lambda' to enable unnamed kernel lambdas

Why isn't this enabled?

I suspect it is. I think the issue is that a named kernel is being provided (a specialization of reduction::MainKrn) that is parameterized by an unnamed type. The compiler doesn't allow that even when unnamed lambda support is enabled.

The only workaround I was able to come up with was to, instead of wrapping the kernel name, to wrap the kernel object instead. This might impose some overhead, but it should work. https://godbolt.org/z/35T9absYd.

template<typename KernelName, int Disambiguator>
struct WrappedKernelName;

template<typename KernelName = sycl::detail::auto_name, typename KernelType>
void f(sycl::handler &h, KernelType k) {
  constexpr bool IsUnnamed = std::is_same_v<KernelName, sycl::detail::auto_name>;
  if constexpr (IsUnnamed) {
    h.single_task([k]{ k(); });
  } else {
	h.single_task<WrappedKernelName<KernelName, 1>>(k);
  }
}

@omarahmed1111
Copy link
Contributor Author

omarahmed1111 commented Dec 16, 2024

The only workaround I was able to come up with was to, instead of wrapping the kernel name, to wrap the kernel object instead. This might impose some overhead, but it should work. https://godbolt.org/z/35T9absYd.

template<typename KernelName, int Disambiguator>
struct WrappedKernelName;

template<typename KernelName = sycl::detail::auto_name, typename KernelType>
void f(sycl::handler &h, KernelType k) {
  constexpr bool IsUnnamed = std::is_same_v<KernelName, sycl::detail::auto_name>;
  if constexpr (IsUnnamed) {
    h.single_task([k]{ k(); });
  } else {
	h.single_task<WrappedKernelName<KernelName, 1>>(k);
  }
}

@tahonermann I tried that and it seems reasonable for wrapping the kernel name but i am still a little confused by how should we get the unnamed kernel name at runtime to query it for the wgsize?

@tahonermann
Copy link
Contributor

@omarahmed1111,

I tried that and it seems reasonable for wrapping the kernel name but i am still a little confused by how should we get the unnamed kernel name at runtime to query it for the wgsize?

What I demonstrated was wrapping the kernel name when a named type is provided and wrapping the kernel object in a lambda otherwise (and letting the kernel name default to auto_name).

The SYCL 2020 specification doesn't provide an interface to reflect a kernel name given a kernel type or object. This seems intentional since the same kernel type and/or object can be associated with multiple (explicitly provided) kernel names. If we can design a useful interface to reflect kernel names that appropriately handles the potential 1-N relationship, I think it would be worthwhile proposing it for standardization.

In the meantime, you can use the __builtin_sycl_unique_stable_name() builtin function to lookup the name that the Intel SYCL library will use for implicitly named kernel object invocations (this reflects the name that is used when the kernel name is defaulted to sycl::detail::auto_name).

As part of the SYCL upstreaming effort, we are planning to retire the __builtin_sycl_unique_stable_name() builtin function in favor of a set of builtins that reflect various properties of SYCL kernels. Feel free to use that builtin function now, just understand that you'll be required to migrate to something else in the (hopefully) near future.

@omarahmed1111
Copy link
Contributor Author

@omarahmed1111,

I tried that and it seems reasonable for wrapping the kernel name but i am still a little confused by how should we get the unnamed kernel name at runtime to query it for the wgsize?

What I demonstrated was wrapping the kernel name when a named type is provided and wrapping the kernel object in a lambda otherwise (and letting the kernel name default to auto_name).

Ah okay, that make sense.

The SYCL 2020 specification doesn't provide an interface to reflect a kernel name given a kernel type or object. This seems intentional since the same kernel type and/or object can be associated with multiple (explicitly provided) kernel names. If we can design a useful interface to reflect kernel names that appropriately handles the potential 1-N relationship, I think it would be worthwhile proposing it for standardization.

Yeah, that would be useful to have an interface like that to make that cases more concrete. Might give that a thought and see if I could come with some good ideas about that.

In the meantime, you can use the __builtin_sycl_unique_stable_name() builtin function to lookup the name that the Intel SYCL library will use for implicitly named kernel object invocations (this reflects the name that is used when the kernel name is defaulted to sycl::detail::auto_name).

As part of the SYCL upstreaming effort, we are planning to retire the __builtin_sycl_unique_stable_name() builtin function in favor of a set of builtins that reflect various properties of SYCL kernels. Feel free to use that builtin function now, just understand that you'll be required to migrate to something else in the (hopefully) near future.

Thanks for sharing this info, I wasn't aware of the __builtin_sycl_unique_stable_name(). I was trying to avoid the heavy refactoring in this PR by trying to see if there is a way to get an accurate query about the kernel preferred wgsize without refactoring the reduction kernels or even trying to make the kernel preferred wgsize a more of an estimation (current situation in this PR). I think if we used __builtin_sycl_unique_stable_name() and had to migrate from it later. Then, it seems the other PR is a more suitable long term solution anyway then. I think I will use the time better to complete the other PR as it should be a better long term solution for that.

@omarahmed1111 omarahmed1111 marked this pull request as draft January 10, 2025 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants