Improve SYCL backend `__parallel_for` performance for large input sizes #1870

mmichel11 · 2024-09-20T21:10:21Z

Summary

This PR improves __parallel_for performance for large input sizes by switching to an nd-range kernel to process multiple inputs per work item which enables us to use the full hardware bandwidth.

Details

On some target architectures, we are currently not hitting roofline memory bandwidth performance in our __parallel_for pattern. The cause is that our SYCL basic kernel implementation only processes a single element per item. This is insufficient to fully utilize memory bandwidth on some target architectures. Processing multiple inputs per work item enables us to perform enough loads / stores to saturate the hardware bandwidth. Explicitly using a coalesced pattern through either a sub-group or work-group stride ensures that a good access pattern is achieved.

A nd-range kernel has been added for large input sizes that uses a heuristic based upon the smallest sized type in the set of provided ranges to determine the number of iterations to process per input item. This drastically improves performance on target architectures for large inputs across nearly all for-based algorithms.

A second kernel has been added as opposed to merging both paths within a single kernel to prevent extra runtime dispatch within the kernel which hurt performance for small inputs. There is a smaller runtime overhead for selecting the best path from the host and compiling two kernels. For small-to-medium inputs, the SYCL basic kernel performs the best.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

include/oneapi/dpl/pstl/utils.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

SergeyKopienko · 2024-09-25T07:54:02Z

@mmichel11 I have take a look to the history of this branch, probably make sense to rebase your branch from the current main state or merge main branch into your PR: we have a lot of new commits in the main branch now.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

dmitriy-sobolev

There are some things not directly related to the main part of the PR, which looks good to me.

dmitriy-sobolev · 2024-10-02T13:19:56Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+            const bool __is_full_sub_group =
+                __sub_group_start_idx + __iters_per_work_item * __sub_group_size <= __count;
+            const std::size_t __work_item_idx = __sub_group_start_idx + __sub_group_local_id;
+            return std::make_tuple(__work_item_idx, __sub_group_size, __is_full_sub_group);


Could you include <tuple>?

From my point of view, the usage of std::make_tuple for primitive types doesn't make sense at all.

Just std::tuple is now used in the new PR.

dmitriy-sobolev · 2024-10-02T17:00:50Z

include/oneapi/dpl/pstl/utils.h

+template <typename _Tuple>
+class __min_tuple_type_size;
+
+template <typename _T>
+class __min_tuple_type_size<std::tuple<_T>>
+{
+  public:
+    static constexpr std::size_t value = sizeof(_T);
+};
+
+template <typename _T, typename... _Ts>
+class __min_tuple_type_size<std::tuple<_T, _Ts...>>
+{
+    static constexpr std::size_t __min_type_value_ts = __min_tuple_type_size<std::tuple<_Ts...>>::value;
+
+  public:
+    static constexpr std::size_t value = std::min(sizeof(_T), __min_type_value_ts);
+};
+
+template <typename _Tuple>
+inline constexpr std::size_t __min_tuple_type_size_v = __min_tuple_type_size<_Tuple>::value;


This can be simplified:

template<typename _Tuple> struct __min_tuple_type_size; template<typename... Ts> struct __min_tuple_type_size<std::tuple<Ts...>> { static constexpr std::size_t value = std::min({sizeof(Ts)...}); };

_v alias is not-necessary as it is used only once.

Thanks, I have replaced this __min_tuple_type_size with __min_nested_type_size which avoids having to flatten the tuple first and applied these ideas there.

dmitriy-sobolev · 2024-10-02T17:22:15Z

include/oneapi/dpl/pstl/tuple_impl.h

@@ -793,6 +793,25 @@ struct __decay_with_tuple_specialization<::std::tuple<_Args...>>
 template <typename... _Args>
 using __decay_with_tuple_specialization_t = typename __decay_with_tuple_specialization<_Args...>::type;

+// Flatten nested std::tuple or oneapi::dpl::__internal::tuple types into a single std::tuple.
+template <typename _T>
+struct __flatten_std_or_internal_tuple


Optional suggestion: __flatten_std_or_internal_tuple -> __flatten_tuple.

This utility has been removed

dmitriy-sobolev · 2024-10-02T17:34:47Z

include/oneapi/dpl/pstl/tuple_impl.h

@@ -793,6 +793,25 @@ struct __decay_with_tuple_specialization<::std::tuple<_Args...>>
 template <typename... _Args>
 using __decay_with_tuple_specialization_t = typename __decay_with_tuple_specialization<_Args...>::type;

+// Flatten nested std::tuple or oneapi::dpl::__internal::tuple types into a single std::tuple.
+template <typename _T>
+struct __flatten_std_or_internal_tuple


I'd recommend moving __flatten_std_or_internal_tuple into utils.h. It is a niche utility not related to the core part of the class.

This was originally done since tuple_impl.h includes utils.h, so we would have to forward declare our internal tuple otherwise to avoid a circular dependency.

The new utility supports on arbitrary type and doesn't require any specializations for our internal tuple, so it can be easily placed in utils.h

dmitriy-sobolev · 2024-10-02T17:36:57Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+    __stride_recommender(const sycl::nd_item<1>& __item, std::size_t __count, std::size_t __iters_per_work_item,
+                         std::size_t __work_group_size)
+    {
+        if constexpr (oneapi::dpl::__internal::__is_spirv_target_v)


Could you include utils.h where __is_spirv_target_v is defined?

Signed-off-by: Matthew Michel <[email protected]>

128 byte memory operations are performed instead of 512 after inspecting the assembly. Processing 512 bytes per sub-group still seems to be the best value after experimentation. Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

* Move __stride_recommender into __parallel_for_large_submitter * Use {} to invoke constructor * Simplify if-else statements in for dispatch Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

mmichel11 · 2024-11-06T19:54:43Z

Thanks for the reviews everyone. I have addressed all current comments.

As discussed offline, we will use the current state of the PR as a starting point to introduce vectorized load / store paths where it is performant by rewriting our bricks for parallel for. Likely, I will have a second PR into this branch with these changes once they are complete.

SergeyKopienko · 2024-12-17T13:54:21Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+    // Compile two kernels: one for small-to-medium inputs and a second for large. This avoids runtime checks within a
+    // single kernel that worsen performance for small cases. If the number of iterations of the large submitter is 1,
+    // then only compile the basic kernel as the two versions are effectively the same.
+    if constexpr (__large_submitter::__iters_per_work_item > 1)


Can we combine this if constexpr together with the next if?
Will we have some real profit from these two conditions?

Combining the two would make it a runtime if conditional. Even if __iters_per_work_item is known at compile-time and the compiler can optimize it out, there may still be a chance for the kernel to be unnecessarily compiled. I think it is best to keep the if constexpr, so we can be sure to avoid compiling the large submitter if possible.

SergeyKopienko · 2024-12-17T13:57:03Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+    {
+        assert(oneapi::dpl::__ranges::__get_first_range_size(__rngs...) > 0);
+        _PRINT_INFO_IN_DEBUG_MODE(__exec);
+        auto __event = __exec.queue().submit([&__rngs..., &__brick, &__exec, __count](sycl::handler& __cgh) {


I think we haven't any other cases where we capture policy into submit call.
May be better to eval

const std::size_t __work_group_size = oneapi::dpl::__internal::__max_work_group_size(__exec, __max_work_group_size);

outside of submit and capture __work_group_size by value?

Sure, applied to new PR (this one will be closed soon).

SergeyKopienko · 2024-12-17T14:00:04Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+            const std::size_t __work_item_idx = __work_group_start_idx + __item.get_local_linear_id();
+            const bool __is_full_work_group =
+                __work_group_start_idx + __iters_per_work_item * __work_group_size <= __count;
+            return std::make_tuple(__work_item_idx, __work_group_size, __is_full_work_group);


Just std::tuple is now used in the new PR.

SergeyKopienko · 2024-12-17T14:02:19Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+            const std::size_t __num_items = __num_groups * __work_group_size;
+            __cgh.parallel_for<_Name...>(
+                sycl::nd_range(sycl::range<1>(__num_items), sycl::range<1>(__work_group_size)),


Suggested change

const std::size_t __num_items = __num_groups * __work_group_size;

__cgh.parallel_for<_Name...>(

sycl::nd_range(sycl::range<1>(__num_items), sycl::range<1>(__work_group_size)),

__cgh.parallel_for<_Name...>(

sycl::nd_range(sycl::range<1>(__num_groups * __work_group_size), sycl::range<1>(__work_group_size)),

It's more readable from my point of view.

Agreed, it is in the new PR.

SergeyKopienko · 2024-12-17T14:04:42Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+        if constexpr (oneapi::dpl::__internal::__is_spirv_target_v)
+        {
+            const __dpl_sycl::__sub_group __sub_group = __item.get_sub_group();
+            const std::uint32_t __sub_group_size = __sub_group.get_local_linear_range();


All these functions returns std::size_t. Could you please explain why you are using std::uint32_t instead?

The signature for the sub-group member function is uint32_t get_local_linear_range() const and related functions also return uint32_t. Were you thinking about the group class maybe?

SergeyKopienko · 2024-12-17T14:07:12Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+                    {
+                        const std::uint8_t __adjusted_iters_per_work_item =
+                            oneapi::dpl::__internal::__dpl_ceiling_div(__count - __idx, __stride);
+                        for (std::uint8_t __i = 0; __i < __adjusted_iters_per_work_item; ++__i)


Should we use _ONEDPL_PRAGMA_UNROLL for this for-loop too?

Not in this case (there is a similar case in the strided loop utility in the new PR). Because the loop end variable is computed at run-time, the loop cannot be unrolled.

This path is called for the last sub-group / work-group, so the performance impact is negligible.

Signed-off-by: Matthew Michel <[email protected]>

mmichel11 marked this pull request as ready for review September 23, 2024 21:24

mmichel11 requested review from julianmi, dmitriy-sobolev, danhoeflinger, adamfidel and MikeDvorskiy September 23, 2024 21:25

mmichel11 added the enhancement label Sep 23, 2024

mmichel11 requested a review from SergeyKopienko September 23, 2024 21:26

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Show resolved Hide resolved

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Show resolved Hide resolved

danhoeflinger reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/utils.h Outdated Show resolved Hide resolved

dmitriy-sobolev reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Show resolved Hide resolved

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Show resolved Hide resolved

SergeyKopienko reviewed Sep 25, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Sep 25, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

mmichel11 force-pushed the dev/mmichel11/parallel_for_sub_group_stride branch from e4b40a5 to 8b9c3c9 Compare September 25, 2024 13:22

adamfidel reviewed Sep 25, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Show resolved Hide resolved

mmichel11 added this to the 2022.8.0 milestone Sep 25, 2024

dmitriy-sobolev reviewed Oct 2, 2024

View reviewed changes

mmichel11 added 4 commits November 6, 2024 13:44

Optimize memory transactions in SYCL backend parallel for

e0e03bd

Signed-off-by: Matthew Michel <[email protected]>

clang-format

9f3384a

Signed-off-by: Matthew Michel <[email protected]>

Correct comment and error handling.

a244bbb

128 byte memory operations are performed instead of 512 after inspecting the assembly. Processing 512 bytes per sub-group still seems to be the best value after experimentation. Signed-off-by: Matthew Michel <[email protected]>

__num_groups bugfix

ad05086

Signed-off-by: Matthew Michel <[email protected]>

mmichel11 added 16 commits November 6, 2024 13:44

Code cleanup

d2cf632

Signed-off-by: Matthew Michel <[email protected]>

Clang format

4eeaf97

Signed-off-by: Matthew Michel <[email protected]>

Update comments

b717ac7

Signed-off-by: Matthew Michel <[email protected]>

Bugfix in comment

839f1ad

Signed-off-by: Matthew Michel <[email protected]>

More cleanup and better handle non-full case

b347606

Signed-off-by: Matthew Michel <[email protected]>

Rename __ndi to __item for consistency with codebase

72a0941

Signed-off-by: Matthew Michel <[email protected]>

Update all comments on kernel naming trick

9f82b11

Signed-off-by: Matthew Michel <[email protected]>

Handle non-full case in a cleaner way

b5f0021

Signed-off-by: Matthew Michel <[email protected]>

Switch min tuple type utility to return size of type

5519dac

Signed-off-by: Matthew Michel <[email protected]>

Remove unnecessary template parameter

2d985a5

Signed-off-by: Matthew Michel <[email protected]>

Make non-template function inline for ODR compliance

6ab46ad

Signed-off-by: Matthew Michel <[email protected]>

If the iters per work item is 1, then only compile the basic pfor kernel

e3e05a7

Signed-off-by: Matthew Michel <[email protected]>

Address several PR comments

8f9c5bb

* Move __stride_recommender into __parallel_for_large_submitter * Use {} to invoke constructor * Simplify if-else statements in for dispatch Signed-off-by: Matthew Michel <[email protected]>

Remove free function __stride_recommender

55623b2

Signed-off-by: Matthew Michel <[email protected]>

Accept ranges as forwarding references in __parallel_for_large_submitter

39b572f

Signed-off-by: Matthew Michel <[email protected]>

Address reviewer comments

33337f8

Signed-off-by: Matthew Michel <[email protected]>

mmichel11 force-pushed the dev/mmichel11/parallel_for_sub_group_stride branch from 8a95b24 to 33337f8 Compare November 6, 2024 19:46

mmichel11 marked this pull request as draft November 6, 2024 19:52

SergeyKopienko reviewed Dec 17, 2024

View reviewed changes

mmichel11 modified the milestone: 2022.8.0 Dec 17, 2024

mmichel11 mentioned this pull request Dec 19, 2024

Re-implement SYCL backend parallel_for to improve bandwidth utilization #1976

Open

mmichel11 added a commit that referenced this pull request Dec 20, 2024

Address applicable comments from PR #1870

845de21

Signed-off-by: Matthew Michel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SYCL backend `__parallel_for` performance for large input sizes #1870

Improve SYCL backend `__parallel_for` performance for large input sizes #1870

mmichel11 commented Sep 20, 2024 •

edited

Loading

SergeyKopienko commented Sep 25, 2024

dmitriy-sobolev left a comment •

edited

Loading

dmitriy-sobolev Oct 2, 2024 •

edited

Loading

mmichel11 Nov 6, 2024

SergeyKopienko Dec 17, 2024

mmichel11 Dec 20, 2024

dmitriy-sobolev Oct 2, 2024 •

edited

Loading

mmichel11 Nov 6, 2024

dmitriy-sobolev Oct 2, 2024

mmichel11 Nov 6, 2024

dmitriy-sobolev Oct 2, 2024

mmichel11 Nov 6, 2024

dmitriy-sobolev Oct 2, 2024

mmichel11 Nov 6, 2024

mmichel11 commented Nov 6, 2024

SergeyKopienko Dec 17, 2024

mmichel11 Dec 20, 2024

SergeyKopienko Dec 17, 2024

mmichel11 Dec 20, 2024

SergeyKopienko Dec 17, 2024

mmichel11 Dec 20, 2024

SergeyKopienko Dec 17, 2024

SergeyKopienko Dec 17, 2024

mmichel11 Dec 20, 2024

SergeyKopienko Dec 17, 2024

mmichel11 Dec 20, 2024

SergeyKopienko Dec 17, 2024

mmichel11 Dec 20, 2024

Improve SYCL backend __parallel_for performance for large input sizes #1870

Are you sure you want to change the base?

Improve SYCL backend __parallel_for performance for large input sizes #1870

Conversation

mmichel11 commented Sep 20, 2024 • edited Loading

Summary

Details

SergeyKopienko commented Sep 25, 2024

dmitriy-sobolev left a comment • edited Loading

Choose a reason for hiding this comment

dmitriy-sobolev Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitriy-sobolev Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmichel11 commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Improve SYCL backend `__parallel_for` performance for large input sizes #1870

Improve SYCL backend `__parallel_for` performance for large input sizes #1870

mmichel11 commented Sep 20, 2024 •

edited

Loading

dmitriy-sobolev left a comment •

edited

Loading

dmitriy-sobolev Oct 2, 2024 •

edited

Loading

dmitriy-sobolev Oct 2, 2024 •

edited

Loading