Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement SYCL backend parallel_for to improve bandwidth utilization #1976

Open
wants to merge 92 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
f3acdca
Optimize memory transactions in SYCL backend parallel for
mmichel11 Sep 5, 2024
06e06ff
clang-format
mmichel11 Sep 5, 2024
ab7a75f
Correct comment and error handling.
mmichel11 Sep 6, 2024
ec0761c
__num_groups bugfix
mmichel11 Sep 10, 2024
281f642
Introduce stride recommender for different targets and better distrib…
mmichel11 Sep 16, 2024
6ffb904
Cleanup
mmichel11 Sep 16, 2024
fad85fe
Unroll loop if possible
mmichel11 Sep 18, 2024
329f000
Revert "Unroll loop if possible"
mmichel11 Sep 18, 2024
420bd6c
Use a small and large kernel in parallel for
mmichel11 Sep 20, 2024
ef78c6a
Improve __iters_per_work_item heuristic.
mmichel11 Sep 20, 2024
7883c3e
Code cleanup
mmichel11 Sep 20, 2024
5c12d66
Clang format
mmichel11 Sep 23, 2024
36a602b
Update comments
mmichel11 Sep 23, 2024
4d645f6
Bugfix in comment
mmichel11 Sep 23, 2024
ca9a06f
More cleanup and better handle non-full case
mmichel11 Sep 23, 2024
3713d62
Rename __ndi to __item for consistency with codebase
mmichel11 Sep 24, 2024
305bf2b
Update all comments on kernel naming trick
mmichel11 Sep 24, 2024
3b50010
Handle non-full case in a cleaner way
mmichel11 Sep 24, 2024
8e5de99
Switch min tuple type utility to return size of type
mmichel11 Sep 24, 2024
65e0b05
Remove unnecessary template parameter
mmichel11 Sep 24, 2024
257815a
Make non-template function inline for ODR compliance
mmichel11 Sep 24, 2024
3929705
If the iters per work item is 1, then only compile the basic pfor kernel
mmichel11 Sep 24, 2024
31a7aae
Address several PR comments
mmichel11 Sep 25, 2024
08d24aa
Remove free function __stride_recommender
mmichel11 Sep 25, 2024
1748a6b
Accept ranges as forwarding references in __parallel_for_large_submitter
mmichel11 Sep 25, 2024
cc829e5
Address reviewer comments
mmichel11 Nov 6, 2024
8dc7706
Introduce vectorized for-path for small types and parallel_backend_sy…
mmichel11 Dec 16, 2024
1309f6a
Improve testing and cleanup of code
mmichel11 Dec 16, 2024
288499f
clang format
mmichel11 Dec 16, 2024
d683b72
Miscellaneous fixes identified during testing
mmichel11 Dec 17, 2024
b4cfcae
clang-format
mmichel11 Dec 17, 2024
62c104f
Fix ordering to __vector_load call
mmichel11 Dec 17, 2024
b525ab7
Add support for vectorization with C++20 parallel range APIs
mmichel11 Dec 17, 2024
7d16c16
Add device copyable specializations for new walk patterns
mmichel11 Dec 17, 2024
f9d63aa
Align vector_walk implementation with other vector functors
mmichel11 Dec 17, 2024
9aa36e1
Add back non-spirv path
mmichel11 Dec 17, 2024
b6d5d98
Further improve test coverage
mmichel11 Dec 17, 2024
4c1a974
Restore original shift_left due to implicit implementation requiremen…
mmichel11 Dec 17, 2024
bebd84b
Fix issues in vectorized rotate
mmichel11 Dec 18, 2024
02d0a18
Fix fpga parallel for compilation issues
mmichel11 Dec 18, 2024
1c3f455
Restore initial shift_left_right.pass.cpp
mmichel11 Dec 18, 2024
774e6f0
Fix test side issue when unnamed lambdas are disabled
mmichel11 Dec 18, 2024
cad0e1b
Add a vector path specialization for std::swap_ranges
mmichel11 Dec 18, 2024
0c2c9a8
General code cleanup
mmichel11 Dec 18, 2024
7aa5bf8
Bugfix with __pattern_swap using nanoranges
mmichel11 Dec 18, 2024
62a19fd
clang-format
mmichel11 Dec 19, 2024
b2128fe
Address applicable comments from PR #1870
mmichel11 Dec 20, 2024
2b1281b
Refactor __lazy_ctor_storage deleter
mmichel11 Jan 2, 2025
1c4ed8c
Address review comments
mmichel11 Jan 2, 2025
d0a66ae
Remove intrusive test macro and adjust input sizes in test framework
mmichel11 Jan 4, 2025
ac6d945
Make walk_scalar_base and walk_vector_or_scalar_base structs
mmichel11 Jan 4, 2025
4654b1d
Add missing max_n
mmichel11 Jan 4, 2025
59ea1ec
Add constructors for for-based bricks
mmichel11 Jan 4, 2025
bbee988
Remove extraneous {} and add constructor to custom_brick
mmichel11 Jan 6, 2025
33dc8b7
Limit recursive searching of __min_nested_type_size to tuples
mmichel11 Jan 6, 2025
8a0f4b5
Work around compiler vectorization issue
mmichel11 Jan 6, 2025
0f81298
Add missing decays
mmichel11 Jan 7, 2025
971edae
Add compile time check to ensure we do not get buffer pointer on host
mmichel11 Jan 7, 2025
e7309c9
Revert "Work around compiler vectorization issue"
mmichel11 Jan 7, 2025
d5c7157
Remove all begin() calls on views in vectorization paths
mmichel11 Jan 7, 2025
0280f7c
Remove unused __is_passed_directly_range utility
mmichel11 Jan 7, 2025
52ce868
Rename __scalar_path / __vector_path to __scalar_path_impl / __vector…
mmichel11 Jan 8, 2025
ab70533
Correct __vector_walk deleters and a type in __reverse_copy
mmichel11 Jan 8, 2025
a26cdba
Set upper limit of 10,000,000 for get_pattern_for_max_n
mmichel11 Jan 9, 2025
6db2d58
General cleanup and renaming for consistency
mmichel11 Jan 9, 2025
2e378ea
Explicitly list template types in specializations of __is_vectorizabl…
mmichel11 Jan 13, 2025
f387a4f
Remove unnecessary local variables
mmichel11 Jan 14, 2025
8a387b2
Remove unnecessary local variables in async and numeric headers
mmichel11 Jan 14, 2025
2ccb478
Correct optimization in __reverse_functor and improve explanation
mmichel11 Jan 16, 2025
af2e16f
Rename custom_brick to __custom_brick
mmichel11 Jan 16, 2025
6a4db2c
Rename __n to __full_range_size in vec utils and fix potential unused…
mmichel11 Jan 17, 2025
5e31e07
Remove unnecessary ternary operator and replace _Idx template with st…
mmichel11 Jan 17, 2025
5fe7c58
Add note to __reverse_copy, __rotate_copy, and minor cleanup
mmichel11 Jan 21, 2025
c9bf4c5
Switch runtime check to compile time check in __reverse_copy
mmichel11 Jan 21, 2025
35e5912
Update comment in __reverse_copy
mmichel11 Jan 21, 2025
fff3647
Remove the usage of __lazy_ctor_storage from all vectorization paths
mmichel11 Jan 16, 2025
deccd49
Remove unneeded template
mmichel11 Jan 21, 2025
1f1b87d
Remove __lazy_ctor_storage::__get_callable_deleter
mmichel11 Jan 21, 2025
b28826a
Address review comments
mmichel11 Jan 22, 2025
5efeb2e
Cleanup some types
mmichel11 Jan 22, 2025
8062c4e
Use __pstl_assign instead of operator= and revert bad change
mmichel11 Jan 22, 2025
96e6349
Avoid modulo in loop body of __rotate_copy
mmichel11 Jan 22, 2025
4d2255f
Make variables const where appropriate
mmichel11 Jan 22, 2025
43a92f6
::std -> std changes, add missing include, and clang-format
mmichel11 Jan 23, 2025
dee9659
Refactor __vector_path_impl of __brick_shift_left
mmichel11 Jan 24, 2025
ae9035f
Add TODO comment to unify vector and strided loop utils
mmichel11 Jan 24, 2025
ab6c28a
Add a vectorized path for __brick_shift_left
mmichel11 Jan 24, 2025
d6b870c
Update comment in __brick_shift_left
mmichel11 Jan 24, 2025
4bf4fe7
Clarify comment
mmichel11 Jan 24, 2025
0c8bf8a
Disambiguate kernel names when testing with multiple types and cleanup
mmichel11 Jan 24, 2025
5285df0
Add comments on access modes and link to GitHub issue
mmichel11 Jan 24, 2025
0d9e1e9
Update __pattern_unique comment
mmichel11 Jan 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ namespace oneapi::dpl::experimental::kt::gpu::esimd::__impl
{

//------------------------------------------------------------------------
// Please see the comment for __parallel_for_submitter for optional kernel name explanation
// Please see the comment above __parallel_for_small_submitter for optional kernel name explanation
//------------------------------------------------------------------------

template <bool __is_ascending, ::std::uint8_t __radix_bits, ::std::uint16_t __data_per_work_item,
Expand Down
18 changes: 12 additions & 6 deletions include/oneapi/dpl/internal/async_impl/async_impl_hetero.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ __pattern_walk1_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For

auto __future_obj = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf.all_view());
unseq_backend::walk1_vector_or_scalar<_ExecutionPolicy, _Function, decltype(__buf.all_view())>{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. walk1_vector_or_scalar is so long name...
    M.b. keep walk_n?
    As far as I understand it is just renaming, not the second "walker"?
  2. Probably, it make sense to add constructor for type auto deduction?
    (for example see https://godbolt.org/z/z3Yfhbo5W )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. walk_n is still used in some other places and is currently more generic, so we do need a separate name. I do think something that reflects the different vector / scalar paths is best.
  2. I am not as familiar with CTAD, but my understanding is that all template types must be deduced from the constructor. The problem with this is that _Ranges... is only passed through the class template to establish tuning parameters, so it cannot be deduced from a constructor and must be explicitly specified by the caller. Since there is no partial CTAD as far as I am aware, then I do not think it is possible to implement unless we pass some unused ranges through the constructor to deduce types. Is this correct and if so, do you think it is still the best approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing you could consider is a "make" function which provides the partial deduction for you. You can provide _ExecutionPolicy and _Ranges... types explicitly as template args to a make function, and _Function could be deduced.
I personally think its a bit overkill for little benefit, but its an option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out how this can be done. I agree that it does not save as much and just listing the template types is the most straightforward in this case.

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ragarding

2. The problem with this is that _Ranges... is only passed through the class template to establish tuning parameters

Basically, walk_vector_or_scalar_base is not a base class. It is just for calculation 3 compile time constant, based on input Ranges... types.
There is nothing to inherit - not implementation, not API.

Other words, it is some "Params" type. It can be defined on the fly where you need the mentioned 3 compile time constants __can_vectorize, __preferred_vector_size and __preferred_iters_per_item:

Params<Range1>::__preferred_iters_per_item
or
Params<Range1, Range2>::__preferred_vector_size
or in general case with parameter pack:
Params<Ranges...>::__preferred_vector_size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This decoupling of walk_vector_or_scalar_base from the brick into a Params class as suggested I believe only works if every for-based brick is vectorizable which is not the case (e.g. __custom_brick, __brick_shift_left). If the parallel for implementation was able to determine these parameters from the ranges alone, then we would not have to pass these range types through the brick's class templates.

What the inheritance gets us is that it ties a particular brick to a strategy with or without vectorization walk_vector_or_scalar_base and walk_scalar_base while also allowing special cases to be implemented without inheritance such as __brick_shift_left. The parallel for implementation then queries these compile-time parameters from the brick which may be inherited from the base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point @MikeDvorskiy, for the majority of the usage of these compile time constants. They could be calculated inline in the functions based on the input range types. However, there are a few exterior public uses as described in this comment from the parallel_for launch code. This usage requires this derived struct to contain the Range type information prior to the actual function calls, and its easiest if you can just query it like this.

You could instead have traits or helpers where you could pass range info when querying this stuff at the parallel_for launch level. I don't have a strong opinion between the two without having the other full implementation to compare to.

Copy link
Contributor

@danhoeflinger danhoeflinger Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To try to clarify further, it could maybe look like this:

auto [__idx, __stride, __is_full] = __stride_recommender( 
	     __item, __count, __iters_per_work_item, _Fp::__preferred_vector_size<_Ranges...>, __work_group_size); 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the initial point now. My understanding is that this is implementable through a variable template within the brick.

Something like:

template <typename _Rngs>
constexpr static std::uint8_t __preferred_vector_size = ...

There would be duplicated definitions of the variables in each brick, but we would not need to pass the ranges through the brick's template parameters. There are pros and cons with each approach and functionality / performance of each should be identical.

At this point in the review, I believe it is too late to make such a large design decision if we want to make it to the milestone. My suggestion is we defer this to an issue and address in the 2022.9.0 milestone.

__f, static_cast<std::size_t>(__n)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_cast<std::size_t> looks suspicious... Why is a reason for doing that?

Copy link
Contributor Author

@mmichel11 mmichel11 Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason here is that __n may be a signed difference type obtained from taking the difference of two iterators while the constructor accepts a std::size_t, so we see compilation errors without this cast.

If it's preferred, I can add a templated type for the size to the constructor, so we can avoid the need for this cast.

__n, __buf.all_view());
return __future_obj;
}

Expand All @@ -67,7 +69,9 @@ __pattern_walk2_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For

auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf1.all_view(), __buf2.all_view());
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__buf1.all_view()),
decltype(__buf2.all_view())>{__f, static_cast<std::size_t>(__n)},
__n, __buf1.all_view(), __buf2.all_view());

return __future.__make_future(__first2 + __n);
}
Expand All @@ -91,10 +95,12 @@ __pattern_walk3_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::write, _ForwardIterator3>();
auto __buf3 = __keep3(__first3, __first3 + __n);

auto __future =
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n,
__buf1.all_view(), __buf2.all_view(), __buf3.all_view());
auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk3_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__buf1.all_view()),
decltype(__buf2.all_view()), decltype(__buf3.all_view())>{
__f, static_cast<size_t>(__n)},
__n, __buf1.all_view(), __buf2.all_view(), __buf3.all_view());

return __future.__make_future(__first3 + __n);
}
Expand Down
37 changes: 26 additions & 11 deletions include/oneapi/dpl/internal/binary_search_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,19 @@ enum class search_algorithm
binary_search
};

template <typename Comp, typename T, search_algorithm func>
struct custom_brick
#if _ONEDPL_BACKEND_SYCL
template <typename Comp, typename T, typename _Range, search_algorithm func>
struct __custom_brick : oneapi::dpl::unseq_backend::walk_scalar_base<_Range>
{
Comp comp;
T size;
bool use_32bit_indexing;

__custom_brick(Comp comp, T size, bool use_32bit_indexing)
: comp(std::move(comp)), size(size), use_32bit_indexing(use_32bit_indexing)
{
}

template <typename _Size, typename _ItemId, typename _Acc>
void
search_impl(_ItemId idx, _Acc acc) const
Expand All @@ -68,17 +74,23 @@ struct custom_brick
get<2>(acc[idx]) = (value != end_orig) && (get<1>(acc[idx]) == get<0>(acc[value]));
}
}

template <typename _ItemId, typename _Acc>
template <typename _IsFull, typename _ItemId, typename _Acc>
void
operator()(_ItemId idx, _Acc acc) const
__scalar_path_impl(_IsFull, _ItemId idx, _Acc acc) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we may try to improve this code by replacing run-time bool value use_32bit_indexing to compile-time indexing type specialization.
I found only 3 places with the code

const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();

so it's not big deal to add if statement outside and call __parallel_for inside for both branches with the different index types. But inside the brick we exclude condition check at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I will reevaluate performance here and provide an update. The advantage of the current approach is that we only compile a single kernel whereas your suggestion may improve kernel performance with the cost of increased JIT overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-checked performance here, and the results are similar to my initial experimentation. For small problem sizes (e.g. <16k elements) there is a noticeable performance benefit for adding the second kernel. It only saves a few microseconds (e.g. ~10 us with 2 kernels ~13 us with one with runtime dispatch). I would consider this case less important, however, since I do not expect binary search to be used with so few search keys.

For larger inputs, the effect of the runtime dispatch is not measurable. I suspect this is because __custom_brick can be quite heavy for a brick as it performs multiple memory accesses making the impact of the if ... else dispatch less noticeable. For this reason, I suggest we keep the current approach which compiles faster.

We can discuss more if needed, but I suggest it be separate from this PR as we do not touch the implementation details of binary_search here apart from adjusting the brick to work with the new design.

{
if (use_32bit_indexing)
search_impl<std::uint32_t>(idx, acc);
else
search_impl<std::uint64_t>(idx, acc);
}
template <typename _IsFull, typename _ItemId, typename _Acc>
void
operator()(_IsFull __is_full, _ItemId idx, _Acc acc) const
{
__scalar_path_impl(__is_full, idx, acc);
}
};
#endif

template <class _Tag, typename Policy, typename InputIterator1, typename InputIterator2, typename OutputIterator,
typename StrictWeakOrdering>
Expand Down Expand Up @@ -155,7 +167,8 @@ lower_bound_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, InputIt
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
__bknd::__parallel_for(
_BackendTag{}, ::std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::lower_bound>{comp, size, use_32bit_indexing},
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::lower_bound>{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using auto type deduction via constructor of __custom_brick.

comp, size, use_32bit_indexing},
value_size, zip_vw)
.__deferrable_wait();
return result + value_size;
Expand Down Expand Up @@ -187,7 +200,8 @@ upper_bound_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, InputIt
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
__bknd::__parallel_for(
_BackendTag{}, std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::upper_bound>{comp, size, use_32bit_indexing},
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::upper_bound>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
.__deferrable_wait();
return result + value_size;
Expand Down Expand Up @@ -217,10 +231,11 @@ binary_search_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, Input
auto result_buf = keep_result(result, result + value_size);
auto zip_vw = make_zip_view(input_buf.all_view(), value_buf.all_view(), result_buf.all_view());
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
__bknd::__parallel_for(_BackendTag{}, std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::binary_search>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
__bknd::__parallel_for(
_BackendTag{}, std::forward<decltype(policy)>(policy),
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::binary_search>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
.__deferrable_wait();
return result + value_size;
}
Expand Down
Loading
Loading