-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement SYCL backend parallel_for
to improve bandwidth utilization
#1976
base: main
Are you sure you want to change the base?
Changes from all commits
f3acdca
06e06ff
ab7a75f
ec0761c
281f642
6ffb904
fad85fe
329f000
420bd6c
ef78c6a
7883c3e
5c12d66
36a602b
4d645f6
ca9a06f
3713d62
305bf2b
3b50010
8e5de99
65e0b05
257815a
3929705
31a7aae
08d24aa
1748a6b
cc829e5
8dc7706
1309f6a
288499f
d683b72
b4cfcae
62c104f
b525ab7
7d16c16
f9d63aa
9aa36e1
b6d5d98
4c1a974
bebd84b
02d0a18
1c3f455
774e6f0
cad0e1b
0c2c9a8
7aa5bf8
62a19fd
b2128fe
2b1281b
1c4ed8c
d0a66ae
ac6d945
4654b1d
59ea1ec
bbee988
33dc8b7
8a0f4b5
0f81298
971edae
e7309c9
d5c7157
0280f7c
52ce868
ab70533
a26cdba
6db2d58
2e378ea
f387a4f
8a387b2
2ccb478
af2e16f
6a4db2c
5e31e07
5fe7c58
c9bf4c5
35e5912
fff3647
deccd49
1f1b87d
b28826a
5efeb2e
8062c4e
96e6349
4d2255f
43a92f6
dee9659
ae9035f
ab6c28a
d6b870c
4bf4fe7
0c8bf8a
5285df0
0d9e1e9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -44,7 +44,9 @@ __pattern_walk1_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For | |
|
||
auto __future_obj = oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf.all_view()); | ||
unseq_backend::walk1_vector_or_scalar<_ExecutionPolicy, _Function, decltype(__buf.all_view())>{ | ||
__f, static_cast<std::size_t>(__n)}, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason here is that If it's preferred, I can add a templated type for the size to the constructor, so we can avoid the need for this cast. |
||
__n, __buf.all_view()); | ||
return __future_obj; | ||
} | ||
|
||
|
@@ -67,7 +69,9 @@ __pattern_walk2_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For | |
|
||
auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf1.all_view(), __buf2.all_view()); | ||
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__buf1.all_view()), | ||
decltype(__buf2.all_view())>{__f, static_cast<std::size_t>(__n)}, | ||
__n, __buf1.all_view(), __buf2.all_view()); | ||
|
||
return __future.__make_future(__first2 + __n); | ||
} | ||
|
@@ -91,10 +95,12 @@ __pattern_walk3_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For | |
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::write, _ForwardIterator3>(); | ||
auto __buf3 = __keep3(__first3, __first3 + __n); | ||
|
||
auto __future = | ||
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, | ||
__buf1.all_view(), __buf2.all_view(), __buf3.all_view()); | ||
auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::walk3_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__buf1.all_view()), | ||
decltype(__buf2.all_view()), decltype(__buf3.all_view())>{ | ||
__f, static_cast<size_t>(__n)}, | ||
__n, __buf1.all_view(), __buf2.all_view(), __buf3.all_view()); | ||
|
||
return __future.__make_future(__first3 + __n); | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,13 +37,19 @@ enum class search_algorithm | |
binary_search | ||
}; | ||
|
||
template <typename Comp, typename T, search_algorithm func> | ||
struct custom_brick | ||
#if _ONEDPL_BACKEND_SYCL | ||
template <typename Comp, typename T, typename _Range, search_algorithm func> | ||
struct __custom_brick : oneapi::dpl::unseq_backend::walk_scalar_base<_Range> | ||
{ | ||
Comp comp; | ||
T size; | ||
bool use_32bit_indexing; | ||
|
||
__custom_brick(Comp comp, T size, bool use_32bit_indexing) | ||
: comp(std::move(comp)), size(size), use_32bit_indexing(use_32bit_indexing) | ||
{ | ||
} | ||
|
||
template <typename _Size, typename _ItemId, typename _Acc> | ||
void | ||
search_impl(_ItemId idx, _Acc acc) const | ||
|
@@ -68,17 +74,23 @@ struct custom_brick | |
get<2>(acc[idx]) = (value != end_orig) && (get<1>(acc[idx]) == get<0>(acc[value])); | ||
} | ||
} | ||
|
||
template <typename _ItemId, typename _Acc> | ||
template <typename _IsFull, typename _ItemId, typename _Acc> | ||
void | ||
operator()(_ItemId idx, _Acc acc) const | ||
__scalar_path_impl(_IsFull, _ItemId idx, _Acc acc) const | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe that we may try to improve this code by replacing run-time const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max(); so it's not big deal to add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As discussed offline, I will reevaluate performance here and provide an update. The advantage of the current approach is that we only compile a single kernel whereas your suggestion may improve kernel performance with the cost of increased JIT overhead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I re-checked performance here, and the results are similar to my initial experimentation. For small problem sizes (e.g. <16k elements) there is a noticeable performance benefit for adding the second kernel. It only saves a few microseconds (e.g. ~10 us with 2 kernels ~13 us with one with runtime dispatch). I would consider this case less important, however, since I do not expect binary search to be used with so few search keys. For larger inputs, the effect of the runtime dispatch is not measurable. I suspect this is because We can discuss more if needed, but I suggest it be separate from this PR as we do not touch the implementation details of |
||
{ | ||
if (use_32bit_indexing) | ||
search_impl<std::uint32_t>(idx, acc); | ||
else | ||
search_impl<std::uint64_t>(idx, acc); | ||
} | ||
template <typename _IsFull, typename _ItemId, typename _Acc> | ||
void | ||
operator()(_IsFull __is_full, _ItemId idx, _Acc acc) const | ||
{ | ||
__scalar_path_impl(__is_full, idx, acc); | ||
} | ||
}; | ||
#endif | ||
|
||
template <class _Tag, typename Policy, typename InputIterator1, typename InputIterator2, typename OutputIterator, | ||
typename StrictWeakOrdering> | ||
|
@@ -155,7 +167,8 @@ lower_bound_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, InputIt | |
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max(); | ||
__bknd::__parallel_for( | ||
_BackendTag{}, ::std::forward<decltype(policy)>(policy), | ||
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::lower_bound>{comp, size, use_32bit_indexing}, | ||
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::lower_bound>{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest using auto type deduction via constructor of |
||
comp, size, use_32bit_indexing}, | ||
value_size, zip_vw) | ||
.__deferrable_wait(); | ||
return result + value_size; | ||
|
@@ -187,7 +200,8 @@ upper_bound_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, InputIt | |
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max(); | ||
__bknd::__parallel_for( | ||
_BackendTag{}, std::forward<decltype(policy)>(policy), | ||
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::upper_bound>{comp, size, use_32bit_indexing}, | ||
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::upper_bound>{ | ||
comp, size, use_32bit_indexing}, | ||
value_size, zip_vw) | ||
.__deferrable_wait(); | ||
return result + value_size; | ||
|
@@ -217,10 +231,11 @@ binary_search_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, Input | |
auto result_buf = keep_result(result, result + value_size); | ||
auto zip_vw = make_zip_view(input_buf.all_view(), value_buf.all_view(), result_buf.all_view()); | ||
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max(); | ||
__bknd::__parallel_for(_BackendTag{}, std::forward<decltype(policy)>(policy), | ||
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::binary_search>{ | ||
comp, size, use_32bit_indexing}, | ||
value_size, zip_vw) | ||
__bknd::__parallel_for( | ||
_BackendTag{}, std::forward<decltype(policy)>(policy), | ||
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::binary_search>{ | ||
comp, size, use_32bit_indexing}, | ||
value_size, zip_vw) | ||
.__deferrable_wait(); | ||
return result + value_size; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
walk1_vector_or_scalar
is so long name...M.b. keep
walk_n
?As far as I understand it is just renaming, not the second "walker"?
(for example see https://godbolt.org/z/z3Yfhbo5W )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
walk_n
is still used in some other places and is currently more generic, so we do need a separate name. I do think something that reflects the different vector / scalar paths is best._Ranges...
is only passed through the class template to establish tuning parameters, so it cannot be deduced from a constructor and must be explicitly specified by the caller. Since there is no partial CTAD as far as I am aware, then I do not think it is possible to implement unless we pass some unused ranges through the constructor to deduce types. Is this correct and if so, do you think it is still the best approach?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing you could consider is a "make" function which provides the partial deduction for you. You can provide
_ExecutionPolicy
and_Ranges...
types explicitly as template args to a make function, and_Function
could be deduced.I personally think its a bit overkill for little benefit, but its an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out how this can be done. I agree that it does not save as much and just listing the template types is the most straightforward in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ragarding
Basically, walk_vector_or_scalar_base is not a base class. It is just for calculation 3 compile time constant, based on input
Ranges...
types.There is nothing to inherit - not implementation, not API.
Other words, it is some "Params" type. It can be defined on the fly where you need the mentioned 3 compile time constants
__can_vectorize
,__preferred_vector_size
and__preferred_iters_per_item
:Params<Range1>::__preferred_iters_per_item
or
Params<Range1, Range2>::__preferred_vector_size
or in general case with parameter pack:
Params<Ranges...>::__preferred_vector_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This decoupling of
walk_vector_or_scalar_base
from the brick into aParams
class as suggested I believe only works if every for-based brick is vectorizable which is not the case (e.g.__custom_brick
,__brick_shift_left
). If the parallel for implementation was able to determine these parameters from the ranges alone, then we would not have to pass these range types through the brick's class templates.What the inheritance gets us is that it ties a particular brick to a strategy with or without vectorization
walk_vector_or_scalar_base
andwalk_scalar_base
while also allowing special cases to be implemented without inheritance such as__brick_shift_left
. The parallel for implementation then queries these compile-time parameters from the brick which may be inherited from the base.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point @MikeDvorskiy, for the majority of the usage of these compile time constants. They could be calculated inline in the functions based on the input range types. However, there are a few exterior public uses as described in this comment from the parallel_for launch code. This usage requires this derived struct to contain the
Range
type information prior to the actual function calls, and its easiest if you can just query it like this.You could instead have traits or helpers where you could pass range info when querying this stuff at the parallel_for launch level. I don't have a strong opinion between the two without having the other full implementation to compare to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To try to clarify further, it could maybe look like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the initial point now. My understanding is that this is implementable through a variable template within the brick.
Something like:
There would be duplicated definitions of the variables in each brick, but we would not need to pass the ranges through the brick's template parameters. There are pros and cons with each approach and functionality / performance of each should be identical.
At this point in the review, I believe it is too late to make such a large design decision if we want to make it to the milestone. My suggestion is we defer this to an issue and address in the
2022.9.0
milestone.