-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement SYCL backend parallel_for
to improve bandwidth utilization
#1976
base: main
Are you sure you want to change the base?
Conversation
parallel_for
to improve bandwidth utilizationparallel_for
to improve bandwidth utilization
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
128 byte memory operations are performed instead of 512 after inspecting the assembly. Processing 512 bytes per sub-group still seems to be the best value after experimentation. Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
…ute work for small inputs Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
This reverts commit e4cbceb. Small sizes slightly slower and for horizontal vectorization no "real" benefit is observed.
Small but measurable overheads can be observed for small inputs where runtime dispatch in the kernel is present to check for the correct path to take. Letting the compiler handle the the small input case in the original kernel shows the best performance. Signed-off-by: Matthew Michel <[email protected]>
We now flatten the user-provided ranges and find the minimum sized type to estimate the best __iters_per_work_item. This benefits performance in calls that wrap multiple buffers in a single input / output through a zip_iterator (e.g. dpct::scatter_if in SYCLomatic compatibility headers). Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
__vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const | ||
{ | ||
// This is needed to enable vectorization | ||
auto __raw_ptr = __rng.begin(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think that
__raw_ptr
isn't very good name becausebegin()
usually linked in mind with iterator. Butraw
usually is some pointer. - Do we really need to have here local variable
__raw_ptr
? Can we pass__rng.begin()
instead of that variable into__vector_walk
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the contexts in which we vectorize, begin() does return pointers, but I agree the name is confusing.
I have addressed this in a different way due to a performance issue. With uint8_t
types, I found the compiler was not properly vectorizing even when calling begin()
on the set of ranges within the kernel leading to performance regressions (about 30% slower than where we should be). Calling begin
from the host and passing it to the submitter to use in the kernel resolves the issue and gives us good performance.
Since begin()
is called on all ranges and passed through the bricks from the submitter, I have switched from the _Rng
naming to _Acc
here as the underlying type may not be a range. Additional template types are also needed.
Update
Please see the comment: #1976 (comment). All of the begin()
calls in this context have been removed.
So now we have 3 entity with defined
Does these constexpr-variables really has different semantic? And if the semantic of these entities are the same, may be make sense to make some re-design to have only one entity |
In some moments implementation details remind me But what if we instead of two different functions template <typename _IsFull, typename _ItemId>
void
__vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const
{
// This is needed to enable vectorization
auto __raw_ptr = __rng.begin();
oneapi::dpl::__par_backend_hetero::__vector_walk<__base_t::__preferred_vector_size>{__n}(__is_full, __idx, __f,
__raw_ptr);
}
// _IsFull is ignored here. We assume that boundary checking has been already performed for this index.
template <typename _IsFull, typename _ItemId>
void
__scalar_path(_IsFull, const _ItemId __idx, _Range __rng) const
{
__f(__rng[__idx]);
} we will have some two functions with the same name and the format excepting the first parameter type which will be used as some Please take a look at |
One more point: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of review. I've not gotten to all the details yet, but this is enough to be interesting.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
static inline std::tuple<std::size_t, std::size_t, bool> | ||
__stride_recommender(const sycl::nd_item<1>& __item, std::size_t __count, std::size_t __iters_per_work_item, | ||
std::size_t __adj_elements_per_work_item, std::size_t __work_group_size) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a general utility which might have utility for other commutative operations beyond just parallel_for
or is there a reason you believe this to be specific to this algorithm / kernel?
If we think it might be useful, we could lift this to a general utility level. Obviously we don't need to incorporate it elsewhere in this PR. An alternative is to add an issue to explore this and only lift it if we find utility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKopienko and I had brief discussion regarding this in the first PR: #1870 (comment).
I think in the initial PR making it a static member was the best choice. However, in this new PR I am pushing to expose more general utilities for future use (e.g. strided loops, vectorization paths) to enforce good memory access patterns, and I think stride recommender can be a good general utility. @SergeyKopienko What are your thoughts here with adding this function as a utility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there is a clear and easy consensus to make it a utility, lets just leave it here and add an issue to explore its use elsewhere as a utility at a later time. I don't want to derail the PR for this purpose.
struct custom_brick | ||
#if _ONEDPL_BACKEND_SYCL | ||
template <typename Comp, typename T, typename _Range, search_algorithm func> | ||
struct custom_brick : oneapi::dpl::unseq_backend::walk_scalar_base<_Range> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets fix the naming of this while were touching all its instances __custom_brick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the historical convention within the internal/
directory is to not use any leading underscores although it has changed a bit over time.
I do not have a strong preference if we make this change or leave it as is, but maybe it fits in a broader discussion regarding the remaining implementations in this directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is compelling reason other than resistance to making purely cosmetic changes in the changelog to have a different convention here. This is why I suggest adjusting it while we are already touching all (or most) instances of it. Perhaps someone with a longer historical knowledge of this code could chime in here if there is a reason to keep this with different conventions.
Not super important to me, so optional nitpick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will wait a bit longer to see if anyone has objections. If not, then I will add this suggestion.
auto __raw_ptr3 = __rng3.begin(); | ||
|
||
oneapi::dpl::__internal::__lazy_ctor_storage<_ValueType1> __rng1_vector[__base_t::__preferred_vector_size]; | ||
oneapi::dpl::__internal::__lazy_ctor_storage<_ValueType2> __rng2_vector[__base_t::__preferred_vector_size]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be possible to combine walk*_vectors_or_scalars
together with some complicated fold instructions, lambdas, tuples, and std::apply.
Take a look at the first answer of https://stackoverflow.com/questions/7230621/how-can-i-iterate-over-a-packed-variadic-template-argument-list. I think you should do something similar, chaining together instructions by returning tuples and then with std::apply
.
Here is an example I was playing with.
https://godbolt.org/z/vc8dK4ed6
In the end, I'm not sure if (1) its actually possible and (2) its worth the complexity to consolidate these structs, but its worth considering...
* Deleter is now a callable lambda returned from a static function in the class * Deleter accepts l-value reference to __lazy_ctor_storage Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
With uint8_t types, the icpx compiler fails to vectorize even when calling begin() on our range within a kernel to pull out a raw pointer. To work around this issue, begin() needs to be called on the host and passed to the kernel Signed-off-by: Matthew Michel <[email protected]>
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
These three cases are all unique when you consider that they define The three unique cases I mention are the following:
|
This reverts commit 1336735.
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
…_path_impl Signed-off-by: Matthew Michel <[email protected]>
High Level Description
This PR improves hardware bandwidth utilization of oneDPL's SYCL backend parallel for pattern through two ideas:
Implementation Details
binary_search
)To implement this approach, the parallel for kernel rewrite from #1870 was adopted with additional changes to handle vectorization paths. Additionally, generic vectorization and strided loop utilities have been defined with the intention for these to be applicable in other portions of the codebase as well. Tests have been expanded to ensure coverage of vectorization paths.
This PR will supersedes #1870. Initially, the plan was to merge this PR into 1870 but after comparing the diff, I believe the most straightforward approach will be to target this directly to main.