-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: numa_partitioner for parallel_for. #1461
base: master
Are you sure you want to change the base?
Conversation
JhaShweta1
commented
Jul 30, 2024
•
edited
Loading
edited
- Added numa_partitioner for parallel_for. files "test1.cpp" & "test.pp" will be deleted, once I add tests.
- Added scan
- labelling it 'WIP' as this needs to be added for sort and for_each too.
@JhaShweta1, could you look into adding oneTBB support for overriding the Distributed Ranges (DR) approach that introduces a concept of Distributed Data Structures (DDS). With DDS, one large flat array will be broken into segments, each allocated on the separate NUMA domain. The motivation for introducing oneTBB overrides is to have an option for developers that use DR to achieve the best performance on CPUs by using oneTBB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review, but comments on first-touch and which algorithms to focus on.
include/oneapi/tbb/blocked_range.h
Outdated
@@ -108,6 +108,22 @@ class blocked_range { | |||
// only comparison 'less than' is required from values of blocked_range objects | |||
__TBB_ASSERT( !(my_begin < r.my_end) && !(r.my_end < my_begin), "blocked_range has been split incorrectly" ); | |||
} | |||
|
|||
// fill elements with their index values | |||
void first_touch(std::vector<Value>& container) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a first_touch function as part of the range is not the right abstraction. This looks like an algorithm. Also, the first-touch code would be application-specific and likely achieved by combining a range, partitioner and algorithm (very likely parallel_for). So I would not expect an explicit first_touch algorithm either.
include/oneapi/tbb/blocked_range2d.h
Outdated
@@ -89,6 +89,19 @@ class blocked_range2d { | |||
//! The columns of the iteration space | |||
const col_range_type& cols() const { return my_cols; } | |||
|
|||
// First touch method | |||
template <typename Container> | |||
void first_touch(Container& container) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previous comment,.
include/oneapi/tbb/blocked_range3d.h
Outdated
@@ -100,6 +100,19 @@ class blocked_range3d { | |||
//! The columns of the iteration space | |||
const col_range_type& cols() const { return my_cols; } | |||
|
|||
// First touch method | |||
template <typename Container> | |||
void first_touch(Container& container) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previous comment,.
include/oneapi/tbb/parallel_reduce.h
Outdated
@@ -402,7 +402,42 @@ class lambda_reduce_body { | |||
} | |||
}; | |||
|
|||
template<typename BasePartitioner> | |||
template<typename Range, typename Body> | |||
void numa_partitioner<BasePartitioner>::execute_reduce(const Range& range, Body& body) const{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now let's focus on parallel_for. We may consider parallel_reduce later.
include/oneapi/tbb/parallel_scan.h
Outdated
|
||
template<typename BasePartitioner> | ||
template<typename Range, typename Body> | ||
void numa_partitioner<BasePartitioner>::execute_scan(const Range& range, Body& body) const{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's skip parallel_scan completely. Already there are only a limited set of partitioners that work with parallel_scan. Let's not worry about making this work.
include/oneapi/tbb/parallel_for.h
Outdated
@@ -179,6 +179,33 @@ task* start_for<Range, Body, Partitioner>::cancel(execution_data& ed) { | |||
return nullptr; | |||
} | |||
|
|||
template<typename BasePartitioner> | |||
template<typename Range, typename Body> | |||
void numa_partitioner<BasePartitioner>::execute_for(const Range& range, const Body& body) const{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to define a member function of numa_partitioner inside of the parallel_for header. That seems very unusual.
include/oneapi/tbb/parallel_for.h
Outdated
std::vector<Range> subranges; | ||
split_range(range, subranges, num_numa_nodes); | ||
std::vector<oneapi::tbb::task_group> task_groups(num_numa_nodes); | ||
initialize_arena(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the task_arenas be initialized once instead of during each parallel_for execution. I would expect that a partitioner like this could be created and then passed to a number of parallel_fors, amortizing the initialization cost.
|
||
void initialize_arena() const { | ||
for (std::size_t node = 0; node < num_numa_nodes; ++node) { | ||
this->arenas.emplace_back(tbb::task_arena::constraints().set_numa_id(node)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the same instance is used across multiple parallel_fors, won't the arenas vector keep growing? I think initialize would be repeatedly invoked.
tbb::numa_partitioner<tbb::affinity_partitioner> n_partitioner(ap); | ||
|
||
// Test parallel_for with numa_partitioner and a lambda body | ||
parallel_for(range, body, n_partitioner); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a test with more than one parallel_for invocation. I think that would uncover some of the design issues.
I second the above comment, that arenas must be initialized only once before any calls to parallel_fors. |