WIP: numa_partitioner for parallel_for. #1461

JhaShweta1 · 2024-07-30T13:17:51Z

Added numa_partitioner for parallel_for. files "test1.cpp" & "test.pp" will be deleted, once I add tests.
Added scan
labelling it 'WIP' as this needs to be added for sort and for_each too.

milubin · 2024-08-01T16:46:20Z

@JhaShweta1, could you look into adding oneTBB support for overriding the Distributed Ranges (DR) approach that introduces a concept of Distributed Data Structures (DDS). With DDS, one large flat array will be broken into segments, each allocated on the separate NUMA domain. The motivation for introducing oneTBB overrides is to have an option for developers that use DR to achieve the best performance on CPUs by using oneTBB.

vossmjp

Not a full review, but comments on first-touch and which algorithms to focus on.

vossmjp · 2024-08-05T13:35:38Z

include/oneapi/tbb/blocked_range.h

@@ -108,6 +108,22 @@ class blocked_range {
        // only comparison 'less than' is required from values of blocked_range objects
        __TBB_ASSERT( !(my_begin < r.my_end) && !(r.my_end < my_begin), "blocked_range has been split incorrectly" );
    }
+
+    // fill elements with their index values
+  void first_touch(std::vector<Value>& container) const {


I think having a first_touch function as part of the range is not the right abstraction. This looks like an algorithm. Also, the first-touch code would be application-specific and likely achieved by combining a range, partitioner and algorithm (very likely parallel_for). So I would not expect an explicit first_touch algorithm either.

vossmjp · 2024-08-05T13:36:07Z

include/oneapi/tbb/blocked_range2d.h

@@ -89,6 +89,19 @@ class blocked_range2d {
    //! The columns of the iteration space
    const col_range_type& cols() const { return my_cols; }

+    // First touch method
+    template <typename Container>
+    void first_touch(Container& container) const {


Same as previous comment,.

vossmjp · 2024-08-05T13:36:15Z

include/oneapi/tbb/blocked_range3d.h

@@ -100,6 +100,19 @@ class blocked_range3d {
    //! The columns of the iteration space
    const col_range_type& cols() const { return my_cols; }

+     // First touch method
+    template <typename Container>
+    void first_touch(Container& container) const {


Same as previous comment,.

vossmjp · 2024-08-05T13:37:21Z

include/oneapi/tbb/parallel_reduce.h

@@ -402,7 +402,42 @@ class lambda_reduce_body {
    }
 };

+template<typename BasePartitioner>
+template<typename Range, typename Body>
+void numa_partitioner<BasePartitioner>::execute_reduce(const Range& range, Body& body) const{


For now let's focus on parallel_for. We may consider parallel_reduce later.

vossmjp · 2024-08-05T13:38:43Z

include/oneapi/tbb/parallel_scan.h

+
+  template<typename BasePartitioner>
+  template<typename Range, typename Body>
+  void numa_partitioner<BasePartitioner>::execute_scan(const Range& range, Body& body) const{


Let's skip parallel_scan completely. Already there are only a limited set of partitioners that work with parallel_scan. Let's not worry about making this work.

vossmjp · 2024-08-12T01:08:14Z

include/oneapi/tbb/parallel_for.h

@@ -179,6 +179,33 @@ task* start_for<Range, Body, Partitioner>::cancel(execution_data& ed) {
    return nullptr;
 }

+template<typename BasePartitioner>
+template<typename Range, typename Body>
+void numa_partitioner<BasePartitioner>::execute_for(const Range& range, const Body& body) const{


Is it necessary to define a member function of numa_partitioner inside of the parallel_for header. That seems very unusual.

vossmjp · 2024-08-12T01:09:04Z

include/oneapi/tbb/parallel_for.h

+        std::vector<Range> subranges;
+	split_range(range, subranges, num_numa_nodes);
+	std::vector<oneapi::tbb::task_group> task_groups(num_numa_nodes);
+	initialize_arena();


Can the task_arenas be initialized once instead of during each parallel_for execution. I would expect that a partitioner like this could be created and then passed to a number of parallel_fors, amortizing the initialization cost.

vossmjp · 2024-08-12T01:12:19Z

include/oneapi/tbb/partitioner.h

+
+  void initialize_arena() const {
+    for (std::size_t node = 0; node < num_numa_nodes; ++node) {
+      this->arenas.emplace_back(tbb::task_arena::constraints().set_numa_id(node));


If the same instance is used across multiple parallel_fors, won't the arenas vector keep growing? I think initialize would be repeatedly invoked.

vossmjp · 2024-08-12T01:13:05Z

test/tbb/test_parallel_for.cpp

+    tbb::numa_partitioner<tbb::affinity_partitioner> n_partitioner(ap);
+
+    // Test parallel_for with numa_partitioner and a lambda body
+    parallel_for(range, body, n_partitioner);


You need a test with more than one parallel_for invocation. I think that would uncover some of the design issues.

milubin · 2024-08-14T17:09:04Z

I second the above comment, that arenas must be initialized only once before any calls to parallel_fors.

JhaShweta1 added 3 commits July 26, 2024 23:50

split

aa684b6

added first touch

59253c9

refractored

3d1174a

JhaShweta1 requested review from vossmjp and pavelkumbrasev July 30, 2024 13:22

JhaShweta1 marked this pull request as draft July 30, 2024 15:40

JhaShweta1 removed request for vossmjp and pavelkumbrasev July 30, 2024 15:41

JhaShweta1 added 5 commits July 30, 2024 11:01

refractored

2209ab3

1D first touch

238c892

added scan

6edea84

added scan

0a01596

added scan

6c68575

JhaShweta1 marked this pull request as ready for review July 31, 2024 13:47

JhaShweta1 requested review from vossmjp, pavelkumbrasev, dnmokhov and sarathnandu July 31, 2024 13:50

JhaShweta1 added 2 commits July 31, 2024 22:36

Parallel_reduce - first touch

0407953

First touch - potential issue

1e1d7a1

vossmjp reviewed Aug 5, 2024

View reviewed changes

JhaShweta1 added 4 commits August 5, 2024 13:19

reverted back first touch from the ranges

200675b

reverted back first touch from the ranges

52fd000

removed reduce & scan, added tests for for

5ebaf7e

removed tests

2172632

vossmjp requested changes Aug 12, 2024

View reviewed changes

arena-single invoke, execute in partitioner, more tests

a8de090

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: numa_partitioner for parallel_for. #1461

WIP: numa_partitioner for parallel_for. #1461

JhaShweta1 commented Jul 30, 2024 •

edited

Loading

milubin commented Aug 1, 2024

vossmjp left a comment

vossmjp Aug 5, 2024

vossmjp Aug 5, 2024

vossmjp Aug 5, 2024

vossmjp Aug 5, 2024

vossmjp Aug 5, 2024

vossmjp Aug 12, 2024

vossmjp Aug 12, 2024

vossmjp Aug 12, 2024

vossmjp Aug 12, 2024

milubin commented Aug 14, 2024

WIP: numa_partitioner for parallel_for. #1461

Are you sure you want to change the base?

WIP: numa_partitioner for parallel_for. #1461

Conversation

JhaShweta1 commented Jul 30, 2024 • edited Loading

milubin commented Aug 1, 2024

vossmjp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

milubin commented Aug 14, 2024

JhaShweta1 commented Jul 30, 2024 •

edited

Loading