-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Added numa_support rfc #1535
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,179 @@ | ||||||
# Simplified NUMA support in oneTBB | ||||||
|
||||||
## Introduction | ||||||
|
||||||
In Non-Uniform Memory Access (NUMA) systems, the cost of memory accesses depends on the | ||||||
*nearness* of the processor to the memory resource on which the accessed data resides. | ||||||
While oneTBB has core support that enables developers to tune for Non-Uniform Memory | ||||||
Access (NUMA) systems, we believe this support can be simplified and improved to provide | ||||||
an improved user experience. | ||||||
|
||||||
This early proposal recommends addressing for areas for improvement: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A typo:
Suggested change
|
||||||
|
||||||
1. improved reliability of HWLOC-dependent topology and pinning support in, | ||||||
2. addition of a NUMA-aware allocation, | ||||||
3. simplified approaches to associate task distribution with data placement and | ||||||
4. where possible, improved out-of-the-box performance for high-level oneTBB features. | ||||||
|
||||||
We expect that this draft proposal may be broken into smaller proposals based on feedback | ||||||
and prioritization of the suggested features. | ||||||
|
||||||
The features for NUMA tuning already available in the oneTBB 1.3 specification include: | ||||||
|
||||||
- Functions in the `tbb::info` namespace **[info_namespace]** | ||||||
- `std::vector<numa_node_id> numa_nodes()` | ||||||
- `int default_concurrency(numa_node_id id = oneapi::tbb::task_arena::automatic)` | ||||||
- `tbb::task_arena::constraints` in **[scheduler.task_arena]** | ||||||
|
||||||
Below is the example that demonstrates the use of these APIs to pin threads to different | ||||||
arenas to each of the NUMA nodes available on a system, submit work across those `task_arena` | ||||||
objects and into associated `task_group`` objects, and then wait for work again using both | ||||||
the `task_arena` and `task_group` objects. | ||||||
Comment on lines
+28
to
+31
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the code can be made simpler with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This pattern of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, the documentation shows a suboptimal pattern then. In particular, it does not explicitly set the number of reserved slots to 0, and essentially can lead to undersubscription. Why repeating the same mistake one more time? :) |
||||||
|
||||||
#include "oneapi/tbb/task_group.h" | ||||||
#include "oneapi/tbb/task_arena.h" | ||||||
|
||||||
#include <vector> | ||||||
|
||||||
int main() { | ||||||
std::vector<oneapi::tbb::numa_node_id> numa_nodes = oneapi::tbb::info::numa_nodes(); | ||||||
std::vector<oneapi::tbb::task_arena> arenas(numa_nodes.size()); | ||||||
std::vector<oneapi::tbb::task_group> task_groups(numa_nodes.size()); | ||||||
|
||||||
// Initialize the arenas and place memory | ||||||
for (int i = 0; i < numa_nodes.size(); i++) { | ||||||
arenas[i].initialize(oneapi::tbb::task_arena::constraints(numa_nodes[i])); | ||||||
arenas[i].execute([i] { | ||||||
// allocate/place memory on NUMA node i | ||||||
}); | ||||||
} | ||||||
for (int j 0; j < NUM_STEPS; ++i) { | ||||||
|
||||||
// Distribute work across the arenas / NUMA nodes | ||||||
for (int i = 0; i < numa_nodes.size(); i++) { | ||||||
arenas[i].execute([&task_groups, i] { | ||||||
task_groups[i].run([] { | ||||||
/* executed by the thread pinned to specified NUMA node */ | ||||||
}); | ||||||
}); | ||||||
} | ||||||
|
||||||
// Wait for the work in each arena / NUMA node to complete | ||||||
for (int i = 0; i < numa_nodes.size(); i++) { | ||||||
arenas[i].execute([&task_groups, i] { | ||||||
task_groups[i].wait(); | ||||||
}); | ||||||
} | ||||||
} | ||||||
|
||||||
return 0; | ||||||
} | ||||||
|
||||||
### The need for application-specific knowledge | ||||||
|
||||||
In general when tuning a parallel application for NUMA systems, the goal is to expose sufficient | ||||||
parallelism while minimizing (or at least controlling) data access and communication costs. The | ||||||
tradeoffs involved in this tuning often rely on application-specific knowledge. | ||||||
|
||||||
In particular, NUMA tuning typically involves: | ||||||
|
||||||
1. Understanding the overall application problem and its use of algorithms and data containers | ||||||
2. Placement of data container objects onto memory resources | ||||||
3. Distribution of tasks to hardware resources that optimize for data placement | ||||||
|
||||||
As shown in the previous example, the oneTBB 1.3 specification only provides low-level | ||||||
support for NUMA optimization. The `tbb::info` namespace provides topology discovery. And the | ||||||
combination of `task_arena`, `task_arena::constraints` and `task_group` provide a mechanism for | ||||||
placing tasks onto specific processors. There is no high-level support for memory allocation | ||||||
or placement, or for guiding the task distribution of algorithms. | ||||||
|
||||||
### Issues that should be resolved in the oneTBB library | ||||||
|
||||||
**The behavior of existing features is not always predictable.** There is a note in | ||||||
section **[info_namespace]** of the oneTBB specification that describes | ||||||
the function `std::vector<numa_node_id> numa_nodes()`, "If error occurs during system topology | ||||||
parsing, returns vector containing single element that equals to `task_arena::automatic`." | ||||||
|
||||||
In practice, the error often occurs because HWLOC is not detected on the system. While the | ||||||
oneTBB documentation states in several places that HWLOC is required for NUMA support and | ||||||
even provides guidance on | ||||||
[how to check for HWLOC](https://www.intel.com/content/www/us/en/docs/onetbb/get-started-guide/2021-12/next-steps.html), | ||||||
the failure to resolve HWLOC at runtime silently returns a default of `task_arena::automatic`. This | ||||||
default does not pin threads to NUMA nodes. It is too easy to write code similar to the preceding | ||||||
example and be unaware that a HWLOC installation error (or lack of HWLOC) has undone all your effort. | ||||||
|
||||||
**Getting good performance using these tools requres notable manual coding effort by users.** As we | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A typo:
Suggested change
|
||||||
can see in the preceding example, if we want to spread work across the NUMA nodes in | ||||||
a system we need to query the topology using functions in the `tbb::info` namespace, create | ||||||
one `task_arena` per NUMA node, along with one `task_group` per NUMA node, and then add an | ||||||
extra loop that iterates overs these `task_arena` and `task_group` objects to execute the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A typo:
Suggested change
|
||||||
work on the desired NUMA nodes. We also need to handle all container allocations using OS-specific | ||||||
APIs (or behaviors, such as first-touch) to allocator or place them on the appropriate NUMA nodes. | ||||||
|
||||||
**The out-of-the-box performance of the generic TBB APIs on NUMA systems is not good enough.** | ||||||
Should the oneTBB library do anything special be default if the system is a NUMA system? Or should | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A typo:
Suggested change
|
||||||
regular random stealing distribute the work across all of the cores, regardless of which NUMA first | ||||||
touched the data? | ||||||
|
||||||
Is it reasonable for a developer to expect that a series of loops, such as the ones that follow, will | ||||||
try to create a NUMA-friendly distribution of tasks so that accesses to the same elements of `b` and `c` | ||||||
in the two loops are from the same NUMA nodes? Or is this too much to expect without providing hints? | ||||||
|
||||||
tbb::parallel_for(0, N, | ||||||
[](int i) { | ||||||
b[i] = f(i); | ||||||
c[i] = g(i); | ||||||
}); | ||||||
|
||||||
tbb::parallel_for(0, N, | ||||||
[](int i) { | ||||||
a[i] = b[i] + c[i]; | ||||||
}); | ||||||
|
||||||
## Proposal | ||||||
|
||||||
### Increased availability of NUMA support | ||||||
|
||||||
The oneTBB 1.3 specification states for `tbb::info::numa_nodes`, "If error occurs during system | ||||||
topology parsing, returns vector containing single element that equals to task_arena::automatic." | ||||||
|
||||||
Since the oneTBB library dynamically loads the HWLOC library, a misconfiguration can cause the HWLOC | ||||||
to fail to be found. In that case, a call like: | ||||||
|
||||||
std::vector<oneapi::tbb::numa_node_id> numa_nodes = oneapi::tbb::info::numa_nodes(); | ||||||
|
||||||
will return a vector with a single element of `task_arena::automatic`. This behavior, as we have noticed | ||||||
through user questions, can lead to unexpected performance from NUMA optimizations. When running | ||||||
on a NUMA system, a developer that has not fully read the documentation may expect that `numa_nodes()` | ||||||
will give a proper accounting of the NUMA nodes. When the code, without raising any alarm, returns only | ||||||
a single, valid element due to the environmental configuation (such as lack of HWLOCK), it is too easy | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A typo:
Suggested change
|
||||||
for developers to not notice that the code is acting in a valid, but unexpected way. | ||||||
|
||||||
We propose that the oneTBB library implementation include, wherever possibly, a statically-linked fallback | ||||||
to decrease that likelihood of such failures. The oneTBB specification will remain unchanged. | ||||||
|
||||||
### NUMA-aware allocation | ||||||
|
||||||
We will define allocators of other features that simplify the process of allocating or places data onto | ||||||
specific NUMA nodes. | ||||||
|
||||||
### Simplified approaches to associate task distribution with data placement | ||||||
|
||||||
As discussed earlier, NUMA-aware allocation is just the first step in optimizing for NUMA architectures. | ||||||
We also need to deliver mechanisms to guide task distribution so that tasks are executed on execution | ||||||
resources that are near to the data they access. oneTBB already provides low-level support through | ||||||
`tbb::info` and `tbb::task_arena`, but we should up-level this support into the high-level algorithms, | ||||||
flow graph and containers where appropriate. | ||||||
|
||||||
### Improved out-of-the-box performance for high-level oneTBB features. | ||||||
|
||||||
For high-level oneTBB features that are modified to provide improved NUMA support, we should try to | ||||||
align default behaviors for those features with user-expectations when used on NUMA systems. | ||||||
|
||||||
## Open Questions | ||||||
|
||||||
1. Do we need simplified support, or are users that want NUMA support in oneTBB | ||||||
willing to, or perhaps even prefer, to manage the details manually? | ||||||
2. Is it reasonable to expect good out-of-the-box performance on NUMA systems | ||||||
without user hints or guidance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably call it "Improved NUMA support".
Correspondingly, the RFC folder could be
numa_support_improvements
, meaning that NUMA support is a core feature and improvements are the gist of the proposal.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand the whole
numa_support_improvements
orsimplified_numa_support
directory will be moved torfcs/supported
directory once these improvements are accepted. There may be another set of NUMA improvements in the future which could result in anothernuma_support_improvements
directory be created in the same place. And then, when this new set is again accepted, it moves to the same directory. I see potential naming clash issue... It is not related to naming of this directory, but to the naming approach in general. Surely, we could usenuma_support_improvements2
as the name of the new directory, but I believe we can do better from the very beginning.I propose having the directory with the name related to the feature itself, e.g.,
numa_support
, without additionals such assimplified
orimprovement
. This way we will convey the idea that the documents inside directly affect the support of a particular feature. For resolving naming clashes I propose having the file to be named as precise as possible to what the proposal changes avoiding general terms/adjectives such asimproved
,increased
, etc.. For example, for sub-RFC that I wrote, I suggest naming the file to something likeintroduce_tbbbind_static_library
orintroduce_tbbbind_statically_linked_with_hwloc
; for NUMA-aware allocators name something likeintroduce_numa-aware_allocator
; for task_group dependencies name something likeintroduce_dependencies_for_tasks_in_task_group
; and so on. This way we would avoid name clashing and still this allows grouping similar rfcs together into dedicated folder such asnuma_support
. Otherwise, I am afraid that the feature is not elaborated enough to be proposed since it sounds too generic in our mind.