diff --git a/rfcs/proposed/numa_support/README.md b/rfcs/proposed/numa_support/README.md new file mode 100755 index 0000000000..c19927f4a6 --- /dev/null +++ b/rfcs/proposed/numa_support/README.md @@ -0,0 +1,156 @@ +# NUMA support + +## Introduction + +In Non-Uniform Memory Access (NUMA) systems, the cost of memory accesses depends on the +*nearness* of the processor to the memory resource on which the accessed data resides. +While oneTBB has core support that enables developers to tune for Non-Uniform Memory +Access (NUMA) systems, we believe this support can be simplified and improved to provide +an improved user experience. + +This RFC acts as an umbrella for sub-proposals that address four areas for improvement: + +1. improved reliability of HWLOC-dependent topology and pinning support in, +2. addition of a NUMA-aware allocation, +3. simplified approaches to associate task distribution with data placement and +4. where possible, improved out-of-the-box performance for high-level oneTBB features. + +We expect that this draft proposal will spawn sub-proposals that will progress +independently based on feedback and prioritization of the suggested features. + +The features for NUMA tuning already available in the oneTBB 1.3 specification include: + +- Functions in the `tbb::info` namespace **[info_namespace]** + - `std::vector numa_nodes()` + - `int default_concurrency(numa_node_id id = oneapi::tbb::task_arena::automatic)` +- `tbb::task_arena::constraints` in **[scheduler.task_arena]** + +Below is the example based on existing oneTBB documentation that demonstrates the use of these APIs +to pin threads to different arenas to each of the NUMA nodes available on a system, submit work +across those `task_arena` objects and into associated `task_group` objects, and then wait for work +again using both the `task_arena` and `task_group` objects. + + void constrain_for_numa_nodes() { + std::vector numa_nodes = tbb::info::numa_nodes(); + std::vector arenas(numa_nodes.size()); + std::vector task_groups(numa_nodes.size()); + + // initialize each arena, each constrained to a different NUMA node + for (int i = 0; i < numa_nodes.size(); i++) + arenas[i].initialize(tbb::task_arena::constraints(numa_nodes[i]), 0); + + // enqueue work to all but the first arena, using the task_group to track work + // by using defer, the task_group reference count is incremented immediately + for (int i = 1; i < numa_nodes.size(); i++) + arenas[i].enqueue( + task_groups[i].defer([] { + tbb::parallel_for(0, N, [](int j) { f(w); }); + }) + ); + + // directly execute the work to completion in the remaining arena + arenas[0].execute([] { + tbb::parallel_for(0, N, [](int j) { f(w); }); + }); + + // join the other arenas to wait on their task_groups + for (int i = 1; i < numa_nodes.size(); i++) + arenas[i].execute([&task_groups, i] { task_groups[i].wait(); }); + } + +### The need for application-specific knowledge + +In general when tuning a parallel application for NUMA systems, the goal is to expose sufficient +parallelism while minimizing (or at least controlling) data access and communication costs. The +tradeoffs involved in this tuning often rely on application-specific knowledge. + +In particular, NUMA tuning typically involves: + +1. Understanding the overall application problem and its use of algorithms and data containers +2. Placement/allocation of data container objects onto memory resources +3. Distribution of tasks to hardware resources that optimize for data placement + +As shown in the previous example, the oneTBB 1.3 specification only provides low-level +support for NUMA optimization. The `tbb::info` namespace provides topology discovery. And the +combination of `task_arena`, `task_arena::constraints` and `task_group` provide a mechanism for +placing tasks onto specific processors. There is no high-level support for memory allocation +or placement, or for guiding the task distribution of algorithms. + +### Issues that should be resolved in the oneTBB library + +**The behavior of existing features is not always predictable.** There is a note in +section **[info_namespace]** of the oneTBB specification that describes +the function `std::vector numa_nodes()`, "If error occurs during system topology +parsing, returns vector containing single element that equals to `task_arena::automatic`." + +In practice, the error can occurs because HWLOC is not detected on the system. While the +oneTBB documentation states in several places that HWLOC is required for NUMA support and +even provides guidance on +[how to check for HWLOC](https://www.intel.com/content/www/us/en/docs/onetbb/get-started-guide/2021-12/next-steps.html), +the inability to resolve HWLOC at runtime silently returns a default of `task_arena::automatic`. This +default does not pin threads to NUMA nodes. It is too easy to write code similar to the preceding +example and be unaware that a HWLOC installation error (or lack of HWLOC) has undone all your effort. + +**Getting good performance using these tools requires notable manual coding effort by users.** As we +can see in the preceding example, if we want to spread work across the NUMA nodes in +a system we might need to query the topology using functions in the `tbb::info` namespace, create +one `task_arena` per NUMA node, along with one `task_group` per NUMA node, and then add an +extra loop that iterates over these `task_arena` and `task_group` objects to execute the +work on the desired NUMA nodes. We also need to handle all container allocations using OS-specific +APIs (or behaviors, such as first-touch) to allocator or place them on the appropriate NUMA nodes. + +**The out-of-the-box performance of the generic TBB APIs on NUMA systems is not good enough.** +Should the oneTBB library do anything special by default if the system is a NUMA system? Or should +regular random stealing distribute the work across all of the cores, regardless of which NUMA first +touched the data? + +Is it reasonable for a developer to expect that a series of loops, such as the ones that follow, will +try to create a NUMA-friendly distribution of tasks so that accesses to the same elements of `b` and `c` +in the two loops are from the same NUMA nodes? Or is this too much to expect without providing hints? + + tbb::parallel_for(0, N, + [](int i) { + b[i] = f(i); + c[i] = g(i); + }); + + tbb::parallel_for(0, N, + [](int i) { + a[i] = b[i] + c[i]; + }); + +## Possible Sub-Proposals + +### Increased availability of NUMA support + +See [sub-RFC for increased availability of NUMA API](tbbbind-link-static-hwloc.org) + + +### Add NUMA-constrained arenas + +See [sub-RFC for creation and use of NUMA-constrained arenas](numa-arenas-creation-and-use.org) + +### NUMA-aware allocation + +Define allocators or other features that simplify the process of allocating or placing data onto +specific NUMA nodes. + +### Simplified approaches to associate task distribution with data placement + +As discussed earlier, NUMA-aware allocation is just the first step in optimizing for NUMA architectures. +We also need to deliver mechanisms to guide task distribution so that tasks are executed on execution +resources that are near to the data they access. oneTBB already provides low-level support through +`tbb::info` and `tbb::task_arena`, but we should up-level this support into the high-level algorithms, +flow graph and containers where appropriate. + +### Improved out-of-the-box performance for high-level oneTBB features. + +For high-level oneTBB features that are modified to provide improved NUMA support, we can try to +align default behaviors for those features with user-expectations when used on NUMA systems. + +## Open Questions + +1. Do we need simplified support, or are users that want NUMA support in oneTBB +willing to, or perhaps even prefer, to manage the details manually? +2. Is it reasonable to expect good out-of-the-box performance on NUMA systems +without user hints or guidance. diff --git a/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org b/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org new file mode 100755 index 0000000000..ebda06992e --- /dev/null +++ b/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org @@ -0,0 +1,120 @@ +# -*- fill-column: 80; -*- + +#+title: Link ~tbbbind~ with Static HWLOC for NUMA API predictability + +*Note:* This document is a sub-RFC of the [[file:README.md][umbrella RFC about improving NUMA +support]]. Specifically, the "Increased availability of NUMA support" section. + +* Introduction +oneTBB has a soft dependency on several variants of ~tbbbind~, which the library +loads during the initialization stage. Each ~tbbbind~, in turn, has a hard +dependency on a specific version of the HWLOC library [1, 2]. The soft +dependency means that the library continues the execution even if the system +loader fails to resolve the hard dependency on HWLOC for ~tbbbind~. In this +case, oneTBB does not discover the hardware topology. Instead, it defaults to +viewing all CPU cores as uniform, consistent with TBB behavior when NUMA +constraints are not used. As a result, the following code returns the irrelevant +values that do not reflect the actual topology: + +#+begin_src C++ +std::vector numa_nodes = oneapi::tbb::info::numa_nodes(); +std::vector core_types = oneapi::tbb::info::core_types(); +#+end_src + +This lack of valid HW topology, caused by the absence of a third-party library, +is the major problem with the current oneTBB behavior. The problem lies in the +lack of diagnostics making it difficult for developers to detect. As a result, +the code continues to run but fails to use NUMA as intended. + +Dependency on a shared HWLOC library has the following benefits: +1. Code reuse with all of the positive consequences out of this, including + relying on the same code that has been tested and debugged, allowing the OS + to share it among different processes, which consequently improves on cache + locality and memory footprint. That's the primary purpose of shared + libraries. +2. A drop-in replacement. Users are able to use their own version of HWLOC + without recompilation of oneTBB. This specific version of HWLOC could include + a hotfix to support a particular and/or new hardware that a customer has, but + whose support is not yet upstreamed to HWLOC project. It is also possible + that such support won't be upstreamed at all if that hardware is not going to + be available for massive users. It could also be a development version of + HWLOC that someone wants to test on their systems first. Of course, they can + do it with the static version as well, but that's more cumbersome as it + requires recompilation of every dependent component. + +The only disadvantage from depending on HWLOC library dynamically is that the +developers that use oneTBB's NUMA support API need to make sure the library is +available and can be found by oneTBB. Depending on the distribution model of a +developer's code, this is achieved either by: +1. Asking the end user to have necessary version of a dependency pre-installed. +2. Bundling necessary HWLOC version together with other pieces of a product + release. + +However, the requirement to fulfill one of the above steps for the NUMA API to +start paying off may be considered as an incovenience and, what is more +important, it is not always obvious that one of these steps is needed. +Especially, due to silent behavior in case HWLOC library cannot be found in the +environment. + +The proposal is to reduce the effect of the disadvantage of relying on a dynamic +HWLOC library. The improvements involve statically linking HWLOC with one of the +~tbbbind~ libraries distributed together with oneTBB. At the same time, you +retain the flexibility to specify different version of HWLOC library if needed. + +Since HWLOC 1.x is an older version and modern operating systems install HWLOC +2.x by default, the probability of users being restricted to HWLOC 1.x is +relatively small. Thus, we can reuse the filename of the ~tbbbind~ library +linked to HWLOC 1.x for the library linked against a static HWLOC 2.x. + +* Proposal +1. Replace the dynamic link of ~tbbbind~ library currently linked + against HWLOC 1.x with a link to a static HWLOC library version 2.x. +2. Add loading of that ~tbbbind~ variant as the last attempt to resolve the + dependency on functionality provided by the ~tbbbind~ layer. +3. Update the oneTBB documentation, including + [[https://uxlfoundation.github.io/oneTBB/search.html?q=tbb%3A%3Ainfo][these + pages]], to detail the steps for identifying which ~tbbbind~ is being used. + +** Advantages +1. The proposed behavior introduces a fallback mechanism for resolving the HWLOC + library dependency when it is not in the environment, while still preferring + user-provided versions. As a result, the problematic oneTBB API usage works + as expected, returning an enumerated list of actual NUMA nodes and core types + on the system the code is running on, provided that the loaded HWLOC library + works on that system and that an application properly distributes all + binaries of oneTBB, sets the environment so that the necessary variant of + ~tbbbind~ library can be found and loaded. +2. Dropping support for HWLOC 1.x, does not introduce an additional ~tbbbind~ + variant while maintaining support for widely used versions of HWLOC. + +** Disadvantages +By default, there is still no diagnostics if you fail to correctly setup an +environment with your version of HWLOC. Although, specifying the ~TBB_VERSION=1~ +environment variable helps identify configuration issues quickly. + +* Alternative Handling for Missing System Topology +The other behavior in case HWLOC library cannot be found is to be more explicit +about the problem of a missing component and to either issue a warning or to +refuse working requiring one of the ~tbbbind~ variant to be loaded (e.g., throw +an exception). + +Comparing these alternative approaches to the one proposed. +** Common Advantages +- Explicitly indicates that the functionality being used does not work, instead + of failing silently. +- Avoids the need to distribute an additional variant of ~tbbbind~ library. + +** Common Disadvantages +- Requires additional step from the user side to resolve the problem. In other + words, it does not provide complete solution to the problem. + +*** Disadvantages of Issuing a Warning +- The warning may be unnoticed, especially if standard streams are closed. + +*** Disadvantages of Throwing an Exception +- May break existing code that does not expect an exception to be thrown. +- Requires introduction of an additional exception hierarchy. + +* References +1. [[https://www.open-mpi.org/projects/hwloc/][HWLOC project main page]] +2. [[https://github.com/open-mpi/hwloc][HWLOC project repository on GitHub]]