Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Adding API for parallel block to task_arena to warm-up/retain/release worker threads #1522

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions rfcs/proposed/parallel_block_for_task_arena/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Adding API for parallel block to task_arena to warm-up/retain/release worker threads

## Introduction

In oneTBB, there has never been an API that allows users to block worker threads within the arena.
This design choice was made to preserve the composability of the application.<br>
Since oneTBB is a dynamic runtime based on task stealing, threads will migrate from one arena to
another while they have tasks to execute.<br>
pavelkumbrasev marked this conversation as resolved.
Show resolved Hide resolved
Before PR#1352, workers moved to the thread pool to sleep once there were no arenas with active
demand. However, PR#1352 introduced a busy-wait block time that blocks a thread for an
`implementation-defined` duration if there is no active demand in arenas. This change significantly
improved performance in cases where the application is run on high thread count systems.<br>
The main idea is that usually, after one parallel computation ends,
another will start after some time. The default block time is a heuristic to utilize this,
covering most cases within its duration.

The default behavior of oneTBB with these changes does not affect performance when oneTBB is used
as the single parallel runtime.<br>
However, some cases where several runtimes are used together might be affected. For example, if an
application builds a pipeline where oneTBB is used for one stage and OpenMP is used for a
subsequent stage, there is a chance that oneTBB workers will interfere with OpenMP threads.
This interference might result in slight oversubscription,
which in turn might lead to underperformance.

This problem can be resolved with an API that indicates when parallel computation is done,
allowing worker threads to be released from the arena,
essentially overriding the default block-time.<br>

This problem can be considered from another angle. Essentially, if the user can indicate where
parallel computation ends, they can also indicate where they start.
pavelkumbrasev marked this conversation as resolved.
Show resolved Hide resolved

<img src="parallel_block_introduction.png" width=800>

With this approach, the user not only releases threads when necessary
but also specifies a programmable block where worker threads should stick to the
executing arena.

## Proposal

Let's consider the guarantees that an API for explicit parallel blocks can provides:
pavelkumbrasev marked this conversation as resolved.
Show resolved Hide resolved
* Start of parallel block:
pavelkumbrasev marked this conversation as resolved.
Show resolved Hide resolved
* Indicates the point from which the scheduler can use a hint and stick threads to the arena.
* Serve as a warm-up hint to the scheduler, making some worker threads immediately available
at the start of the real computatin.
pavelkumbrasev marked this conversation as resolved.
Show resolved Hide resolved
* "Parallel block" itself:
* Scheduler can implement different busy-wait policies to retain threads in the arena.
* End of parallel block:
pavelkumbrasev marked this conversation as resolved.
Show resolved Hide resolved
* Indicates the point from which the scheduler can drop a hint
and unstick threads from the arena.
* Indicates that worker threads should ignore
the default block time (introduced by PR#1352) and leave.

Start of parallel block:<br>
pavelkumbrasev marked this conversation as resolved.
Show resolved Hide resolved
The warm-up hint should have similar guarantees as `task_arena::enqueue` from a signal standpoint.
Users should expect the scheduler will do its best to make some threads available in the arena.

"Parallel block" itself:<br>
The guarantee for retaining threads is a hint to the scheduler;
thus, no real guarantee is provided. The scheduler can ignore the hint and
move threads to another arena or to sleep if conditions are met.

End of parallel block:<br>
pavelkumbrasev marked this conversation as resolved.
Show resolved Hide resolved
It can indicate that worker threads should ignore the default block time but
if work was submitted immediately after the end of the parallel block,
the default block time will be restored.

But what if user would like to disable default block time entirely?<br>
Because the heuristic of extended block time is unsuitable for the task submitted
in unpredictable pattern and duration. In this case, there should be an API to disable
the default block time in the arena entirely.

```cpp
class task_arena {
void indicate_start_of_parallel_block(bool do_warmup = false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

void retain_threads();
void release_threads();

*_parallel_block is misleading since even your example shows serial parts of the region.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think retain_threads and release_threads provides unnecessary guaranteed like the threads will be actually retained.
Should it be something more relaxed?
Perhaps, make_sticky and make_unsticky suites better because we can set definition of sticky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do think "sticky" could people think of thread-to-core affinity? Since constraints are used for affinity, I'm thinking the likelihood of confusion is low and so I'm ok with make_sticky and make_unsticky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not all this be about work rather than threads? Threads are the execution resources, that should not be exposed to the user, should they? I mean that is the original idea of the TBB library. Therefore, I suggest something like expect_[more/less_]parallel_work or assume_[more/less_]parallelism as a good level of a loose terminology what library should tend to "think" about user's code when this API is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBB exposes some level of "thread logic" with observers. I'm not sure whether this API should expose this logic too.
If we want to extend these functions with additional guarantees such as "warm-up" or "leave earlier" perhaps we could not ignore threads completely.

void indicate_end_of_parallel_block(bool disable_default_block_time = false);
void disable_default_block_time();
void enable_default_block_time();
};

namespace this_task_arena {
void indicate_start_of_parallel_block(bool do_warmup = false);
void indicate_end_of_parallel_block(bool disable_default_block_time = false);
void disable_default_block_time();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end-user doesn't know what the default block time is, and it will be platform dependent. The first set of functions indicate of region of interest, these default_block_time functions change a property on the task_arena that is not tied to a region. That makes me think it is better as a constraint. Are there known cases where this needs to be disabled then reenabled dynamically?

If the first two functions became something like retain_threads and release_threads, what would these be named? What about set_sleep_policy( sleep_quickly | sleep_slowly ) or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, it is a good idea to move it constraints. If you need different guarantees just use different arenas the same is applicable to priorities.
If we include the property as part of the constraints which in turn represent HW Resources perhaps the name should represent how these resources will be used, like greedy or something.

@akukanov what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions:

  • How would users think about "greedy" relative to per-arena priorities? It might imply some kind of priority between a greedy normal arena and non-greedy normal arena, even though that wouldn't be the case.
  • In suggesting sleep_quickly and sleep_slowly, which I admit are not great names, I was trying to find something that indicated more about the wastefulness of holding onto resources once you have them and while there's nothing better to do with them in contrast to a greediness in acquiring resources, perhaps from some other competing arenas. I think this is the key point of disable|enable_default_block_time -- while it is a form of greediness, it is more about the amount of wastefulness tolerated to reduce startup latency on the next parallel algorithm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I am thinking about this, thoughts about being nice/responsive to the demand from other arenas appear in my head. But I am not sure how to better combine these two sets of API as they kind of mutually exclusive to each other. Consider, expect more parallel work to appear but be responsive to resources demand from other arenas. Actually, this counterintuition applies to the proposed design as well. Perhaps, we need to express that mutual exclusiveness somehow in the API.

Copy link
Contributor Author

@pavelkumbrasev pavelkumbrasev Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect more parallel work to appear but be responsive to resources demand from other arenas.

What part of the proposal is stating this? (Expects both properties simultaneously)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not stating explicitly, but sort of implying the question What will it mean if I invoke indicate_start_of_parallel_block and call disable_default_block_time right after that?

void enable_default_block_time();
}
```

If the end of the parallel block is not indicated by the user, it will be done automatically when
the last public reference is removed from the arena (i.e., task_arena is destroyed or a thread
is joined for an implicit arena). This ensures correctness is
preserved (threads will not stick forever).

## Considerations

The retaining of worker threads should be implemented with care because
it might introduce performance problems if:
* Threads cannot migrate to another arena because they
stick to the current arena.
* Compute resources are not homogeneous, e.g., the CPU is hybrid.
Heavier involvement of less performant core types might result in artificial work
imbalance in the arena.


## Open Questions in Design

Some open questions that remain:
* Are the suggested APIs sufficient?
* Are there additional use cases that should be considered that we missed in our analysis?
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading