-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API to specify number of threads, from threadpool, to use for the task #17
base: master
Are you sure you want to change the base?
Conversation
@Maratyszcza would love to get your inputs here. Thanks! |
b382813
to
668abfa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kimishpatel
Overall I'm still not understanding why capping the # threads requires the bulk of the changes here.
Is there a simple explanation?
@@ -85,6 +86,12 @@ pthreadpool_t pthreadpool_create(size_t threads_count); | |||
*/ | |||
size_t pthreadpool_get_threads_count(pthreadpool_t threadpool); | |||
|
|||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we include the why?
E.g.
API to cap the number of threads used to do work, rather than those
currently available in the runtime's threadpool.
-
- This is useful to counter potential performance degradation
-
- of using more threads than optimal for the device and use-case
-
- such as the OS scheduling threads to run on smaller cores
-
- at the cost of threading overhead.
Also indicate what happens if num_threads > # threads in pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I need to do add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to use pthreadpool_get_threads_count
/pthreadpool_set_threads_count
for the new API functions and add a new function pthreadpool_get_max_threads_count
to return the number of threads in the thread pool (what pthreadpool_get_threads_count
does now)
src/gcd.c
Outdated
@@ -73,6 +75,15 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) { | |||
return threadpool; | |||
} | |||
|
|||
void pthreadpool_cap_num_threads(size_t num_threads) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this check vs threads_count?
And return a bool to say if it actually capped anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We dont have access to threadpool object here. I think your terminology of max threads actually better explains the behavior. It would so do not use more than max_threads
. Capping somehow implies that some pre-existing value has to be capped?
src/gcd.c
Outdated
@@ -99,15 +110,16 @@ PTHREADPOOL_INTERNAL void pthreadpool_parallelize( | |||
|
|||
/* Locking of completion_mutex not needed: readers are sleeping on command_condvar */ | |||
const struct fxdiv_divisor_size_t threads_count = threadpool->threads_count; | |||
const struct fxdiv_divisor_size_t num_threads_to_use = fxdiv_init_size_t(min(threads_count.value, capped_num_threads)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be done directly in the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which method?
src/pthreads.c
Outdated
@@ -1,3 +1,34 @@ | |||
/* | |||
* Overall architecture: | |||
* 1. capped_num_threads is used to specifiy max threads to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should be clear capped_num_threads is <= threads_count.
Let's not introduce another term (max threads).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that we should document this? Because pthreadpool_cap_num_threads
does not know threadpool size.
src/pthreads.c
Outdated
* However thread 2 will never see cmd 3 because the masked bit is same as cmd1 and it is not perceived as new command. | ||
* To fix this we must add this invariant: | ||
* - Master thread must synchronize with all threads before submitting next command regardless of the eligibility of threads to paricipate in the work. | ||
* Testing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean you ran this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the test I added. I still could not figure out a good way to ensure that only specified number of threads are used. Of course I validated manually but nothing else besides that.
src/pthreads.c
Outdated
uint32_t last_flags, | ||
size_t thread_id) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code seems to have tabs vs spaces, you should fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. Not sure how that creeped in.
src/pthreads.c
Outdated
* At this point thread 2 starts waiting with last_command = cmd 1 because it never saw cmd 2. Master thread submits cmd 3. At cmd 3 all three threads are eligible. | ||
* However thread 2 will never see cmd 3 because the masked bit is same as cmd1 and it is not perceived as new command. | ||
* To fix this we must add this invariant: | ||
* - Master thread must synchronize with all threads before submitting next command regardless of the eligibility of threads to paricipate in the work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you estimated the cost of doing this sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only on one model we wanted to test, but I need to do more comprehensive benchmarking.
src/pthreads.c
Outdated
@@ -291,6 +388,15 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) { | |||
return threadpool; | |||
} | |||
|
|||
void pthreadpool_cap_num_threads(size_t num_threads) { | |||
assert(num_threads > 0); | |||
capped_num_threads = num_threads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as in the other file.
That way we don't need another num_threads_to_use but can use capped_num_threads directly which is cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I asked this before but to be sure, directly where?
src/pthreads.c
Outdated
@@ -322,7 +428,11 @@ PTHREADPOOL_INTERNAL void pthreadpool_parallelize( | |||
|
|||
/* Locking of completion_mutex not needed: readers are sleeping on command_condvar */ | |||
const struct fxdiv_divisor_size_t threads_count = threadpool->threads_count; | |||
pthreadpool_store_relaxed_size_t(&threadpool->active_threads, threads_count.value - 1 /* caller thread */); | |||
const struct fxdiv_divisor_size_t num_threads_to_use = fxdiv_init_size_t(min(threads_count.value, capped_num_threads)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align
src/threadpool-object.h
Outdated
* As per this change, this feature is not available in GCD based | ||
* pthreadpool | ||
*/ | ||
pthreadpool_atomic_size_t num_threads_to_use; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here, we should just have capped_num_threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean just use capped_num_threads? Thats what you mean by directly? That we cannot do because capped_num_threads is thread local variable and each thread will have different value for that which we cannot communicate to each except via atomic variable. But possible I overlooked something.
Not sure what you mean. Are you saying that your intuitive understanding is that it should be a simpler change? The changes in other places are required because we want to:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concern is that this code doesn't do what you expect: it still wakes up all threads in pthreadpool, and them sends some of them to sleep right away. I.e. work dispatch doesn't use all threads, but still pays the full latency cost of synchronizing all threads, even unused ones.
Also, many formatting inconsistencies, please format similarly to existing pthreadpool code.
include/pthreadpool.h
Outdated
@@ -1,6 +1,7 @@ | |||
#ifndef PTHREADPOOL_H_ | |||
#define PTHREADPOOL_H_ | |||
|
|||
#include <stdbool.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessary, none of the functions in the header use bool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaah. This is from my dev setup. Will fix.
src/pthreads.c
Outdated
@@ -54,6 +85,7 @@ | |||
#include "threadpool-object.h" | |||
#include "threadpool-utils.h" | |||
|
|||
thread_local size_t capped_num_threads = UINT_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread-local cap doesn't make sense, the same thread can work with different pthreadpool_t
objects and multiple threads can submit tasks to the same pthreadpool_t
object.
Limit on the number of threads should be a property of pthreadpool_t
object, not of a thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that seems fair. I had something else in mind, but what you said makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually for the case of "multiple threads can submit tasks to the same pthreadpool_t object ", the issue is this. If capped_num_threads
was part of pthreadpool_t
object then we would have to atomically update it and depending on whose update finishes last we might apply different caps. So that may break when thread 1 sets cap to 2 but later on it gets set to 3 by thread 2 and then thread 1 runs with the cap of 3 threads and so does thread 2. In order to make this work we will have to change *parallelize*
API to account for cap.
But for "the same thread can work with different pthreadpool_t objects", what you said makes sense.
Dont have a good solution but did want to point it out in case I missed something.
That is correct, but to do more appropriate fix, it requires us to have separate command buffer for each thread in the pool. However, that is a much larger change so I refrained from it. Another reason was that it also complicates work stealing, although it should be doable. My thought was to try such a change in a follow-up PR, but I am open to suggestions and if you think you want to do that from the get-go, thats also ok.
Yes, my bad. I did not realize this. WIll fix. |
Also btw for the behavior we observed where 4 threads can get mapped to 3 cores, which results in thread swapping each other out, this solution still works even though unused threads are spuriously woken up, since, I suppose, latency of compute tends to be longer than waking and going back to sleep. |
668abfa
to
c7d1ae2
Compare
@Maratyszcza can you please re-review this? In the latest commit I have addressed two of your concerns:
I personally feel the second commit somewhat diminishes the value of what we are trying to achieve here. If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2. If you have a better suggestion, I am happy to hear. Look forward to your comments. |
c7d1ae2
to
f5d213f
Compare
include/pthreadpool.h
Outdated
* Purpose of this is to ameliorate some perf degradation observed | ||
* due to OS mapping a given set of threads to fewer cores. | ||
*/ | ||
void pthreadpool_set_max_num_threads(struct pthreadpool* threadpool, size_t num_threads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pthreadpool_t
instead of struct pthreadpool*
include/pthreadpool.h
Outdated
* due to OS mapping a given set of threads to fewer cores. | ||
*/ | ||
void pthreadpool_set_max_num_threads(struct pthreadpool* threadpool, size_t num_threads); | ||
size_t pthreadpool_get_max_num_threads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should have void
in the parameter list for compatibility with C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah good to know. Did not know this.
@@ -20,7 +20,6 @@ | |||
#include "threadpool-object.h" | |||
#include "threadpool-utils.h" | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert
@@ -79,6 +79,31 @@ struct PTHREADPOOL_CACHELINE_ALIGNED thread_info { | |||
*/ | |||
HANDLE thread_handle; | |||
#endif | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why all changes in this file are needed. Please revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this PR adds per thread wakeup logic so that only participating threads are woken up. That is why this change was needed. If you have better suggestions I am open to it.
src/windows.c
Outdated
@@ -22,6 +22,7 @@ | |||
#include "threadpool-object.h" | |||
#include "threadpool-utils.h" | |||
|
|||
thread_local size_t max_num_threads = UINT_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no thread-local variables
@@ -53,11 +54,11 @@ static void wait_worker_threads(struct pthreadpool* threadpool, uint32_t event_i | |||
} | |||
|
|||
static uint32_t wait_for_new_command( | |||
struct pthreadpool* threadpool, | |||
struct thread_info* thread, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the API change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because now wakeup and command both are per thread.
src/windows.c
Outdated
@@ -147,6 +148,7 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) { | |||
return NULL; | |||
} | |||
threadpool->threads_count = fxdiv_init_size_t(threads_count); | |||
pthreadpool_store_relaxed_size_t(&threadpool->num_threads_to_use, threads_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry.
src/windows.c
Outdated
@@ -190,6 +195,14 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) { | |||
return threadpool; | |||
} | |||
|
|||
void pthreadpool_set_max_num_threads(struct pthreadpool* threadpool, size_t num_threads) { | |||
pthread_mutex_lock(&threadpool->execution_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
threadpool->execution_mutex
is a WinAPI mutex handle, use WaitForSingleObject
/ReleaseMutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah. Thanks for pointing this out. Surprised that internal windows build did not fail. I will make sure to build this on windows.
src/gcd.c
Outdated
@@ -21,6 +21,8 @@ | |||
#include "threadpool-object.h" | |||
#include "threadpool-utils.h" | |||
|
|||
thread_local size_t max_num_threads = UINT_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be here
src/gcd.c
Outdated
@@ -73,6 +76,14 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) { | |||
return threadpool; | |||
} | |||
|
|||
void pthreadpool_set_max_num_threads(struct pthreadpool* threadpool, size_t num_threads) { | |||
pthread_mutex_lock(&threadpool->execution_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
threadpool->execution_mutex
doesn't exist when targeting GCD, use threadpool->execution_semaphore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the code needs substantial changes:
- Avoid changing existing internal APIs unless it is absolutely needed to add the new functionality. This PR is already very big by itself, and mixed-in changes in unrelated APIs make it hard to review.
- Please rename the existing
threads_count
tomax_threads_count
and usethreads_count
for the new functionality. - Validate that
threads_count <= max_threads_count
once in the public API that sets this variable. Doesn't clip tomax_threads_count
in other functions, just assume this holds (optionally add anassert
for this). - Make sure it is tested on Windows and iOS/Mac. I suspect the current version may not compile on these platforms.
Responding here:
Internal API changes are needed because we add per thread command and wait logic.
Good suggestion. Will follow up.
Makes sense.
Will do. @Maratyszcza this leaves me with one high level question that I mentioned in my earlier message. I am copy pasting that here: Should thread_count be set per threadpool object or set as a thread local variable. If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2. I do understand your concern that with thread local pattern, multiple threadpool objects being subject to the same constraint. |
pthreadpool is a low-level library, and for low-level libraries it is preferred to avoid global objects. If necessary, users can implement this functionality at a higher level on top of pthreadpool.
I don't expect it to be a common use-case to use different number of thread pool threads depending on which thread called into it, especially with the current implementation that still wakes up all threads. |
We usually use a singleton pthreadpool object. Thus multiple threads use single threadpool. There are cases when multiple models can be running simultaneously. When threads running individual models are not interleaved, implementation of this PR is ok. However, we dont have any control over how these run in sw stack in which pytorch is integrated. If two threads running two different models start using new API then we may run into weird performance issues which might be hard to debug, such as the last thread to set the thread count wins. Issue is once we expose this API to client, they expect certain behavior which is not guaranteed. That is why thread local setting seems to make more sense to me. But if you feel strongly about this, I understand. |
f5d213f
to
ef5e4b7
Compare
2301913
to
e5f0dfe
Compare
e5f0dfe
to
2a8a028
Compare
Tested on windows and mac |
2a8a028
to
bca5ac6
Compare
Summary: This diff splits the command command queueu and instead uses commands specific to each thread. This enables: - Waking up only subset of threads needed. - Waiting for only subset of threads In this commit the number of threads to use is a thread local variable. Subsetquent commit makes that a property of threadpool object Test Plan: pthreadpool-test Reviewers: Subscribers: Tasks: Tags:
Summary: This commits changes API to set max num threads. It applies the limit to the pthreadpool object. Test Plan: Reviewers: Subscribers: Tasks: Tags:
bca5ac6
to
45b9c96
Compare
Summary:
This PR adds a way use fewer threads than configured with in
pthreadpool. Occassionaly it has been seen that using the # of thredas =
logical core is not efficient. This can be due to system load and
varying other factors that lead threads either being mapped to slower
cores or being mapped to fewer than logical core (as actually seen).
Thus this PR attempt to fix this.
Approach:
specified threads.
for next chunk of work and thus give up their cpu slot.
Both pthreads.c windows.c are modified to add this feature.
Test Plan:
4 tests are added to check this.
Reviewers:
Subscribers:
Tasks:
Tags: