Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThreadPool cleanup (3/n): Switch to vectorized API & remove unused/co… #32

Merged
merged 6 commits into from
May 24, 2020

Conversation

saeta
Copy link
Owner

@saeta saeta commented May 22, 2020

…nfusing extensions and implementations.

This change takes a first^H^H^H^H^H^Hthird whack at a bunch of tech debt:

  1. Removes the Naive thread pool implementation from PJoin.
  2. Removes the unnecessary TypedComputeThreadPool protocol refinement.
  3. Removes the badly implemented extensions that implemented parallelFor
    in terms of join.
  4. Removes use of rethrows, as the rethrows language feature is not
    expressive enough to allow the performance optimizations for the non-throwing
    case.
  5. Adds a vectorized API (which improves performance).

Performance measurements:

After:

name                                                                   time         std                   iterations  
--------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 700.0 ns     ± 70289.84998218225   127457      
NonBlockingThreadPool: join, two levels                                2107.0 ns    ± 131041.5070696377   31115       
NonBlockingThreadPool: join, three levels                              4960.0 ns    ± 178122.9562964306   15849       
NonBlockingThreadPool: join, four levels, three on thread pool thread  5893.0 ns    ± 224021.47900401088  13763       
NonBlockingThreadPool: parallel for, one level                         22420.0 ns   ± 203689.69689780468  7581        
NonBlockingThreadPool: parallel for, two levels                        500985.5 ns  ± 642136.0139757036   1390        

Before:

name                                                                   time          std                   iterations  
---------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 728.0 ns      ± 78662.43173968921   115554      
NonBlockingThreadPool: join, two levels                                2149.0 ns     ± 144611.11773139169  30425       
NonBlockingThreadPool: join, three levels                              5049.0 ns     ± 188450.6773907647   15157       
NonBlockingThreadPool: join, four levels, three on thread pool thread  5951.0 ns     ± 229270.51587738466  10255       
NonBlockingThreadPool: parallel for, one level                         4919427.5 ns  ± 887590.5386061076   302         
NonBlockingThreadPool: parallel for, two levels                        4327151.0 ns  ± 855302.611386676    313         

…nfusing extensions and implementations.

This change takes a first^H^H^H^H^H^Hthird whack at a bunch of tech debt:
 1. Removes the Naive thread pool implementation from `PJoin`.
 2. Removes the unnecessary `TypedComputeThreadPool` protocol refinement.
 3. Removes the badly implemented extensions that implemented `parallelFor`
    in terms of `join`.
 4. Removes use of `rethrows`, as the rethrows language feature is not
    expressive enough to allow the performance optimizations for the non-throwing
    case.
@saeta saeta requested review from dabrahams and pschuh May 22, 2020 17:42
@saeta
Copy link
Owner Author

saeta commented May 22, 2020

Note: I'm trying to break up a very large refactoring I've been working on in #11 (and related branches) into more easily reviewable pieces. Happy to explain how they all fit together out-of-band as appropriate.

@dabrahams
Copy link
Collaborator

IIUC the descriptions of parts 3/4 aren't quite right

  • you don't just remove parallelFor but reimplement it.
  • You replace rethrows with overloads rather than simply removing it

I know it isn't always easy to break up a big stack of work, but personally, I'd have put parts 3, 4, 5 each in its own PR. If you want a magit tutorial I could probably show you how to make this sort of thing go more easily.

Copy link
Collaborator

@dabrahams dabrahams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock progress but please consider all the suggestions and file issues if addressing them needs to be delayed.

@@ -292,6 +292,40 @@ public class NonBlockingThreadPool<Environment: ConcurrencyPlatform>: ComputeThr
if let e = err { throw e }
}

public func parallelFor(n: Int, _ fn: VectorizedParallelForFunction) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comment please!!!

Unless "parallel for on a thread pool" is a very well-established concept, I'd consider renaming these. for in Swift is something we do over a Sequence, and there's no sequence here. Is this something that could be written as an extension on Collection that accepts a thread pool as an argument? Your n repetitions could be well-represented by 0..<n.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to doc comment. PTAL?

I believe that this should most often be accessed as an operation on a random access (or likely some form of "splittable") collection. But in any case, that will have to be generic over the thread pool itself, so we don't get away from having this method and coming up with a name for it.

Note: I started going in this direction a while back but I think that direction needs a "reboot". For now, I'd like to focus on getting this low-level API implemented correctly and efficiently, and we can then refactor and/or stack on the further abstractions.

FWIW: I started out by having VectorizedParallelForFunction take a range instead of 2 integers representing the start and end, but that makes type inference not work as well (as code requires annotations because the alternative API induces an ambiguity between the non-vectorized and vectorized APIs).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Will comment on the doc comment separately (GitHub doesn't make this super convenient)
  • All collections are "splittable" for reasonable definitions of the term, but maybe you mean collections whose disjoint slices can be mutated in parallel. The more general concept is those that “have disjoint parts that can be projected for mutation in parallel.” You could imagine a collection of pairs where you mutate the first of each pair in one thread and the second in another.
  • Not quite sure what you wanted me to notice at that link. The main thing I took away was, “why is he rebasing those slices?“ which probably wasn't the point 😉
  • IMO it's questionable whether we really want the non-vectorized ones and whether they should have the same spelling anyway.

executeParallelFor(0, n)
}

public func parallelFor(n: Int, _ fn: ThrowingVectorizedParallelForFunction) throws {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comment please!

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1; done. (Although I suspect that this comment could be improved...)

// Divide into 2 & recurse.
let rangeSize = end - start
let midPoint = start + (rangeSize / 2)
try self.join({ try executeParallelFor(start, midPoint) }, { try executeParallelFor(midPoint, end) })
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? Your change description gives the impression you are removing the implementation of parallelFor in terms of join, yet here it is.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. That description is getting ahead of the actual implementation in this patch set. I'll update the description in the PR shortly.


try executeParallelFor(0, n)
}

/// Shuts down the thread pool.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaningful summary please. What does it mean to shut a thread pool down?

Sources/PenguinParallel/ThreadPool.swift Outdated Show resolved Hide resolved
// func parallelFor(blockingUpTo n: Int, blocksPerThread: Int, _ fn: ParallelForFunc)
// func parallelFor(blockingUpTo n: Int, _ fn: ParallelForFunc)
// func parallelFor(blockingUpTo n: Int, blocksPerThread: Int, _ fn: ParallelForFunction)
// func parallelFor(blockingUpTo n: Int, _ fn: ParallelForFunction)

/// The maximum amount of parallelism possible within this thread pool.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of redundant with the name, yeah?
Please find a description that doesn't beg the question, "what does it mean to have a maximum parallelism of N?“ from the point-of-view of someone who just has the threadpool abstraction to work with.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick pass, although this can probably be refined further.

var holder = ParallelForFunctionHolder(fn: fn)
try withUnsafePointer(to: &holder) { holder in
try runParallelFor(pool: self, start: 0, end: n, total: n, fn: holder)
/// Convert a non-vectorized operation to a vectorized operation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This summary makes no sense to me. I'd want it to read “Converts…” but then a function that converts one thing to another thing returns that other thing. Looking further, the doc comment appears to be a description of the implementation technique, not of what the function does.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I thought that comments on extension methods that are implementations of methods on the protocols themselves don't show up in typical doc-generation, I tried to write something different & more specific here. I can certainly just copy-pasta the doc comment from the protocol method itself if you think that's more appropriate... :-)

That said, I've attempted to refine this a bit (in the same direction, however).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I thought that comments on extension methods that are implementations of methods on the protocols themselves don't show up in typical doc-generation

You are mistaken:

image

Copy 🍝 is always delicious though somewhat unoriginal. Doc comments are for the user of the API. If you want to write something for maintainers, use //.

/// A `ComputeThreadPool` that executes everything immediately on the current thread.
///
/// This threadpool implementation is useful for testing correctness, as well as avoiding context
/// switches when a computation is designed to be parallelized at a coarser level.
public struct InlineComputeThreadPool: TypedComputeThreadPool {
public struct InlineComputeThreadPool: ComputeThreadPool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public struct InlineComputeThreadPool: ComputeThreadPool {
public struct SerialExecution: ParallelizableExecutionStrategy {

Consider the protocol name an initial suggestion. ComputeThreadPool seems wrong to me, as models are not necessarily thread pools.

When we do mention Threads, is there any point in adding Compute?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of thread-pools, there can be a number of different designs with different properties. In the same way that you can implement a random access collection in terms of a collection (just really inefficiently), I wanted to clearly distinguish what properties the thread-pool has. Concretely, there are I/O-focused thread-pools, where you can blocking and/or non-blocking I/O. This thread pool abstraction is focused on compute-bound tasks, and is tuned / structured with APIs focused on that domain. Does that make sense?

Happy to ponder the names further... related work also uses ConcurrentWorkQueue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of thread-pools, there can be a number of different designs with different properties.

Naturally.

In the same way that you can implement a random access collection in terms of a collection (just really inefficiently),

You mean like Sampling<SomeCollection, [SomeCollection.Index]>? (Phew, glad I looked at that just now!)

I wanted to clearly distinguish what properties the thread-pool has.

Also naturally. I don't get the connection to random access, though.

Concretely, there are I/O-focused thread-pools, where you can blocking and/or non-blocking I/O. This thread pool abstraction is focused on compute-bound tasks, and is tuned / structured with APIs focused on that domain. Does that make sense?

In principle. Are they really separate abstractions though, or at least, isn't the IO one a refinement of this one? Wouldn't you want to write most algorithms once rather than replicate them for different kinds of pools?

Sources/PenguinParallel/ThreadPool.swift Show resolved Hide resolved
/// Executes `a` and `b` optionally in parallel, and returns when both are complete.
///
/// Note: this implementation simply executes them serially.
public func join(_ a: () -> Void, _ b: () -> Void) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of join as an operation that waits on the completion of one or more already concurrently-executing things.

Suggested change
public func join(_ a: () -> Void, _ b: () -> Void) {
public func concurrently(_ a: () -> Void, _ b: () -> Void) {

Just an idea.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context: I picked join as the typical term-of-art in this space. I'm not fully sold on concurrently yet, because join represents optional concurrency, which is important for performance at scale.

I think that it would be good to go over this API and think hard about naming & how the abstractions compose, but only once we understand the performance limitations & constraints. (Concretely, some of the (internal) abstractions are being re-written due to performance limitations in the current structure of things.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an established term-of-art, which means we should use it in the way that has been established. You join stuff after you've forked it, in my experience, Rayon notwithstanding. And IME when you fork, you're running the same closure in both threads of execution, with a parameter passed to indicate which invocation of the closure you got, similar to what you did with parallelFor, which I think might be better called forkJoin ersump'n.

I'm not sure the concurrency is optional from the programming model P.O.V., which is what matters here. There may or may not be any actual parallelism between a and b's execution, but that would be true even if you unconditionally launched separate threads for a and b, so surfacing that distinction as though it's significant seems like a mistake to me.

Also the word “optional” tends to imply it's up to the user, but it's not; this is up to the library.

As for putting off talking about naming and abstractions of “this API” until we know it's performance, I think if we don't do both at once, we don't know what “this API” is. You don't want to design yourself into performance constraints based on assumptions about the programming model that don't actually apply.

@saeta saeta merged commit 64d2335 into master May 24, 2020
@saeta saeta deleted the threadpool-cleanup branch May 24, 2020 19:37
Comment on lines +292 to +293
/// Executes `fn`, optionally in parallel, spanning the range `0..<n`.
public func parallelFor(n: Int, _ fn: VectorizedParallelForBody) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of this doc. When you say “optionally” it implies the caller gets to pass an option, but that's not the case. Also, I have no idea what it means to “execute a function spanning a range.” Was going to make some suggestions but you've already merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants