-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SceneWriter : Improve performance #6232
base: 1.5_maintenance
Are you sure you want to change the base?
SceneWriter : Improve performance #6232
Conversation
684014c
to
826863e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the good news is that this all looks good to me. Which means the bad news is that I haven't spotted whatever is causing the test failures.
Good hunting! If you haven't found the problem by tomorrow I might stare at this some more and see if I think of anything.
break; | ||
} | ||
gatherFunctor( *locationResult ); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm staring at this function pretty closely, since I know there are weird test failures, and threading stuff that I haven't dealt with before is more suspicious than the rest of this code ... but I'm not really seeing any obvious problems.
It's too bad parallel_pipeline doesn't work efficiently for this case, since it seems like exactly what it was designed for ... I wonder what they're doing wrong.
Because I know there's a failure, I'm thinking about task_area::attach vs our usual caution in having to use isolated task_group_contexts, but I can't think of anything actually wrong here ( it's honestly been a bit since I've carefully paid attention to what tbb's arenas and contexts are ... but all of this looks fine ).
One thing that shouldn't affect correctness, but I'm curious: how did you decide on queue.set_capacity( tbb::this_task_arena::max_concurrency() );
? I guess it seems reasonable, but there's nothing to make it any more special than any other value, is there? In any case, one thread is likely to rip through the available capacity by writing a string of something that is quicker to traverse than write, and for the remainder of the runtime, everything will spend most of its time blocked on queue.push, and it's luck of the draw who gets unblocked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too bad parallel_pipeline doesn't work efficiently for this case...I wonder what they're doing wrong.
I think part of it is the way you have to feed tokens (paths in our case) in one-by-one. When a location has many children, parallel_for()
in contrast lets multiple threads "discover" the children in parallel and without contention as they subdivide their allotted portion of the range. Having to pass tokens in also meant making a new ScenePath for each location, whereas parallel_for()
can keep just updating a single instance in place. That part could have been solved by reusing paths from a circular buffer, but wouldn't have helped the serial one-by-one nature of token generation.
Because I know there's a failure, I'm thinking about task_area::attach vs our usual caution in having to use isolated task_group_contexts, but I can't think of anything actually wrong here
I think that side of things was OK, and the real problems were down to the lack of exception handling. This made two things possible :
- An exception in the
parallelTraverse()
could mean we never pushedmonostate
into the queue, so the main thread would loop forever. - An exception in the gather functor could mean we exited the function while the
parallelTraverse()
was still running, which meant it was pushing into a queue that had been destructed already.
Hopefully I've fixed that in e1e66c8.
how did you decide on queue.set_capacity( tbb::this_task_arena::max_concurrency() );?
I didn't give this a huge amount of thought, so other ideas are welcome. But, the danger of a too-large queue is runaway memory usage, and setting the capacity to the number of workers means we won't be using more memory than we were previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad you figured out the exception thing.
On capacity: seems pretty reasonable. Might be worth doing a test where you set it 4X higher, and see if it affects any of your benchmarks? If that doesn't make any difference, it would be a quick way to confirm that it isn't worth thinking about further. I can imagine a scenario where a bit more capacity helps smooth out some unevenness in how long things take to compute/write, but no idea how likely that is in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the 4x capacity, with no discernible impact on speed. At least in these benchmarks I can actually reduce the capacity without taking a hit. Which kindof makes sense since I'm just doing SceneReader direct to SceneWriter, so there's not much opportunity to spend time on actual processing.
locationData.write( output.get(), scope.context()->getTime() ); | ||
} | ||
|
||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I'm grasping at straws a little trying to explain how this is failing. I did think it could be interesting that LocationData is getting an implicit assignment operator called - it's being directly stored in the queue, and it seems like there's an assignment happening in queue.pop()? It might be a performance concern that a bunch of smart pointers are getting pointlessly incremented/decremented, but I don't see any way it would affect correctness if LocationData does get copied, as long as write() gets called exactly once. ( Though I guess I don't know what a NameList is, and whether that is safe to copy ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pop()
is using move assignment, and I've just pushed a commit to use that for push()
too. So there shouldn't be any pointless increment/decrement now. NameList is just vector<InternedString
.
8ec092f
to
19fd82e
Compare
There was a heavy bias towards testing with `.scc` files, but those are not particularly relevant to the majority of Gaffer users.
I do not know why this feature exists. It was introduced in 269cd4c, which purported to be about something completely different. I believe it can only possibly work with the `.scc` format. It isn't matched by similiar functionality in the SceneReader, which doesn't make any attempt to load globals. So I can only assume it was needed internally at Image Engine at some point in time - perhaps for Caribou? I have no idea if it is still needed or not. For now, adding a unit test so I don't break the feature during refactoring. But I suspect it should be removed at some point in the future.
This is more efficient than the previous approach of using a mutex to protect the output SceneInterface from concurent writes. This gives the following reductions in runtime for a direct SceneReader->SceneWriter process : - ALab : 60% - Moana : 50% - Market : 15% Improvements would be expected to be even greater when heavy processing is done in between the read and the write, as it provides more opportunities for parallelism.
19fd82e
to
e2a0237
Compare
I've pushed a rebase with two new commits on the end. The first exposes the fact that the original code was failing for |
This uses a new
SceneAlgo::parallelGatherLocations()
method which processes the scene in parallel, passing per-location results to a single thread for subsequent processing. This is more efficient than the previous approach of using a mutex to protect the output SceneInterface from concurent writes. Which gives the following reductions in runtime for a direct SceneReader->SceneWriter process :Improvements would be expected to be even greater when heavy processing is done in between the read and the write, as it provides more opportunities for parallelism.