SceneWriter : Improve performance #6232

johnhaddon · 2025-01-27T12:43:25Z

This uses a new SceneAlgo::parallelGatherLocations() method which processes the scene in parallel, passing per-location results to a single thread for subsequent processing. This is more efficient than the previous approach of using a mutex to protect the output SceneInterface from concurent writes. Which gives the following reductions in runtime for a direct SceneReader->SceneWriter process :

ALab : 60%
Moana : 50%
Market : 15%

Improvements would be expected to be even greater when heavy processing is done in between the read and the write, as it provides more opportunities for parallelism.

danieldresser-ie

I guess the good news is that this all looks good to me. Which means the bad news is that I haven't spotted whatever is causing the test failures.

Good hunting! If you haven't found the problem by tomorrow I might stare at this some more and see if I think of anything.

danieldresser-ie · 2025-01-28T08:02:13Z

include/GafferScene/SceneAlgo.inl

+			break;
+		}
+		gatherFunctor( *locationResult );
+	}


I'm staring at this function pretty closely, since I know there are weird test failures, and threading stuff that I haven't dealt with before is more suspicious than the rest of this code ... but I'm not really seeing any obvious problems.

It's too bad parallel_pipeline doesn't work efficiently for this case, since it seems like exactly what it was designed for ... I wonder what they're doing wrong.

Because I know there's a failure, I'm thinking about task_area::attach vs our usual caution in having to use isolated task_group_contexts, but I can't think of anything actually wrong here ( it's honestly been a bit since I've carefully paid attention to what tbb's arenas and contexts are ... but all of this looks fine ).

One thing that shouldn't affect correctness, but I'm curious: how did you decide on queue.set_capacity( tbb::this_task_arena::max_concurrency() );? I guess it seems reasonable, but there's nothing to make it any more special than any other value, is there? In any case, one thread is likely to rip through the available capacity by writing a string of something that is quicker to traverse than write, and for the remainder of the runtime, everything will spend most of its time blocked on queue.push, and it's luck of the draw who gets unblocked?

It's too bad parallel_pipeline doesn't work efficiently for this case...I wonder what they're doing wrong.

I think part of it is the way you have to feed tokens (paths in our case) in one-by-one. When a location has many children, parallel_for() in contrast lets multiple threads "discover" the children in parallel and without contention as they subdivide their allotted portion of the range. Having to pass tokens in also meant making a new ScenePath for each location, whereas parallel_for() can keep just updating a single instance in place. That part could have been solved by reusing paths from a circular buffer, but wouldn't have helped the serial one-by-one nature of token generation.

Because I know there's a failure, I'm thinking about task_area::attach vs our usual caution in having to use isolated task_group_contexts, but I can't think of anything actually wrong here

I think that side of things was OK, and the real problems were down to the lack of exception handling. This made two things possible :

An exception in the parallelTraverse() could mean we never pushed monostate into the queue, so the main thread would loop forever.

An exception in the gather functor could mean we exited the function while the parallelTraverse() was still running, which meant it was pushing into a queue that had been destructed already.

Hopefully I've fixed that in e1e66c8.

how did you decide on queue.set_capacity( tbb::this_task_arena::max_concurrency() );?

I didn't give this a huge amount of thought, so other ideas are welcome. But, the danger of a too-large queue is runaway memory usage, and setting the capacity to the number of workers means we won't be using more memory than we were previously.

Glad you figured out the exception thing.

On capacity: seems pretty reasonable. Might be worth doing a test where you set it 4X higher, and see if it affects any of your benchmarks? If that doesn't make any difference, it would be a quick way to confirm that it isn't worth thinking about further. I can imagine a scenario where a bit more capacity helps smooth out some unevenness in how long things take to compute/write, but no idea how likely that is in practice.

I tried the 4x capacity, with no discernible impact on speed. At least in these benchmarks I can actually reduce the capacity without taking a hit. Which kindof makes sense since I'm just doing SceneReader direct to SceneWriter, so there's not much opportunity to spend time on actual processing.

danieldresser-ie · 2025-01-28T08:23:13Z

src/GafferScene/SceneWriter.cpp

+				locationData.write( output.get(), scope.context()->getTime() );
+			}
+
+		);


I feel like I'm grasping at straws a little trying to explain how this is failing. I did think it could be interesting that LocationData is getting an implicit assignment operator called - it's being directly stored in the queue, and it seems like there's an assignment happening in queue.pop()? It might be a performance concern that a bunch of smart pointers are getting pointlessly incremented/decremented, but I don't see any way it would affect correctness if LocationData does get copied, as long as write() gets called exactly once. ( Though I guess I don't know what a NameList is, and whether that is safe to copy ).

I think pop() is using move assignment, and I've just pushed a commit to use that for push() too. So there shouldn't be any pointless increment/decrement now. NameList is just vector<InternedString.

There was a heavy bias towards testing with `.scc` files, but those are not particularly relevant to the majority of Gaffer users.

I do not know why this feature exists. It was introduced in 269cd4c, which purported to be about something completely different. I believe it can only possibly work with the `.scc` format. It isn't matched by similiar functionality in the SceneReader, which doesn't make any attempt to load globals. So I can only assume it was needed internally at Image Engine at some point in time - perhaps for Caribou? I have no idea if it is still needed or not. For now, adding a unit test so I don't break the feature during refactoring. But I suspect it should be removed at some point in the future.

This is more efficient than the previous approach of using a mutex to protect the output SceneInterface from concurent writes. This gives the following reductions in runtime for a direct SceneReader->SceneWriter process : - ALab : 60% - Moana : 50% - Market : 15% Improvements would be expected to be even greater when heavy processing is done in between the read and the write, as it provides more opportunities for parallelism.

johnhaddon · 2025-01-29T15:38:44Z

I've pushed a rebase with two new commits on the end. The first exposes the fact that the original code was failing for .lscc which is what caused the exception in the SceneReaderTest in the first place. The second fixes the problem (we have to avoid calling SceneInterface::scene() because it is botched in LinkedScene), and adds bindings and unit tests for parallelGatherLocations(). Should be ready for another look.

johnhaddon requested a review from danieldresser-ie January 27, 2025 12:43

johnhaddon self-assigned this Jan 27, 2025

johnhaddon force-pushed the sceneWriterPerformance branch from 684014c to 826863e Compare January 27, 2025 16:25

danieldresser-ie reviewed Jan 28, 2025

View reviewed changes

johnhaddon force-pushed the sceneWriterPerformance branch 3 times, most recently from 8ec092f to 19fd82e Compare January 29, 2025 15:32

johnhaddon added 9 commits January 29, 2025 15:33

SceneWriterTest : Remove unused includes

21675ae

SceneWriterTest : Add subtests for different file formats

659ddfe

There was a heavy bias towards testing with `.scc` files, but those are not particularly relevant to the majority of Gaffer users.

SceneWriter : Prefer EditableScope and range-for

b012f03

SceneAlgo : Add parallelGatherLocations() function

ab1e96f

SceneAlgo : Handle exceptions in parallelGatherLocations()

d4c1c3d

fixup! SceneWriterTest : Add subtests for different file formats

8c75b00

fixup! SceneAlgo : Handle exceptions in parallelGatherLocations()

e2a0237

johnhaddon force-pushed the sceneWriterPerformance branch from 19fd82e to e2a0237 Compare January 29, 2025 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SceneWriter : Improve performance #6232

SceneWriter : Improve performance #6232

johnhaddon commented Jan 27, 2025

danieldresser-ie left a comment

danieldresser-ie Jan 28, 2025

johnhaddon Jan 28, 2025 •

edited

Loading

danieldresser-ie Jan 28, 2025

johnhaddon Jan 29, 2025

danieldresser-ie Jan 28, 2025

johnhaddon Jan 28, 2025

johnhaddon commented Jan 29, 2025

SceneWriter : Improve performance #6232

Are you sure you want to change the base?

SceneWriter : Improve performance #6232

Conversation

johnhaddon commented Jan 27, 2025

danieldresser-ie left a comment

Choose a reason for hiding this comment

danieldresser-ie Jan 28, 2025

Choose a reason for hiding this comment

johnhaddon Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

danieldresser-ie Jan 28, 2025

Choose a reason for hiding this comment

johnhaddon Jan 29, 2025

Choose a reason for hiding this comment

danieldresser-ie Jan 28, 2025

Choose a reason for hiding this comment

johnhaddon Jan 28, 2025

Choose a reason for hiding this comment

johnhaddon commented Jan 29, 2025

johnhaddon Jan 28, 2025 •

edited

Loading