Skip to content

Conversation

@AnyOldName3
Copy link
Contributor

Pull Request Template

Description

Avoids disturbing necessary descriptor sets while still avoiding the need for full layout compatibility checks.

This is achieved with a basic vector<bool> saying which sets a pipeline layout has and doesn't have, so layouts it doesn't have won't be bound. This could mean different sets are bound than if a full compatibility check was done, but that should only happen when the scenegraph requested invalid API usage - there should only be one BindDescriptorSet for each slot at the top of the state stacks, and if it's for a slot the current pipeline statically uses, it must be compatible. Therefore, we only need to worry about compatibility checks for slots the current pipeline layout doesn't use, which the change here now does.

There's a slight caveat here - if a pipeline doesn't statically use any bindings in a descriptor set, it's legal to bind nothing to it, so someone might make a scenegraph that leaves its BindDescriptorSet out of a stategroup and inherets something incompatible. If a descriptor set is empty, it can't be statically used, so that's treated as if there's no descriptor set. Any handling beyond that would have a larger performance overhead, so I think use cases other than this aren't worth supporting directly. If someone wants to avoid binding useless descriptor sets, there are other means, e.g. adding a dummy statecommand or not having a pointless descriptor set in the first place.

This is based on #1438, even though technically it could be separated and used on its own. This is because my earlier attempts to fix the problem relied on that PR, and using the same baseline made comparing them more straightforward.

This fix is by far the fastest that helps every failure case I identified (before yesterday), with the performance impact of a worst-case scenario ranging from below the noise floor of my measurements to just 0.25% overhead depending on whether I cherry-pick a good run or my worst run. For a scenegraph that isn't designed specifically to make this have as large an impact as possible, it should be free - it does a few instructions extra work for each pipeline layout change and descriptor set binding, both of which take much longer than that already.

Fixes #1394

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

The main interesting test is the incompatible layout example here: vsg-dev/vsgExamples@master...AnyOldName3:vsgExamples:incompatible-layouts. As discussed in #1394, it replicates the problem I initially discovered in a client app and several related situations that I discovered and thought were likely to have problems. There's lots of detail in this post #1394 (comment). With the master branch, lots of validation errors are emitted in most test cases, but with this branch, it causes no errors at all.

Obviously, I've confirmed that this fixes the problem in the original client app.

As well as this, I've done performance testing. On my machine, the overhead is much smaller than the variation in framerate or execution time from run to run, so simply timing things didn't provide meaningful results. Instead, I found that running for a couple of minutes with a sampling-based profiler (I used the one built into Visual Studio) and comparing the percentage of samples in the modified functions was much more consistent. To get a worst-case scenario and maximise the observable overhead, I generated a scenegraph where the same basic cube is drawn thousands of times cycling between 32 different materials, with the number of draw commands chosen to give around a hundred frames per second. Trying fewer draws and more frames just made the noise from slow functions (particularly vkCmdBeginRenderPass and vkBeginCommandBuffer) overwhelm the overhead I was trying to measure, but trying a smaller number of slower frames meant they were called less and countered this out. Other variations on this style of scenegraph produced equivalent results. As mentioned above, if I cherry-pick the fastest baseline result and slowest result from this branch, and compare the number of samples in modified functions, that only accounts for 0.25% of all samples. If I don't artificially try and make things look as bad as possible, then the difference is too small to measure.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works. They're on a vsgExamples branch - now the problem's fixed, it's not necessarily worth keeping the test app around.
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Otherwise you can't have nice things like the same view descriptor set for all shadersets.
Avoids disturbing necessary descriptor sets while still avoiding the need for full layout compatibility checks.

This is achieved with a basic vector<bool> saying which sets a pipeline layout has and doesn't have, so layouts it doesn't have won't be bound.
This could mean different sets are bound than if a full compatibility check was done, but that should only happen when the scenegraph requested invalid API usage - there should only be one BindDescriptorSet for each slot at the top of the state stacks, and if it's for a slot the current pipeline layout needs, it must be compatible.
Therefore, we only need to worry about compatibility checks for slots the current pipeline layout doesn't use.
The VSG generates layouts like this when descriptor bindings are dependent on defines, and doesn't generate matching empty BindDescriptorSet instances to go with them.
That disrupts the fix in the previous commit, so they must be handled as if they don't exist.

Even though the pipeline layout says these sets exist, there's no need to bind anything to them as they inherently can't be statically used by the pipeline.
@AnyOldName3 AnyOldName3 changed the title Vague layout compatibility checks Cheap layout compatibility checks May 14, 2025
@robertosfield
Copy link
Collaborator

I have just done a first pass review and looks good to me so far. When I'm ready for a second pass I'll create a branch and do performance testing, I'm not expecting any issues based on my initial code reviewer :-)

@AnyOldName3
Copy link
Contributor Author

What kind of timeline can be expected for this getting the full review? It's been a month since it was opened, so it's already been quite a while.

@robertosfield
Copy link
Collaborator

I am hoping to get most of the outstanding PRs reviewed this week.

@robertosfield
Copy link
Collaborator

@AnyOldName3 could you give me a heads up on which of your PRs are ready to review and merge with VSG master and the priority/order that they would be best merged, and which ones are lower priority/for discussion ones.

I ask this as I have a whole bunch of work to do on many different development threads so it's hard for me to keep track of everything.

@AnyOldName3
Copy link
Contributor Author

I'll put a list of everything here to avoid having to figure out a new place.

Soon

#1464 (i.e. this one) is the highest-priority PR from my perspective. It also includes the changes from #1438 , so it might be easier to review that one first.

vsg-dev/vsgXchange#224 doesn't change the API or fix a bug, so nothing has an absolute dependency on it, but it makes a significant difference to text atlas generation time, so it would be nice to have sooner rather than later.

Less soon

AnyOldName3/vsgXchange#1 is a minor docs fix that I apparently opened to the wrong repo ages ago. I think it's slightly more useful than the equivalent thing someone else submitted later, but it's not particularly important now it's not replacing a placeholder.

#1296 is only a proof-of-concept, and doesn't need immediate attention.

#1419 and vsg-dev/vsgExamples#350 are drafts. I've used them to investigate problems by rebasing them and fixing up merge conflicts each time I've needed to, which is a hassle, so I'd rather we figured out a way to make them mergeable but disabled by default eventually, but it's not a priority.

I think we've given up on cgitest, so robertosfield/cgitest#1 can be ignored.

@robertosfield
Copy link
Collaborator

As a first step in review of this PR I have merged as a branch:

https://github.com/vsg-dev/VulkanSceneGraph/tree/AnyOldName3-vague-layout-compatibility-checks

I will track down the old performance tests I was doing to make sure everything is performing well with this branch. On code review it looks reasonable. I might make some small tweaks for readability, but will wait till I'm more familiar with the code.

@robertosfield
Copy link
Collaborator

I have done the bench marking tests I used earlier in the year and see a 1.2% slow down with this PR. This probably the max end of the type of slow down end users might experience.

The other improvements prior to the PR still put this PR ahead of v1.1.10 and v1.0.9 results, so while in the wrong direction don't rob us of all the gains my previous optimization work achieved.

I will do another code review this time looking for opportunities to lower the overhead.

@AnyOldName3
Copy link
Contributor Author

That's a big surprise to me as the simulated worst-case test I did should genuinely have been a worst-case but didn't have nearly that much overhead, and I compared the emitted assembly for multiple compilers to ensure that I wasn't relying on vendor-specific optimisations.

Do you have a test scene you can share so I can see what's different between it and the ones I generated? The only guess I've got is that maybe yours rebinds one material descriptor over and over and it remains in L1 (which would mean doing something explicitly to dirty it), so this branch would make it spill, whereas mine switched between a few materials so they'd always need binding, so they'd be less likely to remain in L1. Either that, or your Vulkan driver has much less overhead than mine, which would be a good thing to know about for future performance testing.

@robertosfield
Copy link
Collaborator

I don't own the test models I am using so can't share. They were imported from old OSG models so probably stress different bits of the VSG than more modern PBR etc. models.

Longer term I'd like to have a set of scene generator utilities to create stress testing models programmatically so won't can create a range of tests and scale them up/down as appropriate.

The code looks like it can be optimized - there more if statements and variable sized containers than I think is required or optimal. The first step I've taken on this path is to reorgnize the checks into two CommandBuffer::enabled(..) methods, this makes the code more readable and centralizes the checks so rewriting them won't change the code that depends upon them.

I have put my changes so far into a optimized-layout-compatibility-checks branch:

https://github.com/vsg-dev/VulkanSceneGraph/tree/optimized-layout-compatibility-checks

The changes so far are just to introduce the CommandBuffer::enabled() method: 334b98d

Next up I'm curious about using a fixed size array for the enabled slots so we can avoid the checks against the container CommandBuffer::_currentDescriptorSetSlots.size().

I'm also thinking that summing the results when iterating over the _currentDescriptorSetSlots and then comparing to the size required might avoid putting if statements in a for loop. I don't know if the test models I'm using would have BindDescriptorSet's with multiple DescriptorSet's so this might be a optimization that might not be picked up in results.

@robertosfield
Copy link
Collaborator

Now I have put the checks all into two CommandBuffer methods we focus on where the optimizations need to be made.

First up is the simple single set check:

        inline bool enabled(uint32_t firstSet) const
        {
            return static_cast<uint32_t>(_currentDescriptorSetSlots.size()) > firstSet && _currentDescriptorSetSlots[firstSet];
        }

Getting the container size and doing the comparison could be replaced by just having a fixed sized _currentDescriptorSetSlots that is big enough for all possibilities. A first step would be to have a std::array or just use a 32 bit mask.

The multiple set variant is more complex:

       inline bool enabled(uint32_t firstSet, uint32_t numSets) const
        {
            if (static_cast<uint32_t>(_currentDescriptorSetSlots.size()) < firstSet + numSets)
                return false;
            for (uint32_t slot = firstSet; slot < firstSet + numSets; ++slot)
            {
                if (!_currentDescriptorSetSlots[slot])
                    return false;
            }
            return true;
        }

It has a three separate conditionals that affect control flow. We could avoid the for loop and all the conditionals by replacing it all with a set of bit mask operations.

@AnyOldName3
Copy link
Contributor Author

The implementation here's all vector of bool stuff, so it's already all bitmask operations because std::vector is specialised for bool to be a bitset. E.g. commandBuffer.getCurrentDescriptorSetSlots()[firstSet] should just turn into a couple of instructions to select the right address (which there are a lot more of for the other variables before the vkCmdBindDescriptorSets call) and then a single bt instruction to extract the right bit.

I'd expected the loop in BindDescriptorSets::record to be unrolled into a bitmask operation that would at least check a whole word worth of slots at once (which would mean only one loop iteration unless an application used more than 32 slots), and I thought I'd determined that that was happening when I looked at Godbolt, but either I changed the code since I last looked and it's disabled the optimisation or I misread the assembly as that doesn't look like it's happening for any of the major compilers with the code as it exists on my branch right now.

If we're happy to limit the VSG to 32 or 64 descriptor slots instead of the uint32_t that the Vulkan spec permits, then the std::vector<bool> could be swapped for a std::bitset<32>, which shaves a few instructions off the assembly.

Annoyingly, that's still not enough to convince the big three compilers to fully unroll the loop in BindDescriptorSets::record, but I expect I can have that working in a few minutes now I've seen it's a problem.

@robertosfield
Copy link
Collaborator

I have replaced the use of std::vector with a uin32_t and bitmask maths and performance on my test models is back to where it was, at 1.4% faster than the original PR. The changes require 2 less lines of code as well. The changes are in commit: 32eb8c4.

So the new code works faster and in my preliminary testing I can't see any regressions, but I'm just testing with standard models that worked prior to this code anyway. So my next step is check the layout tests. I don't recall them off the top of my head so will have to look back over past communications.

@AnyOldName3 could you do a code review of my https://github.com/vsg-dev/VulkanSceneGraph/tree/optimized-layout-compatibility-checks branch? Thanks.

@AnyOldName3
Copy link
Contributor Author

At a first glance, I'd worry it might be slower as moving the checks to inline functions in the CommandBuffer definition stopped them being inlined with MSVC. inline in C++ just means that duplicate identical definitions are legal, not that any inlining will happen.

@robertosfield
Copy link
Collaborator

inlining is hint to the compiler to inline to code where possible, if it's a f*cking awful compiler that can't inline very simple code, that this code is, then it's utter shameful mess that everyone should steer clear of.

This new code is written to be VERY easy for the compiler to optimize. No if statements, no loops, just simple bit math. This is how you should have written the code in the first place if you had wanted it be efficient.

I have merged your vsgincompatiblepipelinelayouts example as branch of vsgExamples and updated it to compile against the recent changes to the built-in ShaderSets. Further details on: vsg-dev/vsgExamples#359

I've now done as much as I can do to get this work ready to be merged with VSG master. I now need you to provide some guidance on how to test using vsgincompatiblepipelinelayouts and to testing yourself.

@AnyOldName3
Copy link
Contributor Author

It turns out that Godbolt had changed some of my compiler flags at some point, so inlining of non-templates was disabled. Either way, now that's sorted out, I get nearly identical assembly with all three major compilers with your uint32_t-based implementation as the std::bitset-based implementation I was working on myself at the same time. That implementation's available at https://github.com/AnyOldName3/VulkanSceneGraph/tree/optimized-layout-compatibility-checks-bitset.

The std::bitset-based implementation currently works, but the uitn32_t-based one doesn't. I've isolated the problem to PipelineLayout::compile. It's not setting the $n$th bit when dealing with descriptor set $n$, it's setting the bit that's $total - n$ from the LSB. I've pushed a fixed implementation to https://github.com/AnyOldName3/VulkanSceneGraph/tree/optimized-layout-compatibility-checks-fixed.

I'll do performance testing with the scenegraphs I tested before, but I'm not anticipating any problems. On all the platforms I tried, the vector<bool> implementation was already a bitset and the final bit extraction was done with the bt instruction, so all that's changed is how it decides which address to extract a bit from, and that the size check has been removed.

@robertosfield
Copy link
Collaborator

I have made a similar change to PipelineLayout.cpp as (your](AnyOldName3@1917a35) fix to the ordering of slot assignment.

I chose to stick with a accumulating the result in a temporary variable then assigning the completed result to avoid any potential threading issues as CPUs guarantee consistent results on assignment at the word level. The commit is: 3a97706

I've reviewed your changes to use bitset<32> and this is close my uin32_t implementation, it's a bit cleaner to read but adds an extra header include and templates for the compiler to parse. Performance wise I expect it should end up the same as uint32_t as it won't pay the penalty for dynamic allocations, copying overhead, pointer references and extra conditions that the variable sized vector will have had.

@AnyOldName3
Copy link
Contributor Author

I can confirm that that branch works, and using the same tests as before, the overhead is around 0.14%, so a bit better than the branch from this PR.

@AnyOldName3
Copy link
Contributor Author

It's been a couple of weeks since I confirmed that the optimized-layout-compatibility-checks branch works. As far as I'm aware, there isn't anything else I'm supposed to be doing to move it forward, so I'm still waiting for you to finish your performance tests and merge it.

@robertosfield
Copy link
Collaborator

robertosfield commented Jul 3, 2025 via email

@robertosfield
Copy link
Collaborator

I have done more testing comparing master with AnyOldName3-vague-layout-compatibility-checks (vector) and optimized-layout-compatibility-checks (uint32_t based), and depending upon the test model I see between 0.5 to 3% slow down.

The 3% slow down for an OpenFlight city model that contains billboarded trees that will be dropped into the transparent bin, so I think this is the sticking point. I haven't dug into specific causes of the slow down so can't suggest possible optimizations to resolve it yet.

What I am thinking is that we need a couple of tools to help with testing:

  1. Analysis of datasets to provide a guideline on the make up of taxing datasets
  2. Synthetic scene creation to create taxing datasets
  3. Benchmarking tool to provide a form of unit test, as well as report views with max frame loads

Such capabilities are open ended task that one could easily spend months working, but for the purposes of catching slow downs and helping figure out bottlenecks we need something sooner so will need to start very simple.

I'm thinking of starting simple terrain grid with trees distributed over it and camera path to do the unit tests with. If it's synthetic then there will be no issues with copyright or hosting databases.

@AnyOldName3
Copy link
Contributor Author

AnyOldName3 commented Jul 7, 2025

When I've been doing performance testing, I've generally found that using any real data has given me much, much smaller impacts than you've had from the same changes, typically far too small to measure against the variance I see from run to run, so to try and get results that indicate what you'll see, I've been designing scenegraphs that hammer the modified code paths as often as possible, and then altering them so different iterations read different data to see how much impact that has - obviously running a hundred instructions in a loop will go at different speeds if it's only reading from L1 cache. I've got a patch for vsgviewer that adds flags to duplicate either the whole scenegraph or nodes within it.

If I don't add lots of duplicates of the same data, then the frametime is nearly entirely spent in vkBeginCommandBuffer and vkBeginRenderpass, which take nearly a millisecond for each call on my driver. I see that they're spending most of their runtime in NtGdiDdDDILock2, which is an undocumented NT function, but it's got lock in the name, so I think it's a safe bet that there's some blocking synchronisation between the usermode and kernel mode parts of the driver. One thought is that the VSG could be feeding commands to the driver faster than it can deal with them - if the whole scenegraph is in cache, then we can generate commands quickly, but it doesn't eliminate the need for the driver to build the command buffer in RAM and then copy it to VRAM, so we're always going to be constrained by PCIe bandwidth with dedicated GPUs even if the record traversal is instant. It could also be a more tractable problem with how we're setting up queues or command pools etc., so I've been intending to investigate it for a while.

Either way, 3% seems like a very large impact given what I've been able to measure. The transparent bin shouldn't mean the modified code was running more often than in my tests, as at least one of them included rebinding the pipeline and every descriptor set between each draw, just like a renderbin would. I don't think as much as 3% of instructions executed or memory accessed would be things that weren't happening before based on what I saw when looking at the assembly etc., but there are confounding factors like whether we're pushing things out of the cache that the driver was using, and that's not going to be consistent between hardware vendors.

I'll investigate further with translucent drawables and see if I can get different results from my previous testing.

@robertosfield
Copy link
Collaborator

I'm seeing 3% slow down on a real-world model, it's a consistent enough result to know it's not just noise. I don't yet know the specific cause of the bottleneck and until I have a performance benchmark written that reproduces it reliably there is no point speculating on it's cause or how to resolve it.

It really sucks that I'm seeing this performance regression because I really could without yet another unpaid task to follow up on before I can get on with tagging the next point release. However, a 3% regression for resolving a niche issue isn't something I can just let by. Good performance can die from a thousand paper cuts.

@AnyOldName3
Copy link
Contributor Author

I've redone my testing of this branch, the optimised version based on bitmasks and the original branch from #1438 both are based on, but with everything rebased on master so my results aren't thrown off by the command pool reset issue anymore. However, I'm still only seeing sub-half-a-percent overhead from either approach (I can't get a more precise figure as the variance between runs is more than that), so it means the command pool issue can't have been what was causing me to be unable to replicate your results. That's with and without alpha blending forcing objects into a bin, too, so it can't just be that.

One more possibility is that maybe some of your test scenes use vsg::BindDescriptorSets instead of just vsg::BindDescriptorSet. The check, particularly without the bitmask optimisation, is more complicated than the one for binding a single descriptor set, so would be expected to be more expensive. I've not included this in my testing as binding multiple descriptor sets at once is generally a sign that they should be one larger descriptor set, so it's not something I'd expect performance-sensitive apps to do.

Obviously, it'll be less hassle for both of us going forward if we can figure out why your measurements are typically massively different to mine, and I was hoping that fixing the command pool reset issue would be the silver bullet that made things consistent.

By the way, you mentioned in #1538 that you'd merged the version of this with the bitmask optimisation, but it's not clear from your phrasing whether you just meant into the https://github.com/vsg-dev/VulkanSceneGraph/compare/optimized-layout-compatibility-checks branch or whether that branch was supposed to be merged into master a while ago and you forgot it hadn't been. I'm not sure whether the 3% figure was only for the vector-of-bool implementation and the uint32_t implementation was sufficiently fast or whether they're both too slow and I'm still supposed to be investigating further optimisation possibilities.

@robertosfield
Copy link
Collaborator

As a quick note, I'm really busy with client work right now so can't look at this again right away. When I get back to it I'll complete by vsgperformance testbed to reproduce the performance issues with procedural generated scenes, after than I can start comparing the different branches to acess the performance impact and if any other optimizations are warranted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Descriptor sets aren't rebound when pipeline layout change 'disturbs' them

2 participants