-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Material system: Implement GPU occlusion culling #1180
Conversation
The sphere projection function is giving incorrect results apparently, since I'm pretty sure the world->view space conversion is correct. I'm also not seeing much of any difference in frame time between this change and frustum culling-only: the frame time that I win in having less draws is nullified by the hi-z pass. It could be improved a bit by making the depth buffer downsampling be done in one compute dispatch, but it might just not really be worth it. On the other hand the hi-z buffer could be used for some other effects too. |
Well, I'm seeing some improvements in certain cases, especially with high graphics settings. The whole thing also isn't taking too much frame time in itself anyway (the indirect draws + cull compute). |
So the lower framerate that illwieckz was getting in #1137 might have to do with something else. Ideally we could also use better synchronization, but I'm not sure that would be possible. |
Possibly rasterizing the bounding boxes with colour/depth writes disabled to determine object visibility would be faster. The depth pre-pass is super fast wherever I looked at it here, so this should be fast too. It does, however, mean that object visibility would be lagging behind by a frame, but that's already happening with this hi-z algorithm. It would also require a bunch of writes to a buffer from every rasterized fragment that passed the depth test, which is not ideal. Perhaps a custom rasterizer in a compute shader that would instead write the visible object ids once per triangle or a batch of triangles would be better. |
There are a few more things that could be improved independent of occlusion culling that might make this work faster in every case anyway:
|
Another potential advantage of compute rasterizer is assigning light sources to a tile/cluster buffer, though with low light counts it might not make much of a difference. |
Doing hardware rasterization of BBoxes and writing to a buffer from the fragment shader with an early discard doesn't work well for some reason. |
I believe this fixes the incorrect projection. However there is still some incorrect culling, which might be due to incorrect depth downsampling with even/odd dimensions, haven't checked yet. |
This mostly works now, however there's still an error with AABB depth projection, and some things need to be un-hardcoded. |
e76f9b5
to
418997a
Compare
12b74aa
to
bac765b
Compare
bac765b
to
7c49392
Compare
7c49392
to
cff409a
Compare
Added some comments and made |
cff409a
to
ad7922d
Compare
Rebased. |
ad7922d
to
7500d84
Compare
Cleaned up a bit, and added a toggle through |
7500d84
to
8a4fd8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol I wanted to approve #1137 instead… 🤦♀️️ |
How can data generated on the previous frame be valid? Everything could be in a different position or viewed from a different angle. |
That's a mistake in pr description, what is meant is that the results of the culling are used on the next frame, i. e. double-buffered (the depth buffer is used from the current frame). |
Fixed pr description. |
8a4fd8f
to
bd6236d
Compare
surfaceCoords *= ( boundingBox.xy + boundingBox.zw ) * 0.5; | ||
const ivec2 surfaceCoordsFloor = ivec2( clamp( surfaceCoords.x, 0.0, float( u_ViewWidth >> level ) ), clamp( surfaceCoords.y, 0.0, float( u_ViewHeight >> level ) ) ); | ||
ivec4 depthCoords = ivec4( surfaceCoordsFloor.xy, | ||
surfaceCoordsFloor.x + ( surfaceCoords.x - surfaceCoordsFloor.x >= 0.5 ? 1 : -1 ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The object won't always occupy 2 pixels in both dimensions, right? It could have some oblong shape and be 1 pixel wide in the x axis and 2 in the y axis for example. I understand this is not a bug as the worst thing that could happen here is a surface not being culled, but I wonder if there isn't another way to do the calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can happen sometimes, yes. I've tried wrapping the texel fetches in if statements so only the first one is guaranteed and the next ones are only done if the new coordinates don't match any of the previous ones, but some surfaces became invisible when they shouldn't be. Not sure why as I couldn't spot what the issue was. This can be improved later though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also an extension that would allow doing this with just one regular texture sample, but the availability of that extension is not high enough to be the only option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah wait, it wasn't working properly because I accidentally used min
instead of max
. Anyway though, with that change it seems to be a tiny bit slower (by <0.1ms). Any consecutive texel fetches with the same coordinates will likely just come from the same texture cache anyway.
94b0116
to
164072d
Compare
Ah, yep, that should be fixed now. |
I've added a portal readback as well: now instead of just calling |
164072d
to
0bfd7d2
Compare
Adding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get a comment somewhere with an overview of how the double buffering works and what is the benefit of it? Like is there an async API for the compute shaders? If world surfaces are double-buffered but models are not, how do we keep them in sync when rendering?
struct PortalView { | ||
uint32_t count; | ||
drawSurf_t* drawSurf; | ||
uint32_t views[MAX_VIEWS]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do these represent? Portals directly visible from this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Some of it can probably be made simpler.
0bfd7d2
to
2dad681
Compare
There's this graph from #1137 which holds true here as well: |
LGTM, but I would still like to understand the double buffering thing. I guess I never really understood the answer to this previous conversation:
I don't get how surfaces can be double buffered without also buffering everything else that may be rendered (particle effects, 2D drawing, etc.). If you used the surface visibility from an earlier time then the player could have moved to a point that makes some of them newly visible. |
Add occlusion culling on the GPU to the material system. This works by creating a depth-reduction image from the depth buffer, then using it on the next frame. The bounding sphere of each surface is projected into a screen-space AABB, then its depth is tested against the depth of 4 texels in the depth image at appropriate mip-level, determined by the size of the AABB on screen. This doesn't include portal views because they use stencil buffer, so doing depth reduction and checking against depth image there would be complicated and likely require some sort of visibility buffer. Also added portal visibility readback: cull_cp will now cull portal surfaces (if there are any) in addition to regular surfaces, and the CPU will read the distance data (-1 if culled) on the next frame to better determine potential visible views.
2dad681
to
171c694
Compare
(rebased) |
Technically yes, but practically the fact that it's only one frame of difference makes it more or less impossible, even when moving around fast. So far I've not seen any discrepancies when using the engine with this change. It's also aided by the fact that the cull takes bounding spheres of the surfaces (either directly or projecting an AABB from them), so the culling somewhat coarse. |
Implement occlusion culling in compute shaders using a Hierarchical Depth-Buffer.
The depth buffer is processed before the post-processing pass, reducing it to the next mip level until a 1x1 resolution is reached.
Then it's used to find out if a surface that passed frustum culling is behind some opaque surfaces. This is done by getting the screen-space AABB from the bounding sphere of the surface, and then comparing its depth to the depth in the appropriate level of the depth-reduction image. The level of the depth image is chosen based on how much of the screen space it's taking up.