Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Material system: Implement GPU occlusion culling #1180

Merged
merged 1 commit into from
Aug 11, 2024

Conversation

VReaperV
Copy link
Contributor

@VReaperV VReaperV commented Jun 4, 2024

Implement occlusion culling in compute shaders using a Hierarchical Depth-Buffer.

The depth buffer is processed before the post-processing pass, reducing it to the next mip level until a 1x1 resolution is reached.
Then it's used to find out if a surface that passed frustum culling is behind some opaque surfaces. This is done by getting the screen-space AABB from the bounding sphere of the surface, and then comparing its depth to the depth in the appropriate level of the depth-reduction image. The level of the depth image is chosen based on how much of the screen space it's taking up.

@VReaperV
Copy link
Contributor Author

The sphere projection function is giving incorrect results apparently, since I'm pretty sure the world->view space conversion is correct. I'm also not seeing much of any difference in frame time between this change and frustum culling-only: the frame time that I win in having less draws is nullified by the hi-z pass. It could be improved a bit by making the depth buffer downsampling be done in one compute dispatch, but it might just not really be worth it. On the other hand the hi-z buffer could be used for some other effects too.

@VReaperV
Copy link
Contributor Author

Well, I'm seeing some improvements in certain cases, especially with high graphics settings. The whole thing also isn't taking too much frame time in itself anyway (the indirect draws + cull compute).

@VReaperV
Copy link
Contributor Author

So the lower framerate that illwieckz was getting in #1137 might have to do with something else. Ideally we could also use better synchronization, but I'm not sure that would be possible.

@VReaperV
Copy link
Contributor Author

Possibly rasterizing the bounding boxes with colour/depth writes disabled to determine object visibility would be faster. The depth pre-pass is super fast wherever I looked at it here, so this should be fast too. It does, however, mean that object visibility would be lagging behind by a frame, but that's already happening with this hi-z algorithm.

It would also require a bunch of writes to a buffer from every rasterized fragment that passed the depth test, which is not ideal. Perhaps a custom rasterizer in a compute shader that would instead write the visible object ids once per triangle or a batch of triangles would be better.

@VReaperV
Copy link
Contributor Author

There are a few more things that could be improved independent of occlusion culling that might make this work faster in every case anyway:

  1. Use batches of e. g. 64 triangles instead of specifying surfaces, which would allow dropping the processSurfaces_cp shader and doing writes in the cull_cp shader instead.
  2. That would also allow creating an index buffer instead of an indirect buffer, for the most part. There would still be one draw per material, but that draw would have all the triangles in it (that weren't culled).
  3. Potentially with triangle batches culling at triangle batch level might speed things up even more: e. g. with cone culling.

@VReaperV
Copy link
Contributor Author

Another potential advantage of compute rasterizer is assigning light sources to a tile/cluster buffer, though with low light counts it might not make much of a difference.

@VReaperV
Copy link
Contributor Author

Doing hardware rasterization of BBoxes and writing to a buffer from the fragment shader with an early discard doesn't work well for some reason.

@VReaperV
Copy link
Contributor Author

I believe this fixes the incorrect projection. However there is still some incorrect culling, which might be due to incorrect depth downsampling with even/odd dimensions, haven't checked yet.

@VReaperV
Copy link
Contributor Author

This mostly works now, however there's still an error with AABB depth projection, and some things need to be un-hardcoded.

@VReaperV VReaperV force-pushed the gpu-occlusion-culling branch 4 times, most recently from e76f9b5 to 418997a Compare July 27, 2024 12:19
@VReaperV VReaperV marked this pull request as ready for review July 27, 2024 12:20
@VReaperV VReaperV force-pushed the gpu-occlusion-culling branch 2 times, most recently from 12b74aa to bac765b Compare July 27, 2024 13:20
@VReaperV VReaperV changed the title WIP: Material system: Implement GPU occlusion culling Material system: Implement GPU occlusion culling Jul 27, 2024
src/engine/renderer/Material.cpp Fixed Show resolved Hide resolved
@VReaperV
Copy link
Contributor Author

Added some comments and made r_lockPVS 1 work with occlusion culling.

@VReaperV
Copy link
Contributor Author

VReaperV commented Jul 29, 2024

Rebased.

@VReaperV
Copy link
Contributor Author

Cleaned up a bit, and added a toggle through r_gpuOcclusionCulling.

illwieckz
illwieckz previously approved these changes Jul 30, 2024
Copy link
Member

@illwieckz illwieckz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand everything, but the overall looks good to me and I confirm it fixes many bugs.

I would appreciate some better spacing:

20240730-215050-000 spaces

I also noticed some new GLSL files use spaces or seem to mix tabs and spaces, I would prefer tabs.

@illwieckz illwieckz added T-Improvement Improvement for an existing feature A-Renderer labels Jul 30, 2024
@illwieckz
Copy link
Member

Lol I wanted to approve #1137 instead… 🤦‍♀️️

@slipher
Copy link
Member

slipher commented Aug 2, 2024

This works by creating a depth-reduction image from the depth buffer, then using it on the next frame.

How can data generated on the previous frame be valid? Everything could be in a different position or viewed from a different angle.

@VReaperV
Copy link
Contributor Author

VReaperV commented Aug 2, 2024

How can data generated on the previous frame be valid? Everything could be in a different position or viewed from a different angle.

That's a mistake in pr description, what is meant is that the results of the culling are used on the next frame, i. e. double-buffered (the depth buffer is used from the current frame).

@VReaperV
Copy link
Contributor Author

VReaperV commented Aug 2, 2024

Fixed pr description.

src/engine/renderer/tr_image.cpp Outdated Show resolved Hide resolved
src/engine/renderer/gl_shader.h Show resolved Hide resolved
src/engine/renderer/glsl_source/cull_cp.glsl Outdated Show resolved Hide resolved
surfaceCoords *= ( boundingBox.xy + boundingBox.zw ) * 0.5;
const ivec2 surfaceCoordsFloor = ivec2( clamp( surfaceCoords.x, 0.0, float( u_ViewWidth >> level ) ), clamp( surfaceCoords.y, 0.0, float( u_ViewHeight >> level ) ) );
ivec4 depthCoords = ivec4( surfaceCoordsFloor.xy,
surfaceCoordsFloor.x + ( surfaceCoords.x - surfaceCoordsFloor.x >= 0.5 ? 1 : -1 ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The object won't always occupy 2 pixels in both dimensions, right? It could have some oblong shape and be 1 pixel wide in the x axis and 2 in the y axis for example. I understand this is not a bug as the worst thing that could happen here is a surface not being culled, but I wonder if there isn't another way to do the calculation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can happen sometimes, yes. I've tried wrapping the texel fetches in if statements so only the first one is guaranteed and the next ones are only done if the new coordinates don't match any of the previous ones, but some surfaces became invisible when they shouldn't be. Not sure why as I couldn't spot what the issue was. This can be improved later though.

Copy link
Contributor Author

@VReaperV VReaperV Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also an extension that would allow doing this with just one regular texture sample, but the availability of that extension is not high enough to be the only option.

Copy link
Contributor Author

@VReaperV VReaperV Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah wait, it wasn't working properly because I accidentally used min instead of max. Anyway though, with that change it seems to be a tiny bit slower (by <0.1ms). Any consecutive texel fetches with the same coordinates will likely just come from the same texture cache anyway.

src/engine/renderer/Material.h Outdated Show resolved Hide resolved
src/engine/renderer/Material.cpp Outdated Show resolved Hide resolved
src/engine/renderer/Material.cpp Show resolved Hide resolved
src/engine/renderer/Material.cpp Outdated Show resolved Hide resolved
@VReaperV VReaperV force-pushed the gpu-occlusion-culling branch 3 times, most recently from 94b0116 to 164072d Compare August 6, 2024 21:24
@VReaperV
Copy link
Contributor Author

VReaperV commented Aug 6, 2024

I also noticed some new GLSL files use spaces or seem to mix tabs and spaces, I would prefer tabs.

Ah, yep, that should be fixed now.

@VReaperV
Copy link
Contributor Author

VReaperV commented Aug 6, 2024

I've added a portal readback as well: now instead of just calling R_MirrorViewBySurface() for every portal surface after sorting them by distance each frame, the cull_cp shader will cull portal surfaces in addition to regular ones, then the CPU will read the results back next frame and use that to determine which portals might be visible.

src/engine/renderer/Material.cpp Dismissed Show resolved Hide resolved
@VReaperV
Copy link
Contributor Author

VReaperV commented Aug 6, 2024

Adding R_SyncRenderThread() to GenerateWorldMaterials() also seems to have fixed crashes with r_smp 1, so I have disabled that restriction.

Copy link
Member

@slipher slipher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a comment somewhere with an overview of how the double buffering works and what is the benefit of it? Like is there an async API for the compute shaders? If world surfaces are double-buffered but models are not, how do we keep them in sync when rendering?

struct PortalView {
uint32_t count;
drawSurf_t* drawSurf;
uint32_t views[MAX_VIEWS];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these represent? Portals directly visible from this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Some of it can probably be made simpler.

src/engine/renderer/glsl_source/cull_cp.glsl Show resolved Hide resolved
src/engine/renderer/tr_main.cpp Show resolved Hide resolved
@VReaperV
Copy link
Contributor Author

VReaperV commented Aug 7, 2024

Could we get a comment somewhere with an overview of how the double buffering works and what is the benefit of it? Like is there an async API for the compute shaders? If world surfaces are double-buffered but models are not, how do we keep them in sync when rendering?

There's this graph from #1137 which holds true here as well:
image
The benefit is not introducing stalls with extra synchronisation, so they can keep executing alongside the regular pipeline (including swapBuffers()). Compute shaders are async by nature, so nothing extra needs to be done for that.
Double-buffering only applies to which surfaces are rendered, not how they are rendered. And since world and model surfaces are culled in completely different ways synchronisation between them does not make sense.

@slipher
Copy link
Member

slipher commented Aug 9, 2024

LGTM, but I would still like to understand the double buffering thing. I guess I never really understood the answer to this previous conversation:

How can data generated on the previous frame be valid? Everything could be in a different position or viewed from a different angle.

That's a mistake in pr description, what is meant is that the results of the culling are used on the next frame, i. e. double-buffered (the depth buffer is used from the current frame).

I don't get how surfaces can be double buffered without also buffering everything else that may be rendered (particle effects, 2D drawing, etc.). If you used the surface visibility from an earlier time then the player could have moved to a point that makes some of them newly visible.

Add occlusion culling on the GPU to the material system. This works by creating a depth-reduction image from the depth buffer, then using it on the next frame. The bounding sphere of each surface is projected into a screen-space AABB, then its depth is tested against the depth of 4 texels in the depth image at appropriate mip-level, determined by the size of the AABB on screen. This doesn't include portal views because they use stencil buffer, so doing depth reduction and checking against depth image there would be complicated and likely require some sort of visibility buffer.

Also added portal visibility readback: cull_cp will now cull portal surfaces (if there are any) in addition to regular surfaces, and the CPU will read the distance data (-1 if culled) on the next frame to better determine potential visible views.
@VReaperV
Copy link
Contributor Author

(rebased)

@VReaperV
Copy link
Contributor Author

If you used the surface visibility from an earlier time then the player could have moved to a point that makes some of them newly visible.

Technically yes, but practically the fact that it's only one frame of difference makes it more or less impossible, even when moving around fast. So far I've not seen any discrepancies when using the engine with this change.

It's also aided by the fact that the cull takes bounding spheres of the surfaces (either directly or projecting an AABB from them), so the culling somewhat coarse.

@VReaperV VReaperV merged commit 4f3468d into DaemonEngine:master Aug 11, 2024
9 checks passed
@VReaperV VReaperV deleted the gpu-occlusion-culling branch August 11, 2024 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Renderer T-Improvement Improvement for an existing feature
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants