Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: 16bit shader conversions #1581

Draft
wants to merge 50 commits into
base: master
Choose a base branch
from
Draft

Conversation

Julusian
Copy link
Member

@Julusian Julusian commented Oct 7, 2024

This is something I started late last year, but haven't had a motivation to finish it. So I am pushing it here, in case someone wants to use it as inspiration or to copy pieces.

The work here was focussed on SDR 16bit compositing. At some point that would have evolved to HDR, but it wasn't considered yet. The intention being to get lossless SDR 10bit yuv through the system, rather than the slightly lossy flow that we have today.

The basic design was to on the producer side, to replace the point where we tell opengl to copy a buffer into a texture, with an opengl compute shader. This would allow us to do yuv->rgb conversion, and even to unpack certain common and packed formats, such as the decklink yuv10 packing. This was not implemented yet.
This would also mean that the composite shader could have the existing colour format handling code removed.

The hope was that doing it here (where opengl is likely already doing a copy, and rearranging the bytes) would have minimal cost on memory, and minimal cost on gpu power. I was trying to avoid doing this on the cpu, as in my experience that is typically under higher pressure (decoding video and deinterlacing). compute shaders are supported in our current minimum opengl version.

On the consumer side, the intention was to do something similar and using a compute shader to do the final copy from the composited texture into the buffer that is copied into cpu memory.
The intention is that the key_only and subregion options in the decklink consumer would make their way into this converter, so that only the subregion needs to be converted and downloaded from the gpu, and at the same time allow other consumers to support the same flows with very little additional code.
This does carry risk of doing more downloads from the gpu than before, but I don't expect that this will be a bottleneck for anyone. Some googling suggests that pcie has separate upload and download bandwidth, and we are much more likely to hit the limits of upload before download.

Whether the cost on bandwidth will be greater than cpu conversions is easily calculable.
For 2 decklinks doing 10bit yuv, that equates to ~2x21bpp, less than downloading as 16bit rgba, this would be the situation for a channel doing f+k. Adding anything more would equate more bandwidth.
And perhaps more importantly, for progressive channels, those buffers could be handed straight to the decklink driver without a copy. Which would relieve cpu memory pressure compared to a cpu conversion as well as cycles.

And by using compute shaders on all downloads, it lets us to use a texture format that glsl will prefer, using a float based format instead of int.
Another aim was to change the compositing to linear-rgb, and handle srgb in the conversion shaders. That has not been investigated at all.

This consumer portion is fairly implemented, with a working (but not verified for accuracy) decklink v210 implementation.
To support this, when constructing a consumer, it is passed a frame_converter, which it can use to convert the const_frame into whatever format it prefers. As part of this, the intention is to remove the 8bit rgba buffer off const_frame, so that it has to also be fetched through the frame_converter, this has not been done in this POC, to avoid breaking every consumer.

For the status of this, it is possible to high bit depth ffmpeg clips, or 16bit pngs, and output then in gpu generated yuv10 out of a decklink. The decklink consumer doesn't support k+f when fed yuv10 frames, but can be done with a second port set to key-only using the sync-group added previously. (I wanted to explore using the 3D api to support k+f on the 4k extreme cards)

A lot of things are hardcoded in testing setups, as this didn't progress beyond a POC.

@Julusian
Copy link
Member Author

@niklaspandersson I'm curious what your (and anyone else at nxtedition who might care) thoughts are on this approach, or if you disagree with some of my assumptions or preference for gpu work. The code is a bit of a mess so don't look in too much detail at it, and it needs a rebase following the merging of your hdr work.
I am considering picking some of this up but want to make sure the design/approach wont get complained about later.
It needs some thought about handling hdr, but I suspect that simply would be another thing for the conversion shaders to consider, depending on the format (or perhaps a different set of conversion shaders for hdr?).

Long term, the conversion shaders may want to handle converting between sdr and hdr, but from what I have read about tone-mapping, that doesn't sound fun.
Doing sdr to hdr for producers may be needed to allow non-hdr producers to work (html) or clips, but it sounds like that can be done with simple maths rather than tone mapping.


On an unrelated note, I had a quick dig into cef following their new implementation of shared-texture support. That looks a lot more likely to be stable so might be worth bringing back.
It also looks like it might be possible to modify cef to change the pixel format of the textures to 16bit. I don't know if it will composite internally at that, but I would hope so. I didn't think to check about hdr, but it must be possible to enable that somehow in chromium too (its been supported in Chrome for years).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants