-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip: 16bit shader conversions #1581
base: master
Are you sure you want to change the base?
Conversation
This reverts commit f22f6a2.
@niklaspandersson I'm curious what your (and anyone else at nxtedition who might care) thoughts are on this approach, or if you disagree with some of my assumptions or preference for gpu work. The code is a bit of a mess so don't look in too much detail at it, and it needs a rebase following the merging of your hdr work. Long term, the conversion shaders may want to handle converting between sdr and hdr, but from what I have read about tone-mapping, that doesn't sound fun. On an unrelated note, I had a quick dig into cef following their new implementation of shared-texture support. That looks a lot more likely to be stable so might be worth bringing back. |
This is something I started late last year, but haven't had a motivation to finish it. So I am pushing it here, in case someone wants to use it as inspiration or to copy pieces.
The work here was focussed on SDR 16bit compositing. At some point that would have evolved to HDR, but it wasn't considered yet. The intention being to get lossless SDR 10bit yuv through the system, rather than the slightly lossy flow that we have today.
The basic design was to on the producer side, to replace the point where we tell opengl to copy a buffer into a texture, with an opengl compute shader. This would allow us to do yuv->rgb conversion, and even to unpack certain common and packed formats, such as the decklink yuv10 packing. This was not implemented yet.
This would also mean that the composite shader could have the existing colour format handling code removed.
The hope was that doing it here (where opengl is likely already doing a copy, and rearranging the bytes) would have minimal cost on memory, and minimal cost on gpu power. I was trying to avoid doing this on the cpu, as in my experience that is typically under higher pressure (decoding video and deinterlacing). compute shaders are supported in our current minimum opengl version.
On the consumer side, the intention was to do something similar and using a compute shader to do the final copy from the composited texture into the buffer that is copied into cpu memory.
The intention is that the
key_only
andsubregion
options in the decklink consumer would make their way into this converter, so that only the subregion needs to be converted and downloaded from the gpu, and at the same time allow other consumers to support the same flows with very little additional code.This does carry risk of doing more downloads from the gpu than before, but I don't expect that this will be a bottleneck for anyone. Some googling suggests that pcie has separate upload and download bandwidth, and we are much more likely to hit the limits of upload before download.
Whether the cost on bandwidth will be greater than cpu conversions is easily calculable.
For 2 decklinks doing 10bit yuv, that equates to ~2x21bpp, less than downloading as 16bit rgba, this would be the situation for a channel doing f+k. Adding anything more would equate more bandwidth.
And perhaps more importantly, for progressive channels, those buffers could be handed straight to the decklink driver without a copy. Which would relieve cpu memory pressure compared to a cpu conversion as well as cycles.
And by using compute shaders on all downloads, it lets us to use a texture format that glsl will prefer, using a float based format instead of int.
Another aim was to change the compositing to linear-rgb, and handle srgb in the conversion shaders. That has not been investigated at all.
This consumer portion is fairly implemented, with a working (but not verified for accuracy) decklink v210 implementation.
To support this, when constructing a consumer, it is passed a
frame_converter
, which it can use to convert the const_frame into whatever format it prefers. As part of this, the intention is to remove the 8bit rgba buffer off const_frame, so that it has to also be fetched through theframe_converter
, this has not been done in this POC, to avoid breaking every consumer.For the status of this, it is possible to high bit depth ffmpeg clips, or 16bit pngs, and output then in gpu generated yuv10 out of a decklink. The decklink consumer doesn't support k+f when fed yuv10 frames, but can be done with a second port set to key-only using the sync-group added previously. (I wanted to explore using the 3D api to support k+f on the 4k extreme cards)
A lot of things are hardcoded in testing setups, as this didn't progress beyond a POC.