Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU 3D lut transform causes banding #1763

Open
cessen opened this issue Jan 21, 2023 · 10 comments
Open

GPU 3D lut transform causes banding #1763

cessen opened this issue Jan 21, 2023 · 10 comments
Labels
Needs Discussion Needs discussion before implmentation, which could result in changes, or a decision not to proceed.

Comments

@cessen
Copy link

cessen commented Jan 21, 2023

I'm using 3d luts that span a large dynamic range and need trilinear interpolation for correct results (I'm happy to provide details of my use case if that's helpful).

Due to the nature of the luts, they should provide sufficiently accurate results even at relatively low lut resolutions. However, much to my surprise there is significant banding when applied by OCIO on GPU:

gpu posterized

The only way to mitigate this is to crank the lut resolution extremely high, and even then especially dark colors exhibit some subtle banding. The results by CPU, on the other hand, are correct even with the low resolution lut:

cpu correct

Looking into OCIO's source, this seems to be the culprit:

// Trilinear interpolation
// Use texture3d and GL_LINEAR and the GPU's built-in trilinear algorithm.
// Note that the fractional components are quantized to 8-bits on some
// hardware, which introduces significant error with small grid sizes.
ss.newLine() << ss.float3Decl(name + "_coords")
<< " = (" << shaderCreator->getPixelName() << ".zyx * "
<< ss.float3Const(dim - 1) << " + "
<< ss.float3Const(0.5f) + ") / "
<< ss.float3Const(dim) << ";";
ss.newLine() << shaderCreator->getPixelName() << ".rgb = "
<< ss.sampleTex3D(name, name + "_coords") << ".rgb;";
}
shaderCreator->addToFunctionShaderCode(ss.string().c_str());

It's even noted in the comment that this can be an issue on some GPUs. For the record, I'm using an nVidia RTX 3060. So not exactly niche.

I think it would make sense to replace the sampleTex3D() call with hand-written trilinear interpolation code, to ensure that it executes with full floating point precision and gives correct results on all platforms. It might be a small perf hit, but for something like OCIO I suspect correctness and consistency is more important here.

Would a PR along those lines be accepted?

@cessen
Copy link
Author

cessen commented Apr 5, 2023

Any thoughts/feedback about this? This is a significant problem for the work that I'm doing with OCIO, and I'd like to see it fixed.

I'm happy to do the leg work, but just want to double-check that the proposed solution is acceptable first.

@doug-walker
Copy link
Collaborator

Hi @cessen , apologies for the delayed reply. Yes, that's a pretty shocking amount of banding in the first image. It looks like maybe you're applying a 3d-LUT to a linear color space encoding? If so, the banding is probably only one of several problems you will encounter. The best practice is to convert to a log-encoded (or at least gamma-encoded) color space before applying a 3d-LUT.

In practice, OCIO users tend to be using 3d-LUTs that are a minimum of 16x16x16 (and typically larger) and being fed with log-encoded data. So the quantization of the fractional component has not been an issue.

I agree that your proposal of writing a trilinear interpolation shader would work around the problem. (This is what we've done for tetrahedral interpolation already.) However, it seems that there is a risk of introducing performance regressions by not using the built-in GPU trilinear interpolation. Tests would ideally need to be done on a variety of graphics cards (old and new) and a range of 3d-LUT sizes to validate that performance was not impacted.

While we thank you for volunteering, we won't agree in advance to accepting a PR for this. We would need to see experimental results covering a sufficiently broad set of use-cases. Almost certainly there will be some speed penalty and at that point it will be a judgment call about whether the improvement in quality for this edge case warrants the overall performance reduction. This is something that would likely be decided by the project's steering committee.

@cessen
Copy link
Author

cessen commented Apr 20, 2023

@doug-walker Thanks for the thorough reply!

While we thank you for volunteering, we won't agree in advance to accepting a PR for this.

Yes, of course. I just wanted to confirm that it wasn't out of the question for some reason. I manage some open source projects myself, and have received PRs that I had to reject due to the scope and goals of the project, and I often wish people would ask ahead of time before putting in the work.

(Edit: I also meant to acknowledge that my phrasing made my intent there very unclear. Sorry about that.)

However, it seems that there is a risk of introducing performance regressions by not using the built-in GPU trilinear interpolation. Tests would ideally need to be done on a variety of graphics cards (old and new) and a range of 3d-LUT sizes to validate that performance was not impacted.

Yeah, I suspected that might be the case.

It occurs to me that an alternative might be to introduce another interpolation mode. Something like INTERP_LINEAR_PRECISE, for use cases where guaranteed full precision is needed.

It looks like maybe you're applying a 3d-LUT to a linear color space encoding?

Although there are certainly cases where we would like to apply smaller LUTs in a linear color space, we've been able to work around that by using higher resolution LUTs and applying them in non-linear color space encodings, as you mentioned. I don't want to completely downplay this, however, because it does mean we have notably higher-res LUTs than needed for the actual transforms we want to do (which is annoying, as the size grows with the cube of the resolution) and it forces those transforms to have otherwise unnecessary log encoding/decoding steps (which performance-wise aren't free either). So this is certainly one of the use cases we'd like full-precision interpolation for. But it's not the main reason that I opened this issue.

An example of the kind of use case where this is more of a problem is working with HDRIs. Unlike footage from principal photography, HDRIs can have a (potentially huge) dynamic range that isn't bounded in advance. So to be conservative and avoid accidental clipping, a 3D LUT intended for HDRIs needs to have a huge extent. Even using a nonlinear color space encoding, LUTs with such extents can still exhibit banding with the current GPU interpolator.

To further exacerbate things, a standard RGB LUT of such large extent doesn't have enough chroma resolution at the low end to do meaningful color transformations. We've solved this by constructing our LUTs to be applied in OCIO's variant of HSV space, so that it has the same chroma resolution at all luminance levels. And that's why tetrahedral interpolation doesn't solve the issue for us, because tetrahedral isn't appropriate for HSV-like spaces and produces strange results. Additionally, you only want the V channel to be log encoded in that case, but the various log transforms in OCIO don't support selective application to single channels. So we're stuck with ExponentTransform and friends for our non-linear encoding, which exacerbates the problem even further.

We could work around this by e.g. applying the color transform ahead of time via the CPU path and saving to an EXR. But we have other use cases with similar issues, where that kind of thing is harder/more obnoxious to do.

So that's the practical thing we're running into.

From a user expectations side, it additionally just feels strange that the CPU and GPU path can diverge in their results by that much. In something like OCIO I would expect the GPU path to be faster, but not lower quality.

@doug-walker
Copy link
Collaborator

Thanks for elaborating on the use-case.

I don't think adding INTERP_LINEAR_PRECISE is the best solution. It seems more like an optimization setting than an interpolation method. Take a look at the OptimizationFlags in OpenColorTypes.h. What do you think about the approach of adding a new setting OPTIMIZATION_NATIVE_GPU_TRILINEAR to turn this on and off. It should be off for OPTIMIZATION_LOSSLESS but on for OPTIMIZATION_VERY_GOOD or higher. It would be similar to OPTIMIZATION_FAST_LOG_EXP_POW in the sense that it's controlling the evaluation method rather than the steps used to reduce the number of ops.

That way, by default people will continue to get the current behavior but could opt in to your shader if desired. I think that would lower the amount of performance testing required, since it would not be used by default. However, if we could get enough performance testing done, it could potentially be deployed beyond the LOSSLESS setting.

@cessen
Copy link
Author

cessen commented Apr 21, 2023

What do you think about the approach of adding a new setting OPTIMIZATION_NATIVE_GPU_TRILINEAR to turn this on and off. It should be off for OPTIMIZATION_LOSSLESS but on for OPTIMIZATION_VERY_GOOD or higher.

I think that's an excellent way forward. That will also make it a little easier to do the performance testing, and for other people to join in if desired, since it won't need a custom branch.

I'll work on putting together a PR for that the next chance I get.

Thanks so much!

@cessen
Copy link
Author

cessen commented Apr 26, 2023

I've started a WIP PR in #1794. It's functional, but I have some questions about how best to proceed in terms of API impact. Despite several different attempts, I couldn't find a way to get the optimization flags passed where they needed to be without affecting public APIs, which I'd like to avoid.

@cessen
Copy link
Author

cessen commented Oct 11, 2023

This is still an outstanding issue for the work that I'm doing. The PR I previously submitted immediately stalled (I had outstanding questions but failed to make it clear I was looking for feedback, and then got busy with other things).

In any case, it's perhaps for the best that it stalled, because I'm re-thinking whether the optimization level approach is really best here, and I'm again thinking that adding something like a INTERP_LINEAR_PRECISE interpolation mode might be a more appropriate solution.

I'm returning to that because—as far as I can tell—the optimization flags are largely under the control of the software that's integrating OCIO, and there's no way to control it from an OCIO config. Is that right? If so, that means any configuration that depends on full-precision LUT evaluation in order to function properly won't be portable across different software applications. And this is certainly relevant for our use case: we use our OCIO configurations across multiple pieces of software, and we don't have control over how they integrate OCIO.

Additionally, if the current reduced-precision LUT behavior is considered an acceptable optimization, it also seems reasonable for a configuration to use that optimization where the reduced precision is sufficient, and only selectively use the slower full-precision LUT evaluation where it's actually needed.

And as a side-benefit, simply adding a new interpolation mode looks to be less invasive than the changes needed to thread the optimization flag through all the needed APIs for the optimization level approach to work.

Thoughts?

@carolalynn carolalynn added the Needs Discussion Needs discussion before implmentation, which could result in changes, or a decision not to proceed. label Feb 7, 2024
@clshortfuse
Copy link

The trilinear sampling is correct, though not constructed the same way by Nvidia:

void main(in float2 sUV
          : TEXCOORD0, out half4 cOut
          : COLOR0, const uniform samplerRECT imagePlane,
            const uniform sampler3D lut, const uniform float3 lutSize)
{
  // Get the image color
  half3 rawColor = texRECT(imagePlane, sUV).rgb;
  // Compute the 3D LUT lookup scale/offset factor
  half3 scale = (lutSize - 1.0) / lutSize;
  half3 offset = 1.0 / (2.0 * lutSize);
  // ****** Apply 3D LUT color transform! **************
  // This is our dependent texture read; The 3D texture's
  // lookup coordinates are dependent on the
  // previous texture read's result
  cOut.rgb = tex3D(lut, scale * rawColor + offset);
}

https://developer.nvidia.com/gpugems/gpugems2/part-iii-high-quality-rendering/chapter-24-using-lookup-tables-accelerate-color

WolframAlpha: https://www.wolframalpha.com/input?i=x+*+%28%28y+-+1%29+%2F+y%29+%2B+%281+%2F+%282+*+y%29%29

Conversely the one in the code is

(color * (size - 1) + 0.5) / size

https://www.wolframalpha.com/input?i=%28x+*+%28y+-+1%29+%2B+0.5%29+%2F+y

They are the same, only I suspect Nvidia's is optimized to end up being a single multiple+add (mad) operation if the size is precomputed.

@cessen
Copy link
Author

cessen commented Jan 14, 2025

The trilinear sampling is correct, though not constructed the same way by Nvidia

Yes, I don't think there is any disagreement about whether it's correct in terms of the underlying math.

The issue is that the built-in 3D texture lookup functionality of the hardware is very low precision (at least on common GPUs), which for certain use cases can cause severe banding artifacts. For those use cases it would be valuable to be able to opt-in to full precision interpolation, even if it's slower.

@clshortfuse
Copy link

clshortfuse commented Jan 14, 2025

Thanks. The first image didn't look like banding to me, but fully clipping values or NaN which is what made me think the algorithm was in question. I found it math strange and had to double check why the method doesn't match other sampling methods exactly as used by Nvidia. It is the same, save for some possible precision loss (maybe?) from using 3 operations on the float color instead of a single mad which would likely maintain its precision due to uint dimensions. I wouldn't be surprised if the first half: (color * (size - 1) + 0.5) ends up losing some precision, only to lose precision again when dividing by / size at the end.

As for grabbing the value, I'd say you should consider loading the texel itself instead of using a sampler. Then you'd have the full precise value straight from the texture. Then you can decide how to lerp/mix between the two points in the 32bit context. (similar to manually unrolling 2D textures). But that's about the extent of my knowledge on this since I'm used to HLSL's syntax of load vs sample and I'm not sure about GL operations and optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Needs discussion before implmentation, which could result in changes, or a decision not to proceed.
Projects
None yet
Development

No branches or pull requests

4 participants