Aug 10, 2011

Average luminance calculation using a compute shader

A common part of most HDR rendering pipelines is some form of average luminance calculation. Typically it’s used to implement Reinhard’s method of image calibration, which is to map the geometric mean of luminance (log average) to some “key value”. This, combined with some time-based adaptation, allows for a reasonable approximation of auto-exposure or human eye adaptation.

In the old days of DX9, the average luminance calculation was usually done repeatedly downscaling a luminance texture as if generating mipmaps. Technically DX9 had support for automatic mipmap generation on the GPU, but support wasn’t guaranteed so to be safe you had to do it yourself. DX10 brought guaranteed support for mipmap generation of a variety of texture formats, making it viable for use with average luminance calculations. It’s obviously still there in DX11, and it’s still a very easy and pretty quick way to do it. On my AMD 6950 it only takes about 0.2ms to generate the full mip chain for a 1024x1024 luminance map, which is pretty quick. But with DX11 it’s not sexy unless you’re doing it with compute shaders, which means we need to ditch that one line call to GenerateMips and replace it with some parallel reductions. Technically a parallel reduction should have much fewer memory reads/writes compared to generating successive mip levels, so there’s also some actual sound reasoning behind exploring that approach.

The DirectX SDK actually comes with a sample that implements average luminance calculation with a compute shader parallel reduction (HDRToneMappingCS11), but unfortunately their CS implementation actually performs a fair bit worse than their pixel shader implementation. A few people on gamedev.net had asked about this and I had said that it should definitely be possible to beat successive downscaling with a compute shader if you did it right, and used cs_5_0 rather than cs_4_0 like the sample. When it came up again today I decided to put my money where my mouth is and make a working example.

The implementation is really simple: render the scene in HDR, render log(luminance) to a 1024x1024 texture, downscale to 1x1 using either GenerateMips or a compute shader reduction, apply adaption, then tone map (and add bloom). My first try was to do the reduction in 32x32 thread groups (giving the max of 1024 per thread group), where each thread sampled a single float from the input texture and stored in shared memory. Then the reduction is done in shared memory using the techniques outlined in Nvidia’s CUDA parallel reduction whitepaper, which helps avoid shared memory conflicts. The first pass used a 32x32 dispatch which resulted in a 32x32 output texture, which was then reduced to 1x1 with one more 1x1 dispatch. Unfortunately this approach took about 0.3ms to complete, which was slower than the 0.2ms taken for generating mips.

For my second try, I decided to explicitly vectorize so that I could take better advantage of the vector ALU’s in my AMD GPU. I reconfigured the reduction compute shader to use 16x16 thread groups, and had each thread group take 4 samples (forming a 2x2 grid) from the input texture and store it in a float4 in shared memory. Then the float4’s were summed in a parallel reduction, with the last step being to sum the 4 components of the final result. This approach required only 0.08ms for the reduction, meaning I hit my goal of beating out the mipmap generation. After all that work saving 0.1ms doesn’t seem like a whole lot, but it’s worth it for the cool factor. The performance differential may also become more pronounced at higher resolution, or on hardware with less bandwidth available. I’m not sure how the compute shader will fare on Nvidia hardware since they don’t use vectorized GPU, so it should be interesting to get some numbers. I’d suspect that shared memory access patterns are going to dominate anyway over ALU cost anyway, so it could go either way.

The sample project is up for download here: https://github.com/TheRealMJP/DX11Samples/releases/tag/v1.3

Comments:

Daan Nijs (@ElMarcel) - Aug 3, 2011

OK, I get it now,, thanks for the clarification!. That’s actual pretty clever, I guess it makes very bright pixels have less of an impact on the average, as well as better temporal stability.

#### [Daan Nijs (@ElMarcel)](http://twitter.com/ElMarcel "ElMarcel@twitter.example.com") - Aug 3, 2011

this might be nitpicking, but you’re saying you’re rendering log(luminance). Wouldn’t GenerateMips give you an incorrect result, as you can’t simply lerp between log values?

#### [MJP](http://mynameismjp.wordpress.com/ "mpettineo@gmail.com") - Aug 3, 2011

Mykhailo: indeed it might! But as I said earlier I suspect that shared memory access is the primary bottleneck, so I wouldn’t expect too much of a gain. Daan: the formula for geometric mean is exp(avg(log(luminance))). So if you render out log(luminance) and generate mips, the last mip level is equal to avg(log(luminance)). Then you just take the exp() of the value you get when sampling the lowest mip level and you have the geometric mean.

#### [Tiago Costa]( "tiago.costav@gmail.com") - Aug 3, 2011

Nice post. Every 0.1ms count =D Specially when using VSync.

#### [Mykhailo Parfeniuk (@sopyer)](http://twitter.com/sopyer "sopyer@twitter.example.com") - Aug 3, 2011

Just quick thought: would not gather4 instruction improve performance of your code?

#### [Tiago Costa]( "tiago.costav@gmail.com") - Aug 3, 2011

@Daan I use GenerateMips to calculate average luminance in DX 10 and it works correctly…

#### []( "") - Aug 0, 2011

I ran this on my personal computer, and got about 312FPS (At the most) (single GTX 460, i7 920 @ 4.1GHz, 3GB of RAM) Not bad but, it stays around 4.2ms. Sadly it doesn’t change much between the methods, infact it goes up with your implementation, about 0.2ms

#### [Peter Kristof]( "peter.kristof@gmail.com") - Jan 0, 2013

Actually, I take it back. You do need the GroupMemoryBarrierWithGroupSync() to avoid potential R/W hazards among different warps.

#### [Implementing a Physically Based Camera: Automatic Exposure | Placeholder Art](http://placeholderart.wordpress.com/2014/12/15/implementing-a-physically-based-camera-automatic-exposure/ "") - Dec 1, 2014

[…] exponential feedback loop to smooth the results of scene metering. I first saw this used in the Average luminance calculation using a compute shader1 post by MJP, which was originally from a SIGGRAPH paper about visual adaption […]

#### [Anoop Thomas](http://anooprthomas.blogspot.com "anoop.r.thomas@gmail.com") - Sep 3, 2013

I recently implemented the both pixel shader and compute shader versions for calculating luminance. For the pixel shader version, I use the downscale 2*2 method with a single bilinear tap all the way down to a single pixel (GenerateMipMaps might be doing the same process). For the compute shader, i use the parallel sum method with 2 passes and then took the average using the resolution. If i use a resolution that is not the power of 2, I end up with an incorrect average (when compared to the compute shader). I have verified that the compute shader version returns the right average, and the pixel shader version starts to introduce errors when downscaling from an odd height or odd width ( for eg, downscaling from 5*3, 5*2, or 4*3). Can you suggest a method to get the right average from the Pixel shader version?

#### [auto stereo](http://www.crashcarreview.org ".t.homaslim63@gmail.com") - Mar 5, 2012

Pretty! This has been a really wonderful article. Many thanks for supplying this info.

#### [Peter Kristof]( "peter.kristof@gmail.com") - Jan 6, 2013

You can use GroupMemoryBarrier() inside the parallel reduction’s for loop in ReductionCS(). I see 2-35% speedups on different AMD architectures with that for slightly different implementation with no explicit vectorization. Thanks for the articles and samples, much appreciated!

#### [CryZe]( "cryze92@gmail.com") - Sep 3, 2012

Wouldn’t the float4’s cause bank conflicts on NVidia hardware? Each subsequent 32-bit is stored in the next memory bank. So the 4 floats of a float4 would be stored on 4 different banks. So when the hardware wants to load all gsm[i].x values, only 4 of them can directly be accessed by the half warp. So it would cause a 4-way bank conflict. I don’t think the hardware coalesces the memory automatically.

#### [MJP](http://mynameismjp.wordpress.com/ "mpettineo@gmail.com") - Sep 3, 2012

Yes it would definitely have bank conflicts on both Nvidia and AMD hardware. The first (scalar) version of the reduction that I wrote avoided bank conflicts, since it used the techniques illustrated the CUDA parallel reduction whitepaper. But then it appeared that on my AMD hardware the benefit of vectorization outweighed the cost of bank conflicts, so I just stopped there. In hindsight it probably would have been best to leave in the scalar version as an option, since I would suspect that would perform better on Nvidia hardware and the new GCN-based AMD hardware.

#### [腕時計定番](http://www.watchsrealize.com/ "blsdrge@gmail.com") - Nov 5, 2013

アルマーニ楽天

#### [Automatic Exposure | Krzysztof Narkowicz](https://knarkowicz.wordpress.com/2016/01/09/automatic-exposure/ "") - Jan 6, 2016

[…] of the screen [Hen14]. Additionally we could also use a compute shader for computing averages [Pet11]. This is usually simpler and more efficient than repeated texture […]