Light Indexed Deferred Rendering

There’s been a bit of a stir on the Internet lately due to AMD’s recent Leo demo, which was recently revealed to be using a modern twist on Light Indexed Deferred Rendering. The idea of light indexed deferred has always been pretty appealing, since it gives you some of the advantages of deferred rendering (namely using the GPU to decide which lights affect each pixel) while still letting you use forward rendering to actually apply the lighting to each surface. While there’s little doubt at this point that deferred rendering has proven itself as an effective and practical technique, I’m sure that plenty of programmers currently maintaining such a renderer have dreamed of a day where they don’t have to figure out how to cram every attribute into their G-Buffer using as few bits as possible, or consume 100’s of megabytes for MSAA G-Buffer textures.

While the benefits of light indexed deferred were pretty obvious to, I was pretty sure that the performance wouldn’t hold up when compared to the state-of-art in traditional deferred rendering. So I decided to make a simple test app where I could toggle between the two techniques for the same scene. For the deferred renderer, I based my implementation very closely on Andrew Lauritzen’s work since he had done quite a bit of work in terms of optimizing it for modern GPU architectures. The only differences were that I used a different G-Buffer layout (normals, specular albedo + roughness, diffuse albedo, and ambient lighting, all 32bpp) and I used an oversized texture instead of a structured buffer for writing out the individual MSAA subsamples from the compute shader.

For the light indexed deferred renderer implementation I used a depth-only prepass to fill the depth buffer, which was then used by a compute shader to compute the list of intersecting lights per-tile. This list was stored in either an R8_UINT or R16_UINT typed buffer (8-bit for < 255 lights, 16-bit otherwise), with enough space pre-allocated in the buffer to store a full light list for each tile. So no bitfields or linked lists or anything fancy like that, just a simple per-tile list terminated by sentinel value. I found that this worked best for the forward lighting pass, since this resulted in the least amount of overheard for reading the list in the forward rendering pass, although there might be better ways to do it. The forward rendering pass then figures out which tile each pixel is in, and applies the list of lights one by one.

In both cases I used normalized Blinn-Phong with fresnel approximation for the lights, so nothing fancy there. I did use a terrible linear falloff for the point lights just so that I could artificially restrict the radius, so please don’t judge me for that. I also used the depth-only prepass for both implementations, since it actually resulted in a speed up of around 0.5ms for the G-Buffer pass. For a test scene, I used the ol’ Sponza atrium.

I gathered some performance numbers for the hardware I have access to, which is an AMD 6970 and an Nvidia GTX 570. For both GPU’s I ran at 1920x1080 resolution with VSYNC disabled, and the timings represent total frame time. The Nvidia numbers were pretty much in line with my expectations:

Nvidia GTX 570

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 6.94ms 6.41ms
2x MSAA 7.81ms 7.51ms
4xMSAA 8.47ms 9.17ms

256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 11.67ms 9.43ms
2x MSAA 12.987ms 10.75ms
4xMSAA 13.88ms 12.34ms

512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 18.18ms 14.084ms
2x MSAA 20.00ms 15.63ms
4xMSAA 21.27ms 17.24ms

1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA N/A 27.03ms
2x MSAA N/A 29.41ms
4xMSAA N/A 31.25ms

Tile-based deferred rendering wins out and nearly every case, and it only gets worse as you add in more lights.  Light indexed seems to scale a bit better with MSAA, but even with that it’s only enough to overcome the overall disadvantage for the 128 light case. For 1024 lights it seemed as though the Nvidia driver or hardware couldn’t handle the large buffer I was using for storing the light indices, as I was getting very strange artifacts on the lower half of the screen. However I can only imagine the trend would continue, and it would lag further behind the tile-based deferred renderer.

For the AMD 6970, the results were much more interesting:

AMD 6970

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.26ms 5.71ms
2x MSAA 5.98ms 9.43ms
4xMSAA 6.49ms 10.75ms

256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 7.87ms 7.87ms
2x MSAA 8.77ms 11.11ms
4xMSAA 9.73ms 13.15ms

512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 11.67ms 11.36ms
2x MSAA 12.98ms 14.93ms
4xMSAA 13.89ms 16.94ms

1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 22.22ms 20.00ms
2x MSAA 24.39ms 25.64ms
4xMSAA 25.64ms 33.33ms

These results really surprised me. The light indexed renderer actually starts out faster than the deferred renderer, and doesn’t really start to fall behind until you hit 1024 lights. However with either 2xMSAA or 4xMSAA the light indexed renderer absolutely blows away the competition. I actually suspected that I did something wrong in my MSAA implementation, until I verified that I got similar results from the Intel sample. Perhaps there’s a better way to handle MSAA in a compute shader for AMD hardware? I didn’t spend a lot of time experimenting, so perhaps someone else has a few bright ideas. Either way it’s clear that forward rendering scales really well with MSAA on this hardware. Even the G-Buffer pass fares pretty well, as it goes from 1ms to 1.2ms to 1.3ms as the MSAA level increases (1.5ms to 1.9ms to 2.1ms without a z prepass).

So, where does this leave us? Even with these numbers we really don’t have a complete picture. Really we need some tests run with…

1. Different scenes, preferably some with even higher poly counts and/or some tessellation 2. More realistic material variety, including different texture configurations, layer blending, decals 3. A variety of complex BRDF’s 4. A few different ambient/bounce lighting configurations 5. More lighting types, with different shadowing configurations 6. More hardware to test on

These things have some big implications on what you store in the G-Buffer, forward shading efficiency, and the cost of a z prepass. That last one is important, since it’s mandatory for light indexed deferred but optional for traditional deferred. While it can still be cheaper overall to have a z prepass before your G-Buffer pass (as it was in my case), that could change depending on how your vertex processing costs.

So for now, my conclusion is that Light Indexed Deferred is at least in the realm of practical for most cases. Personally I consider even 256 to be a LOT of lights, so I’m not too worried about scaling up to thousands of lights anytime soon. But if anyone has access to different GPU’s, I would love to get some more numbers so that I can post them here. So if you happen to have a 7970 or GTX 680 lying around, feel free to download my sample and take down some numbers. Originally the number of lights was hard-coded to 128 in the binary, but I uploaded a new version that lets you toggle through the number of lights that I used for my test runs.

You can find the code and binary on GitHub: https://github.com/TheRealMJP/DX11Samples/releases/tag/v1.0

Here are a few numbers for a GTX 680 contributed by Sander van Rossen:

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 2.30ms 2.60ms
2x MSAA 2.62ms 3.86ms
4xMSAA 2.85ms 4.95ms

And some more numbers for the AM 7970 courtesy of phantom, gathered at 1280x720:

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 1.80ms 1.90ms
2x MSAA 2.00ms 2.72ms
4xMSAA 2.30ms 3.60ms

256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 2.50ms 2.30ms
2x MSAA 2.70ms 3.30ms
4xMSAA 3.00ms 4.20ms

512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 3.30ms 2.90ms
2x MSAA 3.80ms 4.20ms
4xMSAA 4.20ms 5.20ms

1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.90ms 4.50ms
2x MSAA 6.700ms 6.40ms
4xMSAA 7.40ms 7.80ms

Radeon 7970 @ 1920x1080, from 3dcgi:

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 3.03ms 3.34ms
2x MSAA 3.52ms 5.12ms
4xMSAA 3.96ms 6.84ms

256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 4.18ms 4.20ms
2x MSAA 4.76ms 6.25ms
4xMSAA 5.32ms 8.13ms

512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.85ms 5.46ms
2x MSAA 6.62ms 8.00ms
4xMSAA 7.19ms 10.0ms

1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 10.42ms 8.92ms
2x MSAA 11.63ms 12.66ms
4xMSAA 12.82ms 15.63ms

Comments:

zxaida -

Radeon 7970 at 1200MHz clocks 1280x720,default setting 128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 1.70ms 1.80ms 2x MSAA 1.91ms 2.55ms 4xMSAA 6.25ms 3.27ms 256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 2.15ms 2.08ms 2x MSAA 2.42ms 2.94ms 4xMSAA 7.35ms 3.74ms 512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 2.84ms 2.51ms 2x MSAA 3.21ms 3.57ms 4xMSAA 8.771ms 4.46ms 1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 4.78ms 3.74ms 2x MSAA 5.43ms 5.26ms 4xMSAA 11.9ms 6.53ms


#### [Cyrus Rohani]( "crohani@gmx.com") -

Just got my first results after switching from light-prepass to tiled deferred:) I’m quite happy with it. Have you had time to investigate arbitrary volume or spot light implementations yet? I haven’t had time yet since I’m dealing with cascade shadow map performance:( Off topic but, from the latest GDC papers, seems DICE is using VS instancing and the GS to select a rendertarget array index. This is to avoid multiple draws per cascade. But you mention texture arrays being lower performance than an atlas in your tests. Any idea why? Also, I have not heard anyone talk about using an atlas with VS instancing and selecting a viewport index. That would eliminate the array but keep the single draw for all cascades.


#### [Andrew Lauritzen]( "andrew.lauritzen@gmail.com") -

Your demo could be doing something different than mine of course, but if you hit “F8” in mine you can disable G-buffer rendering/updating and last I checked that was the biggest part of the bottleneck on ATI. Of course it would make more sense if it was something in the significantly-more-complex light/shading pass, but that wasn’t what I experienced at least in the past :)


#### [Andrew Lauritzen]( "andrew.lauritzen@gmail.com") -

I will note too that I typically prefer to “disable parts of the rendering”, etc. rather than use queries. Queries are a bit finicky in that they don’t necessarily interact with the pipelining in the GPU in a natural way (i.e. are you measuring end-to-end latency of a submitted command? stalling between each command instead? None is a good solution). Of course there’s no perfect solution but I find that a somewhat more consistent and predictable way to profile than queries.


#### [Nathan Reed](http://reedbeta.com/ "nathaniel.reed@gmail.com") -

On my GTX 580 at 1280x720: 128 lights 1x: LI 3.77 TD 3.17 2x: LI 4.14 TD 3.58 4x: LI 4.39 TD 4.17 256 lights 1x: LI 5.68 TD 4.37 2x: LI 6.33 TD 4.95 4x: LI 6.80 TD 5.52 512 lights 1x: LI 8.54 TD 6.10 2x: LI 9.62 TD 6.80 4x: LI 10.42 TD 7.35 1024 lights 1x: LI 16.67 TD 10.99 2x: LI 18.87 TD 12.05 4x: LI 20.41 TD 12.82 Similar pattern to your GTX 570. I should also note I disabled the Z prepass for the tilde deferred cases since it was slowing it down a bit. By the way, in multisampling mode with light-indexed are you running lighting per MSAA sample or just per pixel? And have you looked at detecting edges and running the per-sample lighting only for the tiles (or pixels) containing edges? That can be a big optimization for tiled-deferred, maybe less so for light-indexed deferred as it seems you’d have to branch in the pixel shader to implement it. Finally, I wonder how CSAA (NVIDIA) or EQAA (AMD) would affect things. I’m not sure how you actually turn those on in D3D, though.


#### [MJP](http://mynameismjp.wordpress.com/ "mpettineo@gmail.com") -

Thanks for those cool graphs Cyrus! I realized a few days ago that I was running an older driver on the machine I did the 570 test on, so it was probably just a driver bug that was resolved at some point. Spot lights are pretty tricky. I’ve been meaning to dedicate some time investigating efficent ways to cull them per-tile, but haven’t gotten around to it yet. A full frustum-frustum test with SAT seems too heavyweight to be done in a single thread (IIRC it’s something like 6*8 + 6*8 + 6*6*8 dot products for the full test), so I’m thinking a cheaper approximation might be the way to go. I’ve been kicking around something I came up with based on plane/cone intersection tests that’s alot cheaper, but gives false positives for a few cases. Rasterizing the volume might be another viable option for expensive lights. I can let you know how it goes once I get some time to work on it more.


#### [MJP](http://mynameismjp.wordpress.com/ "mpettineo@gmail.com") -

@directtovideo I suspected the same thing regarding shared memory pressure, so I ran a few experiments where I varied the thread group size but I wasn’t able to improve the performance. And you’re absolutely right about the scheduling…it’s really the key difference between the two techniques. Ultimately it comes to down to whether the the flexibility you get from shading in a compute shader ends up winning out over the efficiency of hardware scheduling, and taking that into account along with having to render out a G-Buffer (for tiled deferred), or requiring geometry to be rasterized twice (for indexed deferred). @Anonymous For a large number of lights in a scene having a per-pixel list doesn’t seem very compelling to me. The problems to me are: A. You’d have to compute light intersections per-list rather than per-tile, which means that you can’t compute the intersections for many lights in parallel like you can with per-tile lists. You could rasterize the light volumes and append the index to the per-tile linked lists (like in the AMD demo, as you suggested) but I’d imagine that would still be much slower. B. Your granularity during the forward lighting phase is limited by branching coherency, so it doesn’t seem worth it to do fine-grained light intersection C. You’ll consume a lot more memory with per-pixel lists D. If you use a linked list, just reading the light indices in the forward lighting phase is going to be slower. One thing I discovered early on was that just reading indices can be a serious performance drain, so I tried to make it as cheap as possible. For a smaller number of lights it might make sense though, especially if going fine-grained allows you to do a better job culling non-spherical light sources.


#### []( "") -

MJP - great read. Just wondering, why didn’t you try using a per-pixel list (like the AMD order-independent-transparency demo) for the light indices? Do you think that would have worse performance? Thanks.


#### [MJP](http://mynameismjp.wordpress.com/ "mpettineo@gmail.com") -

You definitely have a good point regarding the queries…I try to only use them to get a rough idea of timings but even then they can be quite off from the delta you get in overall frame time. I put it in a setting to disable G-Buffer rendering, and that shows a delta of about 4-4.5ms with 4xMSAA which is definitely pretty significant and more in line with your findings. I can actually get a similar result from my queries if try to force a sync point with a simple compute shader that reads from the MSAA G-Buffer textures. I would suspect that there might be something else expensive going on here, perhaps an expensive decompression step to allow the shader to sample the MSAA textures. Thank you for your input!


#### [Cyrus Rohani]( "") -

Not sure why you had issues with the GTX 570 at 1920x1080, it worked fine with mine. Here’s my results: Windows 7, Intel Q6600, 2.40Ghz NVIDIA GeForce 570 GTX, 296.10 drivers 1280x720: LIDR TBDR No MSAA 4.34 4.03 2x MSAA 4.78 4.54 4x MSAA 5.18 5.05 No MSAA 6.53 5.40 2x MSAA 7.35 6.05 4x MSAA 7.87 6.66 No MSAA 9.90 7.46 2x MSAA 11.23 8.19 4x MSAA 12.04 8.84 No MSAA 19.23 13.15 2x MSAA 21.73 14.49 4x MSAA 23.80 15.38 1920x1080: LIDR TBDR No MSAA 7.24 6.80 2x MSAA 8.00 7.69 4x MSAA 8.47 9.17 No MSAA 11.62 9.90 2x MSAA 12.98 10.98 4x MSAA 13.88 12.19 No MSAA 18.51 14.49 2x MSAA 20.40 15.87 4x MSAA 21.73 17.24 No MSAA 37.03 27.02 2x MSAA 40.00 28.57 4x MSAA 43.47 30.30 I uploaded a graph of results from this page, substituting my GTX 570 results: http://img543.imageshack.us/img543/4589/lidrvstidr.png Do you have any idea about the performance difference if using arbitrary light volumes? Or frustum volumes for spot lights? Thanks.


#### [Sander van Rossen](http://gravatar.com/logicalerror "sander.vanrossen@gmail.com") -

It was at the default resolution, I don’t know if that’s 1280×720.. 128 lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 2.3ms 2.6ms 2x MSAA 2.62ms 3.86ms 4xMSAA 2.85ms 4.95ms 256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 3.5ms 3.95ms 2x MSAA 3.95ms 4.76ms 4xMSAA 4.3ms 6.28ms 512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 5.26ms 5.71ms 2x MSAA 5.95ms 7.87ms 4xMSAA 6.45ms 9.61ms 1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 10.2ms 12.6ms 2x MSAA 11.62ms 15.15ms 4xMSAA 12.65ms 16.39ms And it’s dual gtx680 (so 2x with sli, not a single card). These results make me wonder if SLI is configured correctly … or if something in the app makes it impossible for the driver to use SLI effectively. It’s just hard to believe that dual gtx680’s can be beaten so easily heh


#### [metatronico]( "niels@frohling.biz") -

AMD 5870 128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 3.48ms 3.59ms 2x MSAA 4.03ms 4.60ms 4xMSAA 4.44ms 5.49ms


#### [ethatron]( "niels@paradice-insight.us") -

Continued … (above the “No MSAA” is swapped, sorry - yes on the 5870 “No MSAA” and “Some MSAA” swaps the rank = 512 indexed can’t manage competing) 256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 5.18ms 4.42ms 2x MSAA 5.78ms 5.95ms 4xMSAA 6.32ms 6.89ms 512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 7.51ms 5.95ms 2x MSAA 8.40ms 7.63ms 4xMSAA 9.09ms 8.69ms 1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 14.28ms 10.20ms 2x MSAA 15.87ms 12.98ms 4xMSAA 17.24ms 14.28ms All default resolution.


#### [ethatron]( "niels@paradice-insight.us") -

LI seems ALU-bound and TB seems at least memory-related on the 5870, overclocking the core yields different speedup for LI and TB respectively (looking at the two extremes): LI speeds up more or less linear on all accounts, that is 850 to 990 (MHz) ^= 3.59 to 3.19 (linear is 3.08) ^= 17.24 to 14.92 (linear is 14.80). TB is stalemate in the “No MSAA” case for 990, not faster anymore, that is 850 to 990 (MHz) ^= 3.48 to 3.14 (linear is 2.98) ^= 14.28 to 12.65 (linear is 12.26). On the slow extreme it’s 13.5% vs. 11.5% speedup from a 14.2% overclock, that is TB gains 85% vs. LI. To me it seems on the GKs it’s only that TB is faster because of the large sustainable memory bandwidth. And it is visible, that if I would clock my 5870 at say 2GHz, then TB would never win. LI vs. TB seems a ALU vs. mem tradeoff, or not that relevant if the architecture is somewhere in the middle. But as memory speeds are unlikely to rise much further, and often are below our GDDR5 speeds on medium class cards, and core clock still keeps rising even on medium class cards, I’d say LI has a rosier prognosis.


#### [MJP](http://mynameismjp.wordpress.com/ "mpettineo@gmail.com") -

@Nathan, For Light Indexed Deferred it’s really just forward lighting, so I just turn on MSAA for the render target and let the hardware do its thing. This means that you only shade multiple times per pixel along triangle edges where the triangle doesn’t full cover all subsamples of a pixel. You could certainly use CSAA if you wanted, you just turn it on by using a different quality level. I’m not sure about EQAA. @ethatron Thank you for sharing such a detailed analysis! Your findings make sense though, since light indexed deferred tends to be VERY heavy on ALU in the pixel shader.


#### [3dcgi](http://3dcgi.com "tmartin@ieee.org") -

Radeon 7970 at stock clocks full screen on a 1080p monitor with the taskbar hidden so the rendering window is a title bar short of 1080p. 128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 3.03ms 3.34ms 2x MSAA 3.52ms 5.12ms 4xMSAA 3.96ms 6.84ms 256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 4.18ms 4.20ms 2x MSAA 4.76ms 6.25ms 4xMSAA 5.32ms 8.13ms 512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 5.85ms 5.46ms 2x MSAA 6.62ms 8.00ms 4xMSAA 7.19ms 10.00ms 1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 10.42ms 8.92ms 2x MSAA 11.63ms 12.66ms 4xMSAA 12.82ms 15.63ms


#### [ethatron]( "niels@paradice-insight.us") -

@mjp Here: “No MSAA 3.48ms 3.59ms” I accidentally flipped the number, it should be “No MSAA 3.59ms 3.48ms”. :^)


#### [MJP](http://mynameismjp.wordpress.com/ "mpettineo@gmail.com") -

Thanks for posting those Sanders! The original binary was running at 128 lights. I just uploaded a new one that lets you switch the number of lights. I would assume that you ran at the default resolution of 1280x720?


#### [Nathan Reed](http://reedbeta.wordpress.com/ "nathaniel.reed@gmail.com") -

@MJP Hah, using the hardware to do what it’s designed for - who does that? :) But anyway, it seems that this is a bit of an unfair comparison because light-indexed is mostly shading per-pixel while tiled-deferred is (I presume) shading per sample in all cases. Tiled-deferred with MSAA edge detection could turn things around on the AMD cards. (Of course, the fact that MSAA ‘just works’ with light-indexed is itself an argument in its favor…)


#### [Nathan Reed](http://reedbeta.wordpress.com/ "nathaniel.reed@gmail.com") -

Aha, that’s great! In that case, yeah, the different MSAA scaling between the two techniques is very interesting.


#### [MJP](http://mynameismjp.wordpress.com/ "mpettineo@gmail.com") -

@Nathan The tile-based deferred renderer does use edge detection. It compares the normal + depth of all subsamples in a pixel, and appends the coordinate of those pixels to a list in shared memory. Then all of the subsamples from those pixels distributed evenly among threads in the thread group so they can be shaded. The comparison is actually pretty conservative in my sample, so you end up doing per-sample shading on significantly fewer pixels than in the forward-rendered case. But even with that optimization the AMD cards take a huge hit from MSAA, which is a bit puzzling to me.


#### [MJP](http://mynameismjp.wordpress.com/ "mpettineo@gmail.com") -

Hi Matt, I’ve been using a simple frustum/cone that checks to ensure that some part of the cone is on the positive side of all 6 frustum planes (using the cone/plane test from Real-Time Collision Detection). It certainly works when the cone is entirely on the negative side of one of the 6 planes, but for cones that are rather large relative to the frustum you can get cases where the cone is on the positive side of all 6 planes but still doesn’t intersect the actual frustum (you can actually get the same problem with a sphere/frustum test if you do it the same way). Constructing additional planes to test against will help, but doesn’t solve the problem entirely. If you’re not running into the same issues, then perhaps you’re doing something that’s a bit more sophisticated?


#### [3dcgi](http://3dcgi.com "tmartin@ieee.org") -

I don’t have time to perform a full run, but here are a few 1280x720 numbers for comparison with a stock Radeon 7970. 1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 6.02ms 4.63ms 2x MSAA 6.85ms 6.58ms 4xMSAA 7.52ms 8.00ms One thing I noticed from the other reported numbers is the GTX580 is faster than the 680 at tiled deferred yet the situation changes for index deferred. I’m surprised at how much faster the Radeon 7970 is than the GTX 680. At least with 1024 lights.


#### [MJP](http://mynameismjp.wordpress.com/ "mpettineo@gmail.com") -

Yeah, the default resolution is 1280x720. SLI won’t kick in unless the driver has a profile for the app (or you use NVAPI to manually select a profile), so I’m sure that it’s just running on 1 GPU.


#### [Sander van Rossen](http://gravatar.com/logicalerror "sander.vanrossen@gmail.com") -

I couldn’t see how many lights your binary was displaying, but these are the results for my ridiculously overpowered dual gtx680’s: MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 2.3ms 2.6ms 2x MSAA 2.62ms 3.86ms 4xMSAA 2.85ms 4.95ms


#### [Andrew Lauritzen]( "andrew.lauritzen@gmail.com") -

The big hit on AMD cards with MSAA seems to be in the G-buffer rendering phase in my brief testing. Never tracked down why, as the cards have plenty of bandwidth available. Perhaps a ROP throughput bottleneck, I’m not sure. Ideally if MSAA compression was “perfect”, it should be about the same overhead as MSAA with forward as it (roughly) is on NVIDIA.


#### [MJP](http://mynameismjp.wordpress.com/ "mpettineo@gmail.com") -

Hi Andrew, According to my profiling (perfomed via queries) filing the G-Buffer on my AMD 6970 only accounts for 2.09ms at 1920x1080, 4xMSAA with no z prepass. With a z prepass it takes 1.31ms, with 0.4ms for the z prepass. The increase in frame time mostly comes from the lighting compute shader, which goes from 2.7ms with no MSAA to 6.8ms with 4xMSAA for the 128 light case.


#### [anteru]( "wordpress@catchall.shelter13.net") -

GTX 680, 1280x720 (Numbers are light prepass, tile deferred.) 256 Lights 1x 3.58 4.15 2x 4.07 5.32 4x 4.4 6.25 512 Lights 1x 5.46 5.6 2x 6.2 7.75 4x 6.7 9.4 1024 Lights 1x 10.6 12.2 2x 12.04 14.7 4x 13.1 16.1


#### [directtovideo](http://directtovideo.wordpress.com "smaash@yahoo.com") -

MJP: this is all speculation, but one problem I’ve found with the tile-based deferred and splitting the samples across threads is the amount of shared memory. I’ve had real problems with this especially on geforce - seems real sensitive to sharedmem (and effect on occupancy). The other thing is, with the deferred version (with quite large threadgroups running 16x16 tiles = 256 threads - I actually went for 8x8 here) you’re making quite a big statement about your own scheduling / work balancing - the light indexed version is running a pixel shader to do the lighting so the hardware is scheduling the work in its own, probably smart, way. Wonder if that’s a part of the difference. Nice comparison though! Very useful to see.


#### [Matías N. Goldberg (@matiasgoldberg)](http://twitter.com/matiasgoldberg "matiasgoldberg@twitter.example.com") -

At least on a AMD Radeon HD 7770, there are artifacts when using 1024 lights (all MSAA settings) with Forward rendering (G-Buffer works fine) Here’s an image highlighting the artifacts: http://i.imgur.com/5Gvl2.jpg IT ONLY APPEARS WHEN LOOKING FROM THAT ANGLE Although it looks very small, it’s actually *very* noticeable because it flickers in blocks (tiles) across ALL the roof border; even when the camera is completely still. It works fine with 512 lights. My theory from a quick glance is that those tiles have more lights than what the card allows to hold in the linked list buffer (is there a hard limit? or may be there’s a hard limit in the forward rendering loop…?) and race conditions cause different lights to be dropped each frame; therefore the tile always has a light list that doesn’t hold all the needed lights; being always different. Each tile flickers going lighter & darker. I don’t have an NVIDIA DX11 card to compare with, unfortunately.


#### [HPG 2012 | dickyjim](http://dickyjim.wordpress.com/2012/07/04/hpg-2012/ "") -

[…] Clustered Deferred and Forward Shading This was another paper I had read before attending. The value of the clustering for the samples is based on the spatial distribution of a large group of lights and how that interacts with tile based deferred shading where each tile contains a large range of depths. As an additional clustering key, they’ve also looked at normal cone clustering. Although this was expensive in their scenes, the normal cone clustering looked like it would have value in tiles with less depth and normal variance which is more likely in a game environment. One other thing of note, they mentioned during the managing clusters that their code was 2 passes on Fermi but could be 1 on Kepler due to the improved atomic performance. Overall, I think that a large part of the positive results for the test was due to the selective nature of the scenes used to test the technique. I’d like to see the performance results in a wider range of scenes (for example as in Matt Pettineo’s light indexed work). […]


#### [directtovideo](http://directtovideo.wordpress.com "mattswoboda@yahoo.co.uk") -

MJP: Looks like I’m doing the same as you then. Clearly I just haven’t managed to generate a case that breaks it yet, so will look out for that.


#### [WIP: Deferred Rendering | Chetan Jags](http://chetanjags.wordpress.com/2014/09/25/wip-deferred-rendering/ "") -

[…] mentioned in Battlefield3 presentation and in this demo from Intel. Thinking about ForwardPlus or Light Indexed Deferred for transparent […]


#### [MJP](http://mynameismjp.wordpress.com/ "mpettineo@gmail.com") -

Hi Matias, I haven’t seen any similar artifacts myself, but I’m not terribly surprised. It certainly wouldn’t be the first time that I encountered quirky behavior with compute shaders that use atomics on shared memory variables. There’s actually no linked list, each tile has enough room in a buffer to store indices for N lights (where N is the maximum number of lights in the scene). So there *should* be enough room in the buffer to store 1024 lights, as well as in shared memory.


#### [Lukas M]( "mjp@lukasmeindl.at") -

@matiasgoldberg i got the same problem. I got a HD Radeon 7950 - it flickers on that angle and with the 1024 lights activatated. However it works fine with less lights.


#### [ozlael](http://ozlael.egloos,com "ozjjangozjjang@gmail.com") -

always thank for good article :-)


#### [Matías N. Goldberg (@matiasgoldberg)](http://twitter.com/matiasgoldberg "matiasgoldberg@twitter.example.com") -

Hi, thanks for the answer. Yeah, when I was referring to the linked list buffer, I was thinking you probably just used a big per-tile array. This is unsurprising for me either, CS puts more responsibility to developers than pixel shaders, but it’s a new tech where driver, compiler & even HW bugs can’t be ruled out yet. So, either the driver clamps the buffer size, the HW’s atomic operation is malfunctioning, there’s a rare race condition somewhere, the tile is somehow overflowing, or the pixel shader in the forward pass is just refusing to read the entire buffer and just parsing it partially. So many possibilities…. I just wanted to know if someone else was able to reproduce the artifacts (only shows when lightcount = 1024; while looking from that particular angle as in the screenshot)


#### [directtovideo](http://directtovideo.wordpress.com "mattswoboda@yahoo.co.uk") -

MJP: Did you have any joy with spot culling? I’m running with the cone/plane test per frustum plane I had on PS3/SPU. Haven’t seen any false positives ..