Pictured above: Paul Atreides visualizing his full tree of shader permutations and regretting his decisions

If you've read the previous article then you hopefully have a decent understanding of how shader permutations ended up being such a common issue. The good news is that there is some hope for the future: if we look across recent game releases as well as the features available in the latest GPUs and APIs, we do see some promising avenues for digging ourselves out of our self-imposed avalanche of compiled bytecode. In my opinion nothing I'm going to mention here is a silver bullet on its own: each technique comes with a set of trade-offs to be carefully evaluated in the context of an engine and the games that run on it. Regardless, it's inspiring to see smart and resourceful people come up with clever approaches that help to sidestep some of the issues that I've brought up the previous article.

### Only Compile What You Need

This is the simplest, oldest, and perhaps least-effective way to reduce permutation counts. The general idea is that out of your 2^N possible permutations, some subset of them are either redundant, invalid, or will never actually get used. Therefore you will reduce the set of shaders that you need to compile and load if you can strip out the unnecessary permutations. In many cases the reduction in shader permutation count can be substantial, and can be the difference between "completely untenable" and "we can ship this". Ideally this process is something you would do offline, perhaps as part of a content processing pipeline that has knowledge of what meshes and materials are going to be used in each scene. But there have also been games/engines that have done it at runtime, essentially deferring compilation of a permutation until the scene is loaded. Either way there are some pretty obvious downsides:

• Determining your set of shaders offline requires your offline process to have a pretty complete understanding of both the content and how it interacts with your shader pipeline. Making changes to how that works may also require you to recompile and invoke the content processing pipeline again, as opposed to recompiling the runtime code and running the app again
• Offline approaches may make editors and other tooling more complicated, since you now have to deal with an on-the-fly combination of mesh + material
• On a related note, some engines are setup where they treat their shaders more like code and less like content. For instance, they might want to compile shaders using the same build system used for C++ code since it already handles dependencies and includes correctly. Moving to a system where the shaders are compiled as part of the content pipeline can potentially be a large shift.
• If you wait until runtime to compile your shaders, you now have to either make the user wait for compilation to complete or do something else to hide the compilation time. This might involve having QA generate a cache of shaders to ship with the game, or it might even involve using a slow and generic shader until your specialized permutation finishes compiling in the background. Platforms that don't allow you to invoke the shader compiler at runtime can also make this approach a non-starter, at least without some kind of two-step process that discovers the shaders and then compiles an offline cache to ship with the app.
• The amount of shaders that you generate is content-dependent, and the count could vary wildly depending on the scene/game/material setup/etc.
• In the worst case this degenerates to compiling your full set of permutations, except that you may end up with a more complicated pipeline for generating those permutations.
• It doesn't do anything to reduce your loaded shader/PSO count at runtime. If you need to rely on a small number of PSOs to implement some technique (for example, GPU-driven rendering) then this approach won't help you.

### Run-Time Specialization

Vulkan and Metal both support an interesting feature called specialization constants, or "spec constants" for short (Metal calls the feature "function specialization"). The basic idea goes like this:

• You compile your shader with a global uniform value (basically like a value in a constant buffer) that's used in your shader code
• When creating the PSO, you pass a value for that uniform that will be constant for all draws and dispatches using the PSO
• The driver somehow ensures that the value you passed is used in the shader program. This might include:
• Treating the value as a "push constant", basically a small uniform/constant buffer that gets set from the command buffer
• Patching a value into the compiled intermediate bytecode or vendor-specific ISA
• Treating the value as a compile-time constant when the driver does its JIT compile and performing full optimizations (including constant folding and dead code elimination) based on that value

It's a pretty neat concept, since it potentially lets you avoid having to do a lot of your own compiling and instead rely on the driver's backend compiler to do a lot of the work for you. If you have a lot of specializations it won't necessarily allow you to reduce your PSO count, but it can be pretty nice in a similar way to the "Compile What You Need" strategy if you don't use your full set of possible permutations at runtime. A good example is using them to implement quality levels based on user settings: you're not going to need both the low-quality and high-quality PSOs at once, so you can use a spec constant to choose when you create the PSO. It should be noted though that spec constants also share one of the main issues with "Compile What You Need": it won't reduce your runtime PSO count. You can potentially ship and load fewer shader binaries which is an improvement, but you can still run into the same problems that PSOs can cause with batching and ray tracing.

One interesting variant on this approach could be to have the app manually invoke the shader compiler (or some slimmed-down version of it) to patch in the spec constant and optimize based on its value. I'm not sure if this would be significantly faster than invoking the full compiler toolchain again, but perhaps it could be speedier since you wouldn't need to parse anything. This approach would work on any API as long as the compiler can run on-target, and you would have the peace of mind of knowing that the optimization and dead-stripping is for sure happening. It would also put any trade-offs regarding background compilation in the hands of the app developers, which would certainly be more consistent with the overall spirit of the "explicit" APIs.

### Cached Material Evaluation

• Use Substance Designer or a custom material pipeline to do offline compositing and generating of your textures
• Even with this you generally still want to combine tiling maps together at runtime since "flattening" them would consume considerably more memory
• Use an offline system combined with runtime virtual texturing to full generate a unique set of composited textures for all surfaces in the game (basically the Megatexture approach)
• There are plenty of well-documented issues with this approach, such as disk space, streaming/transcoding cost, lack of fine texture detail, etc.
• Use a runtime virtual texturing system that composites/generates pages on-the-fly
• More complex, you may still need permutations for the process that does the VT page generation

The end goal of these techniques is to end up with simpler pixel shaders that can just sample one set of textures without needing to do anything fancy with them. This won't help you for the portions of the shader that don't deal with generating shading parameters, but it can potentially cut things down quite a bit.

### Replace Permutations with Branching and Looping

On modern GPUs flow control is no longer the problem that it used to be. Having divergent branching and looping within a wave is still a problem due to the SIMD execution model used on GPUs, but if all threads go down the same path it can be quite efficient. Therefore if you've got a material flag that enables a feature for an entire draw call, it could make a lot of sense to stick that value in a uniform/constant buffer and branch on it in the shader at runtime. This kind of branching is usually referred to as "uniform branching" or "static branching" since it guarantees that you're not going to have any divergence. The idea here is that we can rethink a lot of the permutation decisions that we made 10 years ago and (hopefully) reduce our total shader count after sprinkling in a lot of these static branches. If that branch is totally uniform then the shader process will legitimately skip over the instructions in the branch when the flag is disabled, and it may not even cost very many cycles to do this if the code has a dedicated scalar unit for performing those operations.

The poster child for this approach is the past two Doom games from id: Doom 2016 and Doom: Eternal. In their presentations they mention that they use a forward-rendering ubershader that only results in ~100 shaders and ~350 PSOs. This is of course more than a handful but a distributed compilation system can probably tear through that many shaders in no time, making it very attractive from an iteration point of view. Personally I get quite jealous when thinking of how much simpler my life would be with only that many shaders to deal with, and I really doubt I'm the only one! Their games run really well and look great so it's a pretty strong proof-of-concept that this approach can work.

With that said, I personally still think it can be quite difficult to achieve a low permutation count in the general case. In part 1 we discussed why this is tricky to get right, so I won't go into detail again. But the main gist of it is that it can be difficult to know where you should leave a branch or loop and where you should permute/unroll. On consoles you can make pretty informed decisions by looking at a single GPU in isolation (and make use of the excellent tooling), but on PC and mobile you are in a much much tougher spot. But on the positive side having fewer permutations means fewer total programs to optimize by hand, which can potentially pay off in big ways. In particular it's a lot more reasonable to try to make sure that all of your permutations have a reasonable register count and occupancy...at least assuming that you have the tools available to determine these things. There are also some things that really do require permutations since they are static properties of the shader itself, and thus can't be replaced with branching. In particular this includes the set of inputs or outputs of a shader (interpolants, render targets), usage of discard, per-sample execution, and forced early-z.

I think what would truly help in using this approach is to have a lot of discipline and focus in your material/shader feature set. The less features you need to support, the easier it's going to be to make sensible decisions about where you should permute. I would imagine you would have to get used to saying "no" to a lot of feature requests, since every feature added can potentially affect the performance of your ordinary baseline materials unless it's gated behind a permutation. For those of us with very broad material feature sets it seems difficult to get to that point without being very disruptive to existing workflows, and/or without giving up things that make a significant contribution to the look of a game.

Another thing that could really help is good automated processes for helping to figure out which permutations are truly worth it vs. just leaving in a branch. Unfortunately the testing space for this is quite large: you can have N different materials and M different video cards with a huge amount of permutation options, so it's not feasible to test exhaustively. But perhaps reasonable subsets can be derived that give a decent level of confidence.

One last thing I would like to note is that you have to watch out for drivers taking your handful of shaders and permuting them behind your back. Drivers have been known to peek at values from constant/uniform buffers and use them to silently flatten branches and unroll loops. This can lead you to think that your shaders are well-optimized...until you run them on a driver that doesn't do this and the performance is worse. It's tougher for drivers to do this in D3D12 and Vulkan since there are more avenues to load data into shaders, but it's still feasible if the driver is sufficiently motivated.

### Deferred Rendering, and Other Techniques For Splitting Up Shading

When people first learn about deferred rendering, there's usually a lot of focus on how it helps to achieve greater dynamic lighting counts by decoupling the lighting process from your geometry. While this was true at the time it became popular, IMO the real secret weapon of deferred is that it's a way of chopping up a mesh's rendering process into at least two different distinct phases. In effect this is really a way of modularizing your shader code, despite the fact that individual shader programs are still monolithic. You can conceptually think of most deferred techniques as splitting up a single gigantic pixel shader function into two separate functions: one that calculates the parameters that feed into the lighting phase (surface normals, albedo, roughness, etc.) and another that takes those parameters and computes how much light reflects off the surface given a collection of lighting sources. With this mental model you can even think of the G-Buffer as being a sort of "ABI" that allows your first phase to pass its arguments to the second, except in this case it does so by writing out many megabytes of per-pixel data to a collection of render target textures.

The effect on permutation count can be significant, since it effectively causes us to split our pixel shader into two feature groups instead of one. Let's see what our permutated pixel shader from part 1 looks like before and after converting to deferred:

$PShader(NormalMap[Off, On] * LightType[Point, Spot, Area] * Shadows[Off, On])$

$GBuffer(NormalMap[Off, On]) + Lighting(LightType[Point, Spot, Area] * Shadows[Off, On])$

Splitting up our shader turned one of those multiplies into an add, causing us to drop from 12 pixel shader permuations down to 8. Saving 4 shaders might not sound like a lot, but keep in mind the delta can be significantly larger as the number of features grow. In practice this can be a huge win for permutation counts, and the modularization can have all kinds of other halo effects. For example you can potentially optimize the code for your G-Buffer and Lighting phases separately without having to consider interactions, and the peak register usage will often be lower as a result of splitting things into smaller shaders with fewer steps. Shader compilers may also be able to chew through your shaders more quickly since they are smaller, since optimization time can grow non-linearly with code length.

In practice there all kinds of flavors of deferred, and even more ways of splitting things up the final rendering pipeline into separate phases with distinct shader programs. For instance, it's completely possible to only pull out shadow map evaluation into its own deferred passes that only requires a depth buffer, while keeping everything else in a combined forward pixel shader. Or going in the other direction, with deferred you have the option to either do all lighting in a single combined draw or dispatch, vs. further splitting the lighting passes into multiple sub-steps. Doing 1 pass per light source can allow for trivial shader specialization based on the properties of the light source while avoiding combinatorial issues, while doing separate passes for things like sampling ambient diffuse lighting can also provide opportunities for using async compute.

In part 1 we discussed how the monolithic shader compilation model contributed to permutation explosions. If you can't compile and optimize portions of a shader program individually, then permututations that only affect that portion end up requiring the entire program to be recompiled. What if there was some hot new technology from the 1970's that would let us compile separate binaries and then "link" them together into one combined shader program? That would be be pretty neat, right? 😉

The good news is that DXIL and SPIR-V support this sort of thing! For D3D12/DXIL all of the functionality to compile and link libraries is present in dxc, either via the command line or through its API. See my recent post about it for more details on how it works in practice. The story for SPIR-V though is a little more complicated. SPIR-V Tools has a utility for doing the linking, provided you have some compiled SPIR-V that's been generated with the appropriate linkage type. At the time of writing however, neither dxc nor glslang are capable of generating SPIR-V with the appropriate declarations. Therefore if you want to write HLSL or GLSL and link the results together, there doesn't seem to be a working option for doing that. However I've been told that other projects that generate SPIR-V through other means (such as rust-gpu) have been able to successfully link using the SPIR-V tools. Either way it's cool that in both cases the linking happens in open-source toolchains using documented IL representations. 2

There's a big question mark when it comes to linking: for compiling a zillion permutations, would it actually reduce the compile times? I don't have any data to draw even some basic conclusions from, so I think we'll have to wait for someone to do some more comprehensive testing. Ultimately the situation might not be straightforward depending on the inputs to compiling and linking. There's also the potential for a linking step to make the compilation pipeline more complicated, just like it does for C/C++. In particular it might convert what was once a straightforward parallel operation into a tree with dependencies and sync points.

### True Function Calls and Dynamic Dispatch

Having a runtime function call combined with dynamic dispatch is potentially more interesting than a linking step, but it's also much more of a radical change. While linking can happen offline with no driver input, dynamic dispatch for sure requires the driver and the hardware to be on board. The "stuff everything in a statically-allocated chunk of registers" model used by most GPUs certainly doesn't lend itself to true dynamic dispatch, and it's easy to imagine that various constraints on the programming might be necessary to keep performance from falling off a cliff.

The good news is that on PC we sort-of have this right now! The bad news is that it's very constrained, and only works for ray tracing. I'm less familar with the specifics for Vulkan, but in D3D12/DXR it works by essentially having a sort of "run-time linker" step. Basically you compile a bunch of functions into a DXIL binary using a "lib" target (just like you would for offline linking), and then at runtime you assemble all of your pieces together into a combined state object. Later on when you call DispatchRays the driver is able to dynamically execute the right hit/miss/anyhit/etc. shader since it was linked into the state object. There is a callable shaders feature that can be used without any actual ray tracing, however it still needs to be used from within a ray generation shader that was kicked off from DispatchRays. In other words: it's usable right now for compute-like shaders, but currently can't be used within the graphics pipeline. With any luck this could change in the future! 3

Currently I'm not aware of anyone that has tried using DXR callable shaders to replace permutations, or at least that has published their findings. Personally I have not experimented with them, so I'm not yet sure how they might work out in practice. With that said, we can still probably make some reasonable assumptions based on what we know about how they work and what kind of implications they might have on usage and performance:

• They currently require a runtime "link" step where all possible callables are assembled into a single state object and associated shader table, either by creating a new one or by calling AddToStateObject to append additional shader binaries. This has implications on streaming, since loading in new callable shaders may require modifying a global state object being used for rendering. Either way you need to be careful where you interact with state objects since it can involve a JIT compile, just like creating a normal PSO.
• We probably should not expect any magic workarounds for static register allocation: if a callable shader requires many registers, we can likely expect for the occupancy of the entire batch to suffer. It's possible that GPUs could diverge from this model in the future, but that could come with all kinds of potential pitfalls (it's not like they're going to start spilling to a stack when executing thousands of pixel shader waves).
• The way the state object spec is setup gives drivers some room to do transformations if necessary. Since the full set of call targets is known when creating or modifing a state object, it's feasible that the driver might make a decision to "flatten" a CallShader into a branch with inlined function calls.
• There are some subtle implications depending on whether this kind of functionality is implemented as a function call from the same wave, or whether it implies that a wave is launched that runs the function being called. In particular the DXR model implies the latter, since hit shaders can set their own resource context and bindings via a local root signature. There can also be implications around how divergence is handled (different threads in the same wave calling different functions), and whether or not threads are allowed to be "re-packed" into a full wave. For replacing existing permutations I would expect the function calls to be completely uniform, and thus a simpler "run the function in the same thread" approach is sufficient and probably more desirable. Having a resource context is also not necessary as long as bindless is being used.

We'll have to watch this area closely in the future. With any luck this kind of functionality can be extended to graphics pipelines and standard compute shaders, which could be a very compelling option for those of us that are juggling too many permutations. Metal is also currently ahead of the curve on this front by offering function pointers that can be called from any shader stage. Perhaps this can serve as inspiration for PC and Android!

### Summary

Technique Pros Cons Platforms
Compile What You Need Less offline compiling, no changes needed to renderer No reduction in PSOs, might not reduce shader count enough, may need offline analysis Any
Run-Time Specialization Significantly less offline compiling, no changes needed to renderer May add to PSO creation time, driver might not optimize as much as you want Vulkan, Metal
Cached Material Evaluation Less offline compiling and lower PSO counts, could speed up GPU performance Complex and potentially invasive to the renderer, only affects shading parameter generation Any
Runtime Branching And Looping Significantly less offline compiling and lower PSO counts Potential performance implications, may put serious contraints on material features, no shader graphs Any
Deferred Rendering Less offline compiling and PSOs, more modular code, could be better for GPU performance Very invasive to the renderer, may not be a good fit for certain hardware and use cases Any
Offline Linking Potentially quicker offline compiling, no changes needed to renderer May not be faster to compile, doesn't reduce PSOs, may reduce GPU performance if not optimized post-link D3D12, Vulkan, Metal
Dynamic Function Calls Significantly less offline compiling and PSO counts, opens the door to other new techniques Likely worse for GPU performance, requires changes to how your engine handles shaders and PSOs D3D12 + Vulkan (Compute-Only), Metal (Full Support)

### Final Thoughts

That's it for the series! I hope it was an enjoyable read, and that it helped to give some additional context regarding the ongoing battle against shader combinatorial explosion. Please leave a comment or reach out on Twitter if you have any additional thoughts or insights, or if you take issue with anything I've written here.

I'd also like to quickly think everyone that read these posts early and provided valuable feedback and insight:

1. Many mobile GPUs have fast on-chip framebuffer memory that can be accessed through special mechanisms. This can allow deferred renderers to keep their G-Buffer entirely in on-chip memory rather than slower off-chip memory, potentially changing the trade-offs that are mentioned here. [return]
2. D3D11 and fxc actually added support for offline linking in the Win 8 era. Unfortunately it was rather complicated, and I'm not sure if it ever got heavy usage. [return]