Oct 11, 2021

The Shader Permutation Problem - Part 2: How Do We Fix It?

dune screenshot showing Paul’s prescience abilitoes

Pictured above: Paul Atreides visualizing his full tree of shader permutations and regretting his decisions

If you’ve read the previous article then you hopefully have a decent understanding of how shader permutations ended up being such a common issue. The good news is that there is some hope for the future: if we look across recent game releases as well as the features available in the latest GPUs and APIs, we do see some promising avenues for digging ourselves out of our self-imposed avalanche of compiled bytecode. In my opinion nothing I’m going to mention here is a silver bullet on its own: each technique comes with a set of trade-offs to be carefully evaluated in the context of an engine and the games that run on it. Regardless, it’s inspiring to see smart and resourceful people come up with clever approaches that help to sidestep some of the issues that I’ve brought up the previous article.

Only Compile What You Need

This is the simplest, oldest, and perhaps least-effective way to reduce permutation counts. The general idea is that out of your 2^N possible permutations, some subset of them are either redundant, invalid, or will never actually get used. Therefore you will reduce the set of shaders that you need to compile and load if you can strip out the unnecessary permutations. In many cases the reduction in shader permutation count can be substantial, and can be the difference between “completely untenable” and “we can ship this”. Ideally this process is something you would do offline, perhaps as part of a content processing pipeline that has knowledge of what meshes and materials are going to be used in each scene. But there have also been games/engines that have done it at runtime, essentially deferring compilation of a permutation until the scene is loaded. Either way there are some pretty obvious downsides:

Determining your set of shaders offline requires your offline process to have a pretty complete understanding of both the content and how it interacts with your shader pipeline. Making changes to how that works may also require you to recompile and invoke the content processing pipeline again, as opposed to recompiling the runtime code and running the app again
Offline approaches may make editors and other tooling more complicated, since you now have to deal with an on-the-fly combination of mesh + material
On a related note, some engines are setup where they treat their shaders more like code and less like content. For instance, they might want to compile shaders using the same build system used for C++ code since it already handles dependencies and includes correctly. Moving to a system where the shaders are compiled as part of the content pipeline can potentially be a large shift.
If you wait until runtime to compile your shaders, you now have to either make the user wait for compilation to complete or do something else to hide the compilation time. This might involve having QA generate a cache of shaders to ship with the game, or it might even involve using a slow and generic shader until your specialized permutation finishes compiling in the background. Platforms that don’t allow you to invoke the shader compiler at runtime can also make this approach a non-starter, at least without some kind of two-step process that discovers the shaders and then compiles an offline cache to ship with the app.
The amount of shaders that you generate is content-dependent, and the count could vary wildly depending on the scene/game/material setup/etc.
In the worst case this degenerates to compiling your full set of permutations, except that you may end up with a more complicated pipeline for generating those permutations.
It doesn’t do anything to reduce your loaded shader/PSO count at runtime. If you need to rely on a small number of PSOs to implement some technique (for example, GPU-driven rendering) then this approach won’t help you.

For our engine at Ready At Dawn, we have used the offline variant of this approach with some success. When our content processing pipeline encounters a mesh, it looks at the assigned material as well as properties of the mesh itself to figure out the final shader permutation that will be required. A request to compile that shader is then kicked off, and the final bytecode is ready to load in the game. In our case the reduction is quite substantial: just our “generic” base material + shader set supports an unfathomable number of possible permutations, and the vast majority go unused. For the entirety Lone Echo 2 we end up compiling around 5000 pixel shaders out of a possible set of millions, and typically have around 900-1000 PSOs loaded at runtime (with about 50% of them being used in any particular frame). Unfortunately the full set of shaders to compile is still quite large, so a global change can take quite some time to churn through before the full game is rebuilt. To mitigate this we make use of distributed shader compilation that is run on idle machines. One silver lining is that it’s quite simple to build only the set of shaders required for a particular area of the game, since the shaders are compiled as part of processing a particular chunk of the game world. This can make iteration times lower for global shader changes, but still quite far from instant. To handle tooling issues where the mesh + material combination is not known at runtime, we use “generic” un-permuted versions of our shaders that branch on constant buffer values for all features. As a bonus these shaders also support full real-time parameter editing, and have additional dev-only features for debugging. These shaders are huge though since they support every possible feature, and can be quite slow!

Run-Time Specialization

Vulkan and Metal both support an interesting feature called specialization constants, or “spec constants” for short (Metal calls the feature “function specialization”). The basic idea goes like this:

You compile your shader with a global uniform value (basically like a value in a constant buffer) that’s used in your shader code
When creating the PSO, you pass a value for that uniform that will be constant for all draws and dispatches using the PSO
The driver somehow ensures that the value you passed is used in the shader program. This might include:
- Treating the value as a “push constant”, basically a small uniform/constant buffer that gets set from the command buffer
- Patching a value into the compiled intermediate bytecode or vendor-specific ISA
- Treating the value as a compile-time constant when the driver does its JIT compile and performing full optimizations (including constant folding and dead code elimination) based on that value

It’s a pretty neat concept, since it potentially lets you avoid having to do a lot of your own compiling and instead rely on the driver’s backend compiler to do a lot of the work for you. If you have a lot of specializations it won’t necessarily allow you to reduce your PSO count, but it can be pretty nice in a similar way to the “Compile What You Need” strategy if you don’t use your full set of possible permutations at runtime. A good example is using them to implement quality levels based on user settings: you’re not going to need both the low-quality and high-quality PSOs at once, so you can use a spec constant to choose when you create the PSO. It should be noted though that spec constants also share one of the main issues with “Compile What You Need”: it won’t reduce your runtime PSO count. You can potentially ship and load fewer shader binaries which is an improvement, but you can still run into the same problems that PSOs can cause with batching and ray tracing.

The other main issue with this approach (aside from lack of support in D3D12 and consoles) is that you don’t really know which approach the driver is going to take when it handles your spec constant. Sure it makes sense that it will dead-strip as much as it can since it’s doing that as part of its JIT compile anyway, but drivers have complicated trade-offs to deal with when it comes to doing their back-end compilation. If they take too long when initially creating your PSO then games can have slow loading times, and both devs and customers will complain. Doing further optimization on a background thread is possible, but drivers are limited in how much of the CPU they can use for this (particularly in D3D12) and it surely makes things more complicated for both the driver and developer. The worst part is that you could end up having wildly varying results (and thefore performance) based on the vendor and driver version. Sure in some ways you just have to be zen about hoping that PC drivers will do whatever they need to in the background to make things fast, but do you really want to set up your entire shader pipeline on the assumption that you’ll get as good results as if you compiled the shader permutations beforehand? Do you really want to do that on mobile where drivers are even more constrained in their CPU usage?

One interesting variant on this approach could be to have the app manually invoke the shader compiler (or some slimmed-down version of it) to patch in the spec constant and optimize based on its value. I’m not sure if this would be significantly faster than invoking the full compiler toolchain again, but perhaps it could be speedier since you wouldn’t need to parse anything. This approach would work on any API as long as the compiler can run on-target, and you would have the peace of mind of knowing that the optimization and dead-stripping is for sure happening. It would also put any trade-offs regarding background compilation in the hands of the app developers, which would certainly be more consistent with the overall spirit of the “explicit” APIs.

Cached Material Evaluation

A lot of what goes into our shader permutations is typically related to combining and building shading parameters in various ways. It’s quite common to build the “look” of one material by sampling multiple textures with different UV values, either to add varying levels of detail or to create meta-materials from simpler “building blocks”. A broad feature set for doing these operations can naturally result in a lot of shader code, which can in turn lead to shader permutations. Or alternatively an engine might allow technical artists to completely define their own logic by authoring shader graphs, in which case there are potentially unlimited numbers of compiled shaders that could be generated. If you can somehow pull this process out of your pixel shader, then you can perhaps reduce the number of shader variants that are needed for your draws. There are a few ways you might do this:

Use Substance Designer or a custom material pipeline to do offline compositing and generating of your textures
- Even with this you generally still want to combine tiling maps together at runtime since “flattening” them would consume considerably more memory
Use an offline system combined with runtime virtual texturing to full generate a unique set of composited textures for all surfaces in the game (basically the Megatexture approach)
- There are plenty of well-documented issues with this approach, such as disk space, streaming/transcoding cost, lack of fine texture detail, etc.
Use a runtime virtual texturing system that composites/generates pages on-the-fly
- More complex, you may still need permutations for the process that does the VT page generation

The end goal of these techniques is to end up with simpler pixel shaders that can just sample one set of textures without needing to do anything fancy with them. This won’t help you for the portions of the shader that don’t deal with generating shading parameters, but it can potentially cut things down quite a bit.

Replace Permutations with Branching and Looping

On modern GPUs flow control is no longer the problem that it used to be. Having divergent branching and looping within a wave is still a problem due to the SIMD execution model used on GPUs, but if all threads go down the same path it can be quite efficient. Therefore if you’ve got a material flag that enables a feature for an entire draw call, it could make a lot of sense to stick that value in a uniform/constant buffer and branch on it in the shader at runtime. This kind of branching is usually referred to as “uniform branching” or “static branching” since it guarantees that you’re not going to have any divergence. The idea here is that we can rethink a lot of the permutation decisions that we made 10 years ago and (hopefully) reduce our total shader count after sprinkling in a lot of these static branches. If that branch is totally uniform then the shader process will legitimately skip over the instructions in the branch when the flag is disabled, and it may not even cost very many cycles to do this if the code has a dedicated scalar unit for performing those operations.

The poster child for this approach is the past two Doom games from id: Doom 2016 and Doom: Eternal. In their presentations they mention that they use a forward-rendering ubershader that only results in ~100 shaders and ~350 PSOs. This is of course more than a handful but a distributed compilation system can probably tear through that many shaders in no time, making it very attractive from an iteration point of view. Personally I get quite jealous when thinking of how much simpler my life would be with only that many shaders to deal with, and I really doubt I’m the only one! Their games run really well and look great so it’s a pretty strong proof-of-concept that this approach can work.

With that said, I personally still think it can be quite difficult to achieve a low permutation count in the general case. In part 1 we discussed why this is tricky to get right, so I won’t go into detail again. But the main gist of it is that it can be difficult to know where you should leave a branch or loop and where you should permute/unroll. On consoles you can make pretty informed decisions by looking at a single GPU in isolation (and make use of the excellent tooling), but on PC and mobile you are in a much much tougher spot. But on the positive side having fewer permutations means fewer total programs to optimize by hand, which can potentially pay off in big ways. In particular it’s a lot more reasonable to try to make sure that all of your permutations have a reasonable register count and occupancy…at least assuming that you have the tools available to determine these things. There are also some things that really do require permutations since they are static properties of the shader itself, and thus can’t be replaced with branching. In particular this includes the set of inputs or outputs of a shader (interpolants, render targets), usage of discard, per-sample execution, and forced early-z.

I think what would truly help in using this approach is to have a lot of discipline and focus in your material/shader feature set. The less features you need to support, the easier it’s going to be to make sensible decisions about where you should permute. I would imagine you would have to get used to saying “no” to a lot of feature requests, since every feature added can potentially affect the performance of your ordinary baseline materials unless it’s gated behind a permutation. For those of us with very broad material feature sets it seems difficult to get to that point without being very disruptive to existing workflows, and/or without giving up things that make a significant contribution to the look of a game.

Another thing that could really help is good automated processes for helping to figure out which permutations are truly worth it vs. just leaving in a branch. Unfortunately the testing space for this is quite large: you can have N different materials and M different video cards with a huge amount of permutation options, so it’s not feasible to test exhaustively. But perhaps reasonable subsets can be derived that give a decent level of confidence.

One last thing I would like to note is that you have to watch out for drivers taking your handful of shaders and permuting them behind your back. Drivers have been known to peek at values from constant/uniform buffers and use them to silently flatten branches and unroll loops. This can lead you to think that your shaders are well-optimized…until you run them on a driver that doesn’t do this and the performance is worse. It’s tougher for drivers to do this in D3D12 and Vulkan since there are more avenues to load data into shaders, but it’s still feasible if the driver is sufficiently motivated.

Deferred Rendering, and Other Techniques For Splitting Up Shading

When people first learn about deferred rendering, there’s usually a lot of focus on how it helps to achieve greater dynamic lighting counts by decoupling the lighting process from your geometry. While this was true at the time it became popular, IMO the real secret weapon of deferred is that it’s a way of chopping up a mesh’s rendering process into at least two different distinct phases. In effect this is really a way of modularizing your shader code, despite the fact that individual shader programs are still monolithic. You can conceptually think of most deferred techniques as splitting up a single gigantic pixel shader function into two separate functions: one that calculates the parameters that feed into the lighting phase (surface normals, albedo, roughness, etc.) and another that takes those parameters and computes how much light reflects off the surface given a collection of lighting sources. With this mental model you can even think of the G-Buffer as being a sort of “ABI” that allows your first phase to pass its arguments to the second, except in this case it does so by writing out many megabytes of per-pixel data to a collection of render target textures.

The effect on permutation count can be significant, since it effectively causes us to split our pixel shader into two feature groups instead of one. Let’s see what our permutated pixel shader from part 1 looks like before and after converting to deferred:

$$ PShader(NormalMap[Off, On] * LightType[Point, Spot, Area] * Shadows[Off, On]) $$

$$ GBuffer(NormalMap[Off, On]) + Lighting(LightType[Point, Spot, Area] * Shadows[Off, On]) $$

Splitting up our shader turned one of those multiplies into an add, causing us to drop from 12 pixel shader permuations down to 8. Saving 4 shaders might not sound like a lot, but keep in mind the delta can be significantly larger as the number of features grow. In practice this can be a huge win for permutation counts, and the modularization can have all kinds of other halo effects. For example you can potentially optimize the code for your G-Buffer and Lighting phases separately without having to consider interactions, and the peak register usage will often be lower as a result of splitting things into smaller shaders with fewer steps. Shader compilers may also be able to chew through your shaders more quickly since they are smaller, since optimization time can grow non-linearly with code length.

In practice there all kinds of flavors of deferred, and even more ways of splitting things up the final rendering pipeline into separate phases with distinct shader programs. For instance, it’s completely possible to only pull out shadow map evaluation into its own deferred passes that only requires a depth buffer, while keeping everything else in a combined forward pixel shader. Or going in the other direction, with deferred you have the option to either do all lighting in a single combined draw or dispatch, vs. further splitting the lighting passes into multiple sub-steps. Doing 1 pass per light source can allow for trivial shader specialization based on the properties of the light source while avoiding combinatorial issues, while doing separate passes for things like sampling ambient diffuse lighting can also provide opportunities for using async compute.

But of course, everything is a trade-off. One thing that most deferred renderers need to do at some point in the frame is write out results to memory and then read them back into registers¹. In the more classic deferred setups this happens when the G-Buffer is written out to render target textures, and then gets read back in during a lighting pass. If you were to compare that to a modern forward renderer you would likely see that the “G-Buffer” still exists in the forward pipeline, it just ends up sitting in registers that store the values passed between two phases of the same shader program. When you think about things in those terms deferred might not sound so appealing! After all, isn’t it better to keep data in ultra-fast registers rather than spilling them out to memory and then reading them back in again? But of course things are much more complicated than that when it comes to performance and renderer design, as we discussed in part 1 when we talked about taking a global approach to GPU performance. Perhaps using more bandwidth and memory ends up being a net negative, or perhaps the shorter/higher-occupancy shader programs enabled by modularization win out in the end. Perhaps letting the compiler optimize across a giant forward pixel shader enables some serious wins, or perhaps the I$ misses add up and limit your performance. Or perhaps you end up designing your deferred renderer in a way such that you can analyze the properties of your G-Buffer on per-tile basis, and use that analysis to manually propogate optimizations into the deferred shader, thus gaining back many of the benefits that the compiler’s optimizer might have afforded you. Or maybe your forward renderer completely dies due to pixel shader quad overdraw caused by tiny triangles, whereas the deferred renderer handles it more gracefully. It’s clearly not a straighforward proposition, and the continued success of deferred certainly suggests that it can win out in many cases.

Offline Linking

In part 1 we discussed how the monolithic shader compilation model contributed to permutation explosions. If you can’t compile and optimize portions of a shader program individually, then permututations that only affect that portion end up requiring the entire program to be recompiled. What if there was some hot new technology from the 1970’s that would let us compile separate binaries and then “link” them together into one combined shader program? That would be be pretty neat, right? 😉

The good news is that DXIL and SPIR-V support this sort of thing! For D3D12/DXIL all of the functionality to compile and link libraries is present in dxc, either via the command line or through its API. See my recent post about it for more details on how it works in practice. The story for SPIR-V though is a little more complicated. SPIR-V Tools has a utility for doing the linking, provided you have some compiled SPIR-V that’s been generated with the appropriate linkage type. At the time of writing however, neither dxc nor glslang are capable of generating SPIR-V with the appropriate declarations. Therefore if you want to write HLSL or GLSL and link the results together, there doesn’t seem to be a working option for doing that. However I’ve been told that other projects that generate SPIR-V through other means (such as rust-gpu) have been able to successfully link using the SPIR-V tools. Either way it’s cool that in both cases the linking happens in open-source toolchains using documented IL representations. ²

There’s a big question mark when it comes to linking: for compiling a zillion permutations, would it actually reduce the compile times? I don’t have any data to draw even some basic conclusions from, so I think we’ll have to wait for someone to do some more comprehensive testing. Ultimately the situation might not be straightforward depending on the inputs to compiling and linking. There’s also the potential for a linking step to make the compilation pipeline more complicated, just like it does for C/C++. In particular it might convert what was once a straightforward parallel operation into a tree with dependencies and sync points.

The other thing that’s both good and bad is that linking doesn’t really change the runtime situation at all. Since the linking can happen fully offline and outside of the API, the API and driver don’t really have to know or care that things were linked together (although there’s always some possibility that linking generates new bytecode patterns that can cause drivers to choke). This means there’s no need to wait for new API or driver support, which is a good thing. It also means there’s no reason to expect a significant delta in performance between a linked and fully-compiled program, at least assuming there’s a post-link optimization step (not doing one would introduce some interesting trade-offs). But it also means that it won’t do anything to reduce the number of shaders and PSOs that have to loaded at runtime, which is not-so-great. Therefore it won’t be a cure if your engine has a case of PSO-itis.

True Function Calls and Dynamic Dispatch

Having a runtime function call combined with dynamic dispatch is potentially more interesting than a linking step, but it’s also much more of a radical change. While linking can happen offline with no driver input, dynamic dispatch for sure requires the driver and the hardware to be on board. The “stuff everything in a statically-allocated chunk of registers” model used by most GPUs certainly doesn’t lend itself to true dynamic dispatch, and it’s easy to imagine that various constraints on the programming might be necessary to keep performance from falling off a cliff.

The good news is that on PC we sort-of have this right now! The bad news is that it’s very constrained, and only works for ray tracing. I’m less familar with the specifics for Vulkan, but in D3D12/DXR it works by essentially having a sort of “run-time linker” step. Basically you compile a bunch of functions into a DXIL binary using a “lib” target (just like you would for offline linking), and then at runtime you assemble all of your pieces together into a combined state object. Later on when you call DispatchRays the driver is able to dynamically execute the right hit/miss/anyhit/etc. shader since it was linked into the state object. There is a callable shaders feature that can be used without any actual ray tracing, however it still needs to be used from within a ray generation shader that was kicked off from DispatchRays. In other words: it’s usable right now for compute-like shaders, but currently can’t be used within the graphics pipeline. With any luck this could change in the future! ³

Currently I’m not aware of anyone that has tried using DXR callable shaders to replace permutations, or at least that has published their findings. Personally I have not experimented with them, so I’m not yet sure how they might work out in practice. With that said, we can still probably make some reasonable assumptions based on what we know about how they work and what kind of implications they might have on usage and performance:

They currently require a runtime “link” step where all possible callables are assembled into a single state object and associated shader table, either by creating a new one or by calling AddToStateObject to append additional shader binaries. This has implications on streaming, since loading in new callable shaders may require modifying a global state object being used for rendering. Either way you need to be careful where you interact with state objects since it can involve a JIT compile, just like creating a normal PSO.
We probably should not expect any magic workarounds for static register allocation: if a callable shader requires many registers, we can likely expect for the occupancy of the entire batch to suffer. It’s possible that GPUs could diverge from this model in the future, but that could come with all kinds of potential pitfalls (it’s not like they’re going to start spilling to a stack when executing thousands of pixel shader waves).
The way the state object spec is setup gives drivers some room to do transformations if necessary. Since the full set of call targets is known when creating or modifing a state object, it’s feasible that the driver might make a decision to “flatten” a CallShader into a branch with inlined function calls.
There are some subtle implications depending on whether this kind of functionality is implemented as a function call from the same wave, or whether it implies that a wave is launched that runs the function being called. In particular the DXR model implies the latter, since hit shaders can set their own resource context and bindings via a local root signature. There can also be implications around how divergence is handled (different threads in the same wave calling different functions), and whether or not threads are allowed to be “re-packed” into a full wave. For replacing existing permutations I would expect the function calls to be completely uniform, and thus a simpler “run the function in the same thread” approach is sufficient and probably more desirable. Having a resource context is also not necessary as long as bindless is being used.

We’ll have to watch this area closely in the future. With any luck this kind of functionality can be extended to graphics pipelines and standard compute shaders, which could be a very compelling option for those of us that are juggling too many permutations. Metal is also currently ahead of the curve on this front by offering function pointers that can be called from any shader stage. Perhaps this can serve as inspiration for PC and Android!

Summary

Technique	Pros	Cons	Platforms
Compile What You Need	Less offline compiling, no changes needed to renderer	No reduction in PSOs, might not reduce shader count enough, may need offline analysis	Any
Run-Time Specialization	Significantly less offline compiling, no changes needed to renderer	May add to PSO creation time, driver might not optimize as much as you want	Vulkan, Metal
Cached Material Evaluation	Less offline compiling and lower PSO counts, could speed up GPU performance	Complex and potentially invasive to the renderer, only affects shading parameter generation	Any
Runtime Branching And Looping	Significantly less offline compiling and lower PSO counts	Potential performance implications, may put serious contraints on material features, no shader graphs	Any
Deferred Rendering	Less offline compiling and PSOs, more modular code, could be better for GPU performance	Very invasive to the renderer, may not be a good fit for certain hardware and use cases	Any
Offline Linking	Potentially quicker offline compiling, no changes needed to renderer	May not be faster to compile, doesn’t reduce PSOs, may reduce GPU performance if not optimized post-link	D3D12, Vulkan, Metal
Dynamic Function Calls	Significantly less offline compiling and PSO counts, opens the door to other new techniques	Likely worse for GPU performance, requires changes to how your engine handles shaders and PSOs	D3D12 + Vulkan (Compute-Only), Metal (Full Support)

Final Thoughts

That’s it for the series! I hope it was an enjoyable read, and that it helped to give some additional context regarding the ongoing battle against shader combinatorial explosion. Please leave a comment or reach out on Twitter if you have any additional thoughts or insights, or if you take issue with anything I’ve written here.

I’d also like to quickly think everyone that read these posts early and provided valuable feedback and insight:

Many mobile GPUs have fast on-chip framebuffer memory that can be accessed through special mechanisms. This can allow deferred renderers to keep their G-Buffer entirely in on-chip memory rather than slower off-chip memory, potentially changing the trade-offs that are mentioned here. ↩︎
D3D11 and fxc actually added support for offline linking in the Win 8 era. Unfortunately it was rather complicated, and I’m not sure if it ever got heavy usage. ↩︎
D3D11 actually launched with a pseudo-dynamic-linking feature that was called Dynamic Shader Linkage in D3D11. It was a noble effort, and had some really interesting and forward-looking ideas. In particular it tried to save the developer from having to somehow know the right magic conditions for when to permute and when not to, and it did this by essentially kicking that decision up the driver. In order to do this though it had some pretty wacky syntax. Basically you would define interfaces with methods, and then declare N classes that implemented those interfaces and their methods. This let the offline compiler (fxc) figure out all the possibilities for what implementation might be invoked at each call site into an interface, and then it could present the driver’s backend compiler with a precompiled set of possibilities for the callee with full optimization applied across the function boundary. At runtime you would then use some rather complicated APIs to choose which class implementation would ultimately get used when you bound a shader. In hindsight it’s not too hard to see why this failed. The class/interface HLSL additions was quite a departure from previous shader code, and rendered it incompatible when trying to use it across platforms. It also doesn’t really solve the issue of needing to compile N shaders, since that permutation loop was really just getting moved inside of the shader compiler (it wasn’t actually compiling the class implementation methods separately). On top of that, it doesn’t really guarantee that the driver won’t just end up creating N full permutations of your shader program at runtime. This certainly would have been the easiest path forward for them, and may have been the best option for various reasons. ↩︎