Half The Precision, Twice The Fun: Working With FP16 In HLSL

Those of you who have been working on desktop and console graphics long enough will remember working with fp16 math in shaders during the D3D9 era. Back then HLSL supported the half scalar type, which corresponded to a floating-point value using 16-bits of precision. Using it was crucial for extracting the best performance from Nvidia's FX series, 6-series, and 7-series hardware, since it could perform many fp16 operations at faster rate than it could for full-precision 32-bit values. But then the D3D10 era came along with its unified shader cores, and suddenly fp16 math was no more. None of the desktop hardware supported it anymore, and so HLSL went ahead and mapped the half type to float and called it day. And that's the end of story when it comes to fp16 in shaders, The End.

...or not. It turns out that fp16 is still useful for the reasons it was originally useful back in the days of D3D9: it's a good way to improve throughput on a limited transitor/power budget, and the smaller storage size means that you can store more values in general purpose registers without having your thread occupancy suffer due to register pressure. As of Nvidia's new Turing architecture (AKA the RTX 2000 series), AMD's Vega (AKA gfx900, AKA GCN 5) series1 and Intel's Gen8 architecture (used in Broadwell) fp16 is now back in the desktop world. Which means that us desktop graphics programmers now have to deal with it again. And of course if you're a mobile developer, it never really left in the first place. But how do you actually use fp16 in your shader code? That's exactly what this blog will explain!

Before we get into the exact details, you should know that there's actually two parallel paths to using fp16 math in your shaders. This article will cover both, so you can decide which works best for your situation.

Flexible Precision (AKA Minimum Precision, AKA Relaxed Precision)

This is the older of the two fp16 paths, and has actually been around in D3D11 since Windows 8. OpenGL ES has also had a variant of this available for a long time. On this path, what you're basically doing is giving hints to the driver that say "it's okay if you do these operations at less than full 32-bit precision, but it's also okay if you do them at full precision". This basically gives you runtime polymorphic shaders: your final shader only exists as 1 blob of compiled bytecode, but depending on the hardware and driver you may or may not get fp16 ops at runtime. The nice part is that this spares you the pain of having to compile your shader twice, which is great! The downside is that you're not really sure what's going on with the hardware unless you check device caps, and/or check the actual ISA generated by the driver.

When working with HLSL, the way you get this behavior is by using the min16float type and its variants (min16float2, min16float3x3, etc.) 2. By declaring this type for a variable you're providing a hint specifying that it's okay for the driver to store it with less precision, as long as it's greater than or equal to 16 bits. In general it works exactly as you'd expect: you can downcast from fp32 to to fp16 by casting with min16float(), and you'll get warnings if an assignment causes an implicit truncation:

float fp32 = DoSomething();
min16float fp16_x = fp32;               // truncation warning
min16float fp16_y = min16float(fp32);

HLSL allows you to overload functions based on the input type being float or min16float, which means you can create dual versions of your utility functions for fp16 or fp32 when necessary without having to rename them:

float DoSomething(in float x)
    // do some stuff in fp32

min16float DoSomething(in min16float x)
    // do some stuff in fp16

If you enable warnings as errors (which you probably should!) you can start out by converting a few high-level values to fp16 and then letting the compiler point out all of the places where you need to explicitly cast to min16float. Unfortunately it won't catch places where you go from fp16 to fp32, since that doesn't truncate. So you'll need to watch out for those on your own. You'll also need to make sure that you're not inadvertantly using the half datatype, since by default this is still mapped to fp32 in HLSL! I would recommend making some defines in a common header file that map half to min16float, since that lets you avoid that potential issue. It also makes your code cleaner to look at IMO, and makes it easy to globally disable fp16 should you need to:

#define half min16float
#define half2 min16float2
#define half3 min16float3
#define half4 min16float4
#define half3x3 min16float3x3
#define half3x4 min16float3x4
// keep going if you need more matrix types

You'll also need to be careful with literals, which are a little weird when working with the min16float types. In HLSL you have 3 floating point suffixes: f, d, and l. These map to the float, double, and half types respectively. Sadly there's no literal for min16float, and the h suffix maps back to fp32 (since half is mapped to float internally). So for literals you'll instead want to use an unadorned literal (leave off the suffix entirely), and let the compiler sort it out on assignment. The unfortunate side effect of doing this is that calls to overloaded functions can resolve to the fp32 version if you pass a literal:

float DoSomething(in float x, in float y)
    // do some stuff in fp32

min16float DoSomething(in min16float x, in min16float y)
    // do some stuff in fp16

min16float x = 2.0;
min16float y = DoSomething(1.0, x);     // truncation warning!

Fortunately you can work around this by explicitly casting the literal to 16-bit type:

min16float y = DoSomething(min16float(1.0), x);     // no warning

There's one more big gotcha with flexible precision: you can't store them in buffer resources like constant buffers and structured buffers. It makes sense if you think about it: it's okay for temporary values stored in registers to be runtime polymorphic, but it would be really bad if the size and packing of your constant buffer changed depending on the video card and driver you were using! So you're stuck using 32-bit types for those cases. Ideally you would want to store fp16 values in your buffers, since that gives you more compact data and also lets you avoid any cost there might be from converting fp32 to fp16 in the shader core. But your only real option for doing that is to pack the fp16 values in a 32-bit type, and then convert back and forth in the shader:

struct CBLayout
    uint xy;    // two fp16 values packed in the upper and lower 16 bits

ConstantBuffer<CBLayout> CB;

min16float x = min16float(f16tof32(CB.x & 0xFFFF));
min16float y = min16float(f16tof32(CB.y >> 16));

Really you're asking the compiler to convert from fp16 -> fp32 and then truncate back down to fp16, and hoping that the driver's JIT compiler is smart enough to remove all of that when it generates the final ISA. Tom Hammersley's post on GPUOpen suggest that AMD's drivers are capable of recognizing this pattern, but I have no idea how reliable that is in practice across all vendors.

If you're using the open source DirectX Shader Compiler (DXC) to compile your HLSL, then all of this works in both D3D12 and Vulkan. When compiling to DXIL, you'll see the compiler emit code that uses the half data type. It will also mark the shader as requiring the "minimum-precision" feature. When compiling to SPIR-V, you'll see that ops involving min16float are decorated with RelaxedPrecision, which is the SPIR-V version of minimum precision. FXC also supports the minimum precision types for D3D11 and D3D12, if you're unfortunate enough to still be using that (which probably applies to most of us).

If you want to check what the hardware supports at runtime, in D3D11 you can query the device's minimum precision support by calling ID3D11Device::CheckFeatureSupport and passing D3D11_FEATURE_SHADER_MIN_PRECISION_SUPPORT to get back a D3D11_FEATURE_DATA_SHADER_MIN_PRECISION_SUPPORT struct with the caps. In D3D12 it's similar, except you want to ask for D3D12_FEATURE_D3D12_OPTIONS and look at the MinPrecisionSupport member of the returned D3D12_FEATURE_DATA_D3D12_OPTIONS structure. Unfortunately these values aren't really a guarantee: the driver is free to choose what precision to use for any particular operation even if it reports fp16 support. So you really need to use an IHV disassembly tool if you want to be 100% sure of what your GPU is doing. As of right now this value reports 16-bit support for Vega or later on AMD for both D3D11 and D3D12. Intel also reports that it supports fp16 on my Gen9 integrated GPU. Meanwhile for my Turing-based RTX 2080 the driver reports full fp16 support through D3D11, but not through D3D12. Strange!

Unfortunately for Vulkan there's no caps or extensions to indicate how the hardware will interpret RelaxedPrecision operations. This means that IHV tools or documentation are your only means of determining whether or not your operations will actually execute at fp16 precision.

When targetting Vulkan you also have an additional path for targetting fp16 through spirv-opt, which is the standard SPIR-V optimizer used by both glslc as well as DXC. Very recently a new --relax-float-ops pass was added to spirv-opt, which automatically tags everything with RelaxedPrecision. More details are provided in this whitepaper. This pass can also be enabled when using DXC by using the -Oconfig command line argument to invoke additional spirv-opt passes. I'm not sure how actually useful this is in practice, since you're generally going to want to be doing certain things at fp32 when they actually require that amount of precision.

Explicit FP16

Like the name suggests, with this path you'll be writing code that explicitly works with fp16 data types instead of polymorphic types. Being explicit has its advantages: you know for sure that it's going to be run at exactly 16-bit precision without having to query device caps. But the major downside is that your shader will now only work on devices that support the related features and extensions. This means you'll most likely need to compile shaders with and without fp16 types, at least until fp16-capable hardware is ubiquitious on desktop. The other major advantage of being explicit is that you can actually use fp16 data types in your resources, which means you can pack fp16/uint16 data in your constant buffers and structured buffers without needing a pile of ugly code to unpack and convert from 32-bit types. That's not only convenient, it also makes it easier for you and the driver to avoid unnecessary conversions when performing fp16 math.

Explicit fp16 is only supported in DXC through Shader Model 6.2, which means there's no support for FXC or D3D11. To compile your shader for explict fp16 you'll want to pass -enable-16bit-types as an argument and make sure that you're using one of the *_6_2 profiles (or higher). Once you flip this switch, the half type stops behaving as a float and instead acts as a true fp16 type. They've also added a new float16_t type that you can use as well, along with matching float32_t and float64_t types. Here's simplified version of the table from their wiki showing how each type behaves with and without the switch:

HLSL Type Without -enable-16bit-types -enable-16bit-types
float float32_t float32_t
float32_t float32_t float32_t
min10float min16float(warning) float16_t(warning)
min16float min16float float16_t(warning)
half float32_t float16_t
float16_t N/A float16_t
double float64_t float64_t
float64_t float64_t float64_t
int int32_t int32_t
int32_t int32_t int32_t
uint uint32_t uint32_t
uint32_t uint32_t uint32_t
min12int min16int(warning) int16_t(warning)
min16int min16int int16_t(warning)
int16_t N/A int16_t
min12uint min16uint(warning) uint16_t(warning)
min16uint min16uint uint16_t(warning)
uint16_t N/A uint16_t
int64_t int64_t int64_t
uint64_t uint64_t uint64_t

As you can see you also get 16-bit signed/unsigned integers with this flag, which is great for packing more data into your buffers. The compiler also convieniently maps the min16float/min16int/min16uint minimum precision types to their native 16-bit counterparts, which can simplify porting older code. Just be aware that the compilerwill emit a warning in this case, which is meant to remind you that you're no longer getting the "flexible precision" behavior that those types normally provide.

Like I mentioned earlier, in this mode half is back to representing true fp16 values instead of being sliently mapped to the float type under the hood. This also means that the h suffix for literals actually works the way you want it to, which lets you avoid the amiguities that unadorned literals cause with overload resolution. In my opinion this results in cleaner and easier to understand code with less surprising behavior. Here's the example that I showed earlier with functions overloaded by return and parameter types, except this time we'll use explicit fp16:

float DoSomething(in float x, in float y)
    // do some stuff in fp32

half DoSomething(in half x, in half y)
    // do some stuff in fp16

half x = 2.0h;
half y = DoSomething(1.0h, x);  // this is fine!

With explicit fp16 we know that we're giving up the convenience of polymorphic data types, which means our compiled shader will only run on hardware that actually supports fp16 operations. So how do we check this in our graphics APIs? In D3D12, we do this by calling ID3D12::CheckFeatureSupport and passing D3D12_FEATURE_D3D12_OPTIONS4 to obtain a D3D12_FEATURE_DATA_D3D12_OPTIONS4 structure, and checking the value of the Native16BitShaderOpsSupported member. Note that the DXIL produced by the compiler will look very similar to what you would get when using the minimum precision types. The major difference is that it will be marked with the UseNativeBit16 flag in the metadata, which tells the runtime and driver that the fp16 ops need to be natively supported.

On Vulkan things are unfortunately a bit more complicated, since you need to deal with extensions. First, you'll want to check if your GPU and driver support the VK_KHR_shader_float16_int8 extension. If it does, you'll have two extended device properties that you need to check: shaderFloat16 for fp16 support, and shaderInt8 for 8-bit integer support. If shaderFloat16 is set, then you can use native fp16 math operations in your shaders. However this only applies to math operations, and not anything involving I/O with resources! For that you need to check the VK_KHR_16bit_storage extension, which includes 4 new properties that you need to check: storageBuffer16BitAccess, uniformAndStorageBuffer16BitAccess, storagePushConstant16, and storageInputOutput16. These 4 flags correspond to the SPIR-V capabilities outlined in the SPV_KHR_16bit_storage extension, and basically tell you what classes of resources can use fp16 types in them. Nvidia Turing-based hardware currently reports support for shaderFloat16, shaderInt8, storageBuffer16BitAccess, uniformAndStorageBuffer16BitAccess, and storagePushConstant16, but not storageInputOutput16. Meanwhile AMD Vega and Navi-based hardware reports support for shaderFloat16, shaderInt8, storageBuffer16BitAccess, uniformAndStorageBuffer16BitAccess, and storageInputOutput16, but not storagePushConstant16. So basically you'll want to avoid using half types as inputs or outputs from your shader entry points since storageInputOutput16 isn't universally supported, and you'll also want to avoid using 16-bit push constants since storagePushConstant16 isn't universally supported.

That's a lot of details, but we're not quite done yet! It turns out that the original SPIR-V spec had a bunch of instructions included as "extension instructions", which are documented here. These are mostly transcendental functions like Pow and Sin, as well as the FClamp instruction that's commonly used for implementing the saturate() intrinsic. The original spec for these defined them as only taking 32-bit floating point inputs, which meant it was illegal to use them with fp16 values. The SPV_AMD_gpu_shader_half_float extension from AMD lifted this restriction, allowing AMD hardware to support these instructions with fp16 values. Fortunately the SPIR-V spec was revised to add full fp16 support sometime after that AMD extension was released, which means the extension is no longer necessary. Or rather I should say it's almost unnecessary, since the interpolation instructions still only support fp32 (the AMD extension registry has been updated to reflect this). 3

When using DXC to compile HLSL to SPIR-V with -enable-16bit-types you should see ops generated that use types declared like this:

%half = OpTypeFloat 16
%v3half = OpTypeVector %half 3

...as opposed to seeing ops with the RelaxedPrecision tag. You'll also see ops indicating which extension features are required, such as OpCapability Float16 for general fp16 math support and OpCapability UniformAndStorageBuffer16BitAccess for using 16-bit types in uniform or storage buffers.


Hopefully this gives you enough information to decide on how to move forward with fp16 in your HLSL codebase. As of right now using the flexible precision path seems to be the only reasonable choice for targetting PC hardware, since fp16 math is only supported on very recent video cards. However it seems clear to me that explicit fp16 is going to be future, since it's much nicer to work with once you can use it. The only question is how long it will take until we can safely ignore all hardware that can't do it. I'm not the kind of person to predict the future, so I'll leave that part up to someone else. 😄

  1. AMD's Polaris series (gfx803) supports fp16 at full rate (non-packed, unlike Vega) but it doesn't seem to be exposed in any API. Update 10/7/2019: Allan MacKinnon has helpfully pointed me to this GitHub issue where AMD engineers explained why they never enabled fp16 support for pre-gfx900 hardware. [return]
  2. There's actually a few other types such as min10float and min16uint to along with the 16-bit float types, but this article is just going to focus on fp16. [return]
  3. The current version of DXC is erroneously marking the SPIR-V as requiring the SPV_AMD_gpu_shader_half_float extension when any of the extended instructions are emitted for fp16. I've filed an issue, and I assume it will fixed very soon. [return]