Samstag, September 03, 2016

Exec masks in LLVM

Like is usual in GPUs, Radeon executes shaders in waves that execute the same program for many threads or work-items simultaneously in lock-step. Given a single program counter for up to 64 items (e.g. pixels being processed by a pixel shader), branch statements must be lowered to manipulation of the exec mask (unless the compiler can prove the branch condition to be uniform across all items). The exec mask is simply a bit-field that contains a 1 for every thread that is currently active, so code like this:
    if (i != 0) {
        ... some code ...
    }
gets lowered to something like this:
    v_cmp_ne_i32_e32 vcc, 0, v1
    s_and_saveexec_b64 s[0:1], vcc
    s_xor_b64 s[0:1], exec, s[0:1]

if_block:
    ... some code ...

join:
    s_or_b64 exec, exec, s[0:1]
(The saveexec assembly instructions apply a bit-wise operation to the exec register, storing the original value of exec in their destination register. Also, we can introduce branches to skip the if-block entirely if the condition happens to be uniformly false, )

This is quite different from CPUs, and so a generic compiler framework like LLVM tends to get confused. For example, the fast register allocator in LLVM is a very simple allocator that just spills all live registers at the end of a basic block before the so-called terminators. Usually, those are just branch instructions, so in the example above it would spill registers after the s_xor_b64.

This is bad because the exec mask has already been reduced by the if-condition at that point, and so vector registers end up being spilled only partially.

Until recently, these issues were hidden by the fact that we lowered the control flow instructions into their final form only at the very end of the compilation process. However, previous optimization passes including register allocation can benefit from seeing the precise shape of the GPU-style control flow earlier. But then, some of the subtleties of the exec masks need to be taken account by those earlier optimization passes as well.

A related problem arises with another GPU-specific specialty, the "whole quad mode". We want to be able to compute screen-space derivatives in pixel shaders - mip-mapping would not be possible without it - and the way this is done in GPUs is to always run pixel shaders on 2x2 blocks of pixels at once and approximate the derivatives by taking differences between the values for neighboring pixels. This means that the exec mask needs to be turned on for pixels that are not really covered by whatever primitive is currently being rendered. Those are called helper pixels.

However, there are times when helper pixels absolutely must be disabled in the exec mask, for example when storing to an image. A separate pass deals with the enabling and disabling of helper pixels. Ideally, this pass should run after instruction scheduling, since we want to be able to rearrange memory loads and stores freely, which can only be done before adding the corresponding exec-instructions. The instructions added by this pass look like this:
    s_mov_b64 s[2:3], exec
    s_wqm_b64 exec, exec

    ... code with helper pixels enabled goes here ...

    s_and_b64 exec, exec, s[2:3]

    ... code with helper pixels disabled goes here ...
Naturally, adding the bit-wise AND of the exec mask must happen in a way that doesn't conflict with any of the exec manipulations for control flow. So some careful coordination needs to take place.

My suggestion is to allow arbitrary instructions at the beginning and end of basic blocks to be marked as "initiators" and "terminators", as opposed to the current situation, where there is no notion of initiators, and whether an instruction is a terminator is a property of the opcode. An alternative, that Matt Arsenault is working on, adds aliases for certain exec-instructions which act as terminators. This may well be sufficient, I'm looking forward to seeing the result.