Posts mit dem Label Radeon werden angezeigt. Alle Posts anzeigen
Posts mit dem Label Radeon werden angezeigt. Alle Posts anzeigen

Montag, November 10, 2025

An alias analysis shower thought

Alias analysis allows a compiler to understand whether two memory accesses may conflict in the sense that they both touch the same memory location and at least one of them is a write. This in turn enables certain compiler optimizations. For example, memory accesses that do not conflict can be rearranged as part of instruction scheduling. Using this freedom where it exists is especially important in GPU programs, where memory accesses frequently have latencies of many hundreds of cycles, and there is no out-of-order scheduling in hardware to bail us out.

Instruction scheduling typically happens after instruction selection. Instruction selection makes a program harder to reason about because it replaces an internal representation that uses fairly generic instructions (such as pointer addition and pointer-sized multiplications) with target-specific instructions that can be more complex to reason about (such as shift and an add combined into a single instruction, or a 64-bit addition decomposed into two 32-bit additions with a carry between them). This in turn makes alias analysis a lot harder.

LLVM addresses this issue by keeping references to pointer values in the pre-isel instruction intermediate representation around. Alias analysis is then performed based on those pointer values.

To some extent this seems a bit arbitrary and merely a curious historical artifact. There are many "isel-related" transforms that happen on LLVM IR before the "actual" instruction selection pass that generates MachineIR. Especially with the unfortunate co-existence of SelectionDAG and GlobalISel, there is a strong incentive to extract certain common lowerings into earlier passes in LLVM IR, to avoid code duplication. However, pulling complex lowerings on address calculations earlier in the pass pipeline currently means likely losing important information about pointers and therefore weakening alias analysis. It would be great if we could explicitly preserve the pointers from an earlier point in compilation somehow.

There really doesn't seem to be a good a priori argument against it. A compiler written from scratch could easily use a unified IR substrate throughout and freely choose a point at which pointers for alias analysis become "frozen". It'd just be a massive undertaking to move LLVM to such a model.

Mittwoch, Februar 07, 2024

Building a HIP environment from scratch

HIP is a C++-based, single-source programming language for writing GPU code. "Single-source" means that a single source file can contain both the "host code" which runs on the CPU and the "device code" which runs on the GPU. In a sense, HIP is "CUDA for AMD", except that HIP can actually target both AMD and Nvidia GPUs.

If you merely want to use HIP, your best bet is to look at the documentation and download pre-built packages. (By the way, the documentation calls itself "ROCm" because that's what AMD calls its overall compute platform. It includes HIP, OpenCL, and more.)

I like to dig deep, though, so I decided I want to build at least the user space parts myself to the point where I can build a simple HelloWorld using a Clang from upstream LLVM. It's all open-source, after all!

It's a bit tricky, though, in part because of the kind of bootstrapping problems you usually get when building toolchains: Running the compiler requires runtime libraries, at least by default, but building the runtime libraries requires a compiler. Luckily, it's not quite that difficult, though, because compiling the host libraries doesn't require a HIP-enabled compiler - any C++ compiler will do. And while the device libraries do require a HIP- (and OpenCL-)enabled compiler, it is possible to build code in a "freestanding" environment where runtime libraries aren't available.

What follows is pretty much just a list of steps with running commentary on what the individual pieces do, since I didn't find an equivalent recipe in the official documentation. Of course, by the time you read this, it may well be outdated. Good luck!

Components need to be installed, but installing into some arbitrary prefix inside your $HOME works just fine. Let's call it $HOME/prefix. All packages use CMake and can be built using invocations along the lines of:

cmake -S . -B build -GNinja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_PREFIX=$HOME/prefix -DCMAKE_PREFIX_PATH=$HOME/prefix
ninja -C build install

In some cases, additional variables need to be set.

Step 1: clang and lld

We're going to need a compiler and linker, so let's get llvm/llvm-project and build it with Clang and LLD enabled: -DLLVM_ENABLE_PROJECTS='clang;lld' -DLLVM_TARGETS_TO_BUILD='X86;AMDGPU'

Building LLVM is an art of its own which is luckily reasonably well documented, so I'm going to leave it at that.

Step 2: Those pesky cmake files

Build and install ROCm/rocm-cmake to avoid cryptic error messages down the road when building other components that use those CMake files without documenting the dependency clearly. Not rocket science, but man am I glad for GitHub's search function.

Step 3: libhsa-runtime64.so

This is the lowest level user space host-side library in the ROCm stack. Its services, as far as I understand them, include setting up device queues and loading "code objects" (device ELF files). All communication with the kernel driver goes through here.

Notably though, this library does not know how to dispatch a kernel! In the ROCm world, the so-called Architected Queueing Language is used for that. An AQL queue is setup with the help of the kernel driver (and that does go through libhsa-runtime64.so), and then a small ring buffer and a "door bell" associated with the queue are mapped into the application's virtual memory space. When the application wants to dispatch a kernel, it (or rather, a higher-level library like libamdhip64.so that it links against) writes an AQL packet into the ring buffer and "rings the door bell", which basically just means writing a new ring buffer head pointer to the door bell's address. The door bell virtual memory page is mapped to the device, so ringing the door bell causes a PCIe transaction (for us peasants; MI300A has slightly different details under the hood) which wakes up the GPU.

Anyway, libhsa-runtime64.so comes in two parts for what I am being told are largely historical reasons:

The former is statically linked into the latter...

Step 4: It which must not be named

For Reasons(tm), there is a fork of LLVM in the ROCm ecosystem, ROCm/llvm-project. Using upstream LLVM for the compiler seems to be fine and is what I as a compiler developer obviously want to do. However, this fork has an amd directory with a bunch of pieces that we'll need. I believe there is a desire to upstream them, but also an unfortunate hesitation from the LLVM community to accept something so AMD-specific.

In any case, the required components can each be built individually against the upstream LLVM from step 1:

  • hipcc; this is a frontend for Clang which is supposed to be user-friendly, but at the cost of adding an abstraction layer. I want to look at the details under the hood, so I don't want to and don't have to use it; but some of the later components want it
  • device-libs; as the name says, these are libraries of device code. I'm actually not quite sure what the intended abstraction boundary is between this one and the HIP libraries from the next step. I think these ones are meant to be tied more closely to the compiler so that other libraries, like the HIP library below, don't have to use __builtin_amdgcn_* directly? Anyway, just keep on building...
  • comgr; the "code object manager". Provides a stable interface to LLVM, Clang, and LLD services, up to (as far as I understand it) invoking Clang to compile kernels at runtime. But it seems to have no direct connection to the code-related services in libhsa-runtime64.so.

That last one is annoying. It needs a -DBUILD_TESTING=OFF

Worse, it has a fairly large interface with the C++ code of LLVM, which is famously not stable. In fact, at least during my little adventure, comgr wouldn't build as-is against the LLVM (and Clang and LLD) build that I got from step 1. I had to hack out a little bit of code in its symbolizer. I'm sure it's fine.

Step 5: libamdhip64.so

Finally, here comes the library that implements the host-side HIP API. It also provides a bunch of HIP-specific device-side functionality, mostly by leaning on the device-libs from the previous step.

It lives in ROCm/clr, which stands for either Compute Language Runtimes or Common Language Runtime. Who knows. Either one works for me. It's obviously for compute, and it's common because it also contains OpenCL support.

You also need ROCm/HIP at this point. I'm not quite sure why stuff is split up into so many repositories. Maybe ROCm/HIP is also used when targeting Nvidia GPUs with HIP, but ROCm/CLR isn't? Not a great justification in my opinion, but at least this is documented in the README.

CLR also needs a bunch of additional CMake options: -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=${checkout of ROCm/HIP} -DHIPCC_BIN_DIR=$HOME/prefix/bin

Step 6: Compiling with Clang

We can now build simple HIP programs with our own Clang against our own HIP and ROCm libraries:

clang -x hip --offload-arch=gfx1100 --rocm-path=$HOME/prefix -rpath $HOME/prefix/lib -lstdc++ HelloWorld.cpp
LD_LIBRARY_PATH=$HOME/prefix/lib ./a.out

Neat, huh?

Sonntag, Dezember 31, 2023

Vulkan driver debugging stories

Recently, I found myself wanting to play some Cyberpunk 2077. Thanks to Proton, that's super easy and basically just works on Linux. Except that I couldn't enable raytracing, which annoyed me given that I have an RDNA3-based GPU that should be perfectly capable. Part of it may have been that I'm (obviously) using a version of the AMDVLK driver.

The first issue was that Proton simply wouldn't advertise raytracing (DXR) capabilities on my setup. That is easily worked around by setting VKD3D_CONFIG=dxr in the environment (in Steam launch options, set the command to VKD3D_CONFIG=dxr %command%).

This allowed me to enable raytracing in the game's graphics settings which unfortunately promptly caused a GPU hang and a GPUVM fault report in dmesg. Oh well, time for some debugging. That is (part of) my job, after all.

The fault originated from TCP, which means it's a shader vector memory access to a bad address. There's a virtually limitless number of potential root causes, so I told the amdgpu kernel module to take it easy on the reset attempts (by setting the lockup_timeout module parameter to a rather large value - that can be done on the Grub command line, but I chose to add a setting in /etc/modprobe.d/ instead) broke out good old trusty UMR in client/server mode (run with --server on the system under debug, and with --gui tcp://${address}:1234 on another system) to look at the waves that were hung. Sure enough, they had the fatal_halt bit set, were stuck a few instructions past a global_load_b64, and looking at VGPRs did suggest a suspicious address.

Tooling for shader debugging is stuck in the earlier parts of the 20th century (which may seem like an impressive feat of time travel given that programmable shading didn't even exist back then, but trust me it's genuinely and inherently way more difficult than CPU debug), so the next step was to get some pipeline dumps to correlate against the disassembly shown in UMR. Easy peasy, point the Vulkan driver at a custom amdVulkanSettings.cfg by way of the AMD_CONFIG_DIR environment variable and enable pipeline dumping by adding EnablePipelineDump,1 to the config file. Oh, and setting the AMD_DEBUG_DIR environment variable is helpful, too. Except now the game crashed before it even reached the main menu. Oops.

Well, that's a CPU code problem, and CPU debugging has left the 1970s firmly behind for somewhere in the 1990s or early 2000s. So let's get ourselves a debug build of the driver and attach gdb. Easy, right? Right?!? No. Cyberpunk 2077 is a Windows game, run in Proton, which is really Wine, which is really an emulator that likes to think of itself as not an emulator, run in some kind of container called a "pressure vessel" to fit the Steam theme. Fun.

To its credit, Proton tries to be helpful. You can set PROTON_DUMP_DEBUG_COMMANDS=1 in the environment which dumps some shell scripts to /tmp/proton-$user/ which allowed me to comparatively easily launch Cyberpunk 2077 from the terminal without going through the Steam client each time. But Wine seems to hate debugging, and it seems to hate debugging of native Linux code even more, and obviously the Vulkan driver is native Linux code. All my attempts to launch the game in some form of debugger in order to catch it red-handed were in vain.

At this point, I temporarily resigned myself to more debugging time travel of the bad kind, i.e. backwards in time to worse tooling. printf() still works, after all, and since the crash was triggered by enabling pipeline dumps, I had a fairly good idea about the general area in the driver that must have contained the problem.

So I went on a spree of sprinkling printf()s everywhere, which led to some extremely confusing and non-determinstic results. Confusing and non-deterministic is a really great hint, though, because it points at multi-threading. Indeed, Cyberpunk 2077 is a good citizen and does multi-threaded pipeline compilation. Or perhaps VKD3D is being helpful. Either way, it's a good thing except it exposed a bug in the driver. So I started sprinkling std::lock_guards everywhere. That helped narrow down the problem area. Add some good old staring at code and behold: somebody had very recently added a use of strtok() to the pipeline dumping logic. Very bad idea, very easy fix.

Okay, so I can dump some pipelines now, but I still don't get to the main menu because the game now crashes with an assertion somewhere in PAL. I could start staring at pipeline dumps, but this is an assertion that (1) suggests a legitimate problem, which means prioritizing it might actually be helpful, and (2) is in the kind of function that is called from just about everywhere, which means I really, really need to be able to look at a stacktrace now. It's time to revisit debuggers.

One of the key challenges with my earlier attempts at using gdb was that (1) Wine likes to fork off tons of processes, which means getting gdb to follow the correct one is basically impossible, and (2) the crash happens very quickly, so manually attaching gdb after the fact is basically impossible. But the whole point of software development is to make the impossible possible, so I tweaked the implementation of PAL_ASSERT to poke at /proc/self to figure out whether a debugger is already attached and if one isn't, optionally print out a helpful message including the PID and then sleep instead of calling abort() immediately. This meant that I could now attach gdb at my leisure, which I did.

And was greeted with an absolutely useless gdb session because Wine is apparently being sufficiently creative with the dynamic linker structures that gdb can't make sense of what's happening on its own and doesn't find any symbols, let alone further debug info, and so there was no useful backtrace. Remember how I mentioned that Wine hates debugging?

Luckily, a helpful soul pointed out that /proc/$pid/maps exists and tells us where .so's are mapped into a process address space, and there's absolutely nothing Wine can do about that. Even better, gdb allows the user to manually tell it about shared libraries that have been loaded. Even even better, gdb can be scripted with Python. So, I wrote a gdb script that walks the backtrace and figures out which shared libraries to tell gdb about to make sense of the backtraces. (Update: Friedrich Vock helpfully pointed out that attaching with gdb -p $pid /path/to/the/correct/bin/wine64 also allows gdb to find shared libraries.)

At this point, another helpful soul pointed out that Fossilize exists and can play back pipeline creation in a saner environment than a Windows game running on VKD3D in Wine in a pressure vessel. That would surely have reduced my debugging woes somewhat. Oh well, at least I learned something.

From there, fixing all the bugs was almost a walk in the park. The assertion I had run into in PAL was easy to fix, and finally I could get back to the original problem: that GPU hang. That turned out to be a fairly mundane problem in LLPC's raytracing implementation, for which I have a fix. It's still going to take a while to trickle out, in part because this whole debugging odyssey has a corresponding complex chain of patches that just take a while to ferment, and in part because pre-Christmas is very far from a quiet time and things have just been generally crazy. Still: very soon you, too, will be able to play Cyberpunk 2077 with raytracing using the AMDVLK driver.

Freitag, März 09, 2018

TableGen #5: DAGs

This is the fifth part of a series; see the first part for a table of contents.

With bit sequences, we have already seen one unusual feature of TableGen that is geared towards its specific purpose. DAG nodes are another; they look a bit like S-expressions:
def op1;
def op2;
def i32:

def Example {
  dag x = (op1 $foo, (op2 i32:$bar, "Hi"));
}
In the example, there are two DAG nodes, represented by a DagInit object in the code. The first node has as its operation the record op1. The operation of a DAG node must be a record, but there are no other restrictions. This node has two children or arguments: the first argument is named foo but has no value. The second argument has no name, but it does have another DAG node as its value.

This second DAG node has the operation op2 and two arguments. The first argument is named bar and has value i32, the second has no name and value "Hi".

DAG nodes can have any number of arguments, and they can be nested arbitrarily. The values of arguments can have any type, at least as far as the TableGen frontend is concerned. So DAGs are an extremely free-form way of representing data, and they are really only given meaning by TableGen backends.

There are three main uses of DAGs:
  1. Describing the operands on machine instructions.
  2. Describing patterns for instruction selection.
  3. Describing register files with something called "set theory".
I have not yet had the opportunity to explore the last point in detail, so I will only give an overview of the first two uses here.

Describing the operands of machine instructions is fairly straightforward at its core, but the details can become quite elaborate.

I will illustrate some of this with the example of the V_ADD_F32 instruction from the AMDGPU backend. V_ADD_F32 is a standard 32-bit floating point addition, at least in its 32-bit-encoded variant, which the backend represents as V_ADD_F32_e32.

Let's take a look at some of the fully resolved records produced by the TableGen frontend:
def V_ADD_F32_e32 {    // Instruction AMDGPUInst ...
  dag OutOperandList = (outs anonymous_503:$vdst);
  dag InOperandList = (ins VSrc_f32:$src0, VGPR_32:$src1);
  string AsmOperands = "$vdst, $src0, $src1";
  ...
}


def anonymous_503 {    // DAGOperand RegisterOperand VOPDstOperand
  RegisterClass RegClass = VGPR_32;
  string PrintMethod = "printVOPDst";
  ...
}
As you'd expect, there is one out operand. It is named vdst and an anonymous record is used to describe more detailed information such as its register class (a 32-bit general purpose vector register) and the name of a special method for printing the operand in textual assembly output. (The string "printVOPDst" will be used by the backend that generates the bulk of the instruction printer code, and refers to the method AMDGPUInstPrinter::printVOPDst that is implemented manually.)

There are two in operands. src1 is a 32-bit general purpose vector register and requires no special handling, but src0 supports more complex operands as described in the record VSrc_f32 elsewhere.

Also note the string AsmOperands, which is used as a template for the automatically generated instruction printer code. The operand names in that string refer to the names of the operands as defined in the DAG nodes.

This was a nice warmup, but didn't really demonstrate the full power and flexibility of DAG nodes. Let's look at V_ADD_F32_e64, the 64-bit encoded version, which has some additional features: the sign bits of the inputs can be reset or inverted, and the result (output) can be clamped and/or scaled by some fixed constants (0.5, 2, and 4). This will seem familiar to anybody who has worked with the old OpenGL assembly program extensions or with DirectX shader assembly.

The fully resolved records produced by the TableGen frontend are quite a bit more involved:
def V_ADD_F32_e64 {    // Instruction AMDGPUInst ...
  dag OutOperandList = (outs anonymous_503:$vdst);
  dag InOperandList =
    (ins FP32InputMods:$src0_modifiers, VCSrc_f32:$src0,
         FP32InputMods:$src1_modifiers, VCSrc_f32:$src1,
         clampmod:$clamp, omod:$omod);
  string AsmOperands = "$vdst, $src0_modifiers, $src1_modifiers$clamp$omod";
  list<dag> Pattern =
    [(set f32:$vdst, (fadd
      (f32 (VOP3Mods0 f32:$src0, i32:$src0_modifiers,
                      i1:$clamp, i32:$omod)),
      (f32 (VOP3Mods f32:$src1, i32:$src1_modifiers))))];
  ...
}

def FP32InputMods {     // DAGOperand Operand InputMods FPInputMods
  ValueType Type = i32;
 
string PrintMethod = "printOperandAndFPInputMods";
 
AsmOperandClass ParserMatchClass = FP32InputModsMatchClass;
  ...
}


def FP32InputModsMatchClass {   // AsmOperandClass FPInputModsMatchClass
  string Name = "RegOrImmWithFP32InputMods";
  string PredicateMethod = "isRegOrImmWithFP32InputMods";
  string ParserMethod = "parseRegOrImmWithFPInputMods";
  ...
}
The out operand hasn't changed, but there are now many more special in operands that describe whether those additional features of the instruction are used.

You can again see how records such as FP32InputMods refer to manually implemented methods. Also note that the AsmOperands string no longer refers to src0 or src1. Instead, the printOperandAndFPInputMods method on src0_modifiers and src1_modifiers will print the source operand together with its sign modifiers. Similarly, the special ParserMethod parseRegOrImmWithFPInputMods will be used by the assembly parser.

This kind of extensibility by combining generic automatically generated code with manually implemented methods is used throughout the TableGen backends for code generation.

Something else is new here: the Pattern. This pattern, together will all the other patterns defined elsewhere, is compiled into a giant domain-specific bytecode that executes during instruction selection to turn the SelectionDAG into machine instructions. Let's take this particular pattern apart:
(set f32:$vdst, (fadd ...))
We will match an fadd selection DAG node that outputs a 32-bit floating point value, and this output will be linked to the out operand vdst. (set, fadd and many others are defined in the target-independent include/llvm/Target/TargetSelectionDAG.td.)
(fadd (f32 (VOP3Mods0 f32:$src0, i32:$src0_modifiers,
                      i1:$clamp, i32:$omod)),
      (f32 (VOP3Mods f32:$src1, i32:$src1_modifiers)))
Both input operands of the fadd node must be 32-bit floating point values, and they will be handled by complex patterns. Here's one of them:
def VOP3Mods { // ComplexPattern
  string SelectFunc = "SelectVOP3Mods";
  int NumOperands = 2;
  ...
}
As you'd expect, there's a manually implemented SelectVOP3Mods method. Its signature is
bool SelectVOP3Mods(SDValue In, SDValue &Src,
                    SDValue &SrcMods) const;
It can reject the match by returning false, otherwise it pattern matches a single input SelectionDAG node into nodes that will be placed into src1 and src1_modifiers in the particular pattern we were studying.

Patterns can be arbitrarily complex, and they can be defined outside of instructions as well. For example, here's a pattern for generating the S_BFM_B32 instruction, which generates a bitfield mask:
def anonymous_2373anonymous_2371 {    // Pattern Pat ...
  dag PatternToMatch =
    (i32 (shl (i32 (add (i32 (shl 1, i32:$a)), -1)), i32:$b));
  list<dag> ResultInstrs = [(S_BFM_B32 ?:$a, ?:$b)];
  ...
}
The name of this record doesn't matter. The instruction selection TableGen backend simply looks for all records that have Pattern as a superclass. In this case, we match an expression of the form ((1 << a) - 1) << b on 32-bit integers into a single machine instruction.

So far, we've mostly looked at how DAGs are interpreted by some of the key backends of TableGen. As it turns out, most backends generate their DAGs in a fairly static way, but there are some fancier techniques that can be used as well. This post is already quite long though, so we'll look at those in the next post.

Dienstag, März 06, 2018

TableGen #4: Resolving variables

This is the fourth part of a series; see the first part for a table of contents.

It's time to look at some of the guts of TableGen itself. TableGen is split into a frontend, which parses the TableGen input, instantiates all the records, resolves variable references, and so on, and many different backends that generate code based on the instantiated records. In this series I'll be mainly focusing on the frontend, which lives in lib/TableGen/ inside the LLVM repository, e.g. here on the GitHub mirror. The backends for LLVM itself live in utils/TableGen/, together with the command line tool's main() function. Clang also has its own backends.

Let's revisit what kind of variable references there are and what kind of resolving needs to be done with an example:
class Foo<int src> {
  int Src = src;
  int Offset = 1;
  int Dst = !add(Src, Offset);
}

multiclass Foos<int src> {
  def a : Foo<src>;
  let Offset = 2 in
  def b : Foo<src>;
}

foreach i = 0-3 in
defm F#i : Foos<i>;
This is actually broken in older LLVM by one of the many bugs, but clearly it should work based on what kind of features are generally available, and with my patch series it certainly does work in the natural way. We see four kinds of variable references:
  • internally within a record, such as the initializer of Dst referencing Src and Offset
  • to a class template variable, such as Src being initialized by src
  • to a multiclass template variable, such as src being passed as a template argument for Foo
  • to a foreach iteration variable
As an aside, keep in mind that let in TableGen does not mean the same thing as in the many functional programming languages that have a similar construct. In those languages let introduces a new variable, but TableGen's let instead overrides the value of a variable that has already been defined elsewhere. In the example above, the let-statement causes the value of Offset to be changed in the record that was instantiated from the Foo class to create the b prototype inside multiclass Foos.

TableGen internally represents variable references as instances of the VarInit class, and the variables themselves are simply referenced by name. This causes some embarrassing issues around template arguments which are papered over by qualifying the variable name with the template name. If you pass the above example through a sufficiently fixed version of llvm-tblgen, one of the outputs will be the description of the Foo class:
class Foo<int Foo:src = ?> {
  int Src = Foo:src;
  int Offset = 1;
  int Dst = !add(Src, Offset);
  string NAME = ?;
}
As you can see, Foo:src is used to refer to the template argument. In fact, the template arguments of both classes and multiclasses are temporarily added as variables to their respective prototype records. When the class or prototype in a multiclass is instantiated, all references to the template argument variables are resolved fully, and the variables are removed (or rather, some of them are removed, and making that consistent is one of the many things I set out to clean up).

Similarly, references to foreach iteration variables are resolved when records are instantiated, although those variables aren't similarly qualified. If you want to learn more about how variable names are looked up, TGParser::ParseIDValue is a good place to start.

The order in which variables are resolved is important. In order to achieve the flexibility of overriding defaults with let-statements, internal references among record variables must be resolved after template arguments.

Actually resolving variable references used to be done by the implementations of the following virtual method of the Init class hierarchy (which represents initializers, i.e. values and expressions):
virtual Init *resolveReferences(Record &R, const RecordVal *RV) const;
This method recursively resolves references in the constituent parts of the expression and then performs constant folding, and returns the resulting value (or the original value if nothing could be resolved). Its interface is somewhat magical: R represents the "current" record which is used as a frame of reference for magical lookups in the implementation of !cast; this is a topic for another time, though. At the same time, variables referencing R are supposed to be resolved, but only if RV is null. If RV is non-null, then only references to that specific variable are supposed to be resolved. Additionally, some behaviors around unset depend on this.

This is replaced in my changes with
virtual Init *resolveReferences(Resolver &R) const;
where Resolver is an abstract base class / interface which can lookup values based on their variable names:
class Resolver {
  Record *CurRec;

public:
  explicit Resolver(Record *CurRec) : CurRec(CurRec) {}
  virtual ~Resolver() {}

  Record *getCurrentRecord() const { return CurRec; }
  virtual Init *resolve(Init *VarName) = 0;
  virtual bool keepUnsetBits() const { return false; }
};
The "current record" is used as a reference for the aforementioned magical !casts, and keepUnsetBits instructs the implementation of bit sequences in BitsInit not to resolve to ? (as was explained in the third part of the series). resolve itself is implemented by one of the subclasses, most notably:
  1. MapResolver: Resolve based on a dictionary of name-value pairs.
  2. RecordResolver: Resolve variable names that appear in the current record.
  3. ShadowResolver: Delegate requests to an underlying resolver, but filter out some names.
 This last type of resolver is used by the implementations of !foreach and !foldl to avoid mistakes with nesting. Consider, for example:
class Exclamation<list<string> messages> {
  list Messages = !foreach(s, messages, s # "!");
}

class Greetings<list<string> names>
    : Exclamation&lt!foreach(s, names, "Hello, " # s)>;

def : Greetings<["Alice", "Bob"]>;
This effectively becomes a nested !foreach. The iteration variable is named s in both, so when substituting s for the outer !foreach, we must ensure that we don't also accidentally substitute s in the inner !foreach. We achieve this by having !foreach wrap the given resolver with a ShadowResolver. The same principle applies to !foldl as well, of course.

Freitag, Februar 23, 2018

TableGen #3: Bits

This is the third part of a series; see the first part for a table of contents.

One of the main backend uses of TableGen is describing target machine instructions, and that includes describing the binary encoding of instructions and their constituents parts. This requires a certain level of bit twiddling, and TableGen supports this with explicit bit (single bit) and bits (fixed-length sequence of bits) types:
class Enc<bits<7> op> {
  bits<10> Encoding;

  let Encoding{9-7} = 5;
  let Encoding{6-0} = op;
}

def InstA : Enc<0x35>;
def InstB : Enc<0x08>;
... will produce records:
def InstA {     // Enc
  bits<10> Encoding = { 1, 0, 1, 0, 1, 1, 0, 1, 0, 1 };
  string NAME = ?;
}
def InstB {     // Enc
  bits<10> Encoding = { 1, 0, 1, 0, 0, 0, 1, 0, 0, 0 };
  string NAME = ?;
}
So you can quite easily slice and dice bit sequences with curly braces, as long as the indices themselves are constants.

But the real killer feature is that so-called unset initializers, represented by a question mark, aren't fully resolved in bit sequences:
class Enc<bits<3> opcode> {
  bits<8> Encoding;
  bits<3> Operand;

  let Encoding{0} = opcode{2};
  let Encoding{3-1} = Operand;
  let Encoding{5-4} = opcode{1-0};
  let Encoding{7-6} = { 1, 0 };
}

def InstA : Enc<5>;
... produces a  record:
def InstA {     // Enc
  bits<8> Encoding = { 1, 0, 0, 1, Operand{2}, Operand{1}, Operand{0}, 1 };
  bits<3> Operand = { ?, ?, ? };
  string NAME = ?;
}
So instead of going ahead and saying, hey, Operand{2} is ?, let's resolve that and plug it into Encoding, TableGen instead keeps the fact that bit 3 of Encoding refers to Operand{2} as part of its data structures.

Together with some additional data, this allows a backend of TableGen to automatically generate code for instruction encoding and decoding (i.e., disassembling). For example, it will create the source for a giant C++ method with signature
uint64_t getBinaryCodeForInstr(const MCInst &MI, /* ... */) const;
which contains a giant constant array with all the fixed bits of each instruction followed by a giant switch statement with cases of the form:
case AMDGPU::S_CMP_EQ_I32:
case AMDGPU::S_CMP_EQ_U32:
case AMDGPU::S_CMP_EQ_U64:
// more cases...
case AMDGPU::S_SET_GPR_IDX_ON: {
  // op: src0
  op = getMachineOpValue(MI, MI.getOperand(0), Fixups, STI);
  Value |= op & UINT64_C(255);
  // op: src1
  op = getMachineOpValue(MI, MI.getOperand(1), Fixups, STI);
  Value |= (op & UINT64_C(255)) << 8;
  break;
}
The bitmasks and shift values are all derived from the structure of unset bits as in the example above, and some additional data (the operand DAGs) are used to identify the operand index corresponding to TableGen variables like Operand based on their name. For example, here are the relevant parts of the S_CMP_EQ_I32 record generated by the AMDGPU backend's TableGen files:
 def S_CMP_EQ_I32 {      // Instruction (+ other superclasses)
  field bits<32> Inst = { 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, src1{7}, src1{6}, src1{5}, src1{4}, src1{3}, src1{2}, src1{1}, src1{0}, src0{7}, src0{6}, src0{5}, src0{4}, src0{3}, src0{2}, src0{1}, src0{0} };
  dag OutOperandList = (outs);
  dag InOperandList = (ins SSrc_b32:$src0, SSrc_b32:$src1);
  bits
<8> src0 = { ?, ?, ?, ?, ?, ?, ?, ? };
  bits
<8> src1 = { ?, ?, ?, ?, ?, ?, ?, ? };
  // many more variables...
}
Note how Inst, which describes the 32-bit encoding as a whole, refers to the TableGen variables src0 and src1. The operand indices used in the calls to MI.getOperand() above are derived from the InOperandList, which contains nodes with the corresponding names. (SSrc_b32 is the name of a record that subclasses RegisterOperand and describes the acceptable operands, such as registers and inline constants.)

Hopefully this helped you appreciate just how convenient TableGen can be. Not resolving the ? in bit sequences is an odd little exception to an otherwise fairly regular language, but the resulting expressive power is clearly worth it. It's something to keep in mind when we discuss how variable references are resolved.

Mittwoch, Februar 21, 2018

TableGen #2: Functional Programming

This is the second part of a series; see the first part for a table of contents.

When the basic pattern of having classes with variables that are filled in via template arguments or let-statements reaches the limits of its expressiveness, it can become useful to calculate values on the fly. TableGen provides string concatenation out of the box with the paste operator ('#'), and there are built-in functions which can be easily recognized since they start with an exclamation mark, such as !add, !srl, !eq, and !listconcat. There is even an !if-builtin and a somewhat broken and limited !foreach.

There is no way of defining new functions, but there is a pattern that can be used to make up for it: classes with ret-values:
class extractBit<int val, int bitnum> {
  bit ret = !and(!srl(val, bitnum), 1);
}

class Foo<int val> {
  bit bitFour = extractBit<val, 4>.ret;
}

def Foo1 : Foo<5>;
def Foo2 : Foo<17>;
This doesn't actually work in LLVM trunk right now because of the deficiencies around anonymous record instantiations that I mentioned in the first part of the series, but after a lot of refactoring and cleanups, I got it to work reliably. It turns out to be an extremely useful tool.

In case you're wondering, this does not support recursion and it's probably better that way. It's possible that TableGen is already accidentally Turing complete, but giving it that power on purpose seems unnecessary and might lead to abuse.

Without recursion, a number of builtin functions are required. There has been a !foreach for a long time, and it is a very odd duck:
def Defs {
  int num;
}

class Example<list<int> nums> {
  list<int> doubled = !foreach(Defs.num, nums, !add(Defs.num, Defs.num));
}

def MyNums : Example<[4, 1, 9, -3]>;
In many ways it does what you'd expect, except that having to define a dummy record with a dummy variable in this way is clearly odd and fragile. Until very recently it did not actually support everything you'd think even then, and even with the recent fixes there are plenty of bugs. Clearly, this is how !foreach should look instead:
class Example<list<int> nums> {
  list<int> doubled =
      !foreach(x, nums, !add(x, x));
}

def MyNums : Example<[4, 1, 9, -3]>;
... and that's what I've implemented.

This ends up being a breaking change (the only one in the whole series, hopefully), but !foreach isn't actually used in upstream LLVM proper anyway, and external projects can easily adapt.

A new feature that I have found very helpful is a fold-left operation:
class Enumeration<list<string> items> {
  list<string> ret = !foldl([], items, lhs, item,
      !listconcat(lhs, [!size(lhs) # ": " # item]));
}

def MyList : Enumeration<["foo", "bar", "baz"]>;
This produces the following record:
def MyList {    // Enumeration
  list<string> ret = ["0: foo", "1: bar", "2: baz"];
  string NAME = ?;
}
Needless to say, it was necessary to refactor the TableGen tool very deeply to enable this kind of feature, but I am quite happy with how it ended up.

The title of this entry is "Functional Programming", and in a sense I lied. Functions are not first-class values in TableGen even with my changes, so one of the core features of functional programming is missing. But that's okay: most of what you'd expect to have and actually need is now available in a consistent manner, even if it's still clunkier than in a "real" programming language. And again: making functions first-class would immediately make TableGen Turing complete. Do we really want that?

Montag, Februar 19, 2018

TableGen #1: What has TableGen ever done for us?

This is the first entry in an on-going series. Here's a list of all entries:
  1. What has TableGen ever done for us?
  2. Functional Programming
  3. Bits
  4. Resolving variables
  5. DAGs
  6. to be continued 
Also: here is a talk (slides + video) I gave in the FOSDEM 2019 LLVM devroom on TableGen.

Anybody who has ever done serious backend work in LLVM has probably developed a love-hate relationship with TableGen. At its best it can be an extremely useful tool that saves a lot of manual work. At its worst, it will drive you mad with bizarre crashes, indecipherable error messages, and generally inscrutable failures to understand what you want from it.

TableGen is an internal tool of the LLVM compiler framework. It implements a domain-specific language that is used to describe many different kinds of structures. These descriptions are translated to read-only data tables that are used by LLVM during compilation.

For example, all of LLVM's intrinsics are described in TableGen files. Additionally, each backend describes its target machine's instructions, register file(s), and more in TableGen files.

The unit of description is the record. At its core, a record is a dictionary of key-value pairs. Additionally, records are typed by their superclass(es), and each record can have a name. So for example, the target machine descriptions typically contain one record for each supported instruction. The name of this record is the name of the enum value which is used to refer to the instruction. A specialized backend in the TableGen tool collects all records that subclass the Instruction class and generates instruction information tables that is used by the C++ code in the backend and the shared codegen infrastructure.

The main point of the TableGen DSL is to provide an ostensibly convenient way to generate a large set of records in a structured fashion that exploits regularities in the target machine architecture. To get an idea of the scope, the X86 backend description contains ~47k records generated by ~62k lines of TableGen. The AMDGPU backend description contains ~39k records generated by ~24k lines of TableGen.

To get an idea of what TableGen looks like, consider this simple example:
def Plain {
  int x = 5;
}

class Room<string name> {
  string Name = name;
  string WallColor = "white";
}

def lobby : Room<"Lobby">;

multiclass Floor<int num, string color> {
  let WallColor = color in {
    def _left : Room<num # "_left">;
    def _right : Room<num # "_right">;
  }
}

defm first_floor : Floor<1, "yellow">;
defm second_floor : Floor
<2, "gray">;
This example defines 6 records in total. If you have an LLVM build around, just run the above through llvm-tblgen to see them for yourself. The first one has name Plain and contains a single value named x of value 5. The other 5 records have Room as a superclass and contain different values for Name and WallColor.

The first of those is the record of name lobby, whose Name value is "Lobby" (note the difference in capitalization) and whose WallColor is "white".

Then there are four records with the names first_floor_left, first_floor_right, second_floor_left, and second_floor_right. Each of those has Room as a superclass, but not Floor. Floor is a multiclass, and multiclasses are not classes (go figure!). Instead, they are simply collections of record prototypes. In this case, Floor has two record prototypes, _left and _right. They are instantiated by each of the defm directives. Note how even though def and defm look quite similar, they are conceptually different: one instantiates the prototypes in a multiclass (or several multiclasses), the other creates a record that may or may not have one or more superclasses.

The Name value of first_floor_left is "1_left" and its WallColor is "yellow", overriding the default. This demonstrates the late-binding nature of TableGen, which is quite useful for modeling exceptions to an otherwise regular structure:
class Foo {
  string salutation = "Hi";
  string message = salutation#", world!";
}

def : Foo {
  let
salutation = "Hello";
}
The message of the anonymous record defined by the def-statement is "Hello, world!".

There is much more to TableGen. For example, a particularly surprising but extremely useful feature are the bit sets that are used to describe instruction encodings. But that's for another time.

For now, let me leave you with just one of the many ridiculous inconsistencies in TableGen:
class Tag<int num> {
  int Number = num;
}

class Test<int num> {
  int Number1 = Tag<5>.Number;
  int Number2 = Tag<num>.Number;
  Tag Tag1 = Tag<5>;
  Tag Tag2 = Tag<num>;
}

def : Test<5>;
What are the values in the anonymous record? It turns out that Number1 and Number2 are both 5, but Tag1 and Tag2 refer to different records. Tag1 refers to an anonymous record with superclass Tag and Number equal to 5, while Tag2 also refers to an anonymous record, but with the Number equal to an unresolved variable reference.

This clearly doesn't make sense at all and is the kind of thing that sometimes makes you want to just throw it all out of the window and build your own DSL with blackjack and Python hooks. The problem with that kind of approach is that even if the new thing looks nicer initially, it'd probably end up in a similarly messy state after another five years.

So when I ran into several problems like the above recently, I decided to take a deep dive into the internals of TableGen with the hope of just fixing a lot of the mess without reinventing the wheel. Over the next weeks, I plan to write a couple of focused entries on what I've learned and changed, starting with how a simple form of functional programming should be possible in TableGen.

Samstag, September 09, 2017

radeonsi: out-of-order rasterization on VI+

I've been polishing a patch of Marek to enable out-of-order rasterization on VI+. Assuming it goes through as planned, this will be the first time we're adding driver-specific drirc configuration options that are unfamiliar to the enthusiast community (there's radeonsi_enable_sisched already, but Phoronix has reported on the sisched option often enough). So I thought it makes sense to explain what those options are about.

Background: Out-of-order rasterization

Out-of-order rasterization is an optimization that can be enabled in some cases. Understanding it properly requires some background on how tasks are spread across shader engines (SEs) on Radeon GPUs.

The frontends (vertex processing, including tessellation and geometry shaders) and backends (fragment processing, including rasterization and depth and color buffers) are spread across SEs roughly like this:


(Not shown are the compute units (CUs) in each SE, which is where all shaders are actually executed.)

The input assembler distributes primitives (i.e., triangles) and their vertices across SEs in a mostly round-robin fashion for vertex processing. In the backend, work is distributed across SEs by on-screen location, because that improves cache locality.

This means that once the data of a triangle (vertex position and attributes) is complete, most likely the corresponding rasterization work needs to be distributed to other SEs. This is done by what I'm simplifying as the "crossbar" in the diagram.

OpenGL is very precise about the order in which the fixed-function parts of fragment processing should happen. If one triangle comes after another in a vertex buffer and they overlap, then the fragments of the second triangle better overwrite the corresponding fragments of the first triangle (if they weren't rejected by the depth test, of course). This means that the "crossbar" may have to delay forwarding primitives from a shader engine until all earlier primitives (which were processed in another shader engine) have been forwarded. This only happens rarely, but it's still sad when it does.

There are some cases in which the order of fragments doesn't matter. Depth pre-passes are a typical example: the order in which triangles are written to the depth buffer doesn't matter as long as the "front-most" fragments win in the end. Another example are some operations involved in stencil shadows.

Out-of-order rasterization simply means that the "crossbar" does not delay forwarding triangles. Triangles are instead forwarded immediately, which means that they can be rasterized out-of-order. With the in-progress patches, the driver recognizes cases where this optimization can be enabled safely.

By the way #1: From this explanation, you can immediately deduce that this feature only affects GPUs with multiple SEs. So integrated GPUs are not affected, for example.

By the way #2: Out-of-order rasterization is entirely disabled by setting R600_DEBUG=nooutoforder.


Why the configuration options?

There are some cases where the order of fragments almost doesn't matter. It turns out that the most common and basic type of rendering is one of these cases. This is when you're drawing triangles without blending and with a standard depth function like LEQUAL with depth writes enabled. Basically, this is what you learn to do in every first 3D programming tutorial.

In this case, the order of fragments is mostly irrelevant because of the depth test. However, it might happen that two triangles have the exact same depth value, and then the order matters. This is very unlikely in common scenes though. Setting the option radeonsi_assume_no_z_fights=true makes the driver assume that it indeed never happens, which means out-of-order rasterization can be enabled in the most common rendering mode!

Some other cases occur with blending. Some blending modes (though not the most common ones) are commutative in the sense that from a purely mathematical point of view, the end result of blending two triangles together is the same no matter which order they're blended in. Unfortunately, additive blending (which is one of those modes) involves floating point numbers in a way where changing the order of operations can lead to different rounding, which leads to subtly different results. Using out-of-order rasterization would break some of the guarantees the driver has to give for OpenGL conformance.

The option radeonsi_commutative_blend_add=true tells the driver that you don't care about these subtle errors and will lead to out-of-order rasterization being used in some additional cases (though again, those cases are rarer, and many games probably don't encounter them at all).

tl;dr

Out-of-order rasterization can give a very minor boost on multi-shader engine VI+ GPUs (meaning dGPUs, basically) in many games by default. In most games, you should be able to set radeonsi_assume_no_z_fights=true and radeonsi_commutative_blend_add=true to get an additional very minor boost. Those options aren't enabled by default because they can lead to incorrect results.

Sonntag, Juni 25, 2017

ARB_gl_spirv, NIR linking, and a NIR backend for radeonsi

SPIR-V is the binary shader code representation used by Vulkan, and GL_ARB_gl_spirv is a recent extension that allows it to be used for OpenGL as well. Over the last weeks, I've been exploring how to add support for it in radeonsi.

As a bit of background, here's an overview of the various relevant shader representations that Mesa knows about. There are some others for really old legacy OpenGL features, but we don't care about those. On the left, you see the SPIR-V to LLVM IR path used by radv for Vulkan. On the right is the path from GLSL to LLVM IR, plus a mention of the conversion from GLSL IR to NIR that some other drivers are using (i965, freedreno, and vc4).

For GL_ARB_gl_spirv, we ultimately need to translate SPIR-V to LLVM IR. A path for this exists, but it's in the context of radv, not radeonsi. Still, the idea is to reuse this path.

Most of the differences between radv and radeonsi are in the ABI used by the shaders: the conventions by which the shaders on the GPU know where to load constants and image descriptors from, for example. The existing NIR-to-LLVM code needs to be adjusted to be compatible with radeonsi's ABI. I have mostly completed this work for simple VS-PS shader pipelines, which has the interesting side effect of allowing the GLSL-to-NIR conversion in radeonsi as well. We don't plan to use it soon, but it's nice to be able to compare.

Then there's adding SPIR-V support to the driver-independent mesa/main code.  This is non-trivial, because while GL_ARB_gl_spirv has been designed to remove a lot of the cruft of the old GLSL paths, we still need more supporting code than a Vulkan driver. This still needs to be explored a bit; the main issue is that GL_ARB_gl_spirv allows using default-block uniforms, so the whole machinery around glUniform*() calls has to work, which requires setting up all the same internal data structures that are setup for GLSL programs. Oh, and it looks like assigning locations is required, too.

My current plan is to achieve all this by re-using the GLSL linker, giving a final picture that looks like this:

So the canonical path in radeonsi for GLSL remains GLSL -> AST -> IR -> TGSI -> LLVM (with an optional deviation along the IR -> NIR -> LLVM path for testing), while the path for GL_ARB_gl_spirv is SPIR-V -> NIR -> LLVM, with NIR-based linking in between. In radv, the path remains as it is today.

Now, you may rightfully say that the GLSL linker is a huge chunk of subtle code, and quite thoroughly invested in GLSL IR. How could it possibly be used with NIR?

The answer is that huge parts of the linker don't really that much about the code in the shaders that are being linked. They only really care about the variables: uniforms and shader inputs and outputs. True, there are a bunch of linking steps that touch code, but most of them aren't actually needed for SPIR-V. Most notably, GL_ARB_gl_spirv doesn't require intrastage linking, and it explicitly disallows the use of features that only exist in compatibility profiles.

So most of the linker functionality can be preserved simply by converting the relevant variables (shader inputs/outputs, uniforms) from NIR to IR, then performing the linking on those, and finally extracting the linker results and writing them back into NIR. This isn't too much work. Luckily, NIR reuses the GLSL IR type system.

There are still parts that might need to look at the actual shader code, but my hope is that they are few enough that they don't matter.

And by the way, some people might want to move the IR -> NIR translation to before linking, so this work would set a foundation for that as well.

Anyway, I got a ridiculously simple toy VS-PS pipeline working correctly this weekend. The real challenge now is to find actual test cases...

Montag, Oktober 24, 2016

Compiling shaders: dynamically uniform variables and "convergent" intrinsics

There are some program transformations that are obviously correct when compiling regular single-threaded or even multi-threaded code, but that cannot be used for shader code. For example:

 v = texture(u_sampler, texcoord);  
 if (cond) {  
   gl_FragColor = v;  
 } else {  
   gl_FragColor = vec4(0.);  
 }  
   
 ... cannot be transformed to ...
   
 if (cond) {
   // The implicitly computed derivate of texcoord
   // may be wrong here if neighbouring pixels don't
   // take the same code path.
   gl_FragColor = texture(u_sampler, texcoord);
 } else {
   gl_FragColor = vec4(0.);  
 }

 ... but the reverse transformation is allowed.

Another example is:

 if (cond) {
   v = texelFetch(u_sampler[1], texcoord, 0);
 } else {
   v = texelFetch(u_sampler[2], texcoord, 0);
 }

 ... cannot be transformed to ...

 v = texelFetch(u_sampler[cond ? 1 : 2], texcoord, 0);
 // Incorrect, unless cond happens to be dynamically uniform.

 ... but the reverse transformation is allowed.

Using GL_ARB_shader_ballot, yet another example is:

 bool cond = ...;
 uint64_t v = ballotARB(cond);
 if (other_cond) {
   use(v);
 }

 ... cannot be transformed to ...

 bool cond = ...;
 if (other_cond) {
   use(ballotARB(cond));
   // Here, ballotARB returns 1-bits only for threads/work items
   // that take the if-branch.
 }

 ... and the reverse transformation is also forbidden.

These restrictions are all related to the GPU-specific SPMD/SIMT execution model, and they need to be taught to the compiler. Unfortunately, we partially fail at that today.

Here are some types of restrictions to think about (each of these restrictions should apply on top of any other restrictions that are expressible in the usual, non-SIMT-specific ways, of course):

  1. An instruction can be moved from location A to location B only if B dominates or post-dominates A.

    This restriction applies e.g. to instructions that take derivatives (like in the first example) or that explicitly take values from neighbouring threads (like in the third example). It also applies to barrier instructions.

    This is LLVM's convergent function attribute as I understand it.

  2. An instruction can be moved from location A to location B only if A dominates or post-dominates B.

    This restriction applies to the ballot instruction above, but it is not required for derivative computations or barrier instructions.

    This is in a sense dual to LLVM's convergent attribute, so it's co-convergence? Divergence? Not sure what to call this.

  3. Something vague about not introducing additional non-uniformity in the arguments of instructions / intrinsic calls.

    This last one applies to the sampler parameter of texture intrinsics (for the second example), to the ballot instruction, and also to the texture coordinates on sampling instructions that implicitly compute derivatives.

For the last type of restriction, consider the following example:

 uint idx = ...;
 if (idx == 1u) {
   v = texture(u_sampler[idx], texcoord);
 } else if (idx == 2u) {
   v = texture(u_sampler[idx], texcoord);
 }

 ... cannot be transformed to ...

 uint idx = ...;
 if (idx == 1u || idx == 2u) {
   v = texture(u_sampler[idx], texcoord);
 }

In general, whenever an operation has this mysterious restriction on its arguments, then the second restriction above must apply: we can move it from A to B only if A dominates or post-dominates B, because only then can we be certain that the move introduces no non-uniformity. (At least, this rule applies to transformations that are not SIMT-aware. A SIMT-aware transformation might be able to prove that idx is dynamically uniform even without the predication on idx == 1u or idx == 2u.)

However, the control flow rule is not enough:

 v1 = texture(u_sampler[0], texcoord);
 v2 = texture(u_sampler[1], texcoord);
 v = cond ? v1 : v2;

 ... cannot be transformed to ...

 v = texture(u_sampler[cond ? 0 : 1], texcoord);

The transformation does not break any of the CFG-related rules, and it would clearly be correct for a single-threaded program (given the knowledge that texture(...) is an operation without side effects). So the CFG-based restrictions really aren't sufficient to model the real set of restrictions that apply to the texture instruction. And it gets worse:

 v1 = texelFetch(u_sampler, texcoord[0], 0);
 v2 = texelFetch(u_sampler, texcoord[1], 0);
 v = cond ? v1 : v2;

 ... is equivalent to ...

 v = texelFetch(u_sampler, texcoord[cond ? 0 : 1], 0);

After all, texelFetch computes no implicit derivatives.

Calling the three kinds of restrictions 'convergent', 'co-convergent', and 'uniform', we get:

 texture(uniform sampler, uniform texcoord) convergent (co-convergent)
 texelFetch(uniform sampler, texcoord, lod) (co-convergent)
 ballotARB(uniform cond) convergent co-convergent
 barrier() convergent

For the texturing instructions, I put 'co-convergent' in parentheses because these instructions aren't inherently 'co-convergent'. The attribute is only there because of the 'uniform' function argument.

Actually, looking at the examples, it seems that co-convergent only appears when a function has a uniform argument. Then again, the texelFetch function can be moved freely in the CFG by a SIMT-aware pass that can prove that the move doesn't introduce non-uniformity to the sampler argument, so being able to distinguish functions that are inherently co-convergent (like ballotARB) from those that are only implicitly co-convergent (like texture and texelFetch) is still useful.

For added fun, things get muddier when you notice that in practice, AMDGPU doesn't even flag texturing intrinsics as 'convergent' today. Conceptually, the derivative-computing intrinsics need to be convergent to ensure that the texture coordinates for neighbouring pixels are preserved (as in the very first example). However, the AMDGPU backend does register allocation after the CFG has been transformed into the wave-level control-flow graph. So register allocation automatically preserves neighbouring pixels even when a texture instruction is sunk into a location with additional control-flow dependencies.

When we reach a point where vector register allocation happens with respect to the thread-level control-flow graph, then texture instructions really need to be marked as convergent for correctness. (This change would be beneficial overall, but is tricky because scalar register allocation must happen with respect to the wave-level control flow graph. LLVM currently wants to allocate all registers in one pass.)

Samstag, September 03, 2016

Exec masks in LLVM

Like is usual in GPUs, Radeon executes shaders in waves that execute the same program for many threads or work-items simultaneously in lock-step. Given a single program counter for up to 64 items (e.g. pixels being processed by a pixel shader), branch statements must be lowered to manipulation of the exec mask (unless the compiler can prove the branch condition to be uniform across all items). The exec mask is simply a bit-field that contains a 1 for every thread that is currently active, so code like this:
    if (i != 0) {
        ... some code ...
    }
gets lowered to something like this:
    v_cmp_ne_i32_e32 vcc, 0, v1
    s_and_saveexec_b64 s[0:1], vcc
    s_xor_b64 s[0:1], exec, s[0:1]

if_block:
    ... some code ...

join:
    s_or_b64 exec, exec, s[0:1]
(The saveexec assembly instructions apply a bit-wise operation to the exec register, storing the original value of exec in their destination register. Also, we can introduce branches to skip the if-block entirely if the condition happens to be uniformly false, )

This is quite different from CPUs, and so a generic compiler framework like LLVM tends to get confused. For example, the fast register allocator in LLVM is a very simple allocator that just spills all live registers at the end of a basic block before the so-called terminators. Usually, those are just branch instructions, so in the example above it would spill registers after the s_xor_b64.

This is bad because the exec mask has already been reduced by the if-condition at that point, and so vector registers end up being spilled only partially.

Until recently, these issues were hidden by the fact that we lowered the control flow instructions into their final form only at the very end of the compilation process. However, previous optimization passes including register allocation can benefit from seeing the precise shape of the GPU-style control flow earlier. But then, some of the subtleties of the exec masks need to be taken account by those earlier optimization passes as well.

A related problem arises with another GPU-specific specialty, the "whole quad mode". We want to be able to compute screen-space derivatives in pixel shaders - mip-mapping would not be possible without it - and the way this is done in GPUs is to always run pixel shaders on 2x2 blocks of pixels at once and approximate the derivatives by taking differences between the values for neighboring pixels. This means that the exec mask needs to be turned on for pixels that are not really covered by whatever primitive is currently being rendered. Those are called helper pixels.

However, there are times when helper pixels absolutely must be disabled in the exec mask, for example when storing to an image. A separate pass deals with the enabling and disabling of helper pixels. Ideally, this pass should run after instruction scheduling, since we want to be able to rearrange memory loads and stores freely, which can only be done before adding the corresponding exec-instructions. The instructions added by this pass look like this:
    s_mov_b64 s[2:3], exec
    s_wqm_b64 exec, exec

    ... code with helper pixels enabled goes here ...

    s_and_b64 exec, exec, s[2:3]

    ... code with helper pixels disabled goes here ...
Naturally, adding the bit-wise AND of the exec mask must happen in a way that doesn't conflict with any of the exec manipulations for control flow. So some careful coordination needs to take place.

My suggestion is to allow arbitrary instructions at the beginning and end of basic blocks to be marked as "initiators" and "terminators", as opposed to the current situation, where there is no notion of initiators, and whether an instruction is a terminator is a property of the opcode. An alternative, that Matt Arsenault is working on, adds aliases for certain exec-instructions which act as terminators. This may well be sufficient, I'm looking forward to seeing the result.

Donnerstag, Mai 19, 2016

A little 5-to-8-bit mystery

Writing the accelerated glReadPixels path for reads to PBOs for Gallium, I wanted to make sure the various possible format conversions are working correctly. They do, but I noticed something strange: when reading from a GL_RGB565 framebuffer to GL_UNSIGNED_BYTE, I was getting tiny differences in the results depending on the code path that was taken. What was going on?

Color values are conceptually floating point values, but most of the time, so-called normalized formats are used to store the values in memory. In fact, many probably think of color values as 8-bit normalized values by default, because of the way many graphics programs present color values and because of the #cccccc color format of HTML.

Normalized formats generalize this well-known notion to an arbitrary number of bits. Given a normalized integer value x in N bits, the corresponding floating point value is x / (2**N - 1) - for example, x / 255 for 8 bits and x / 31 for 5 bits. When converting between normalized formats with different bit depths, the values cannot be mapped perfectly. For example, since 255 and 31 are coprime, the only floating point values representable exactly in both 5- and 8-bit channels are 0.0 and 1.0.

So some imprecision is unavoidable, but why was I getting different values in different code paths?

It turns out that the non-PBO path first blits the requested framebuffer region to a staging texture, from where the result is then memcpy()d to the user's buffer. It is the GPU that takes care of the copy from VRAM, the de-tiling of the framebuffer, and the format conversion. The blit uses the normal 3D pipeline with a simple fragment shader that reads from the "framebuffer" (which is really bound as a texture during the blit) and writes to the staging texture (which is bound as the framebuffer).

Normally, fragment shaders operate on 32-bit floating point numbers. However, Radeon hardware allows an optimization where color values are exported from the shader to the CB hardware unit as 16-bit half-precision floating point numbers when the framebuffer does not require the full floating point precision. This is useful because it reduces the bandwidth required for shader exports and allows more shader waves to be in flight simultaneously, because less memory is reserved for the exports.

And it turns out that the value 20 in a 5-bit color channel, when first converted into half-float (fp16) format, becomes 164 in an 8-bit color channel, even though the 8-bit color value that is closest to the floating point number represented by 20 in 5-bit is actually 165. The temporary conversion to fp16 cuts off a bit that would make the difference.

Intrigued, I wrote a little script to see how often this happens. It turns out that 20 in a 5-bit channel and 32 in a 6-bit channel are the only cases where the temporary conversion to fp16 leads to the resulting 8-bit value to be off by one. Luckily, people don't usually use GL_RGB565 framebuffers... and as a general rule, taking a value from an N-bit channel, converting it to fp16, and then storing the value again in an N-bit value (of the same bit depth!) will always result in what we started out with, as long as N <= 11 (figuring out why is an exercise left to the reader ;-)) - so the use cases we really care about are fine.