Tagebuch eines Interplanetaren Botschafters

A new kind of git history

2024-05-04T18:30:00.001+02:00

Discussions about rebase vs. merge are familiar territory for anybody with an interest in version control in general and git in particular. I want to finally give a more permanent home to an idea that I have expressed in the past and that I've occasionally seen others hint at in those discussions as well.

There are multiple camps in these discussions that have slightly different ideas about how and for what purposes git should be used.

The first major axis of disagreement is whether history needs to be git bisect-able. Outside of my own little hobby projects, I've always worked on projects for which bisectability was important. This has generally been because their scope was such that CI simply had no chance to cover all uses of the software. Bug reports that can be traced to regressions from weeks or even months ago are not frequent per se, but they have always been frequent enough to matter. git bisect is an essential tool for finding those regression points when they happen. Not all projects are like that, but for projects which are, the notion of an "atomic" change to the project's main development branch (or branches) is important.

The second major axis of disagreement is whether the development history of those "atomic" changes is important enough to preserve. The original git development workflow does not consider this to be important: developers send around and review multiple iterations of a change, but only the final version of the change goes into the permanent record of the git repository. I tend to agree with that view. I have very occasionally found it useful to go back and read through the comments on a pull request that lead to a change months ago (or the email thread in projects that use an email workflow), but I have never found it useful to look at older versions of a change.

Some people seem to really care about this kind of history, though. They're the people who argue for a merge-based workflow for pull requests on GitHub (but against force-pushes to the same) and who have built hacks for bisectability and readability of history like --first-parent. I'm calling that a hack because it does not compose well. It works for projects whose atomic change history is essentially linear, but it breaks down once the history becomes more complex. What if the project occasionally has a genuine merge? Now you'd want to apply --first-parent for most merge commits but not all. Things get messy.

One final observation. Even "my" camp, which generally prefers to discard development history leading up to the atomic change in a main development branch, does want to preserve a kind of history that is currently not captured by git's graph. git revert inserts the hash of the commit that was reverted into the commit message. Similarly, git cherry-pick optionally inserts the hash of the commit that was cherry-picked into the commit message.

In other words, there is a kind of history for whose preservation at least in some cases there seems to be a broad consensus. This kind of history is distinct from the history that is captured by commit parent links. Looked at in this light, the idea is almost obvious: make this history an explicit part of git commit metadata.

The gist of it would be this. Every commit has a (often empty) list of historical commit references explaining the origins of the diff that is implicitly represented by the commit; let's call them diff-parents. The diff-parents are an ordered list of references to commits, each of them with a "reverted" bit that can optionally be set.

The history of a revert can be encoded by making the reverted commit a diff-parent with the "reverted" bit set. The history of a cherry-pick can be encoded similarly, with the "reverted" bit clear. When we perform a simple rebase, each new commit has an obvious diff-parent. When commits are squashed during a rebase, the sequence of squashed commits becomes the list of diff-parents of the newly formed commit. GitHub users who like to preserve all development history can use the "squash" option when landing pull requests and have the history be preserved via the list of diff-parents. git commit --amend can similarly record the original commit as diff-parent.

This is an idea and not a fully fleshed-out plan. There are obviously a whole bunch of tricky questions to answer. For example: How does this all fit into git's admittedly often byzantine CLI? Can merge commits be diff-parents, and how would that work? Can we visualize the difference between a commit and its diff-parents? (Hint: Here's an idea)

Diff-parents are a source of potential information leaks. This is not a problem specific to the idea of diff-parents; it is a general problem with the idea of preserving all history. Imagine some developer accidentally commits some credentials in their local clone of a repository and then uses git commit --amend to remove them again. Whoops, the commit that contains the credentials is still referenced as a diff-parent. Will it (and therefore the credentials) be published to the world for all to see when the developers pushes their branch to GitHub? This needs to be taken seriously.

So there are a whole bunch of issues that would have to be addressed for this idea to work well. I believe those issues to be quite surmountable in principle, but given the state of git development (where GitHub, which to many is almost synonymous with git, doesn't even seem to be able to understand how git was originally meant to be used) I am not particularly optimistic. Still, I think it's a good idea, and I'd love to see it or something like it in git.

Building a HIP environment from scratch

2024-02-07T12:30:00.001+01:00

HIP is a C++-based, single-source programming language for writing GPU code. "Single-source" means that a single source file can contain both the "host code" which runs on the CPU and the "device code" which runs on the GPU. In a sense, HIP is "CUDA for AMD", except that HIP can actually target both AMD and Nvidia GPUs.

If you merely want to use HIP, your best bet is to look at the documentation and download pre-built packages. (By the way, the documentation calls itself "ROCm" because that's what AMD calls its overall compute platform. It includes HIP, OpenCL, and more.)

I like to dig deep, though, so I decided I want to build at least the user space parts myself to the point where I can build a simple HelloWorld using a Clang from upstream LLVM. It's all open-source, after all!

It's a bit tricky, though, in part because of the kind of bootstrapping problems you usually get when building toolchains: Running the compiler requires runtime libraries, at least by default, but building the runtime libraries requires a compiler. Luckily, it's not quite that difficult, though, because compiling the host libraries doesn't require a HIP-enabled compiler - any C++ compiler will do. And while the device libraries do require a HIP- (and OpenCL-)enabled compiler, it is possible to build code in a "freestanding" environment where runtime libraries aren't available.

What follows is pretty much just a list of steps with running commentary on what the individual pieces do, since I didn't find an equivalent recipe in the official documentation. Of course, by the time you read this, it may well be outdated. Good luck!

Components need to be installed, but installing into some arbitrary prefix inside your $HOME works just fine. Let's call it $HOME/prefix. All packages use CMake and can be built using invocations along the lines of:

cmake -S . -B build -GNinja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_PREFIX=$HOME/prefix -DCMAKE_PREFIX_PATH=$HOME/prefix
ninja -C build install

In some cases, additional variables need to be set.

Step 1: clang and lld

We're going to need a compiler and linker, so let's get llvm/llvm-project and build it with Clang and LLD enabled: -DLLVM_ENABLE_PROJECTS='clang;lld' -DLLVM_TARGETS_TO_BUILD='X86;AMDGPU'

Building LLVM is an art of its own which is luckily reasonably well documented, so I'm going to leave it at that.

Step 2: Those pesky cmake files

Build and install ROCm/rocm-cmake to avoid cryptic error messages down the road when building other components that use those CMake files without documenting the dependency clearly. Not rocket science, but man am I glad for GitHub's search function.

Step 3: libhsa-runtime64.so

This is the lowest level user space host-side library in the ROCm stack. Its services, as far as I understand them, include setting up device queues and loading "code objects" (device ELF files). All communication with the kernel driver goes through here.

Notably though, this library does not know how to dispatch a kernel! In the ROCm world, the so-called Architected Queueing Language is used for that. An AQL queue is setup with the help of the kernel driver (and that does go through libhsa-runtime64.so), and then a small ring buffer and a "door bell" associated with the queue are mapped into the application's virtual memory space. When the application wants to dispatch a kernel, it (or rather, a higher-level library like libamdhip64.so that it links against) writes an AQL packet into the ring buffer and "rings the door bell", which basically just means writing a new ring buffer head pointer to the door bell's address. The door bell virtual memory page is mapped to the device, so ringing the door bell causes a PCIe transaction (for us peasants; MI300A has slightly different details under the hood) which wakes up the GPU.

Anyway, libhsa-runtime64.so comes in two parts for what I am being told are largely historical reasons:

ROCm/ROCT-Thunk-Interface
ROCm/ROCR-Runtime; this one has one of those bootstrap issues and needs a -DIMAGE_SUPPORT=OFF

The former is statically linked into the latter...

Step 4: It which must not be named

For Reasons(tm), there is a fork of LLVM in the ROCm ecosystem, ROCm/llvm-project. Using upstream LLVM for the compiler seems to be fine and is what I as a compiler developer obviously want to do. However, this fork has an amd directory with a bunch of pieces that we'll need. I believe there is a desire to upstream them, but also an unfortunate hesitation from the LLVM community to accept something so AMD-specific.

In any case, the required components can each be built individually against the upstream LLVM from step 1:

hipcc; this is a frontend for Clang which is supposed to be user-friendly, but at the cost of adding an abstraction layer. I want to look at the details under the hood, so I don't want to and don't have to use it; but some of the later components want it
device-libs; as the name says, these are libraries of device code. I'm actually not quite sure what the intended abstraction boundary is between this one and the HIP libraries from the next step. I think these ones are meant to be tied more closely to the compiler so that other libraries, like the HIP library below, don't have to use __builtin_amdgcn_* directly? Anyway, just keep on building...
comgr; the "code object manager". Provides a stable interface to LLVM, Clang, and LLD services, up to (as far as I understand it) invoking Clang to compile kernels at runtime. But it seems to have no direct connection to the code-related services in libhsa-runtime64.so.

That last one is annoying. It needs a -DBUILD_TESTING=OFF

Worse, it has a fairly large interface with the C++ code of LLVM, which is famously not stable. In fact, at least during my little adventure, comgr wouldn't build as-is against the LLVM (and Clang and LLD) build that I got from step 1. I had to hack out a little bit of code in its symbolizer. I'm sure it's fine.

Step 5: libamdhip64.so

Finally, here comes the library that implements the host-side HIP API. It also provides a bunch of HIP-specific device-side functionality, mostly by leaning on the device-libs from the previous step.

It lives in ROCm/clr, which stands for either Compute Language Runtimes or Common Language Runtime. Who knows. Either one works for me. It's obviously for compute, and it's common because it also contains OpenCL support.

You also need ROCm/HIP at this point. I'm not quite sure why stuff is split up into so many repositories. Maybe ROCm/HIP is also used when targeting Nvidia GPUs with HIP, but ROCm/CLR isn't? Not a great justification in my opinion, but at least this is documented in the README.

CLR also needs a bunch of additional CMake options: -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=${checkout of ROCm/HIP} -DHIPCC_BIN_DIR=$HOME/prefix/bin

Step 6: Compiling with Clang

We can now build simple HIP programs with our own Clang against our own HIP and ROCm libraries:

clang -x hip --offload-arch=gfx1100 --rocm-path=$HOME/prefix -rpath $HOME/prefix/lib -lstdc++ HelloWorld.cpp
LD_LIBRARY_PATH=$HOME/prefix/lib ./a.out

Neat, huh?

Vulkan driver debugging stories

2023-12-31T12:52:00.003+01:00

Recently, I found myself wanting to play some Cyberpunk 2077. Thanks to Proton, that's super easy and basically just works on Linux. Except that I couldn't enable raytracing, which annoyed me given that I have an RDNA3-based GPU that should be perfectly capable. Part of it may have been that I'm (obviously) using a version of the AMDVLK driver.

The first issue was that Proton simply wouldn't advertise raytracing (DXR) capabilities on my setup. That is easily worked around by setting VKD3D_CONFIG=dxr in the environment (in Steam launch options, set the command to VKD3D_CONFIG=dxr %command%).

This allowed me to enable raytracing in the game's graphics settings which unfortunately promptly caused a GPU hang and a GPUVM fault report in dmesg. Oh well, time for some debugging. That is (part of) my job, after all.

The fault originated from TCP, which means it's a shader vector memory access to a bad address. There's a virtually limitless number of potential root causes, so I told the amdgpu kernel module to take it easy on the reset attempts (by setting the lockup_timeout module parameter to a rather large value - that can be done on the Grub command line, but I chose to add a setting in /etc/modprobe.d/ instead) broke out good old trusty UMR in client/server mode (run with --server on the system under debug, and with --gui tcp://${address}:1234 on another system) to look at the waves that were hung. Sure enough, they had the fatal_halt bit set, were stuck a few instructions past a global_load_b64, and looking at VGPRs did suggest a suspicious address.

Tooling for shader debugging is stuck in the earlier parts of the 20th century (which may seem like an impressive feat of time travel given that programmable shading didn't even exist back then, but trust me it's genuinely and inherently way more difficult than CPU debug), so the next step was to get some pipeline dumps to correlate against the disassembly shown in UMR. Easy peasy, point the Vulkan driver at a custom amdVulkanSettings.cfg by way of the AMD_CONFIG_DIR environment variable and enable pipeline dumping by adding EnablePipelineDump,1 to the config file. Oh, and setting the AMD_DEBUG_DIR environment variable is helpful, too. Except now the game crashed before it even reached the main menu. Oops.

Well, that's a CPU code problem, and CPU debugging has left the 1970s firmly behind for somewhere in the 1990s or early 2000s. So let's get ourselves a debug build of the driver and attach gdb. Easy, right? Right?!? No. Cyberpunk 2077 is a Windows game, run in Proton, which is really Wine, which is really an emulator that likes to think of itself as not an emulator, run in some kind of container called a "pressure vessel" to fit the Steam theme. Fun.

To its credit, Proton tries to be helpful. You can set PROTON_DUMP_DEBUG_COMMANDS=1 in the environment which dumps some shell scripts to /tmp/proton-$user/ which allowed me to comparatively easily launch Cyberpunk 2077 from the terminal without going through the Steam client each time. But Wine seems to hate debugging, and it seems to hate debugging of native Linux code even more, and obviously the Vulkan driver is native Linux code. All my attempts to launch the game in some form of debugger in order to catch it red-handed were in vain.

At this point, I temporarily resigned myself to more debugging time travel of the bad kind, i.e. backwards in time to worse tooling. printf() still works, after all, and since the crash was triggered by enabling pipeline dumps, I had a fairly good idea about the general area in the driver that must have contained the problem.

So I went on a spree of sprinkling printf()s everywhere, which led to some extremely confusing and non-determinstic results. Confusing and non-deterministic is a really great hint, though, because it points at multi-threading. Indeed, Cyberpunk 2077 is a good citizen and does multi-threaded pipeline compilation. Or perhaps VKD3D is being helpful. Either way, it's a good thing except it exposed a bug in the driver. So I started sprinkling std::lock_guards everywhere. That helped narrow down the problem area. Add some good old staring at code and behold: somebody had very recently added a use of strtok() to the pipeline dumping logic. Very bad idea, very easy fix.

Okay, so I can dump some pipelines now, but I still don't get to the main menu because the game now crashes with an assertion somewhere in PAL. I could start staring at pipeline dumps, but this is an assertion that (1) suggests a legitimate problem, which means prioritizing it might actually be helpful, and (2) is in the kind of function that is called from just about everywhere, which means I really, really need to be able to look at a stacktrace now. It's time to revisit debuggers.

One of the key challenges with my earlier attempts at using gdb was that (1) Wine likes to fork off tons of processes, which means getting gdb to follow the correct one is basically impossible, and (2) the crash happens very quickly, so manually attaching gdb after the fact is basically impossible. But the whole point of software development is to make the impossible possible, so I tweaked the implementation of PAL_ASSERT to poke at /proc/self to figure out whether a debugger is already attached and if one isn't, optionally print out a helpful message including the PID and then sleep instead of calling abort() immediately. This meant that I could now attach gdb at my leisure, which I did.

And was greeted with an absolutely useless gdb session because Wine is apparently being sufficiently creative with the dynamic linker structures that gdb can't make sense of what's happening on its own and doesn't find any symbols, let alone further debug info, and so there was no useful backtrace. Remember how I mentioned that Wine hates debugging?

Luckily, a helpful soul pointed out that /proc/$pid/maps exists and tells us where .so's are mapped into a process address space, and there's absolutely nothing Wine can do about that. Even better, gdb allows the user to manually tell it about shared libraries that have been loaded. Even even better, gdb can be scripted with Python. So, I wrote a gdb script that walks the backtrace and figures out which shared libraries to tell gdb about to make sense of the backtraces. (Update: Friedrich Vock helpfully pointed out that attaching with gdb -p $pid /path/to/the/correct/bin/wine64 also allows gdb to find shared libraries.)

At this point, another helpful soul pointed out that Fossilize exists and can play back pipeline creation in a saner environment than a Windows game running on VKD3D in Wine in a pressure vessel. That would surely have reduced my debugging woes somewhat. Oh well, at least I learned something.

From there, fixing all the bugs was almost a walk in the park. The assertion I had run into in PAL was easy to fix, and finally I could get back to the original problem: that GPU hang. That turned out to be a fairly mundane problem in LLPC's raytracing implementation, for which I have a fix. It's still going to take a while to trickle out, in part because this whole debugging odyssey has a corresponding complex chain of patches that just take a while to ferment, and in part because pre-Christmas is very far from a quiet time and things have just been generally crazy. Still: very soon you, too, will be able to play Cyberpunk 2077 with raytracing using the AMDVLK driver.

An Update on Dialects in LLVM

2023-05-12T10:40:00.001+02:00

EuroLLVM took place in Glasgow this week. I wasn't there, but it's a good opportunity to check in with what's been happening in dialects for LLVM in the ~half year since my keynote at the LLVM developer meeting.

Where we came from

To give an ultra-condensed recap: The excellent idea that MLIR brought to the world of compilers is to explicitly separate the substrate in which a compiler intermediate representation is implemented (the class hierarchy and basic structures that are used to represent and manipulate the program representation at compiler runtime) from the semantic definition of a dialect (the types and operations that are available in the IR and their meaning). Multiple dialects can co-exist on the same substrate, and in fact the phases of compilation can be identified with the set of dialects that are used within each phase.

Unfortunately for AMD's shader compiler, while MLIR is part of the LLVM project and shares some foundational support libraries with LLVM, its IR substrate is entirely disjoint from LLVM's IR substrate. If you have an existing compiler built on LLVM IR, you could bolt on an MLIR-based frontend, but what we really need is a way to gradually introduce some of the capabilities offered by MLIR throughout an existing LLVM-based compilation pipeline.

That's why I started llvm-dialects last year. We published its initial release a bit more than half a year ago, and have greatly expanded its capabilities since then.

Where we are now

We have been using llvm-dialects in production for a while now. Some of its highlights so far are:

Almost feature-complete for defining custom operations (aka intrinsics or instructions). The main thing that's missing is varargs support - we just haven't needed that yet.
Most of the way there for defining custom types: custom types can be defined, but they can't be used everywhere. I'm working on closing the gaps as we speak - some upstream changes in LLVM itself are required.
Expressive language for describing constraints on operation and type arguments and operation results - see examples here and here.
Thorough, automatically generated IR verifier routines.
A flexible and efficient visitor mechanism that is inspired by but beats LLVM's TypeSwitch in some important ways.

Transitioning to the use of llvm-dialects is a gradual process for us and far from complete. We have always had custom operations, but we used to do implement them in a rather ad-hoc manner. The old way of doing it consisted of hand-writing code like this:

SmallVector<Value *, 4> args;
std::string instName = lgcName::OutputExportXfb;
args.push_back(getInt32(xfbBuffer));
args.push_back(xfbOffset);
args.push_back(getInt32(streamId));
args.push_back(valueToWrite);
addTypeMangling(nullptr, args, instName);
return CreateNamedCall(instName, getVoidTy(), args, {});

With llvm-dialects, we can use a much cleaner builder pattern:

return create<InputImportGenericOp>(
resultTy, false, location, getInt32(0), elemIdx,
PoisonValue::get(getInt32Ty()));

Accessing the operands of a custom operation used to be a matter of code with magic numbers everywhere:

if (callInst.arg_size() > 2)
vertexIdx = isDontCareValue(callInst.getOperand(2))
? nullptr : callInst.getOperand(2);

With llvm-dialects, we get far more readable code:

Value *vertexIdx = nullptr;
if (!inputOp.getPerPrimitive())
vertexIdx = inputOp.getArrayIndex();

Following the example set by MLIR, these accessor methods as well as the machinery required to make the create<FooOp>(...) builder call work are automatically generated from a dialect definition written in a TableGen DSL.

An important lesson from the transition so far is that the biggest effort, but also one of the biggest benefits, has to do with getting to a properly defined IR in the first place.

I firmly believe that understanding a piece of software starts not with the code that is executed but with the interfaces and data structures that the code implements and interacts with. In a compiler, the most important data structure is the IR. You should think of the IR as the bulk of the interface for almost all compiler code.

When defining custom operations in the ad-hoc manner that we used to use, there isn't one place in which the operations themselves are defined. Instead, the definition is implicit in the scattered locations where the operations are created and consumed. More often than is comfortable, this leads to definitions that are fuzzy or confused, which leads to code that is fuzzy and confused, which leads to bugs and a high maintenance cost, which leads to the dark side (or something).

By having a designated location where the custom operations are explicitly defined - the TableGen file - there is a significant force pushing towards proper definitions. As the experience of MLIR shows, this isn't automatic (witness the rather thin documentation of many of the dialects in upstream MLIR), but without this designated location, it's bound to be worse. And so a large part of transitioning to a systematically defined dialect is cleaning up those instances of confusion and fuzziness. It pays off: I have found hidden bugs this way, and the code becomes noticeably more maintainable.

Where we want to go

llvm-dialects is already a valuable tool for us. I'm obviously biased, but if you're in a similar situation to us, or you're thinking of starting a new LLVM-based compiler, I recommend it.

There is more that can be done, though, and I'm optimistic we'll get around to further improvements over time as we gradually convert parts of our compiler that are being worked on anyway. My personal list of items on the radar:

As mentioned already, closing the remaining gaps in custom type support.
Our compiler uses quite complex metadata in a bunch of places. It's hard to read for humans, doesn't have a good compatibility story for lit tests, and accessing it at compile-time isn't particularly efficient. I have some ideas for how to address all these issues with an extension mechanism that could also benefit upstream LLVM.
Compile-time optimizations. At the moment, casting custom operations is still based on string comparison, which is clearly not ideal. There are a bunch of other things in this general area as well.
I really want to see some equivalent of MLIR regions in LLVM. But that's a non-trivial amount of work and will require patience.

There's also the question of if or when llvm-dialects will eventually be integrated into LLVM upstream. There are lots of good arguments in favor. Its DSL for defining operations is a lot friendlier than what is used for intrinsics at the moment. Getting nice, auto-generated accessor methods and thorough verification for intrinsics would clearly be a plus. But it's not a topic that I'm personally going to push in the near future. I imagine we'll eventually get there once we've collected even more experience.

Of course, if llvm-dialects is useful to you and you feel like contributing in these or other areas, I'd be more than happy about that!

Diff modulo base, a CLI tool to assist with incremental code reviews

2023-01-21T05:20:00.001+01:00

One of the challenges of reviewing a lot of code is that many reviews require multiple iterations. I really don't want to do a full review from scratch on the second and subsequent rounds. I need to be able to see what has changed since last time.

I happen to work on projects that care about having a useful Git history. This means that authors of (without loss of generality) pull requests use amend and rebase to change commits and force-push the result. I would like to see the only the changes they made since my last review pass. Especially when the author also rebased onto a new version of the main branch, existing code review tools tend to break down.

Git has a little-known built-in subcommand, git range-diff, which I had been using for a while. It's pretty cool, really: It takes two ranges of commits, old and new, matches old and new commits, and then shows how they changed. The rather huge problem is that its output is a diff of diffs. Trying to make sense of those quickly becomes headache-inducing.

I finally broke down at some point late last year and wrote my own tool, which I'm calling diff-modulo-base. It allows you to look at the difference of the repository contents between old and new in the history below, while ignoring all the changes that are due to differences in the respective base versions A and B.

As a bonus, it actually does explicitly show differences between A and B that would have caused merge conflicts during rebase. This allows a fairly comfortable view of how merge conflicts were resolved.

I've been using this tool for a while now. While there are certainly still some rough edges and to dos, I did put a bunch more effort into it over the winter holidays and am now quite happy with it. I'm making it available for all to try at https://git.sr.ht/~nhaehnle/diff-modulo-base. Let me know if you find it useful!

Better integration with the larger code review flow?

One of the rough edges is that it would be great to integrate tightly with the GitHub notifications workflow. That workflow is surprisingly usable in that you can essentially treat the notifications as an inbox in which you can mark notifications as unread or completed, and can "mute" issues and pull requests all with keyboard shortcut.

What's missing in my workflow is a reliable way to remember the most recent version of a pull request that I have reviewed. My somewhat passable workaround for now is to git fetch before I do a round of reviews, and rely on the local reflog of remote refs. A Git alias allows me to say

git dmb-origin $pull_request_id

and have that become

git diff-modulo-base origin/main origin/pull/$pull_request_id/head@{1} origin/pull/$pull_request_id/head

which is usually what I want.

Ideally, I'd have a fully local way of interacting with GitHub notifications, which could then remember the reviewed version in a more reliable way. This ought to also fix the terrible lagginess of the web interface. But that's a rant for another time.

Rust

This is the first serious piece of code I've written in Rust. I have to say that experience has really been quite pleasant so far. Rust's tooling is pretty great, mostly thanks to the rust-analyzer LSP server.

The one thing I'd wish is that the borrow checker was able to better understand "partial" borrows. I find it occasionally convenient to tie a bunch of data structures together in a general context structure, and helper functions on such aggregates can't express that they only borrow part of the structure. This can usually be worked around by changing data types, but the fact that I have to do that is annoying. It feels like having to solve a puzzle that isn't part of the inherent complexity of the underlying problem that the code is trying to solve.

And unlike, say, circular references or graph structures in general, where it's clear that expressing and proving the sort of useful lifetime facts that developers might intuitively reason about quickly becomes intractable, improving the support for partial borrows feels like it should be a tractable problem.

Software testing, and why I'm unhappy about it

2023-01-14T23:30:00.001+01:00

Automated testing of software is great. Unfortunately, what's commonly considered best practice for how to integrate testing into the development flow is a bad fit for a lot of software. You know what I mean: You submit a pull request, some automated testing process is kicked off in the background, and some time later you get back a result that's either Green (all tests passed) or Red (at least one test failed). Don't submit the PR if the tests are red. Sounds good, doesn't quite work.

There is Software Under Test (SUT), the software whose development is the goal of what you're doing, and there's the Test Bench (TB): the tests themselves, but also additional relevant parts of the environment like perhaps some hardware device that the SUT is run against.

The above development practice works well when the SUT and TB are both defined by the same code repository and are developed together. And admittedly, that is the case for a lot of useful software. But it just so happens that I mostly tend to work on software where the SUT and TB are inherently split. Graphics drivers and shader compilers implement some spec (like Vulkan or Direct3D), and an important part of the TB are a conformance test suite and other tests, the bulk of which are developed separately from the driver itself. Not to mention the GPU itself and other software like the kernel mode driver. The point is, TB development is split from the SUT development and it is infeasible to make changes to them in lockstep.

Down with No Failures, long live No Regressions

Problem #1 with keeping all tests passing all the time is that tests can fail for reasons whose root cause is not an SUT change.

For example, a new test case is added to the conformance test suite, but that test happens to fail. Suddenly nobody can submit any changes anymore.

That clearly makes no sense, and because Tooling Sucks(tm), what folks typically do is maintain a manual list of test cases that are excluded from automated testing. This unblocks development, but are you going to remember to update that exclusion list? Bonus points if the exclusion list isn't even maintained in the same repository as the SUT, which just compounds the problem.

The situation is worse when you bring up a large new feature or perhaps a new version of the hardware supported by your driver (which is really just a super duper large new feature), where there is already a large body of tests written by somebody else. Development of the new feature may take months and typically is merged bit by bit over time. For most of that time, there are going to be some test failures. And that's fine!

Unfortunately, a typical coping mechanism is that automated testing for the feature is entirely disabled until the development process is complete. The consequences are dire, as regressions in relatively basic functionality can go unnoticed for a fairly long time.

And sometimes there are simply changes in the TB that are hard to control. Maybe you upgraded the kernel mode driver for your GPU on the test systems, and suddenly some weird corner case tests fail. Yes, you have to fix it somehow, but removing the test case from your automated testing process is almost always the wrong response.

In fact, failing tests are, given the right context, a good thing! Let's say a bug is discovered in a real application in the field. Somebody root causes the problem and writes a simplified reproducer. This reproducer should be added to the TB as soon as possible, even if it is going to fail initially!

To be fair, many of the common testing frameworks recognize this by allowing tests to be marked as "expected to fail". But they typically also assume that the TB can be changed in lockstep with the SUT and fall on their face when that isn't the case.

What is needed here is to treat testing as a truly continuous exercise, with some awareness by the automation of how test runs relate to the development history.

During day-to-day development, the important bit isn't that there are no failures. The important bit is that there are no regressions.

Automation ought to track which tests pass on the main development branch and provide pre-commit reports for pull requests relative to those results: Have there been any regressions? Have any tests been fixed? Block code submissions when they cause regressions, but don't block them for pre-existing failures, especially when those failures are caused by changes in the TB.

Changes to the TB should also be tested where possible, and when they cause regressions those should be investigated. But it is quite common that regressions caused by a TB change are legitimate and shouldn't block the TB change.

Sparse testing

Problem #2 is that good test coverage means that tests take a very long time to run.

Your first solution to this problem should be to parallelize and throw more hardware at it. Let's hope the people who control the purse care enough about quality.

There is sometimes also low-hanging fruit you should pick, like wasting lots of time in process (or other) startup and teardown overhead. Addressing that can be a double-edged sword. Changing a test suite from running every test case in a separate process to running multiple test cases sequentially in the same process reduces isolation between the tests and can therefore make the tests flakier. It can also expose genuine bugs, though, and so the effort is usually worth it.

But all these techniques have their limits.

Let me give you an example. Compilers tend to have lots of switches that subtly change the compiler's behavior without (intentionally) affecting correctness. How good is your test coverage of these?

Chances are, most of them don't see any real testing at all. You probably have a few hand-written test cases that exercise the options. But what you really should be doing is run an entire suite of end-to-end tests with each of the switches applied to make sure you aren't missing some interaction. And you really should be testing all combinations of switches as well.

The combinatorial explosion is intense. Even if you only have 10 boolean switches, testing them each individually without regressing the turn-around time of the test suite requires 11x more test system hardware. Testing all possible combinations requires 1024x more. Nobody has that kind of money.

The good news is that having extremely high confidence in the quality of your software doesn't require that kind of money. If we run the entire test suite a small number of times (maybe even just once!) and independently choose a random combination of switches for each test, then not seeing any regressions there is a great indication that there really aren't any regressions.

Why is that? Because failures are correlated! Test T failing with a default setting of switches is highly correlated with test T failing with some non-default switch S enabled.

This effect isn't restriced to taking the cross product of a test suite with a bunch of configuration switches. By design, an exhaustive conformance test suite is going to have many sets of tests with high failure correlation. For example, in the Vulkan test suite you might have a bunch of test cases that all do the same thing, but with a different combination of framebuffer format and blend function. When there is a regression affecting such tests, the specific framebuffer format or blend function might not matter at all, and all of the tests will regress. Or perhaps the regression is related to a specific framebuffer format, and so all tests using that format will regress regardless of the blend function that is used, and so on.

A good automated testing system would leverage these observations using statistical methods (aka machine learning).

Combinatorial explosion causes your full test suite to take months to run? No problem, treat testing as a genuinely continuous task. Have test systems continuously run random samplings of test cases on the latest version of the main development branch. Whenever a change is made to either the SUT or TB, switch over to testing that new version instead. When a failure is encountered, automatically determine if it is a regression by referring to earlier test results (of the exact same test if it has been run previously, or related tests otherwise) combined with a bisection over the code history.

Pre-commit testing becomes an interesting fuzzy problem. By all means have a small traditional test suite that is manually curated to run within a few minutes. But we can also apply the approach of running randomly sampled tests to pre-commit testing.

A good automated testing system would learn a statistical model of regressions and combine that with the test results obtained so far to provide an estimate of the likelihood of regression. As long as no regression is actually found, this likelihood will keep dropping as more tests are run, though it will not drop to 0 unless all tests are run (and the setup here was this would take months). The team can define a likelihood threshold that a change must reach before it can be committed based on the their appetite for risk and rate of development.

The statistical model should be augmented with source-level information about the change, such as keywords that appear in the diff and commit message and the set of files that was changed. After all, there ought to be some meaningful correlation between regressions in a raytracing test case and the fact that the regressing change affected a file with "raytracing" in its name. The model should then also be used to bias the random sampling of tests to be run to maximize the information extracted per effort spent on running test cases.

Some caveats

What I've described is largely motivated by the fact that the world is messier than commonly accepted testing "wisdom" allows. However, the world is too messy even for what I've described.

I haven't talked about flaky (randomly failing) tests at all, though a good automated testing system should be able to cope with them. Re-running a test in the same configuration is not black magic and can be used to confirm that a test is flaky. If we wanted to get fancy, we could even estimate the failure probability and treat a significant increase of the failure rate as a regression!

Along similar lines, there can be state leakage between test cases that causes failures only when test cases are run in a specific order, or when specific test cases are run in parallel. This would manifest as flaky tests, and so flaky test detection ought to try to help tease out these scenarios. That is admittedly difficult and will probably never be entirely reliable. Luckily, it doesn't happen often.

Sometimes, there are test cases that can leave a test system in such a broken state that it has to be rebooted. This is not entirely unusual in very early bringup of a driver for new hardware, when even the device's firmware may still be unstable. An automated test system can and should treat this case just like one would treat a crashing test process: Detect the failure, perhaps using some timer-based watchdog, force a reboot, possibly using a remote-controlled power switch, and resume with the next test case. But if a decent fraction of your test suite is affected, the resulting experience isn't fun and there may not be anything your team can do about it in the short term. So that's an edge case where manual exclusion of tests seems legitimate.

So no, testing perfection isn't attainable for many kinds of software projects. But even relative to what feels like it should be realistically attainable, the state of the art is depressing.

A New Type of Convergence Control Intrinsic?

2022-03-09T17:00:00.001+01:00

Subgroup operations or wave intrinsics, such as reducing a value across the threads of a shader subgroup or wave, were introduced in GPU programming languages a while ago. They communicate with other threads of the same wave, for example to exchange the input values of a reduction, but not necessarily with all of them if there is divergent control flow.

In LLVM, we call such operations convergent. Unfortunately, LLVM does not define how the set of communicating threads in convergent operations -- the set of converged threads -- is affected by control flow.

If you're used to thinking in terms of structured control flow, this may seem trivial. Obviously, there is a tree of control flow constructs: loops, if-statements, and perhaps a few others depending on the language. Two threads are converged in the body of a child construct if and only if both execute that body and they are converged in the parent. Throw in some simple and intuitive rules about loop counters and early exits (nested return, break and continue, that sort of thing) and you're done.

In an unstructured control flow graph, the answer is not obvious at all. I gave a presentation at the 2020 LLVM Developers' Meeting that explains some of the challenges as well as a solution proposal that involves adding convergence control tokens to the IR.

Very briefly, convergent operations in the proposal use a token variable that is defined by a convergence control intrinsic. Two dynamic instances of the same static convergent operation from two different threads are converged if and only if the dynamic instances of the control intrinsic producing the used token values were converged.

(The published draft of the proposal talks of multiple threads executing the same dynamic instance. I have since been convinced that it's easier to teach this matter if we instead always give every thread its own dynamic instances and talk about a convergence equivalence relation between dynamic instances. This doesn't change the resulting semantics.)

The draft has three such control intrinsics: anchor, entry, and (loop) heart. Of particular interest here is the heart. For the most common and intuitive use cases, a heart intrinsic is placed in the header of natural loops. The token it defines is used by convergent operations in the loop. The heart intrinsic itself also uses a token that is defined outside the loop: either by another heart in the case of nested loops, or by an anchor or entry. The heart combines two intuitive behaviors:

It uses a token in much the same way that convergent operations do: two threads are converged for their first execution of the heart if and only if they were converged at the intrinsic that defined the used token.
Two threads are converged at subsequent executions of the heart if and only if they were converged for the first execution and they are currently at the same loop iteration, where iterations are counted by a virtual loop counter that is incremented at the heart.

Viewed from this angle, how about we define a weaker version of these rules that lies somewhere between an anchor and a loop heart? We could call it a "light heart", though I will stick with "iterating anchor". The iterating anchor defines a token but has no arguments. Like for the anchor, the set of converged threads is implementation-defined -- when the iterating anchor is first encountered. When threads encounter the iterating anchor again without leaving the dominance region of its containing basic block, they are converged if and only if they were converged during their previous encounter of the iterating anchor.

The notion of an iterating anchor came up when discussing the convergence behaviors that can be guaranteed for natural loops. Is it possible to guarantee that natural loops always behave in the natural way -- according to their loop counter -- when it comes to convergence?

Naively, this should be possible: just put hearts into loop headers! Unfortunately, that's not so straightforward when multiple natural loops are contained in an irreducible loop:

Hearts in A and C must refer to a token defined outside the loops; that is, a token defined in E. The resulting program is ill-formed because it has a closed path that goes through two hearts that use the same token, but the path does not go through the definition of that token. This well-formedness rule exists because the rules about heart semantics are unsatisfiable if the rule is broken.

The underlying intuitive issue is that if the branch at E is divergent in a typical implementation, the wave (or subgroup) must choose whether A or C is executed first. Neither choice works. The heart in A indicates that (among the threads that are converged in E) all threads that visit A (whether immediately or via C) must be converged during their first visit of A. But if the wave executes A first, then threads which branch directly from E to A cannot be converged with those that first branch to C. The opposite conflict exists if the wave executes C first.

If we replace the hearts in A and C by iterating anchors, this problem goes away because the convergence during the initial visit of each block is implementation-defined. In practice, it should fall out of which of the blocks the implementation decides to execute first.

So it seems that iterating anchors can fill a gap in the expressiveness of the convergence control design. But are they really a sound addition? There are two main questions:

Satisfiability: Can the constraints imposed by iterating anchors be satisfied, or can they cause the sort of logical contradiction discussed for the example above? And if so, is there a simple static rule that prevents such contradictions?
Spooky action at a distance: Are there generic code transforms which change semantics while changing a part of the code that is distant from the iterating anchor?

The second question is important because we want to add convergence control to LLVM without having to audit and change the existing generic transforms. We certainly don't want to hurt compile-time performance by increasing the amount of code that generic transforms have to examine for making their decisions.

Satisfiability

Consider the following simple CFG with an iterating anchor in A and a heart in B that refers back to a token defined in E:

Now consider two threads that are initially converged with execution traces:

E - A - A - B - X
E - A - B - A - X

The heart rule implies that the threads must be converged in B. The iterating anchor rule implies that if the threads are converged in their first dynamic instances of A, then they must also be converged in their second dynamic instances of A, which leads to a temporal paradox.

One could try to resolve the paradox by saying that the threads cannot be converged in A at all, but this would mean that the threads must diverge before a divergent branch occurs. That seems unreasonable, since typical implementations want to avoid divergence as long as control flow is uniform.

The example arguably breaks the spirit of the rule about convergence regions from the draft proposal linked above, and so a minor change to the definition of convergence region may be used to exclude it.

What if the CFG instead looks as follows, which does not break any rules about convergence regions:

For the same execution traces, the heart rule again implies that the threads must be converged in B. The convergence of the first dynamic instances of A are technically implementation-defined, but we'd expect most implementations to be converged there.

The second dynamic instances of A cannot be converged due to the convergence of the dynamic instances of B. That's okay: the second dynamic instance of A in thread 2 is a re-entry into the dominance region of A, and so its convergence is unrelated to any convergence of earlier dynamic instances of A.

Spooky action at a distance

Unfortunately, we still cannot allow this second example. A program transform may find that the conditional branch in E is constant and the edge from E to B is dead. Removing that edge brings us back to the previous example which is ill-formed. However, a transform which removes the dead edge would not normally inspect the blocks A and B or their dominance relation in detail. The program becomes ill-formed by spooky action at a distance.

The following static rule forbids both example CFGs: if there is a closed path through a heart and an iterating anchor, but not through the definition of the token that the heart uses, then the heart must dominate the iterating anchor.

There is at least one other issue of spooky action at a distance. If the iterating anchor is not the first (non-phi) instruction of its basic block, then it may be preceded by a function call in the same block. The callee may contain control flow that ends up being inlined. Back edges that previously pointed at the block containing the iterating anchor will then point to a different block, which changes the behavior quite drastically. Essentially, the iterating anchor is reduced to a plain anchor.

What can we do about that? It's tempting to decree that an iterating anchor must always be the first (non-phi) instruction of a basic block. Unfortunately, this is not easily done in LLVM in the face of general transforms that might sink instructions or merge basic blocks.

Preheaders to the rescue

We could chew through some other ideas for making iterating anchors work, but that turns out to be unnecessary. The desired behavior of iterating anchors can be obtained by inserting preheader blocks. The initial example of two natural loops contained in an irreducible loop becomes:

Place anchors in Ap and Cp and hearts in A and C that use the token defined by their respective dominating anchor. Convergence at the anchors is implementation-defined, but relative to this initial convergence at the anchor, convergence inside the natural loops headed by A and C behaves in the natural way, based on a virtual loop counter. The transform of inserting an anchor in the preheader is easily generalized.

To sum it up: We've concluded that defining an "iterating anchor" convergence control intrinsic is problematic, but luckily also unnecessary. The control intrinsics defined in the original proposal are sufficient. I hope that the discussion that led to those conclusions helps illustrate some aspects of the convergence control proposal for LLVM as well as the goals and principles that drove it.

Can memcpy be implemented in LLVM IR?

2021-06-11T07:40:00.001+02:00

(Added a section on load/store pairs on June 14th)

This question probably seems absurd. An unoptimized memcpy is a simple loop that copies bytes. How hard can that be? Well...

There's a fascinating thread on llvm-dev started by George Mitenkov proposing a new family of "byte" types. I found the proposal and discussion difficult to follow. In my humble opinion, this is because the proposal touches some rather subtle and underspecified aspects of LLVM IR semantics, and rather than address those fundamentals systematically, it jumps right into the minutiae of the instruction set. I look forward to seeing how the proposal evolves. In the meantime, this article is a byproduct of me attempting to digest the problem space.

Here is a fairly natural way to (attempt to) implement memcpy in LLVM IR:

define void @memcpy(i8* %dst, i8* %src, i64 %n) {
entry:
%dst.end = getelementptr i8, i8* %dst, i64 %n
%isempty = icmp eq i64 %n, 0
br i1 %isempty, label %out, label %loop

loop:
%src.loop = phi i8* [ %src, %entry ], [ %src.next, %loop ]
%dst.loop = phi i8* [ %dst, %entry ], [ %dst.next, %loop ]
%ch = load i8, i8* %src.loop
store i8 %ch, i8* %dst.loop
%src.next = getelementptr i8, i8* %src.loop, i64 1
%dst.next = getelementptr i8, i8* %dst.loop, i64 1
%done = icmp eq i8* %dst.next, %dst.end
br i1 %done, label %out, label %loop

out:
ret void
}

Unfortunately, the copy that is written to the destination is not a perfect copy of the source.

Hold on, I hear you think, each byte of memory holds one of 256 possible bit patterns, and this bit pattern is perfectly copied by the `load`/`store` sequence! The catch is that in LLVM's model of execution, a byte of memory can in fact hold more than just one of those 256 values. For example, a byte of memory can be poison, which means that there are at least 257 possible values. Poison is forwarded perfectly by the code above, so that's fine. The trouble starts because of pointer provenance.

What and why is pointer provenance?

From a machine perspective, a pointer is just an integer that is interpreted as a memory address.

For the compiler, alias analysis -- that is, the ability to prove that different pointers point at different memory addresses -- is crucial for optimization. One basic tool in the alias analysis toolbox is to recognize that if pointers point into different "memory objects" -- different stack or heap allocations -- then they cannot alias.

Unfortunately, many pointers are obtained via getelementptr (GEP) using dynamic (non-constant) indices. These dynamic indices could be such that the resulting pointer points into a different memory object than the base pointer. This makes it nearly impossible to determine at compile time whether two pointers point into the same memory object or not.

Which is why there is a rule which says (among other things) that if a pointer P obtained via GEP ends up going out-of-bounds and pointing into a different memory object than the pointer on which the GEP was based, then dereferencing P is undefined behavior even though the pointer's memory address is valid from the machine perspective.

As a corollary, a situation is possible in which there are two pointers whose underlying memory address is identical but whose provenance is different. In that case, it's possible that one of them can be dereferenced while dereferencing the other is undefined behavior.

This only makes sense if, in the formal semantics of LLVM IR, pointer values carry more information than just an integer interpreted as a memory address. They also carry provenance information, which is essentially the set of memory objects that can be accessed via this pointer and any pointers derived from it.

Bytes in memory carry provenance information

What is the provenance of a pointer that results from a load instruction? In a clean operational semantics, the load must derive this provenance from the values stored in memory.

If bytes of memory can only hold one of 256 bit patterns (or poison), that doesn't give us much to work with. We could say that the provenance of the pointer is "empty", meaning the pointer cannot be used to access any memory objects -- but that's clearly useless. Or we could say that the provenance of the pointer is "all", meaning the pointer (or pointers derived from it) can be freely used to access all memory objects, assuming the underlying address is adjusted appropriately. That isn't much better.[0]

Instead, we must say that -- as far as LLVM IR semantics are concerned -- each byte of memory holds pointer provenance information in addition to its i8 content. The provenance information in memory is written by pointer store, and pointer load uses it to reconstruct the original provenance of the loaded pointer.

What happens to provenance information in non-pointer load/store? A load can simply ignore the additional information in memory. For store, I see 3 possible choices:

1. Leave the provenance information that already happens to be in memory unmodified.
2. Set the provenance to "empty".
3. Set the provenance to "all".

Looking back at our attempt to implement memcpy, there is no choice which results in a perfect copy. All of the choices lose provenance information.

Without major changes to LLVM IR, only the last choice is potentially viable because it is the only choice that allows dereferencing pointers that are loaded from the memcpy destination.

Should we care about losing provenance information?

Without major changes to LLVM IR, we can only implement a memcpy that loses provenance information during the copy.

So what? Alias analysis around memcpy and code like it ends up being conservative, but reasonable people can argue that this doesn't matter. The burden of evidence lies on whoever wants to make a large change here in order to improve alias analysis.

That said, we cannot just call it a day and go (or stay) home either, because there are related correctness issues in LLVM today, e.g. bug 37469 mentioned in the initial email of that llvm-dev thread.

Here's a simpler example of a correctness issue using our hand-coded memcpy:

define i32 @sample(i32** %pp) {
%tmp = alloca i32*
%pp.8 = bitcast i32** %pp to i8*
%tmp.8 = bitcast i32** %tmp to i8*
call void @memcpy(i8* %tmp.8, i8* %pp.8, i64 8)
%p = load i32*, i32** %tmp
%x = load i32, i32* %p
ret i32 %x
}

A transform that should be possible is to eliminate the memcpy and temporary allocation:

define i32 @sample(i32** %pp) {
%p = load i32*, i32** %pp
%x = load i32, i32* %p
ret i32 %x
}

This transform is incorrect because it introduces undefined behavior.

To see why, remember that this is the world where we agree that integer stores write an "all" provenance to memory, so %p in the original program has "all" provenance. In the transformed program, this may no longer be the case. If @sample is called with a pointer that was obtained through an out-of-bounds GEP whose resulting address just happens to fall into a different memory object, then the transformed program has undefined behavior where the original program didn't.

We could fix this correctness issue by introducing an unrestrict instruction which elevates a pointer's provenance to the "all" provenance:

define i32 @sample(i32** %pp) {
%p = load i32*, i32** %pp
%q = unrestrict i32* %p
%x = load i32, i32* %q
ret i32 %x
}

Here, %q has "all" provenance and therefore no undefined behavior is introduced.

I believe that (at least for address spaces that are well-behaved?) it would be correct to fold inttoptr(ptrtoint(x)) to unrestrict(x). The two are really the same.

For that reason, unrestrict could also be used to fix the above-mentioned bug 37469. Several folks in the bug's discussion stated the opinion that the bug is caused by incorrect store forwarding that should be weakened via inttoptr(ptrtoint(x)). unrestrict(x) is simply a clearer spelling of the same idea.

A dead end: integers cannot have provenance information

A natural thought at this point is that the situation could be improved by adding provenance information to integers. This is technically correct: our hand-coded memcpy would then produce a perfect copy of the memory contents.

However, we would get into serious trouble elsewhere because global value numbering (GVN) and similar transforms become incorrect: two integers could compare equal using the icmp instruction, but still be different because of different provenance. Replacing one by the other could result in miscompilation.

GVN is important enough that adding provenance information to integers is a no-go.

I suspect that the unrestrict instruction would allow us to apply GVN to pointers, at the cost of making later alias analysis more conservative and sprinkling unrestrict instructions that may inhibit other transforms. I have no idea what the trade-off is on that.

The "byte" types: accurate representation of memory contents

With all the above in mind, I can see the first-principles appeal of the proposed "byte" types. They allow us to represent the contents of memory accurately in SSA values, and so they fill a real gap in the expressiveness of LLVM IR.

That said, the software development cost of adding a whole new family of types to LLVM is very high, so it better be justified by more than just aesthetics.

Our hand-coded memcpy can be turned into a perfect copier with straightforward replacement of i8 by b8:

define void @memcpy(b8* %dst, b8* %src, i64 %n) {
entry:
%dst.end = getelementptr b8, b8* %dst, i64 %n
%isempty = icmp eq i64 %n, 0
br i1 %isempty, label %out, label %loop

loop:
%src.loop = phi b8* [ %src, %entry ], [ %src.next, %loop ]
%dst.loop = phi b8* [ %dst, %entry ], [ %dst.next, %loop ]
%ch = load b8, b8* %src.loop
store b8 %ch, b8* %dst.loop
%src.next = getelementptr b8, b8* %src.loop, i64 1
%dst.next = getelementptr b8, b8* %dst.loop, i64 1
%done = icmp eq b8* %dst.next, %dst.end
br i1 %done, label %out, label %loop

out:
ret void
}

Looking at the concrete choices made in the proposal, I disagree with some of them.

Memory should not be typed. In the proposal, storing an integer always results in different memory contents than storing a pointer (regardless of its provenance), and implicitly trying to mix pointers and integers is declared to be undefined behavior. In other words, a sequence such as:

store i64 %x, i64* %p
%q = bitcast i64* %p to i8**
%y = load i8*, i8** %q

... is undefined behavior under the proposal instead of being effectively inttoptr(%x). That seems fine for C/C++, but is it going to be fine for other frontends?

The corresponding distinction between bytes-as-integers and bytes-as-pointers complicates the proposal overall, e.g. it forces them to add a bytecast instruction.

Conversely, the benefits of the distinction are unclear to me. One benefit appears to be guaranteed non-aliasing between pointer and non-pointer memory accesses, but that is a form of type-based alias analysis which in LLVM should idiomatically be done via TBAA metadata. (Update: see the addendum for another potential argument in favor of typed memory.)

So let's keep memory untyped, please.

Bitwise poison in byte values makes me really nervous due to the arbitrary deviation from how poison works in other types. I don't see any justification for it in the proposal. I can kind of see how one could be motivated by implementing memcpy with vector intrinsics operating on, for example, <8 x b32>, but a simpler solution would be to just use <32 x b8> instead. And if poison is indeed bitwise, then certainly pointer provenance would also have to be bitwise!

Finally, no design discussion is complete without a little bit of bike-shedding. I believe the name "byte" is inspired by C++'s std::byte, but given that types such as b256 are possible, this name would forever be a source of confusion. Naming is hard, and I think we should at least try to look for a better one. Let me kick off the brainstorming by suggesting we think of them as "memory content" values, because that's what they are. The types could be spelled m8, m32, etc. in IR assembly.

A variation: adding a pointer provenance type

In the llvm-dev thread, Jeroen Dobbelaere points out work being done to introduce explicit `ptr_provenance` operands on certain instructions, in service of C99's restrict keyword. I haven't properly digested this work, but it inspired the thoughts of this section.

Values of the proposed byte types have both a bit pattern and a pointer provenance. Do we really need to have both pieces of information in the same SSA value? We could instead split them up into an integer bit pattern value and a pointer provenance value with an explicit provenance type. Loads of integers could read out the provenance information stored in memory and provide it as a secondary result. Similarly, stores of integers could accept the desired provenance to be stored in memory as a secondary data operand. This would allow us to write a perfect memcpy by replacing the core load/store sequence with something like:

%ch, %provenance = load_with_provenance i8, i8* %src
store_with_provenance i8 %ch, provenance %provenance, i8* %dst

The syntax and instruction names in the example are very much straw men. Don't take them too seriously, especially because LLVM IR doesn't currently allow multiple result values.

Interestingly, this split allows the derivation of pointer provenance to follow a different path than the calculation of the pointer's bit pattern. This in turn allows us in principle to perform GVN on pointers without being conservative for alias analysis.

One of the steps in bug 37469 is not quite GVN, but morally similar. Simplifying a lot, the original program sequence:

%ch1 = load i8, i8* %p1
%ch2 = load i8, i8* %p2
%eq = icmp eq i8 %ch1, %ch2
%ch = select i1 %eq, i8 %ch1, i8 %ch2
store i8 %ch, i8* %p3

... is transformed into:

%ch2 = load i8, i8* %p2
store i8 %ch2, i8* %p3

This is correct for the bit patterns being loaded and stored, but the program also indirectly relies on pointer provenance of the data. Of course, there is no pointer provenance information being copied here because i8 only holds a bit pattern. However, with the "byte" proposal, all the i8s would be replaced by b8s, and then the transform becomes incorrect because it changes the provenance information.

If we split the proposed use of b8 into a use of i8 and explicit provenance values, the original program becomes:

%ch1, %prov1 = load_with_provenance i8, i8* %p1
%ch2, %prov2 = load_with_provenance i8, i8* %p2
%eq = icmp eq i8 %ch1, %ch2
%ch = select i1 %eq, i8 %ch1, i8 %ch2
%prov = select i1 %eq, provenance %prov1, provenance %prov2
store_with_provenance i8 %ch, provenance %prov, i8* %p3

This could be transformed into something like:

%prov1 = load_only_provenance i8* %p1
%ch2, %prov2 = load_with_provenance i8, i8* %p2
%prov = merge provenance %prov1, %prov2
store_with_provenance i8 %ch2, provenance %prov, i8* %p3

... which is just as good for code generation but loses only very little provenance information.

Aside: loop idioms

Without major changes to LLVM IR, a perfect memcpy cannot be implemented because pointer provenance information is lost.

Nevertheless, one could still define the @llvm.memcpy intrinsic to be a perfect copy. This helps memcpys in the original source program be less conservative in terms of alias analysis. However, it also makes it incorrect to replace a memcpy loop idiom with a use of @llvm.memcpy: without adding unrestrict instructions, the replacement may introduce undefined behavior; and there is no way to bound the locations where such unrestricts may be needed.

We could augment @llvm.memcpy with an immediate argument that selects its provenance behavior.

In any case, one can argue that bug 37469 is really a bug in the loop idiom recognizer. It boils down to the details of how everything is defined, and unfortunately, these weird corner cases are currently underspecified in the LangRef.

Conclusion

We started with the question of whether memcpy can be implemented in LLVM IR. The answer is a qualified Yes. It is possible, but the resulting copy is imperfect because pointer provenance information is lost. This has surprising implications which in turn happen to cause real miscompilation bugs -- although those bugs could be fixed even without a perfect memcpy.

The "byte" proposal has a certain aesthetic appeal because it fixes a real gap in the expressiveness of LLVM IR, but its software engineering cost is large and I object to some of its details. There are also alternatives to consider.

The miscompilation bugs obviously need to be fixed, but they can be fixed much less intrusively, albeit at the cost of more conservative alias analysis in the affected places. It is not clear to me whether improving alias analysis justifies the more complex solutions.

I would like to understand better how all of this interacts with the C99 restrict work. That work introduces mechanisms for explicitly talking about pointer provenance in the IR, which may allow us to kill two birds with one stone.

In any case, this is a fascinating topic and discussion, and I feel like we're only at the beginning.

Addendum: storing back previously loaded integers

(Added this section on June 14th)

Harald van Dijk on Phrabricator and Ralf Jung on llvm-dev, referring to a Rust issue, explicitly and implicitly point out a curious issue with loading and storing integers.

Here is Harald's example:

define i8* @f(i8* %p) {
%buf = alloca i8*
%buf.i32 = bitcast i8** %buf to i32*
store i8* %p, i8** %buf
%i = load i32, i32* %buf.i32
store i32 %i, i32* %buf.i32
%q = load i8*, i8** %buf
ret i8* %q
}

There is a pair of load/store of i32 which is fully redundant from a machine perspective and so we'd like to optimize that away, after which it becomes obvious that the function really just returns %p -- at least as far as bit patterns are concerned.

However, in a world where memory is untyped but has provenance information, this optimization is incorrect because it can introduce undefined behavior: the load/store of i32 resets the provenance information in memory to "all", so that the original function returns an unrestricted version of %p. This is no longer the case after the optimization.

There are at least two possible ways of resolving this conflict.

We could define memory to be typed, in the sense that each byte of memory remembers whether it was most recently stored as a pointer or a non-pointer. A load with the wrong type returns poison. In that case, the example above returns poison before the optimization (because %i is guaranteed to be poison). After the optimization it returns non-poison, which is an acceptable refinement, so the optimization is correct.

The alternative is to keep memory untyped and say that directly eliminating the i32 store in the example is incorrect.

We are facing a tradeoff that depends on how important that optimization is for performance.

Two observations to that end. First, the more common case of dead store elimination is one where there are multiple stores to the same address in a row, and we remove all but the last one of them. That more common optimization is unaffected by provenance issues either way.

Second, we can still perform store forwarding / peephole optimization across such load/store pairs, as long as we are careful to introduce unrestrict where needed. The example above can be optimized via store forwarding to:

define i8* @f(i8* %p) {
%buf = alloca i8*
%buf.i32 = bitcast i8** %buf to i32*
store i8* %p, i8** %buf
%i = load i32, i32* %buf.i32
store i32 %i, i32* %buf.i32
%q = unrestrict i8* %p
ret i8* %q
}

We can then dead-code eliminate the bulk of the function and obtain:

define i8* @f(i8* %p) {
%q = unrestrict i8* %p
ret i8* %q
}

... which is as good as it can possibly get.

So, there is a good chance that preventing this particular optimization is relatively cheap in terms of code quality, and the gain in overall design simplicity may well be worth it.

[0] We could also say that the loaded pointer's provenance is magically the memory object that happens to be at the referenced memory address. Either way, provenance would become a useless no-op in most cases. For example, mem2reg would have to insert unrestrict instructions (defined later) everywhere because pointers become effectively "unrestricted" when loaded from alloca'd memory.

They want to be small, they want to be big: thoughts on code reviews and the power of patch series

2020-06-25T00:34:00.000+02:00

Code reviews are a central fact of life in software development. It's important to do them well, and developer quality of life depends on a good review workflow.

Unfortunately, code reviews also appear to be a difficult problem. Many projects are bottlenecked by code reviews, in that reviewers are hard to find and progress gets slowed down by having to wait a long time for reviews.

The "solution" that I've often seen applied in practice is to have lower quality code reviews. Reviewers don't attempt to gain a proper understanding of a change, so reviews become shallower and therefore easier. This is convenient on the surface, but more likely to allow bad code to go through: a subtle corner case that isn't covered by tests (yet?) may be missed, there may be a misunderstanding of a relevant underlying spec, a bad design decision slips through, and so on. This is bound to cause pains later on.

I've experienced a number of different code review workflows in practice, based on a number of tools: GitHub PRs and GitLab MRs, Phabricator, and the e-mail review of patch series that is the original workflow for which Git was designed. Of those, the e-mail review flow produced the highest quality. There may be confounding factors, such as the nature of the projects and the selection of developers working on them, but quality issues aside I certainly feel that the e-mail review flow was the most pleasant to work with. Over time I've been thinking and having discussions a lot about just why that is. I feel that I have distilled it to two key factors, which I've decided to write down here so I can just link to this blog post in the future.

First, the UI experience of e-mail is a lot nicer. All of the alternatives are ultimately web-based, and their UI latency is universally terrible. Perhaps I'm particularly sensitive, but I just cannot stand web UIs for serious work. Give me something that reacts to all input reliably in under 50ms and I will be much happier. E-mail achieves that, web UIs don't. Okay, to be fair, e-mail is probably more in the 100ms range given the general state of the desktop. The point is, web UIs are about an order of magnitude worse. It's incredibly painful. (All of this assumes that you're using a decent native e-mail client. Do yourself a favor and give that a try if you haven't. The CLI warriors all have their favorites, but frankly Thunderbird works just fine. Outlook doesn't.)

Second, I've come to realize that there are conflicting goals in review granularity that e-mail happens to address pretty well, but none of the alternatives do a good job of it. Most of the alternatives don't even seem to understand that there is a problem to begin with! Here's the thing:

Reviews want to be small. The smaller and the more self-contained a change is, the easier it is to wrap your head around and judge. If you do a big refactor that is supposed to have no functional impact, followed by a separate small functional change that is enabled by the refactor, then each change individually is much easier to review. Approving changes at a fine granularity also helps ensure that you've really thought through each individual change and that each change has a reasonable justification. Important details don't get lost in something larger.

Reviews want to be big. A small, self-contained change can be difficult to understand and judge in isolation. You're doing a refactor that moves a function somewhere else? Fine, it's easy to tell that the change is correct, but is it a good change? To judge that, you often need to understand how the refactored result ends up being used in later changes, so it's good to see all those changes at once. Keep in mind though that you don't necessarily have to approve them at the same time. It's entirely possible to say, yes, that refactor looks good, we can go ahead with that, but please fix the way it's being used in a subsequent change.

There is another reason why reviews want to be big. Code reviews have a mental context-switching overhead. As a reviewer, you need to think yourself into the affected code in order to judge it well. If you do many reviews, you typically need to context-switch between each review. This can be very taxing mentally and ultimately unpleasant. A similar, though generally smaller, context-switching overhead applies to the author of the change as well: let's say you send out some changes for review, then go off and do another thing, and come back a day or two later to some asynchronously written reviews. In order to respond to the review, you may now have to page the context of that change back in. The point of all this is that when reviews are big, the context-switching overhead gets amortized better, i.e. the cost per change drops.

Reviews want to be both small and big. Guess what, patch series solve that problem! You get to review an entire feature implementation in the form of a patch series at once, so your context-switching overhead is reduced and you can understand how the different parts of the change play together. At the same time, you can drill down into individual patches and review those. Two levels of detail are available simultaneously.

So why e-mail? Honestly, I don't know. Given that the original use case for Git is based on patch series review, it's mind-boggling in a bad way that web-based Git hosting and code review systems do such a poor job of it, if they handle it at all.

Gerrit is the only system I know of that really takes patch series as an idea seriously, but while I am using it occasionally, I haven't had the opportunity to really stress it. Unfortunately, most people don't even want to consider Gerrit as an option because it's ugly.

Phabricator's stacks are a pretty decent attempt and I've made good use of them in the context of LLVM. However, they're too hidden and clumsy to navigate. Both Phabricator and Gerrit lack a mechanism for discussing a patch series as a whole.

GitHub and Gitlab? They're basically unusable. Yes, you can look at individual commits, but then GitHub doesn't even display the commits in the correct order: they're sorted by commit or author date, not by the Git DAG order, which is an obviously and astonishingly bad idea. Comments tend to get lost when authors rebase, which is what authors should do in order to ensure a clean history, and actually reviewing an individual commit is impossible. Part of the power of patch series is the ability to say: "Patches 1-6, 9, and 11 are good to go, the rest needs work."

Oh, and by the way: Commit messages? They're important! Gerrit again is the only system I know of that allows review comments on commit messages. It's as if the authors of Gerrit are the only ones who really understood the problem space. Unfortunately, they seem to lack business and web design skills, and so we ended up in the mess we're in right now.

Mind you, even if the other players got their act together and supported the workflow properly, there'd still be the problem of UI latency. One can dream...

FOSDEM talk on TableGen

2019-02-05T11:03:00.000+01:00

Video and slides for my talk in the LLVM devroom on TableGen are now available here.

Now I only need the time and energy to continue my blog series on the topic...

TableGen #5: DAGs

2018-03-09T11:25:00.000+01:00

This is the fifth part of a series; see the first part for a table of contents.

With bit sequences, we have already seen one unusual feature of TableGen that is geared towards its specific purpose. DAG nodes are another; they look a bit like S-expressions:

def op1;
def op2;
def i32:

def Example {
dag x = (op1 $foo, (op2 i32:$bar, "Hi"));
}

In the example, there are two DAG nodes, represented by a DagInit object in the code. The first node has as its operation the record op1. The operation of a DAG node must be a record, but there are no other restrictions. This node has two children or arguments: the first argument is named foo but has no value. The second argument has no name, but it does have another DAG node as its value.

This second DAG node has the operation op2 and two arguments. The first argument is named bar and has value i32, the second has no name and value "Hi".

DAG nodes can have any number of arguments, and they can be nested arbitrarily. The values of arguments can have any type, at least as far as the TableGen frontend is concerned. So DAGs are an extremely free-form way of representing data, and they are really only given meaning by TableGen backends.

There are three main uses of DAGs:

Describing the operands on machine instructions.
Describing patterns for instruction selection.
Describing register files with something called "set theory".

I have not yet had the opportunity to explore the last point in detail, so I will only give an overview of the first two uses here.

Describing the operands of machine instructions is fairly straightforward at its core, but the details can become quite elaborate.

I will illustrate some of this with the example of the V_ADD_F32 instruction from the AMDGPU backend. V_ADD_F32 is a standard 32-bit floating point addition, at least in its 32-bit-encoded variant, which the backend represents as V_ADD_F32_e32.

Let's take a look at some of the fully resolved records produced by the TableGen frontend:

def V_ADD_F32_e32 { // Instruction AMDGPUInst ...
dag OutOperandList = (outs anonymous_503:$vdst);
dag InOperandList = (ins VSrc_f32:$src0, VGPR_32:$src1);
string AsmOperands = "$vdst, $src0, $src1";
...
}

def anonymous_503 { // DAGOperand RegisterOperand VOPDstOperand
RegisterClass RegClass = VGPR_32;
string PrintMethod = "printVOPDst";
...
}

As you'd expect, there is one out operand. It is named vdst and an anonymous record is used to describe more detailed information such as its register class (a 32-bit general purpose vector register) and the name of a special method for printing the operand in textual assembly output. (The string "printVOPDst" will be used by the backend that generates the bulk of the instruction printer code, and refers to the method AMDGPUInstPrinter::printVOPDst that is implemented manually.)

There are two in operands. src1 is a 32-bit general purpose vector register and requires no special handling, but src0 supports more complex operands as described in the record VSrc_f32 elsewhere.

Also note the string AsmOperands, which is used as a template for the automatically generated instruction printer code. The operand names in that string refer to the names of the operands as defined in the DAG nodes.

This was a nice warmup, but didn't really demonstrate the full power and flexibility of DAG nodes. Let's look at V_ADD_F32_e64, the 64-bit encoded version, which has some additional features: the sign bits of the inputs can be reset or inverted, and the result (output) can be clamped and/or scaled by some fixed constants (0.5, 2, and 4). This will seem familiar to anybody who has worked with the old OpenGL assembly program extensions or with DirectX shader assembly.

The fully resolved records produced by the TableGen frontend are quite a bit more involved:

def V_ADD_F32_e64 {    // Instruction AMDGPUInst ...
dag OutOperandList = (outs anonymous_503:$vdst);
dag InOperandList =
    (ins FP32InputMods:$src0_modifiers, VCSrc_f32:$src0,
         FP32InputMods:$src1_modifiers, VCSrc_f32:$src1,
         clampmod:$clamp, omod:$omod);
string AsmOperands = "$vdst, $src0_modifiers, $src1_modifiers$clamp$omod";
list<dag> Pattern =
    [(set f32:$vdst, (fadd
      (f32 (VOP3Mods0 f32:$src0, i32:$src0_modifiers,
                      i1:$clamp, i32:$omod)),
      (f32 (VOP3Mods f32:$src1, i32:$src1_modifiers))))];
...
}

def FP32InputMods {     // DAGOperand Operand InputMods FPInputMods
ValueType Type = i32;
string PrintMethod = "printOperandAndFPInputMods";
AsmOperandClass ParserMatchClass = FP32InputModsMatchClass;
...
}

def FP32InputModsMatchClass {   // AsmOperandClass FPInputModsMatchClass
string Name = "RegOrImmWithFP32InputMods";
string PredicateMethod = "isRegOrImmWithFP32InputMods";
string ParserMethod = "parseRegOrImmWithFPInputMods";
...
}

The out operand hasn't changed, but there are now many more special in operands that describe whether those additional features of the instruction are used.

You can again see how records such as FP32InputMods refer to manually implemented methods. Also note that the AsmOperands string no longer refers to src0 or src1. Instead, the printOperandAndFPInputMods method on src0_modifiers and src1_modifiers will print the source operand together with its sign modifiers. Similarly, the special ParserMethod parseRegOrImmWithFPInputMods will be used by the assembly parser.

This kind of extensibility by combining generic automatically generated code with manually implemented methods is used throughout the TableGen backends for code generation.

Something else is new here: the Pattern. This pattern, together will all the other patterns defined elsewhere, is compiled into a giant domain-specific bytecode that executes during instruction selection to turn the SelectionDAG into machine instructions. Let's take this particular pattern apart:

(set f32:$vdst, (fadd ...))

We will match an fadd selection DAG node that outputs a 32-bit floating point value, and this output will be linked to the out operand vdst. (set, fadd and many others are defined in the target-independent include/llvm/Target/TargetSelectionDAG.td.)

(fadd (f32 (VOP3Mods0 f32:$src0, i32:$src0_modifiers,
i1:$clamp, i32:$omod)),
(f32 (VOP3Mods f32:$src1, i32:$src1_modifiers)))

Both input operands of the fadd node must be 32-bit floating point values, and they will be handled by complex patterns. Here's one of them:

def VOP3Mods { // ComplexPattern
string SelectFunc = "SelectVOP3Mods";
int NumOperands = 2;
...
}

As you'd expect, there's a manually implemented SelectVOP3Mods method. Its signature is

bool SelectVOP3Mods(SDValue In, SDValue &Src,
SDValue &SrcMods) const;

It can reject the match by returning false, otherwise it pattern matches a single input SelectionDAG node into nodes that will be placed into src1 and src1_modifiers in the particular pattern we were studying.

Patterns can be arbitrarily complex, and they can be defined outside of instructions as well. For example, here's a pattern for generating the S_BFM_B32 instruction, which generates a bitfield mask:

def anonymous_2373anonymous_2371 { // Pattern Pat ...
dag PatternToMatch =
(i32 (shl (i32 (add (i32 (shl 1, i32:$a)), -1)), i32:$b));
list<dag> ResultInstrs = [(S_BFM_B32 ?:$a, ?:$b)];
...
}

The name of this record doesn't matter. The instruction selection TableGen backend simply looks for all records that have Pattern as a superclass. In this case, we match an expression of the form ((1 << a) - 1) << b on 32-bit integers into a single machine instruction.

So far, we've mostly looked at how DAGs are interpreted by some of the key backends of TableGen. As it turns out, most backends generate their DAGs in a fairly static way, but there are some fancier techniques that can be used as well. This post is already quite long though, so we'll look at those in the next post.

TableGen #4: Resolving variables

2018-03-06T14:17:00.000+01:00

This is the fourth part of a series; see the first part for a table of contents.

It's time to look at some of the guts of TableGen itself. TableGen is split into a frontend, which parses the TableGen input, instantiates all the records, resolves variable references, and so on, and many different backends that generate code based on the instantiated records. In this series I'll be mainly focusing on the frontend, which lives in lib/TableGen/ inside the LLVM repository, e.g. here on the GitHub mirror. The backends for LLVM itself live in utils/TableGen/, together with the command line tool's main() function. Clang also has its own backends.

Let's revisit what kind of variable references there are and what kind of resolving needs to be done with an example:

class Foo<int src> {
int Src = src;
int Offset = 1;
int Dst = !add(Src, Offset);
}

multiclass Foos<int src> {
def a : Foo<src>;
let Offset = 2 in
def b : Foo<src>;
}

foreach i = 0-3 in
defm F#i : Foos<i>;

This is actually broken in older LLVM by one of the many bugs, but clearly it should work based on what kind of features are generally available, and with my patch series it certainly does work in the natural way. We see four kinds of variable references:

internally within a record, such as the initializer of Dst referencing Src and Offset
to a class template variable, such as Src being initialized by src
to a multiclass template variable, such as src being passed as a template argument for Foo
to a foreach iteration variable

As an aside, keep in mind that let in TableGen does not mean the same thing as in the many functional programming languages that have a similar construct. In those languages let introduces a new variable, but TableGen's let instead overrides the value of a variable that has already been defined elsewhere. In the example above, the let-statement causes the value of Offset to be changed in the record that was instantiated from the Foo class to create the b prototype inside multiclass Foos.

TableGen internally represents variable references as instances of the VarInit class, and the variables themselves are simply referenced by name. This causes some embarrassing issues around template arguments which are papered over by qualifying the variable name with the template name. If you pass the above example through a sufficiently fixed version of llvm-tblgen, one of the outputs will be the description of the Foo class:

class Foo<int Foo:src = ?> {
int Src = Foo:src;
int Offset = 1;
int Dst = !add(Src, Offset);
string NAME = ?;
}

As you can see, Foo:src is used to refer to the template argument. In fact, the template arguments of both classes and multiclasses are temporarily added as variables to their respective prototype records. When the class or prototype in a multiclass is instantiated, all references to the template argument variables are resolved fully, and the variables are removed (or rather, some of them are removed, and making that consistent is one of the many things I set out to clean up).

Similarly, references to foreach iteration variables are resolved when records are instantiated, although those variables aren't similarly qualified. If you want to learn more about how variable names are looked up, TGParser::ParseIDValue is a good place to start.

The order in which variables are resolved is important. In order to achieve the flexibility of overriding defaults with let-statements, internal references among record variables must be resolved after template arguments.

Actually resolving variable references used to be done by the implementations of the following virtual method of the Init class hierarchy (which represents initializers, i.e. values and expressions):

virtual Init *resolveReferences(Record &R, const RecordVal *RV) const;

This method recursively resolves references in the constituent parts of the expression and then performs constant folding, and returns the resulting value (or the original value if nothing could be resolved). Its interface is somewhat magical: R represents the "current" record which is used as a frame of reference for magical lookups in the implementation of !cast; this is a topic for another time, though. At the same time, variables referencing R are supposed to be resolved, but only if RV is null. If RV is non-null, then only references to that specific variable are supposed to be resolved. Additionally, some behaviors around unset depend on this.

This is replaced in my changes with

virtual Init *resolveReferences(Resolver &R) const;

where Resolver is an abstract base class / interface which can lookup values based on their variable names:

class Resolver {
Record *CurRec;

public:
explicit Resolver(Record *CurRec) : CurRec(CurRec) {}
virtual ~Resolver() {}

Record *getCurrentRecord() const { return CurRec; }
virtual Init *resolve(Init *VarName) = 0;
virtual bool keepUnsetBits() const { return false; }
};

The "current record" is used as a reference for the aforementioned magical !casts, and keepUnsetBits instructs the implementation of bit sequences in BitsInit not to resolve to ? (as was explained in the third part of the series). resolve itself is implemented by one of the subclasses, most notably:

MapResolver: Resolve based on a dictionary of name-value pairs.
RecordResolver: Resolve variable names that appear in the current record.
ShadowResolver: Delegate requests to an underlying resolver, but filter out some names.

This last type of resolver is used by the implementations of !foreach and !foldl to avoid mistakes with nesting. Consider, for example:

class Exclamation<list<string> messages> {
list Messages = !foreach(s, messages, s # "!");
}

class Greetings<list<string> names>
: Exclamation&lt!foreach(s, names, "Hello, " # s)>;

def : Greetings<["Alice", "Bob"]>;

This effectively becomes a nested !foreach. The iteration variable is named s in both, so when substituting s for the outer !foreach, we must ensure that we don't also accidentally substitute s in the inner !foreach. We achieve this by having !foreach wrap the given resolver with a ShadowResolver. The same principle applies to !foldl as well, of course.

TableGen #3: Bits

2018-02-23T01:31:00.000+01:00

This is the third part of a series; see the first part for a table of contents.

One of the main backend uses of TableGen is describing target machine instructions, and that includes describing the binary encoding of instructions and their constituents parts. This requires a certain level of bit twiddling, and TableGen supports this with explicit bit (single bit) and bits (fixed-length sequence of bits) types:

class Enc<bits<7> op> {
bits<10> Encoding;

let Encoding{9-7} = 5;
let Encoding{6-0} = op;
}

def InstA : Enc<0x35>;
def InstB : Enc<0x08>;

... will produce records:

def InstA { // Enc
bits<10> Encoding = { 1, 0, 1, 0, 1, 1, 0, 1, 0, 1 };
string NAME = ?;
}
def InstB { // Enc
bits<10> Encoding = { 1, 0, 1, 0, 0, 0, 1, 0, 0, 0 };
string NAME = ?;
}

So you can quite easily slice and dice bit sequences with curly braces, as long as the indices themselves are constants.

But the real killer feature is that so-called unset initializers, represented by a question mark, aren't fully resolved in bit sequences:

class Enc<bits<3> opcode> {
bits<8> Encoding;
bits<3> Operand;

let Encoding{0} = opcode{2};
let Encoding{3-1} = Operand;
let Encoding{5-4} = opcode{1-0};
let Encoding{7-6} = { 1, 0 };
}

def InstA : Enc<5>;

... produces a record:

def InstA { // Enc
bits<8> Encoding = { 1, 0, 0, 1, Operand{2}, Operand{1}, Operand{0}, 1 };
bits<3> Operand = { ?, ?, ? };
string NAME = ?;
}

So instead of going ahead and saying, hey, Operand{2} is ?, let's resolve that and plug it into Encoding, TableGen instead keeps the fact that bit 3 of Encoding refers to Operand{2} as part of its data structures.

Together with some additional data, this allows a backend of TableGen to automatically generate code for instruction encoding and decoding (i.e., disassembling). For example, it will create the source for a giant C++ method with signature

uint64_t getBinaryCodeForInstr(const MCInst &MI, /* ... */) const;

which contains a giant constant array with all the fixed bits of each instruction followed by a giant switch statement with cases of the form:

case AMDGPU::S_CMP_EQ_I32:
case AMDGPU::S_CMP_EQ_U32:
case AMDGPU::S_CMP_EQ_U64:
// more cases...
case AMDGPU::S_SET_GPR_IDX_ON: {
// op: src0
op = getMachineOpValue(MI, MI.getOperand(0), Fixups, STI);
Value |= op & UINT64_C(255);
// op: src1
op = getMachineOpValue(MI, MI.getOperand(1), Fixups, STI);
Value |= (op & UINT64_C(255)) << 8;
break;
}

The bitmasks and shift values are all derived from the structure of unset bits as in the example above, and some additional data (the operand DAGs) are used to identify the operand index corresponding to TableGen variables like Operand based on their name. For example, here are the relevant parts of the S_CMP_EQ_I32 record generated by the AMDGPU backend's TableGen files:

def S_CMP_EQ_I32 { // Instruction (+ other superclasses)
field bits<32> Inst = { 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, src1{7}, src1{6}, src1{5}, src1{4}, src1{3}, src1{2}, src1{1}, src1{0}, src0{7}, src0{6}, src0{5}, src0{4}, src0{3}, src0{2}, src0{1}, src0{0} };
dag OutOperandList = (outs);
dag InOperandList = (ins SSrc_b32:$src0, SSrc_b32:$src1);
bits<8> src0 = { ?, ?, ?, ?, ?, ?, ?, ? };
bits<8> src1 = { ?, ?, ?, ?, ?, ?, ?, ? };
// many more variables...
}

Note how Inst, which describes the 32-bit encoding as a whole, refers to the TableGen variables src0 and src1. The operand indices used in the calls to MI.getOperand() above are derived from the InOperandList, which contains nodes with the corresponding names. (SSrc_b32 is the name of a record that subclasses RegisterOperand and describes the acceptable operands, such as registers and inline constants.)

Hopefully this helped you appreciate just how convenient TableGen can be. Not resolving the ? in bit sequences is an odd little exception to an otherwise fairly regular language, but the resulting expressive power is clearly worth it. It's something to keep in mind when we discuss how variable references are resolved.

TableGen #2: Functional Programming

2018-02-21T11:22:00.000+01:00

This is the second part of a series; see the first part for a table of contents.

When the basic pattern of having classes with variables that are filled in via template arguments or let-statements reaches the limits of its expressiveness, it can become useful to calculate values on the fly. TableGen provides string concatenation out of the box with the paste operator ('#'), and there are built-in functions which can be easily recognized since they start with an exclamation mark, such as !add, !srl, !eq, and !listconcat. There is even an !if-builtin and a somewhat broken and limited !foreach.

There is no way of defining new functions, but there is a pattern that can be used to make up for it: classes with ret-values:

class extractBit<int val, int bitnum> {
bit ret = !and(!srl(val, bitnum), 1);
}

class Foo<int val> {
bit bitFour = extractBit<val, 4>.ret;
}

def Foo1 : Foo<5>;
def Foo2 : Foo<17>;

This doesn't actually work in LLVM trunk right now because of the deficiencies around anonymous record instantiations that I mentioned in the first part of the series, but after a lot of refactoring and cleanups, I got it to work reliably. It turns out to be an extremely useful tool.

In case you're wondering, this does not support recursion and it's probably better that way. It's possible that TableGen is already accidentally Turing complete, but giving it that power on purpose seems unnecessary and might lead to abuse.

Without recursion, a number of builtin functions are required. There has been a !foreach for a long time, and it is a very odd duck:

def Defs {
int num;
}

class Example<list<int> nums> {
list<int> doubled = !foreach(Defs.num, nums, !add(Defs.num, Defs.num));
}

def MyNums : Example<[4, 1, 9, -3]>;

In many ways it does what you'd expect, except that having to define a dummy record with a dummy variable in this way is clearly odd and fragile. Until very recently it did not actually support everything you'd think even then, and even with the recent fixes there are plenty of bugs. Clearly, this is how !foreach should look instead:

class Example<list<int> nums> {
list<int> doubled =
!foreach(x, nums, !add(x, x));
}

def MyNums : Example<[4, 1, 9, -3]>;

... and that's what I've implemented.

This ends up being a breaking change (the only one in the whole series, hopefully), but !foreach isn't actually used in upstream LLVM proper anyway, and external projects can easily adapt.

A new feature that I have found very helpful is a fold-left operation:

class Enumeration<list<string> items> {
list<string> ret = !foldl([], items, lhs, item,
!listconcat(lhs, [!size(lhs) # ": " # item]));
}

def MyList : Enumeration<["foo", "bar", "baz"]>;

This produces the following record:

def MyList { // Enumeration
list<string> ret = ["0: foo", "1: bar", "2: baz"];
string NAME = ?;
}

Needless to say, it was necessary to refactor the TableGen tool very deeply to enable this kind of feature, but I am quite happy with how it ended up.

The title of this entry is "Functional Programming", and in a sense I lied. Functions are not first-class values in TableGen even with my changes, so one of the core features of functional programming is missing. But that's okay: most of what you'd expect to have and actually need is now available in a consistent manner, even if it's still clunkier than in a "real" programming language. And again: making functions first-class would immediately make TableGen Turing complete. Do we really want that?

TableGen #1: What has TableGen ever done for us?

2018-02-19T19:10:00.000+01:00

This is the first entry in an on-going series. Here's a list of all entries:

Also: here is a talk (slides + video) I gave in the FOSDEM 2019 LLVM devroom on TableGen.

Anybody who has ever done serious backend work in LLVM has probably developed a love-hate relationship with TableGen. At its best it can be an extremely useful tool that saves a lot of manual work. At its worst, it will drive you mad with bizarre crashes, indecipherable error messages, and generally inscrutable failures to understand what you want from it.

TableGen is an internal tool of the LLVM compiler framework. It implements a domain-specific language that is used to describe many different kinds of structures. These descriptions are translated to read-only data tables that are used by LLVM during compilation.

For example, all of LLVM's intrinsics are described in TableGen files. Additionally, each backend describes its target machine's instructions, register file(s), and more in TableGen files.

The unit of description is the record. At its core, a record is a dictionary of key-value pairs. Additionally, records are typed by their superclass(es), and each record can have a name. So for example, the target machine descriptions typically contain one record for each supported instruction. The name of this record is the name of the enum value which is used to refer to the instruction. A specialized backend in the TableGen tool collects all records that subclass the Instruction class and generates instruction information tables that is used by the C++ code in the backend and the shared codegen infrastructure.

The main point of the TableGen DSL is to provide an ostensibly convenient way to generate a large set of records in a structured fashion that exploits regularities in the target machine architecture. To get an idea of the scope, the X86 backend description contains ~47k records generated by ~62k lines of TableGen. The AMDGPU backend description contains ~39k records generated by ~24k lines of TableGen.

To get an idea of what TableGen looks like, consider this simple example:

def Plain {
int x = 5;
}

class Room<string name> {
string Name = name;
string WallColor = "white";
}

def lobby : Room<"Lobby">;

multiclass Floor<int num, string color> {
let WallColor = color in {
def _left : Room<num # "_left">;
def _right : Room<num # "_right">;
}
}

defm first_floor : Floor<1, "yellow">;
defm second_floor : Floor<2, "gray">;

This example defines 6 records in total. If you have an LLVM build around, just run the above through llvm-tblgen to see them for yourself. The first one has name Plain and contains a single value named x of value 5. The other 5 records have Room as a superclass and contain different values for Name and WallColor.

The first of those is the record of name lobby, whose Name value is "Lobby" (note the difference in capitalization) and whose WallColor is "white".

Then there are four records with the names first_floor_left, first_floor_right, second_floor_left, and second_floor_right. Each of those has Room as a superclass, but not Floor. Floor is a multiclass, and multiclasses are not classes (go figure!). Instead, they are simply collections of record prototypes. In this case, Floor has two record prototypes, _left and _right. They are instantiated by each of the defm directives. Note how even though def and defm look quite similar, they are conceptually different: one instantiates the prototypes in a multiclass (or several multiclasses), the other creates a record that may or may not have one or more superclasses.

The Name value of first_floor_left is "1_left" and its WallColor is "yellow", overriding the default. This demonstrates the late-binding nature of TableGen, which is quite useful for modeling exceptions to an otherwise regular structure:

class Foo {
string salutation = "Hi";
string message = salutation#", world!";
}

def : Foo {
let salutation = "Hello";
}

The message of the anonymous record defined by the def-statement is "Hello, world!".

There is much more to TableGen. For example, a particularly surprising but extremely useful feature are the bit sets that are used to describe instruction encodings. But that's for another time.

For now, let me leave you with just one of the many ridiculous inconsistencies in TableGen:

class Tag<int num> {
int Number = num;
}

class Test<int num> {
int Number1 = Tag<5>.Number;
int Number2 = Tag<num>.Number;
Tag Tag1 = Tag<5>;
Tag Tag2 = Tag<num>;
}

def : Test<5>;

What are the values in the anonymous record? It turns out that Number1 and Number2 are both 5, but Tag1 and Tag2 refer to different records. Tag1 refers to an anonymous record with superclass Tag and Number equal to 5, while Tag2 also refers to an anonymous record, but with the Number equal to an unresolved variable reference.

This clearly doesn't make sense at all and is the kind of thing that sometimes makes you want to just throw it all out of the window and build your own DSL with blackjack and Python hooks. The problem with that kind of approach is that even if the new thing looks nicer initially, it'd probably end up in a similarly messy state after another five years.

So when I ran into several problems like the above recently, I decided to take a deep dive into the internals of TableGen with the hope of just fixing a lot of the mess without reinventing the wheel. Over the next weeks, I plan to write a couple of focused entries on what I've learned and changed, starting with how a simple form of functional programming should be possible in TableGen.

Posits

2017-09-30T20:25:00.000+02:00

The posit number system is a proposed alternative to floating point numbers. Having heard of posits a couple of times now, I'd like to take the time to digest them and, in the second half, write a bit about their implementation in hardware. Their creator makes some bold claims about posits being simpler to implement, and - spoiler alert! - I believe he's mistaken. Posits are still a clever idea and may indeed be a good candidate for replacing floating point in the long run. But trade-offs are an inescapable fact of life.

Floating point revisited

In floating point, numbers are represented by a sign bit, an exponent, and a mantissa:

The value of a normal floating point number is ±1.m₂*2^e (actually, e is stored with a bias in order to be able to treat it like an unsigned number most of the time, but let's not get distracted by that kind of detail). By using an exponent, a wide range of numbers can be represented at a constant relative accuracy.

There are some non-normal floating point numbers. When e is maximal, the number is either considered infinity or "not a number", depending on m. When e is minimal, it represents a sub-normal number: either a denormal or zero.

Denormals can be confusing at first, but their justification is actually quite simple. Let's take single-precision floating point as an example, where there are 8 exponent bits and 23 mantissa bits. The smallest positive normal single-precision floating point number is 1.00000000000000000000000₂*2^-126. The next larger representable number is 1.00000000000000000000001₂*2^-126. Those numbers are not equal, but their difference is not representable as a normal single-precision floating point number. It would be rather odd if the difference between non-equal numbers were equal to zero, as it would be if we had to round the difference to zero!

When e is minimal, the represented number is (in the case of single-precision floating point) ±0.m₂*2^-126, which means that the difference between the smallest normal numbers, 0.00000000000000000000001₂*2^-126, can still be represented.

Note how with floating point numbers, the relative accuracy with which numbers can be represented is constant for almost the entire range of representable numbers. Once you get to sub-normal numbers, the accuracy drops very quickly. At the other end, the drop is even more extreme with a sudden jump to infinity.

Posits

The basic idea of posits is to vary the size of the mantissa and to use a variable-length hybrid encoding of the exponent that mixes unary with binary encodings. The variable-length exponent encoding is shorter for exponents close to zero, so that more bits of mantissa are available for numbers close to one.

Posits have a fixed number of binary exponent bits e (except in the extreme ranges), and a posit system is characterized by that number. A typical choice appears to be es = 3. The unary part of the exponent is encoded by the r bits. For positive posits, 10₁ encodes 0, 110₁ encodes 1, 01₁ encodes -1, 001₁ encodes -2, and so on. The overall encoded number is then ±1.m₂*2^{r*2^es + e}.

Let's look at some examples of 16-bit posits with es = 3.

0 10 000 0000000000 is 1.0₂*2^0*2³+0 = 1.
0 10 000 1000000000 is 1.1₂*2^0*2³+0 = 1.5.
0 10 001 0000000000 is 1.0₂*2^0*2³+1 = 2.
0 10 111 0000000000 is 1.0₂*2^0*2³+7 = 128.
0 110 000 000000000 is 1.0₂*2^1*2³+0 = 256.
0 1110 000 00000000 is 1.0₂*2^2*2³+0 = 65536.
0 111111111110 000 is 1.0₂*2^10*2³+0 = 2⁸⁰. Note that there is no mantissa anymore! The next larger number is:
0 111111111110 001 is 1.0₂*2^10*2³+1 = 2⁸¹.
0 111111111110 111 is 1.0₂*2^10*2³+7 = 2⁸⁷.
0 1111111111110 00 is 1.0₂*2^11*2³+0 = 2⁸⁸. Now the number of binary exponent bits starts shrinking. The missing bits are implicitly zero, so the next larger number is:
0 1111111111110 01 is 1.0₂*2^11*2³+2 = 2⁹⁰.
0 1111111111110 11 is 1.0₂*2^11*2³+6 = 2⁹⁴.
0 11111111111110 0 is 1.0₂*2^12*2³+0 = 2⁹⁶.
0 11111111111110 1 is 1.0₂*2^12*2³+4 = 2¹⁰⁰.
0 111111111111110 is 1.0₂*2^13*2³+0 = 2¹⁰⁴. There are no binary exponent bits left, but the presentation in the slides linked above still allows for one larger normal number:
0 111111111111111 is 1.0₂*2^14*2³+0 = 2¹¹².
Going in the other direction, we get:
0 01 111 0000000000 is 1.0₂*2^-1*2³+7 = 1/2 = 0.5.
0 01 000 0000000000 is 1.0₂*2^-1*2³+0 = 1/256 = 0.00390625.
0 001 000 000000000 is 1.0₂*2^-2*2³+0 = 1/65536 = 0.0000152587890625.
0 000000000001 111 is 1.0₂*2^-11*2³+7 = 2^-81.
0 000000000001 000 is 1.0₂*2^-11*2³+0 = 2^-88.
0 0000000000001 11 is 1.0₂*2^-12*2³+6 = 2^-90.
0 0000000000001 00 is 1.0₂*2^-12*2³+0 = 2^-96.
0 00000000000001 0 is 1.0₂*2^-13*2³+0 = 2^-104.
0 000000000000001 is 1.0₂*2^-14*2³+0 = 2^-112. This is the smallest positive normal number, since we have no choice but to treat 0 specially:
0 000000000000000 is 0.

For values close to 1, the accuracy is the same as for half-precision floating point numbers (which have 5 exponent and 10 mantissa bits). Half-precision floating point numbers do have slightly higher accuracy at the extreme ends of their dynamic range, but the dynamic range of posits is much higher. This is a very tempting trade-off for many applications.

By the way: if we had set es = 2, we could have larger accuracy for values close to 1, while still having a higher dynamic range than half-precision floating point.

You'll note that we have not encountered an infinity. Gustafson's proposal here is to do away with the distinction between positive and negative zero and infinity. Instead, his proposal is to think of the real numbers projectively, and use a two's complement representation, meaning that negating a posit is the same operation at the bit level as negating an integer. For example:

1 111111111111111 is -1.0₂*2^-14*2³+0 = -2^-112.
1 10 000 0000000000 is -1.0₂*2^0*2³+0 = -1. The next smaller number (larger in absolute magnitude) is:
1 01 111 1111111111 is -1.0000000001₂*2^0*2³+0.
1 01 111 1000000000 is -1.1₂*2^0*2³+0 = -1.5
1 000000000000001 is -1.0₂*2^14*2³+0 = -2¹¹².

The bit pattern 1 000000000000000 (which, like 0, is its own inverse in two's complement negation) would then represent infinity.

There's an elegance to thinking projectively in this way. Comparison of posits is the same as comparison of signed integers at the bit level (except for infinity, which is unordered). Even better, it's great that the smallest and largest normal numbers are multiplicative inverses of each other.

But to people used to floating point, not having a "sign + magnitude" representation is surprising. I also imagine that it could be annoying for a hardware implementation, so let's look into that.

Hardware implementations

In his presentations, Gustafson claims that by reducing the number of special cases, posits are easier to implement than floating point. No doubt there are fewer special cases (no denorms, no NaNs), but at the cost of a more complicated normal case.

Let's take a look at a floating point multiply. The basic structure is conceptually quite simple, since all parts of a floating point number can be treated separately:

By far the most expensive part here is the multiplication of the mantissas. There are of course a bunch of special cases that need to be accounted for: the inputs could be zero, infinity, or NaN, and the multiplication could overflow. Each of these cases are easily detected and handled with a little bit of comparatively inexpensive boolean logic.

Where it starts to get complicated is when handling the possibility that an input is denormal, or when the multiplication of two normal numbers results in a denormal.

When an input is denormal, the corresponding input for the multiply is 0.m instead of 1.m. Some logic has to decide whether the most significant input bit to the multiply is 0 or 1. This could potentially add to the latency of the computation. Luckily, deciding whether the input is denormal is fairly simple, and only the most significant input bit is affected. Because of carries, the less significant input bits tend to be more critical for latency. Conversely, this means that the latency of determining the most significant input bit can be hidden well.

On the output side, the cost is higher, both in terms of the required logic and in terms of the added latency, because a shifter is needed to shift the output into the correct position. Two cases need to be considered: When a multiplication of two normal numbers results in a denormal, the output has to be shifted to the right an appropriate number of places.

When a denormal is multiplied by a normal number, the output needs to be shifted to the left or the right, depending on the exponent of the normal number. Additionally, the number of leading zeros of either the denormal input or of the multiplication output is required to determine the exponent of the final result. Since the area cost is the same either way, I would expect implementations to determine the leading zero of the denormal input, since that allows for better latency hiding.

(The design space for floating point multipliers is larger than I've shown here. For example, you could deal with denormals by shifting their mantissa into place before the multiply. That seems like a waste of hardware considering that you cannot avoid the shifter after the multiply, but my understanding of hardware design is limited, so who knows.)

So there is a bit more hardware required than just what is shown in the diagram above: a leading-zero-count and a shifter, plus a bit more random logic. But now compare to the effort required for a posit multiply:

First of all, there is unavoidable latency in front of the multiplier. Every single bit of mantissa input may be masked off, depending on the variable size of the exponent's unary part. The exponents themselves need to be decoded in order to add them, and then the resulting exponent needs to be encoded again. Finally, the multiplication result needs to be shifted into place; this was already required for floating point multiplication, but the determination of the shift becomes more complicated since it depends on the exponent size. Also, each output bit needs a multiplexer since it can originate from either the exponent or the mantissa.

From my non-expert glance, here's the hardware you need in addition to the multiplier and exponent addition:

two leading-bit counts to decode the unary exponent parts (floating-point multiply only needs a single leading-zero count for a denormal input)
two shifters to shift the binary input exponent parts into place
logic for masking the input mantissas
one leading bit encoder
one shifter to shift the binary output exponent part into place
one shifter to shift the mantissa into place (floating-point also needs this)
multiplexer logic to combine the variable-length output parts

Also note that the multiplier and mantissa shifter may have to be larger, since - depending on the value of es - the mantissa of posits close to 1 can be larger than the mantissa of floating point numbers.

On the other hand, the additional shifters don't have to be large, since they only need to shift es bits. The additional hardware is almost certainly dominated by the cost of the mantissa multiplier. Still, the additional latency could be a problem - though obviously, I have no actual experience designing floating point multipliers.

There's also the issue of the proposed two's complement representation for negative posits. This may not be too bad for the mantissa multiplication, since one can probably treat it as a signed integer multiplication and automatically get the correct signs for the resulting mantissa. However, I would expect some more overhead for the decoding and encoding of the exponent.

The story should be similar for posit vs. floating point addition. When building a multiply-accumulate unit, the latency that is added for masking the input based on the variable exponent length can likely be hidden quite well, but there does not appear a way around the decoding and encoding of exponents.

Closing thoughts

As explained above, I expect posit hardware to be more expensive than floating point hardware. However, the gain in dynamic range and accuracy is nothing to sneeze at. It's worth giving posits a fair shot, since the trade-off may be worth it.

There is a lot of legacy software that relies on floating point behavior. Luckily, a posit ALU contains all the pieces of a floating point ALU, so it should be possible to build an ALU that can do both at pretty much the cost of a posit-only ALU. This makes a painless transition feasible.

Posits have an elegant design based on thinking about numbers projectively, but the lack of NaNs, the two's complement representation, and not having signed zeros and infinities may be alien to some floating point practicioners. I don't know how much of an issue this really is, but it's worth pointing out that a simple modification to posits could accommodate all these concerns. Using again the example of 16-bit posits with es = 3, we could designate bit patterns at the extreme ends as NaN and infinity:
0 111111111111111 is +inf (instead of 2¹¹²).
0 000000000000001 is +NaN (instead of 2^-112).
We could then treat the sign bit independently, like in floating point, giving us ±0, ±inf, and ±NaN. The neat properties related to thinking projectively would be lost, but the smallest and largest positive normal numbers would still be multiplicative inverses of each other. The hardware implementation may even be smaller, thanks to not having to deal with two's complement exponents.

The inertia of floating point is massive, and I don't expect it to be unseated anytime soon. But it's awesome to see people rethinking such fundamental building blocks of computing and coming up with solid new ideas. Posits aren't going to happen quickly, if at all, but it's worth taking them seriously.

radeonsi: out-of-order rasterization on VI+

2017-09-09T13:30:00.000+02:00

I've been polishing a patch of Marek to enable out-of-order rasterization on VI+. Assuming it goes through as planned, this will be the first time we're adding driver-specific drirc configuration options that are unfamiliar to the enthusiast community (there's radeonsi_enable_sisched already, but Phoronix has reported on the sisched option often enough). So I thought it makes sense to explain what those options are about.

Background: Out-of-order rasterization

Out-of-order rasterization is an optimization that can be enabled in some cases. Understanding it properly requires some background on how tasks are spread across shader engines (SEs) on Radeon GPUs.

The frontends (vertex processing, including tessellation and geometry shaders) and backends (fragment processing, including rasterization and depth and color buffers) are spread across SEs roughly like this:

(Not shown are the compute units (CUs) in each SE, which is where all shaders are actually executed.)

The input assembler distributes primitives (i.e., triangles) and their vertices across SEs in a mostly round-robin fashion for vertex processing. In the backend, work is distributed across SEs by on-screen location, because that improves cache locality.

This means that once the data of a triangle (vertex position and attributes) is complete, most likely the corresponding rasterization work needs to be distributed to other SEs. This is done by what I'm simplifying as the "crossbar" in the diagram.

OpenGL is very precise about the order in which the fixed-function parts of fragment processing should happen. If one triangle comes after another in a vertex buffer and they overlap, then the fragments of the second triangle better overwrite the corresponding fragments of the first triangle (if they weren't rejected by the depth test, of course). This means that the "crossbar" may have to delay forwarding primitives from a shader engine until all earlier primitives (which were processed in another shader engine) have been forwarded. This only happens rarely, but it's still sad when it does.

There are some cases in which the order of fragments doesn't matter. Depth pre-passes are a typical example: the order in which triangles are written to the depth buffer doesn't matter as long as the "front-most" fragments win in the end. Another example are some operations involved in stencil shadows.

Out-of-order rasterization simply means that the "crossbar" does not delay forwarding triangles. Triangles are instead forwarded immediately, which means that they can be rasterized out-of-order. With the in-progress patches, the driver recognizes cases where this optimization can be enabled safely.

By the way #1: From this explanation, you can immediately deduce that this feature only affects GPUs with multiple SEs. So integrated GPUs are not affected, for example.

By the way #2: Out-of-order rasterization is entirely disabled by setting R600_DEBUG=nooutoforder.

Why the configuration options?

There are some cases where the order of fragments almost doesn't matter. It turns out that the most common and basic type of rendering is one of these cases. This is when you're drawing triangles without blending and with a standard depth function like LEQUAL with depth writes enabled. Basically, this is what you learn to do in every first 3D programming tutorial.

In this case, the order of fragments is mostly irrelevant because of the depth test. However, it might happen that two triangles have the exact same depth value, and then the order matters. This is very unlikely in common scenes though. Setting the option radeonsi_assume_no_z_fights=true makes the driver assume that it indeed never happens, which means out-of-order rasterization can be enabled in the most common rendering mode!

Some other cases occur with blending. Some blending modes (though not the most common ones) are commutative in the sense that from a purely mathematical point of view, the end result of blending two triangles together is the same no matter which order they're blended in. Unfortunately, additive blending (which is one of those modes) involves floating point numbers in a way where changing the order of operations can lead to different rounding, which leads to subtly different results. Using out-of-order rasterization would break some of the guarantees the driver has to give for OpenGL conformance.

The option radeonsi_commutative_blend_add=true tells the driver that you don't care about these subtle errors and will lead to out-of-order rasterization being used in some additional cases (though again, those cases are rarer, and many games probably don't encounter them at all).

tl;dr

Out-of-order rasterization can give a very minor boost on multi-shader engine VI+ GPUs (meaning dGPUs, basically) in many games by default. In most games, you should be able to set radeonsi_assume_no_z_fights=true and radeonsi_commutative_blend_add=true to get an additional very minor boost. Those options aren't enabled by default because they can lead to incorrect results.

ARB_gl_spirv, NIR linking, and a NIR backend for radeonsi

2017-06-25T23:10:00.000+02:00

SPIR-V is the binary shader code representation used by Vulkan, and GL_ARB_gl_spirv is a recent extension that allows it to be used for OpenGL as well. Over the last weeks, I've been exploring how to add support for it in radeonsi.

As a bit of background, here's an overview of the various relevant shader representations that Mesa knows about. There are some others for really old legacy OpenGL features, but we don't care about those. On the left, you see the SPIR-V to LLVM IR path used by radv for Vulkan. On the right is the path from GLSL to LLVM IR, plus a mention of the conversion from GLSL IR to NIR that some other drivers are using (i965, freedreno, and vc4).

For GL_ARB_gl_spirv, we ultimately need to translate SPIR-V to LLVM IR. A path for this exists, but it's in the context of radv, not radeonsi. Still, the idea is to reuse this path.

Most of the differences between radv and radeonsi are in the ABI used by the shaders: the conventions by which the shaders on the GPU know where to load constants and image descriptors from, for example. The existing NIR-to-LLVM code needs to be adjusted to be compatible with radeonsi's ABI. I have mostly completed this work for simple VS-PS shader pipelines, which has the interesting side effect of allowing the GLSL-to-NIR conversion in radeonsi as well. We don't plan to use it soon, but it's nice to be able to compare.

Then there's adding SPIR-V support to the driver-independent mesa/main code. This is non-trivial, because while GL_ARB_gl_spirv has been designed to remove a lot of the cruft of the old GLSL paths, we still need more supporting code than a Vulkan driver. This still needs to be explored a bit; the main issue is that GL_ARB_gl_spirv allows using default-block uniforms, so the whole machinery around glUniform*() calls has to work, which requires setting up all the same internal data structures that are setup for GLSL programs. Oh, and it looks like assigning locations is required, too.

My current plan is to achieve all this by re-using the GLSL linker, giving a final picture that looks like this:

So the canonical path in radeonsi for GLSL remains GLSL -> AST -> IR -> TGSI -> LLVM (with an optional deviation along the IR -> NIR -> LLVM path for testing), while the path for GL_ARB_gl_spirv is SPIR-V -> NIR -> LLVM, with NIR-based linking in between. In radv, the path remains as it is today.

Now, you may rightfully say that the GLSL linker is a huge chunk of subtle code, and quite thoroughly invested in GLSL IR. How could it possibly be used with NIR?

The answer is that huge parts of the linker don't really that much about the code in the shaders that are being linked. They only really care about the variables: uniforms and shader inputs and outputs. True, there are a bunch of linking steps that touch code, but most of them aren't actually needed for SPIR-V. Most notably, GL_ARB_gl_spirv doesn't require intrastage linking, and it explicitly disallows the use of features that only exist in compatibility profiles.

So most of the linker functionality can be preserved simply by converting the relevant variables (shader inputs/outputs, uniforms) from NIR to IR, then performing the linking on those, and finally extracting the linker results and writing them back into NIR. This isn't too much work. Luckily, NIR reuses the GLSL IR type system.

There are still parts that might need to look at the actual shader code, but my hope is that they are few enough that they don't matter.

And by the way, some people might want to move the IR -> NIR translation to before linking, so this work would set a foundation for that as well.

Anyway, I got a ridiculously simple toy VS-PS pipeline working correctly this weekend. The real challenge now is to find actual test cases...

Unser Wohlstand basiert auf geerbtem Wissen und aktiven Gehirnen

2017-01-22T12:52:00.001+01:00

Vor Kurzem schrieb Stefan Pietsch einen Essay, in dem er einigermaßen frei assoziierend eine Reihe von Themen im Umfeld "Arbeit" kommentiert. Der Text ist etwas zu unfokussiert für eine umfassende Antwort, aber ich möchte ein paar Punkte ansprechen, bei denen wir nicht ganz einig sind

Ich will einmal am Ende anfangen, worauf unser Wohlstand eigentlich beruht. Wenn man sich das im historischen Vergleich überlegt, dann übersieht Pietsch einen ganz wichtigen Faktor: im Wesentlichen beruht unser Wohlstand auf den wissenschaftlichen und technologischen Errungenschaften, die wir von unseren Vorfahren geerbt haben und heute weiterentwickeln.

Das hat interessante Konsequenzen. Da das Wissen und die Technologien von der ganzen Menschheit geerbt wurden, kann ein Einzelner daraus resultierende Gewinne nicht ohne Weiteres für sich beanspruchen. Zwar hat der Einzelne, indem er das geerbte Wissen praktisch anwendbar macht, durchaus seinen Teil beigetragen. Diese Argumentation eignet sich aber nur zur Rechtfertigung von relativem Wohlstand im Vergleich zu anderen, die weniger zur praktischen Anwendung des geerbten Wissens beigetragen haben. Sie eignet sich nicht zur Rechtfertigung von absolutem Wohlstand, da der absolute Wohlstand eben nicht aus dem persönlichen Beitrag kommt. Dieser Widerspruch lässt sich durch eine sehr "linke" Gesetzgebung ausgleichen, die dafür sorgt, dass jeder über hohen absolute Wohlstand verfügt, auch wenn es weiterhin relative Unterschiede gibt (die dann aber natürlich geringer ausfallen).

Unser Wohlstand beruht aber natürlich nicht nur auf geerbtem Wissen. Pietsch nennt ein paar weiter Punkte, die aber nicht wirklich an die Wurzel gehen. Ganz wesentlich beruht unser Wohlstand darauf, dass ein möglichst großer Anteil der menschlichen Gehirne, die auf unserem Planeten wandeln, möglichst gut in gesellschaftliche Produktionsprozesse einbezogen und genutzt werden.

Dazu gehört natürlich die von Pietsch genannte Freiheit: Gehirne, die von sich aus Konstruktives leisten wollen, muss man machen lassen.

Dazu gehört auch, dass man Gehirne, die vielleicht nicht unbedingt von sich aus Konstruktives leisten, trotzdem dazu ermutigen. Die von Pietsch genannte Achtung des Eigentums ist ein entsprechendes Anreizsystem. Wenn man das Eigentum von dieser Perspektive aus betrachtet erkennt man aber auch sofort, dass es dabei schnell zu Tradeoffs kommt. Gerne wird ja zum Beispiel die Senkung von Spitzensteuersätzen als Anreiz begründet. Aber hier muss man doch nachhaken: wie stark ist der Anreizeffekt dieser Steuersenkungen denn nun wirklich? Und wäre es nicht besser, die Gehirne in der breiten Masse zu mobilisieren, anstatt sie womöglich noch durch die offensichtliche und steigende Ungleichheit zu demoralisieren?

Der Blick auf die aktiven Gehirne wirft auf viele politische Themen ein anderes Licht. Pietsch nennt zum Beispiel eine Selbständigenquote von etwa 23% im Deutschland Anfang der 1960er und sieht darin lobenswerte Eigeninitiative. (Er unterschlägt dabei übrigens die in seiner eigenen Quelle erkennbare Tatsache, dass der Wert bis ins Jahr 2000 in Deutschland höher lag als in den USA, obwohl diese landläufig als sehr viel unternehmerischer wahrgenommen werden. Glaube keiner Statistik, die du nicht selbst selektiv ausgewählt hast.)

Ich sehe in dieser hohen Selbständigenquote aber auch Unternehmen mit durchschnittlich vier Personen. Da ist keine tiefe Spezialisierung möglich, und damit ist auch der optimalen Nutzung von Gehirnen eine gewisse Grenze gesetzt.

Ich sehe auch das Potential für Scheinselbständigkeit oder Selbständigkeit aus Not. Da wird ein großer Teil der mentalen Energie verschwendet, weil sich die Menschen mit Sorgen auseinandersetzen müssen, die ihnen von guter linker Politik genommen werden könnten.

Letztlich ist es die Einbindung möglichst vieler Menschen in produktiv-kreative Prozesse auf hohem Niveau, die für unseren Wohlstand essentiell ist. Diese Einbindung kann in Form von Unternehmertun geschehen, aber oft ist das eben auch der falsche Weg.

Consciousness is not a quantum phenomenon

2016-12-18T22:27:00.000+01:00

Consciousness is weird. Quantum physics is weird. There's this temptation to conclude that therefore, they must be related, but they're almost certainly not. In fact, I want to explore an argument that consciousness cannot be a quantum phenomenon but must necessarily be classical.

We don't really understand consciousness, but many people have tried. A compelling argument pushed by Douglas Hofstadter is that consciousness is a product of certain sufficiently complex "strange loops", self-reflexive processes. Put simply, the mind observes the body, but it also observes itself observing the body, and it observes itself observing itself observing the body, and so on. On some level, this endlessly reflexive process of self-observation is consciousness.

Part of this self-reflexive process is the simulation of simplified models of the world, including ourselves and other people. This allows us to anticipate the effects of possible actions, which helps us choose the actions that we ultimately take. (Much or even most of our daily decision making doesn't use this complicated process and simply uses subconscious pattern matching. But we're focusing on the conscious part of the mind here.) Running multiple simulations to choose from a set of possible actions requires starting the simulation with the same initial state multiple times. In other words, it requires copying the initial state.

If consciousness were a quantum phenomenon, this would imply copying some quantum states. However, it is physically impossible to copy a quantum state due to the no-cloning theorem, just like it is impossible to change the amount of energy in the universe.

This immediately gives us a very nice explanation for why we are unable to perceive quantum states. Nothing can copy quantum states, and so a conscious mind that perceives quantum states cannot develop, since perception and memory of states of the world certainly requires making copies of those states (at least partial ones).

There is some wiggle-room in this argument. Even a mind that only perceives, remembers, and computes on internal representations of classical states might still use quantum physical processes for this computation. And at a certain level, it is actually obviously true that our brains do this: our brains are full of chemistry, and chemistry is full of quantum physics. The point is that this is just an irrelevant implementation detail. Those quantum processes could just as well be simulated by something classical.

And keep in mind that if we believe in the importance of "strange loops", then the mind itself is one of the things that the mind perceives. The mind cannot perceive quantum states, and so any kind of quantum-ness in the operation of the mind must be purely incidental. It cannot be a part of the "strange loop", and so it must ultimately be of limited importance.

There is yet more wiggle-room. We know that some computational tasks can be implemented much more efficiently on a quantum computer than on a classical computer - at least in theory. Nobody has ever built a true quantum computer that is big enough to demonstrate a speed-up over classical computers and ~~lived to tell the tale~~ published it. Perhaps consciousness requires the solution of computational tasks that can only be solved efficiently using quantum computing?

I doubt it. It looks increasingly like even in theory, quantum computers can only speed up a few very specialized mathematical problems, and that's not what our minds are particularly good at.

It's always possible that this whole argument will turn out to be wrong, once we learn more about consciousness in the future (and wouldn't that be interesting!), but for now, the bottom line for me is: Strange information processing loops most likely play an important role in consciousness. Such loops cannot involve quantum states due to the no-cloning theorem. Hence, consciousness is not a quantum phenomenon.

Bedingungsloses Grundeinkommen: Die Finanzierung ist nicht das Problem

2016-12-07T22:38:00.000+01:00

Gestern hat Joerg Wellbrock auf dem Spiegelfechter einen Beitrag geschrieben, in dem er einige Kritik am Konzept des Bedingungslosen Grundeinkommen (BGE) übt. Ich teile vieles der Kritik zumindest als Skepsis.

Aber wenn er die Finanzierung des BGE hinterfragt, steige ich aus. Er sieht ein Finanzierungsproblem, wo in Wirklichkeit keins ist. Denn der Staat, zumindest der monetär souveräne Staat, kann finanzieren was und wie viel er will. Zu glauben, ein monetär souveräner Staat könne irgendwo ein Finanzierungsproblem haben ist ungefähr so, als würde man glauben, dass beim Spiel Monopoly die Bank pleite gehen kann. Kann sie nicht - und wenn das gedruckte Geld ausgeht, dann behilft man sich eben irgendwie anders. So steht es in den Regeln.

Moment! ruft der geneigte Leser jetzt innerlich, aber ist das nicht Gelddrucken und kommt dann nicht die Inflation? Nein, Gelddrucken ist das nicht. Auch beim Monopoly-Spielen würde man das "Problem" eher nicht durch Gelddrucken lösen. Aber ja, natürlich kann Geldausgeben grundsätzlich immer zu Inflation führen, den Umständen entsprechend. Diese Möglichkeit führt ja auch Wellbrock schon als potentielles Problem des BGE an.

Aber es ist eben wichtig, dass das Potential für Inflation das einzige Problem ist. Es gibt kein Finanzierungsproblem. Vielleicht gibt es ein Inflationsproblem. Der Unterschied ist wichtig, und zwar aus zwei Gründen.

Erstens geben wir den sogenannten Kreditgebern und Finanzmärkten zu viel Macht, wenn wir an ein Finanzierungsproblem glauben. Dieser Glaube führt dazu, dass unsere Politiker in Angst leben vor den sogenannten Finanzmärkten. Und weil diese so diffus sind und nicht klar definiert, gibt diese Angst dann de facto Macht an die Hohepriester bei Banken und Think Tanks, die als Interpreten des Marktwillens gelten. Neutrale Beobachter sind das auf keinen Fall, und das Wohl der breiten Bevölkerung haben die meisten von ihnen auch nicht im Blick. Wir leben aber in einer Demokratie. Deren Politik muss natürlich die Beschränkungen der Realität anerkennen, aber sie darf nicht grundlos Macht an nicht legitimierte Gruppen abgeben.

Zweitens schwingt beim Gedanken an Finanzierung und Kredite immer automatisch mit, dass das Geld zurück gezahlt werden soll, womöglich noch mit Gewinn. Effizienz ist natürlich auch für demokratische Politik ein sinnvolles Ziel, aber Profitwirtschaft darf kein Ziel sein. Außerdem gibt es gute Gründe dafür, dass der Staatsaushalt im langfristigen Mittel gar nicht ausgeglichen werden darf. Es kann vereinzelt Perioden geben, in denen ein Überschuss im Staatshaushalt sinnvoll ist. Aber im langfristigen Mittel muss ein inflationsneutraler Staatshaushalt von normalen, monetär souveränen Staaten im Defizit sein.

Besser also, wenn man erkennt, dass der Kaiser keine Kleider hat. Inflationsprobleme kann es für monetär souveräne Staaten geben, Finanzierungsprobleme nicht.

Zurück zum BGE. Ich finde das Konzept sympathisch, aber ich teile die Bedenken, dass ein BGE vermutlich wegen der Inflationsproblematik nicht so realisierbar ist, wie sich die Befürworter das wünschen. Je näher die Roboterutopie kommt, um so realitischer wird ein BGE natürlich. Ich bezweifle, dass wir heute schon so weit sind, muss aber zugeben, dass es letztlich eine empirische Frage ist, die man nur experimentell beantworten kann.

Und als kleines Nachwort noch: Die Überlegungen zum Staatshaushalt oben gehen von einem monetär souveränen Staat aus. Deutschland ist nicht mehr monetär souverän, sondern Mitglied der Eurozone. Diese hat keinen Souverän, und ich würde sagen, dass genau darin das Kernproblem der Eurozone liegt. Um die Eurozone langfristig politisch zu stabilisieren, bräuchte es einen Haushalt auf Euroebene, der wichtige Stabilisatorenaufgaben übernimmt. Dazu gehören auch Elemente des Sozialstaats, und die BGE-Befürworter täten gut daran, für ein BGE auf Euro-Ebene zu werben. Aber all das ist ein anderes Thema.

Was Trump und Hitler (vielleicht) gemeinsam haben

2016-11-11T11:09:00.000+01:00

Den zukünftigen US-Präsidenten mit Faschisten zu vergleichen ist nicht neu, aber eine Facette des Vergleichs habe ich bisher selten gesehen. In seiner Rede zum Wahlsieg hat Trump von groß angelegten Investitionsprogrammen gesprochen. Tatsächlich kann die USA das sehr gut gebrauchen. Schon seit Jahren beklagen Bauingenieure dort den schlechten Zustand der Infrastruktur, und natürlich wäre ein solches Investitionsprogramm gut für die Wirtschaft und könnte flächendeckend die Nachfrage nach Arbeit stärken.

Das ist wichtig, denn man muss sich klar machen, dass es ganze Schichten in den USA gibt, an denen die Erholung von der Wirtschaftskrise vollkommen vorbeigegangen ist. Diese Menschen beklagen sich zurecht über den Status Quo.

Das vermischt sich dann oft mit den für mich illegitimen Klagen von Weißen, die einfach nur Angst vor dem Verlust des gefühlten Privilegs der Mehrheit haben. Werdet endlich erwachsen!, möchte ich diesen Menschen zurufen. Aber wer sie dann deswegen als Rassisten abstempelt und den legitimen Teil ihrer Beschwerden schlicht ignoriert macht es sich auch zu einfach. Man riskiert damit nämlich genau den Aufstieg von Gefährdern wie Trump.

Die Parallele zu Hitler, dessen Regierungsprogramm ja trotz allem Negativen tatsächlich auch durch massive Investitionen erst einmal die Leben vieler Menschen verbessert hat, ist mir deswegen so wichtig, weil sie so unglaublich traurig ist. Warum konnten sich die Demokraten der Weimarer Republik nicht zu großen Investitionsprogramm zusammenreißen? Warum gelang es in den USA in den letzten Jahren nicht? Und warum gelingt es auch heute in Deutschland wieder nicht und schafft so den Nährboden für die AfD? Warum lassen liberale, progressive Parteien und Politiker ihre Flanken so ungeschützt? Man könnte ja durchaus die positiven Seiten eines expansiven und inklusiven Wirtschaftsprogramms übernehmen ohne deswegen gleich dem globalen Klimawandel die Tür zu öffnen, alle Juden zu vernichten, oder einen Weltkrieg anzuzetteln.

Wenn ich heute mit Menschen darüber rede, begegne ich leider immer wieder Fehlinformationen über unser Wirtschaftssystem. Bei Hitler heißt es sofort, das wäre ja alles in Wirklichkeit schlecht gewesen wegen der Schuldenaufnahme (dem zweiten beliebten Argument bei Hitler, nämlich der Kritik am Rüstungsfokus, stimme ich zu - aber es gibt so viele sinnvolle Dinge, in die man investieren kann, es muss wirklich nicht das Militär sein!). Auch im heutigen Deutschland ist die Ideologie der schwarzen Null das große Hindernis. Genauso haben in den USA in den letzten Jahren die Republikaner gerne mit Verweis auf das Haushaltsdefizit blockiert. Letzteres mag Zweifel aufbringen, ob es unter Trump große Investitionsprogramme geben wird, aber die Republikaner haben ja auch immer wieder bewiesen, dass ihnen Haushaltsdefizite nur so lange wichtig sind, wie sie damit demokratische Vorschläge blockieren können. Paul Ryans (republikanischer Speaker of the House) eigene Budgetentwürfe enthalten, wenn man sie nüchtern analysiert, immer gigantische Haushaltsdefizite. Vielleicht kommt Trump mit seinem Investitionsprogramm trotzdem nicht durch den Kongress, aber die Chancen stehen nicht schlecht.

In Wirklichkeit sind Schulden und Defizite für monetär souveräne Staaten schlicht kein Problem - zumindest nicht so, wie die meisten Menschen das zu wissen glauben. In der Eurozone verkompliziert die fehlende Souveränität die Situation, aber das ist ein Thema für ein andermal.

Natürlich ist ein Investitionsprogramm alleine nicht genug. Auch an Bildung muss gearbeitet werden und an regionalen Strukturen, und sicherlich an noch mehr. Aber ein Investitionsprogramm ist ein guter, klar sichtbarer und medienwirksamer Anfang, mit dem es einfach ist, in die richtige Richtung zu gehen.

Unsere Demokratie ist zu wichtig, um sie einer ökonomischen Ideologie zu opfern.

Compiling shaders: dynamically uniform variables and "convergent" intrinsics

2016-10-24T21:36:00.000+02:00

There are some program transformations that are obviously correct when compiling regular single-threaded or even multi-threaded code, but that cannot be used for shader code. For example:

 v = texture(u_sampler, texcoord);  
 if (cond) {  
   gl_FragColor = v;  
 } else {  
   gl_FragColor = vec4(0.);  
 }  
   
 ... cannot be transformed to ...
   
 if (cond) {
   // The implicitly computed derivate of texcoord
   // may be wrong here if neighbouring pixels don't
   // take the same code path.
   gl_FragColor = texture(u_sampler, texcoord);
 } else {
   gl_FragColor = vec4(0.);  
 }

 ... but the reverse transformation is allowed.

Another example is:

 if (cond) {
   v = texelFetch(u_sampler[1], texcoord, 0);
 } else {
   v = texelFetch(u_sampler[2], texcoord, 0);
 }

 ... cannot be transformed to ...

 v = texelFetch(u_sampler[cond ? 1 : 2], texcoord, 0);
 // Incorrect, unless cond happens to be dynamically uniform.

 ... but the reverse transformation is allowed.

Using GL_ARB_shader_ballot, yet another example is:

 bool cond = ...;
 uint64_t v = ballotARB(cond);
 if (other_cond) {
   use(v);
 }

 ... cannot be transformed to ...

 bool cond = ...;
 if (other_cond) {
   use(ballotARB(cond));
   // Here, ballotARB returns 1-bits only for threads/work items
   // that take the if-branch.
 }

 ... and the reverse transformation is also forbidden.

These restrictions are all related to the GPU-specific SPMD/SIMT execution model, and they need to be taught to the compiler. Unfortunately, we partially fail at that today.

Here are some types of restrictions to think about (each of these restrictions should apply on top of any other restrictions that are expressible in the usual, non-SIMT-specific ways, of course):

An instruction can be moved from location A to location B only if B dominates or post-dominates A.

This restriction applies e.g. to instructions that take derivatives (like in the first example) or that explicitly take values from neighbouring threads (like in the third example). It also applies to barrier instructions.

This is LLVM's convergent function attribute as I understand it.
An instruction can be moved from location A to location B only if A dominates or post-dominates B.

This restriction applies to the ballot instruction above, but it is not required for derivative computations or barrier instructions.

This is in a sense dual to LLVM's convergent attribute, so it's co-convergence? Divergence? Not sure what to call this.
Something vague about not introducing additional non-uniformity in the arguments of instructions / intrinsic calls.

This last one applies to the sampler parameter of texture intrinsics (for the second example), to the ballot instruction, and also to the texture coordinates on sampling instructions that implicitly compute derivatives.

For the last type of restriction, consider the following example:

 uint idx = ...;
 if (idx == 1u) {
   v = texture(u_sampler[idx], texcoord);
 } else if (idx == 2u) {
   v = texture(u_sampler[idx], texcoord);
 }

 ... cannot be transformed to ...

 uint idx = ...;
 if (idx == 1u || idx == 2u) {
   v = texture(u_sampler[idx], texcoord);
 }

In general, whenever an operation has this mysterious restriction on its arguments, then the second restriction above must apply: we can move it from A to B only if A dominates or post-dominates B, because only then can we be certain that the move introduces no non-uniformity. (At least, this rule applies to transformations that are not SIMT-aware. A SIMT-aware transformation might be able to prove that idx is dynamically uniform even without the predication on idx == 1u or idx == 2u.)

However, the control flow rule is not enough:

 v1 = texture(u_sampler[0], texcoord);
 v2 = texture(u_sampler[1], texcoord);
 v = cond ? v1 : v2;

 ... cannot be transformed to ...

 v = texture(u_sampler[cond ? 0 : 1], texcoord);

The transformation does not break any of the CFG-related rules, and it would clearly be correct for a single-threaded program (given the knowledge that texture(...) is an operation without side effects). So the CFG-based restrictions really aren't sufficient to model the real set of restrictions that apply to the texture instruction. And it gets worse:

 v1 = texelFetch(u_sampler, texcoord[0], 0);
 v2 = texelFetch(u_sampler, texcoord[1], 0);
 v = cond ? v1 : v2;

 ... is equivalent to ...

 v = texelFetch(u_sampler, texcoord[cond ? 0 : 1], 0);

After all, texelFetch computes no implicit derivatives.

Calling the three kinds of restrictions 'convergent', 'co-convergent', and 'uniform', we get:

 texture(uniform sampler, uniform texcoord) convergent (co-convergent)
 texelFetch(uniform sampler, texcoord, lod) (co-convergent)
 ballotARB(uniform cond) convergent co-convergent
 barrier() convergent

For the texturing instructions, I put 'co-convergent' in parentheses because these instructions aren't inherently 'co-convergent'. The attribute is only there because of the 'uniform' function argument.

Actually, looking at the examples, it seems that co-convergent only appears when a function has a uniform argument. Then again, the texelFetch function can be moved freely in the CFG by a SIMT-aware pass that can prove that the move doesn't introduce non-uniformity to the sampler argument, so being able to distinguish functions that are inherently co-convergent (like ballotARB) from those that are only implicitly co-convergent (like texture and texelFetch) is still useful.

For added fun, things get muddier when you notice that in practice, AMDGPU doesn't even flag texturing intrinsics as 'convergent' today. Conceptually, the derivative-computing intrinsics need to be convergent to ensure that the texture coordinates for neighbouring pixels are preserved (as in the very first example). However, the AMDGPU backend does register allocation after the CFG has been transformed into the wave-level control-flow graph. So register allocation automatically preserves neighbouring pixels even when a texture instruction is sunk into a location with additional control-flow dependencies.

When we reach a point where vector register allocation happens with respect to the thread-level control-flow graph, then texture instructions really need to be marked as convergent for correctness. (This change would be beneficial overall, but is tricky because scalar register allocation must happen with respect to the wave-level control flow graph. LLVM currently wants to allocate all registers in one pass.)

Exec masks in LLVM

2016-09-03T20:18:00.000+02:00

Like is usual in GPUs, Radeon executes shaders in waves that execute the same program for many threads or work-items simultaneously in lock-step. Given a single program counter for up to 64 items (e.g. pixels being processed by a pixel shader), branch statements must be lowered to manipulation of the exec mask (unless the compiler can prove the branch condition to be uniform across all items). The exec mask is simply a bit-field that contains a 1 for every thread that is currently active, so code like this:

    if (i != 0) {
        ... some code ...
    }

gets lowered to something like this:

    v_cmp_ne_i32_e32 vcc, 0, v1
    s_and_saveexec_b64 s[0:1], vcc
    s_xor_b64 s[0:1], exec, s[0:1]

if_block:
    ... some code ...

join:
    s_or_b64 exec, exec, s[0:1]

(The saveexec assembly instructions apply a bit-wise operation to the exec register, storing the original value of exec in their destination register. Also, we can introduce branches to skip the if-block entirely if the condition happens to be uniformly false, )

This is quite different from CPUs, and so a generic compiler framework like LLVM tends to get confused. For example, the fast register allocator in LLVM is a very simple allocator that just spills all live registers at the end of a basic block before the so-called terminators. Usually, those are just branch instructions, so in the example above it would spill registers after the s_xor_b64.

This is bad because the exec mask has already been reduced by the if-condition at that point, and so vector registers end up being spilled only partially.

Until recently, these issues were hidden by the fact that we lowered the control flow instructions into their final form only at the very end of the compilation process. However, previous optimization passes including register allocation can benefit from seeing the precise shape of the GPU-style control flow earlier. But then, some of the subtleties of the exec masks need to be taken account by those earlier optimization passes as well.

A related problem arises with another GPU-specific specialty, the "whole quad mode". We want to be able to compute screen-space derivatives in pixel shaders - mip-mapping would not be possible without it - and the way this is done in GPUs is to always run pixel shaders on 2x2 blocks of pixels at once and approximate the derivatives by taking differences between the values for neighboring pixels. This means that the exec mask needs to be turned on for pixels that are not really covered by whatever primitive is currently being rendered. Those are called helper pixels.

However, there are times when helper pixels absolutely must be disabled in the exec mask, for example when storing to an image. A separate pass deals with the enabling and disabling of helper pixels. Ideally, this pass should run after instruction scheduling, since we want to be able to rearrange memory loads and stores freely, which can only be done before adding the corresponding exec-instructions. The instructions added by this pass look like this:

    s_mov_b64 s[2:3], exec
    s_wqm_b64 exec, exec

    ... code with helper pixels enabled goes here ...

    s_and_b64 exec, exec, s[2:3]

    ... code with helper pixels disabled goes here ...

Naturally, adding the bit-wise AND of the exec mask must happen in a way that doesn't conflict with any of the exec manipulations for control flow. So some careful coordination needs to take place.

My suggestion is to allow arbitrary instructions at the beginning and end of basic blocks to be marked as "initiators" and "terminators", as opposed to the current situation, where there is no notion of initiators, and whether an instruction is a terminator is a property of the opcode. An alternative, that Matt Arsenault is working on, adds aliases for certain exec-instructions which act as terminators. This may well be sufficient, I'm looking forward to seeing the result.

Wie man eine Privatisierung von Autobahnen bewertet

2016-06-18T12:10:00.000+02:00

Heute erfahre ich, wohl mit etwas Verspätung, dass in Deutschland mal wieder die Privatisierung der Autobahnen diskutiert wird. Wozu soll diese Privatisierung gut sein?

Soweit ich sehe, gibt es in der aktuellen Diskussion grob skizziert zwei Argumente. Erstens, die Zinsen sind niedrig und private Investoren sehnen sich nach Möglichkeiten, Gewinne zu erzielen. Private Autobahnen wären eine solche Möglichkeit. Dieses Argument finde ich ziemlich unverschämt. Private Investoren - vor allem die individuellen Investoren, die bei Privatisierungen in der Regel am meisten profitieren - gehören definitionsgemäß zu den Leuten, die nun wirklich als Allerletztes Hilfe vom Staat brauchen. Sich von deren Interessen leiten zu lassen geht gar nicht.

Kommen wir also zum zweiten Argument. Die Deutschen sehen zwar, dass Investitionen in das Autobahnnetz nötig sind, lieben aber auch die schwarze Null. Aber auch das ist kein stimmiges Argument. Um die Privatisierung der Autobahnen sinnvoll bewerten zu können, ist es hilfreich, sich die Nettogeldflüsse anzusehen:
Egal wie die Autobahn finanziell organisiert wird, letztendlich kommen die Einnahmen für das System zum einen von den Nutzern und zum anderen aus allgemeinen Steuern (manche Einnahmequellen können irgendwo dazwischen liegen, zum Beispiel Benzinsteuern). Und letztendlich landen die Ausgaben zum einen beim eigentlichen Bau und Betrieb der Autobahnen - das ist offensichtlich - und bei privaten Investoren. Letzteres ist nicht ganz so offensichtlich: auch eine Autobahn, die rein in Staatshand ist, hat Ausgaben an private Investoren wenn der Staat sich für Bau und Betrieb "verschuldet".

Aus Sicht des Gemeinwohls gibt es mit Blick auf die Geldflüsse vor allem zwei Fragen: Wie viel muss von der linken Seite gezahlt werden, und wie wird die Rechnung zwischen Autobahnnutzern und Bürgern allgemein aufgeteilt? Von links kommt zwangsläufig genau so viel Geld, wie rechts wieder herausfließt. Sinnvollerweise konzentriert sich die Politik also auf die rechte Seite. Hilft die Privatisierung dabei, die Ausgaben auf der rechten Seite zu reduzieren?

Meine kurze Einschätzung: Bei Bau und Betrieb lässt sich durch Privatisierung kaum etwas erreichen. Der größte Brocken ist der Bau, und der wird schon lange an private Unternehmen ausgeschrieben. Wenn es hier also Effizienz durch privates Wirtschaften geben sollte, so wird diese schon längst erreicht.

Spannender wird es bei den Ausgaben an private Investoren. Die erwartete Rendite bei privaten Unternehmungen liegt deutlich höher als die Zinsen auf Bundesanleihen. Man kennt das von anderen Privatisierungen: wenn die privaten Investoren nicht von einem höheren Gewinn als bei Bundesanleihen ausgehen können, dann kaufen sie eben lieber Bundesanleihen. Es ist also davon auszugehen, dass bei einer Privatisierung der Autobahnen die Ausgaben insgesamt eher steigen werden, wodurch das ganze System für die Bürger und Nutzer teurer wird. Privatisierung ist ein Verlustgeschäft.

Die zweite Frage, nämlich ob die Einnahmen mehr aus allgemeinen Steuern oder mehr direkt von den Nutzen kommen sollten, ist ganz offensichtlich unabhängig von der Frage der Privatisierung. Man muss nur bedenken (so wenig ich persönlich ein Freund des Autofahrens bin), dass eine verstärkte Finanzierung über die Nutzer in der Tendenz die Ausgabenseite erhöht und so das gesamte System teurer macht - Maut zu erheben ist ein Bürokratieaufwand. Ob die durch Maut verringerte Effizienz ein angemessener Preis ist für höhere Gerechtigkeit muss jeder für sich selbst entscheiden. Man beachte auch die Analogie zur Forderung nach kostenlosem ÖPNV!

Übrigens: Selbst wenn die Ausgaben mit der Privatisierung überraschend sinken sollten, gibt es noch zusätzliche Erwägungen, die in der Tendenz gegen eine Privatisierung sprechen. Zum einen stellt sich die Frage der Kontrolle: selbst wenn der Staat Mehrheitseigner des Autobahnnetzes bleibt erhöht sich doch der Einfluss demokratisch nicht legitimierter Kräfte auf den Betrieb wichtiger Infrastruktur des Landes. Wie hoch der finanzielle Zugewinn sein müsste, um diesen Kontrollverlust zu rechtfertigen, muss jeder selbst für sich entscheiden.

Zum anderen spielt auch die Zusammensetzung der privaten Investoren eine Rolle - wie weit werden die Ausgaben an private Investoren effektiv in der Bevölkerung gestreut? Bei der standardisierten, "langweiligen" Bundesanleihe sind die privaten Investoren oft langweilige Fonds an denen auch "normalere", weniger reiche Bürger z.B. zur Altersvorsorge beteiligt sind. Die Ausgaben für private Investoren gehen somit zwar nicht an alle Bürger gleichmässig, aber sie sind doch relativ weit gestreut. Bei einer Privatisierung wird aber nicht mit derart "langweiligen" Finanzinstrumenten gearbeitet, wodurch sie sich tendenziell eher in der Hand von Menschen sammeln, die ohnehin schon reich sind. Somit sorgen die Ausgaben an private Investoren also in der Tendenz für stärkere Ungleichheit im Land - auch das ein Argument gegen die Privatisierung.

Heißt das denn, das jede Privatisierung schlecht ist? Schließlich lassen sich die genannten Argumente auch auf andere Privatisierungen übertragen. Tatsächlich ist der Blick auf die Nettogeldflüsse immer hilfreich, und ja, das Fazit sieht bei allen Privatisierungsprojekten, die heutzutage diskutiert werden, ähnlich aus. Das liegt aber vor allem daran, dass sich der Staat bei uns schon weitgehend aus dem exekutiven Geschäft herausgezogen hat. Wäre die Stahlindustrie oder die Produktion von Industrierobotern heute in staatlicher Hand, dann sähe die Analyse einer Privatisierung der entsprechenden Unternehmen sicher anders aus, weil es dort glaubwürdige Argumente für einen Effizienzgewinn durch Marktöffnung gibt. Im Fall von Infrastruktur gibt es aber in der Regel keine sinnvolle Marktöffnung, und daher auch kaum Argumente für Privatisierung.