Tagebuch eines Interplanetaren Botschafters

Montag, November 10, 2025

An alias analysis shower thought

Alias analysis allows a compiler to understand whether two memory accesses may conflict in the sense that they both touch the same memory location and at least one of them is a write. This in turn enables certain compiler optimizations. For example, memory accesses that do not conflict can be rearranged as part of instruction scheduling. Using this freedom where it exists is especially important in GPU programs, where memory accesses frequently have latencies of many hundreds of cycles, and there is no out-of-order scheduling in hardware to bail us out.

Instruction scheduling typically happens after instruction selection. Instruction selection makes a program harder to reason about because it replaces an internal representation that uses fairly generic instructions (such as pointer addition and pointer-sized multiplications) with target-specific instructions that can be more complex to reason about (such as shift and an add combined into a single instruction, or a 64-bit addition decomposed into two 32-bit additions with a carry between them). This in turn makes alias analysis a lot harder.

LLVM addresses this issue by keeping references to pointer values in the pre-isel instruction intermediate representation around. Alias analysis is then performed based on those pointer values.

To some extent this seems a bit arbitrary and merely a curious historical artifact. There are many "isel-related" transforms that happen on LLVM IR before the "actual" instruction selection pass that generates MachineIR. Especially with the unfortunate co-existence of SelectionDAG and GlobalISel, there is a strong incentive to extract certain common lowerings into earlier passes in LLVM IR, to avoid code duplication. However, pulling complex lowerings on address calculations earlier in the pass pipeline currently means likely losing important information about pointers and therefore weakening alias analysis. It would be great if we could explicitly preserve the pointers from an earlier point in compilation somehow.

There really doesn't seem to be a good a priori argument against it. A compiler written from scratch could easily use a unified IR substrate throughout and freely choose a point at which pointers for alias analysis become "frozen". It'd just be a massive undertaking to move LLVM to such a model.

Samstag, Mai 04, 2024

A new kind of git history

Discussions about rebase vs. merge are familiar territory for anybody with an interest in version control in general and git in particular. I want to finally give a more permanent home to an idea that I have expressed in the past and that I've occasionally seen others hint at in those discussions as well.

There are multiple camps in these discussions that have slightly different ideas about how and for what purposes git should be used.

The first major axis of disagreement is whether history needs to be git bisect-able. Outside of my own little hobby projects, I've always worked on projects for which bisectability was important. This has generally been because their scope was such that CI simply had no chance to cover all uses of the software. Bug reports that can be traced to regressions from weeks or even months ago are not frequent per se, but they have always been frequent enough to matter. git bisect is an essential tool for finding those regression points when they happen. Not all projects are like that, but for projects which are, the notion of an "atomic" change to the project's main development branch (or branches) is important.

The second major axis of disagreement is whether the development history of those "atomic" changes is important enough to preserve. The original git development workflow does not consider this to be important: developers send around and review multiple iterations of a change, but only the final version of the change goes into the permanent record of the git repository. I tend to agree with that view. I have very occasionally found it useful to go back and read through the comments on a pull request that lead to a change months ago (or the email thread in projects that use an email workflow), but I have never found it useful to look at older versions of a change.

Some people seem to really care about this kind of history, though. They're the people who argue for a merge-based workflow for pull requests on GitHub (but against force-pushes to the same) and who have built hacks for bisectability and readability of history like --first-parent. I'm calling that a hack because it does not compose well. It works for projects whose atomic change history is essentially linear, but it breaks down once the history becomes more complex. What if the project occasionally has a genuine merge? Now you'd want to apply --first-parent for most merge commits but not all. Things get messy.

One final observation. Even "my" camp, which generally prefers to discard development history leading up to the atomic change in a main development branch, does want to preserve a kind of history that is currently not captured by git's graph. git revert inserts the hash of the commit that was reverted into the commit message. Similarly, git cherry-pick optionally inserts the hash of the commit that was cherry-picked into the commit message.

In other words, there is a kind of history for whose preservation at least in some cases there seems to be a broad consensus. This kind of history is distinct from the history that is captured by commit parent links. Looked at in this light, the idea is almost obvious: make this history an explicit part of git commit metadata.

The gist of it would be this. Every commit has a (often empty) list of historical commit references explaining the origins of the diff that is implicitly represented by the commit; let's call them diff-parents. The diff-parents are an ordered list of references to commits, each of them with a "reverted" bit that can optionally be set.

The history of a revert can be encoded by making the reverted commit a diff-parent with the "reverted" bit set. The history of a cherry-pick can be encoded similarly, with the "reverted" bit clear. When we perform a simple rebase, each new commit has an obvious diff-parent. When commits are squashed during a rebase, the sequence of squashed commits becomes the list of diff-parents of the newly formed commit. GitHub users who like to preserve all development history can use the "squash" option when landing pull requests and have the history be preserved via the list of diff-parents. git commit --amend can similarly record the original commit as diff-parent.

This is an idea and not a fully fleshed-out plan. There are obviously a whole bunch of tricky questions to answer. For example: How does this all fit into git's admittedly often byzantine CLI? Can merge commits be diff-parents, and how would that work? Can we visualize the difference between a commit and its diff-parents? (Hint: Here's an idea)

Diff-parents are a source of potential information leaks. This is not a problem specific to the idea of diff-parents; it is a general problem with the idea of preserving all history. Imagine some developer accidentally commits some credentials in their local clone of a repository and then uses git commit --amend to remove them again. Whoops, the commit that contains the credentials is still referenced as a diff-parent. Will it (and therefore the credentials) be published to the world for all to see when the developers pushes their branch to GitHub? This needs to be taken seriously.

So there are a whole bunch of issues that would have to be addressed for this idea to work well. I believe those issues to be quite surmountable in principle, but given the state of git development (where GitHub, which to many is almost synonymous with git, doesn't even seem to be able to understand how git was originally meant to be used) I am not particularly optimistic. Still, I think it's a good idea, and I'd love to see it or something like it in git.

Mittwoch, Februar 07, 2024

Building a HIP environment from scratch

HIP is a C++-based, single-source programming language for writing GPU code. "Single-source" means that a single source file can contain both the "host code" which runs on the CPU and the "device code" which runs on the GPU. In a sense, HIP is "CUDA for AMD", except that HIP can actually target both AMD and Nvidia GPUs.

If you merely want to use HIP, your best bet is to look at the documentation and download pre-built packages. (By the way, the documentation calls itself "ROCm" because that's what AMD calls its overall compute platform. It includes HIP, OpenCL, and more.)

I like to dig deep, though, so I decided I want to build at least the user space parts myself to the point where I can build a simple HelloWorld using a Clang from upstream LLVM. It's all open-source, after all!

It's a bit tricky, though, in part because of the kind of bootstrapping problems you usually get when building toolchains: Running the compiler requires runtime libraries, at least by default, but building the runtime libraries requires a compiler. Luckily, it's not quite that difficult, though, because compiling the host libraries doesn't require a HIP-enabled compiler - any C++ compiler will do. And while the device libraries do require a HIP- (and OpenCL-)enabled compiler, it is possible to build code in a "freestanding" environment where runtime libraries aren't available.

What follows is pretty much just a list of steps with running commentary on what the individual pieces do, since I didn't find an equivalent recipe in the official documentation. Of course, by the time you read this, it may well be outdated. Good luck!

Components need to be installed, but installing into some arbitrary prefix inside your $HOME works just fine. Let's call it $HOME/prefix. All packages use CMake and can be built using invocations along the lines of:

cmake -S . -B build -GNinja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_PREFIX=$HOME/prefix -DCMAKE_PREFIX_PATH=$HOME/prefix
ninja -C build install

In some cases, additional variables need to be set.

Step 1: clang and lld

We're going to need a compiler and linker, so let's get llvm/llvm-project and build it with Clang and LLD enabled: -DLLVM_ENABLE_PROJECTS='clang;lld' -DLLVM_TARGETS_TO_BUILD='X86;AMDGPU'

Building LLVM is an art of its own which is luckily reasonably well documented, so I'm going to leave it at that.

Step 2: Those pesky cmake files

Build and install ROCm/rocm-cmake to avoid cryptic error messages down the road when building other components that use those CMake files without documenting the dependency clearly. Not rocket science, but man am I glad for GitHub's search function.

Step 3: libhsa-runtime64.so

This is the lowest level user space host-side library in the ROCm stack. Its services, as far as I understand them, include setting up device queues and loading "code objects" (device ELF files). All communication with the kernel driver goes through here.

Notably though, this library does not know how to dispatch a kernel! In the ROCm world, the so-called Architected Queueing Language is used for that. An AQL queue is setup with the help of the kernel driver (and that does go through libhsa-runtime64.so), and then a small ring buffer and a "door bell" associated with the queue are mapped into the application's virtual memory space. When the application wants to dispatch a kernel, it (or rather, a higher-level library like libamdhip64.so that it links against) writes an AQL packet into the ring buffer and "rings the door bell", which basically just means writing a new ring buffer head pointer to the door bell's address. The door bell virtual memory page is mapped to the device, so ringing the door bell causes a PCIe transaction (for us peasants; MI300A has slightly different details under the hood) which wakes up the GPU.

Anyway, libhsa-runtime64.so comes in two parts for what I am being told are largely historical reasons:

ROCm/ROCT-Thunk-Interface
ROCm/ROCR-Runtime; this one has one of those bootstrap issues and needs a -DIMAGE_SUPPORT=OFF

The former is statically linked into the latter...

Step 4: It which must not be named

For Reasons(tm), there is a fork of LLVM in the ROCm ecosystem, ROCm/llvm-project. Using upstream LLVM for the compiler seems to be fine and is what I as a compiler developer obviously want to do. However, this fork has an amd directory with a bunch of pieces that we'll need. I believe there is a desire to upstream them, but also an unfortunate hesitation from the LLVM community to accept something so AMD-specific.

In any case, the required components can each be built individually against the upstream LLVM from step 1:

hipcc; this is a frontend for Clang which is supposed to be user-friendly, but at the cost of adding an abstraction layer. I want to look at the details under the hood, so I don't want to and don't have to use it; but some of the later components want it
device-libs; as the name says, these are libraries of device code. I'm actually not quite sure what the intended abstraction boundary is between this one and the HIP libraries from the next step. I think these ones are meant to be tied more closely to the compiler so that other libraries, like the HIP library below, don't have to use __builtin_amdgcn_* directly? Anyway, just keep on building...
comgr; the "code object manager". Provides a stable interface to LLVM, Clang, and LLD services, up to (as far as I understand it) invoking Clang to compile kernels at runtime. But it seems to have no direct connection to the code-related services in libhsa-runtime64.so.

That last one is annoying. It needs a -DBUILD_TESTING=OFF

Worse, it has a fairly large interface with the C++ code of LLVM, which is famously not stable. In fact, at least during my little adventure, comgr wouldn't build as-is against the LLVM (and Clang and LLD) build that I got from step 1. I had to hack out a little bit of code in its symbolizer. I'm sure it's fine.

Step 5: libamdhip64.so

Finally, here comes the library that implements the host-side HIP API. It also provides a bunch of HIP-specific device-side functionality, mostly by leaning on the device-libs from the previous step.

It lives in ROCm/clr, which stands for either Compute Language Runtimes or Common Language Runtime. Who knows. Either one works for me. It's obviously for compute, and it's common because it also contains OpenCL support.

You also need ROCm/HIP at this point. I'm not quite sure why stuff is split up into so many repositories. Maybe ROCm/HIP is also used when targeting Nvidia GPUs with HIP, but ROCm/CLR isn't? Not a great justification in my opinion, but at least this is documented in the README.

CLR also needs a bunch of additional CMake options: -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=${checkout of ROCm/HIP} -DHIPCC_BIN_DIR=$HOME/prefix/bin

Step 6: Compiling with Clang

We can now build simple HIP programs with our own Clang against our own HIP and ROCm libraries:

clang -x hip --offload-arch=gfx1100 --rocm-path=$HOME/prefix -rpath $HOME/prefix/lib -lstdc++ HelloWorld.cpp
LD_LIBRARY_PATH=$HOME/prefix/lib ./a.out

Neat, huh?

Sonntag, Dezember 31, 2023

Vulkan driver debugging stories

Recently, I found myself wanting to play some Cyberpunk 2077. Thanks to Proton, that's super easy and basically just works on Linux. Except that I couldn't enable raytracing, which annoyed me given that I have an RDNA3-based GPU that should be perfectly capable. Part of it may have been that I'm (obviously) using a version of the AMDVLK driver.

The first issue was that Proton simply wouldn't advertise raytracing (DXR) capabilities on my setup. That is easily worked around by setting VKD3D_CONFIG=dxr in the environment (in Steam launch options, set the command to VKD3D_CONFIG=dxr %command%).

This allowed me to enable raytracing in the game's graphics settings which unfortunately promptly caused a GPU hang and a GPUVM fault report in dmesg. Oh well, time for some debugging. That is (part of) my job, after all.

The fault originated from TCP, which means it's a shader vector memory access to a bad address. There's a virtually limitless number of potential root causes, so I told the amdgpu kernel module to take it easy on the reset attempts (by setting the lockup_timeout module parameter to a rather large value - that can be done on the Grub command line, but I chose to add a setting in /etc/modprobe.d/ instead) broke out good old trusty UMR in client/server mode (run with --server on the system under debug, and with --gui tcp://${address}:1234 on another system) to look at the waves that were hung. Sure enough, they had the fatal_halt bit set, were stuck a few instructions past a global_load_b64, and looking at VGPRs did suggest a suspicious address.

Tooling for shader debugging is stuck in the earlier parts of the 20th century (which may seem like an impressive feat of time travel given that programmable shading didn't even exist back then, but trust me it's genuinely and inherently way more difficult than CPU debug), so the next step was to get some pipeline dumps to correlate against the disassembly shown in UMR. Easy peasy, point the Vulkan driver at a custom amdVulkanSettings.cfg by way of the AMD_CONFIG_DIR environment variable and enable pipeline dumping by adding EnablePipelineDump,1 to the config file. Oh, and setting the AMD_DEBUG_DIR environment variable is helpful, too. Except now the game crashed before it even reached the main menu. Oops.

Well, that's a CPU code problem, and CPU debugging has left the 1970s firmly behind for somewhere in the 1990s or early 2000s. So let's get ourselves a debug build of the driver and attach gdb. Easy, right? Right?!? No. Cyberpunk 2077 is a Windows game, run in Proton, which is really Wine, which is really an emulator that likes to think of itself as not an emulator, run in some kind of container called a "pressure vessel" to fit the Steam theme. Fun.

To its credit, Proton tries to be helpful. You can set PROTON_DUMP_DEBUG_COMMANDS=1 in the environment which dumps some shell scripts to /tmp/proton-$user/ which allowed me to comparatively easily launch Cyberpunk 2077 from the terminal without going through the Steam client each time. But Wine seems to hate debugging, and it seems to hate debugging of native Linux code even more, and obviously the Vulkan driver is native Linux code. All my attempts to launch the game in some form of debugger in order to catch it red-handed were in vain.

At this point, I temporarily resigned myself to more debugging time travel of the bad kind, i.e. backwards in time to worse tooling. printf() still works, after all, and since the crash was triggered by enabling pipeline dumps, I had a fairly good idea about the general area in the driver that must have contained the problem.

So I went on a spree of sprinkling printf()s everywhere, which led to some extremely confusing and non-determinstic results. Confusing and non-deterministic is a really great hint, though, because it points at multi-threading. Indeed, Cyberpunk 2077 is a good citizen and does multi-threaded pipeline compilation. Or perhaps VKD3D is being helpful. Either way, it's a good thing except it exposed a bug in the driver. So I started sprinkling std::lock_guards everywhere. That helped narrow down the problem area. Add some good old staring at code and behold: somebody had very recently added a use of strtok() to the pipeline dumping logic. Very bad idea, very easy fix.

Okay, so I can dump some pipelines now, but I still don't get to the main menu because the game now crashes with an assertion somewhere in PAL. I could start staring at pipeline dumps, but this is an assertion that (1) suggests a legitimate problem, which means prioritizing it might actually be helpful, and (2) is in the kind of function that is called from just about everywhere, which means I really, really need to be able to look at a stacktrace now. It's time to revisit debuggers.

One of the key challenges with my earlier attempts at using gdb was that (1) Wine likes to fork off tons of processes, which means getting gdb to follow the correct one is basically impossible, and (2) the crash happens very quickly, so manually attaching gdb after the fact is basically impossible. But the whole point of software development is to make the impossible possible, so I tweaked the implementation of PAL_ASSERT to poke at /proc/self to figure out whether a debugger is already attached and if one isn't, optionally print out a helpful message including the PID and then sleep instead of calling abort() immediately. This meant that I could now attach gdb at my leisure, which I did.

And was greeted with an absolutely useless gdb session because Wine is apparently being sufficiently creative with the dynamic linker structures that gdb can't make sense of what's happening on its own and doesn't find any symbols, let alone further debug info, and so there was no useful backtrace. Remember how I mentioned that Wine hates debugging?

Luckily, a helpful soul pointed out that /proc/$pid/maps exists and tells us where .so's are mapped into a process address space, and there's absolutely nothing Wine can do about that. Even better, gdb allows the user to manually tell it about shared libraries that have been loaded. Even even better, gdb can be scripted with Python. So, I wrote a gdb script that walks the backtrace and figures out which shared libraries to tell gdb about to make sense of the backtraces. (Update: Friedrich Vock helpfully pointed out that attaching with gdb -p $pid /path/to/the/correct/bin/wine64 also allows gdb to find shared libraries.)

At this point, another helpful soul pointed out that Fossilize exists and can play back pipeline creation in a saner environment than a Windows game running on VKD3D in Wine in a pressure vessel. That would surely have reduced my debugging woes somewhat. Oh well, at least I learned something.

From there, fixing all the bugs was almost a walk in the park. The assertion I had run into in PAL was easy to fix, and finally I could get back to the original problem: that GPU hang. That turned out to be a fairly mundane problem in LLPC's raytracing implementation, for which I have a fix. It's still going to take a while to trickle out, in part because this whole debugging odyssey has a corresponding complex chain of patches that just take a while to ferment, and in part because pre-Christmas is very far from a quiet time and things have just been generally crazy. Still: very soon you, too, will be able to play Cyberpunk 2077 with raytracing using the AMDVLK driver.