Sonntag, Dezember 31, 2023

Vulkan driver debugging stories

Recently, I found myself wanting to play some Cyberpunk 2077. Thanks to Proton, that's super easy and basically just works on Linux. Except that I couldn't enable raytracing, which annoyed me given that I have an RDNA3-based GPU that should be perfectly capable. Part of it may have been that I'm (obviously) using a version of the AMDVLK driver.

The first issue was that Proton simply wouldn't advertise raytracing (DXR) capabilities on my setup. That is easily worked around by setting VKD3D_CONFIG=dxr in the environment (in Steam launch options, set the command to VKD3D_CONFIG=dxr %command%).

This allowed me to enable raytracing in the game's graphics settings which unfortunately promptly caused a GPU hang and a GPUVM fault report in dmesg. Oh well, time for some debugging. That is (part of) my job, after all.

The fault originated from TCP, which means it's a shader vector memory access to a bad address. There's a virtually limitless number of potential root causes, so I told the amdgpu kernel module to take it easy on the reset attempts (by setting the lockup_timeout module parameter to a rather large value - that can be done on the Grub command line, but I chose to add a setting in /etc/modprobe.d/ instead) broke out good old trusty UMR in client/server mode (run with --server on the system under debug, and with --gui tcp://${address}:1234 on another system) to look at the waves that were hung. Sure enough, they had the fatal_halt bit set, were stuck a few instructions past a global_load_b64, and looking at VGPRs did suggest a suspicious address.

Tooling for shader debugging is stuck in the earlier parts of the 20th century (which may seem like an impressive feat of time travel given that programmable shading didn't even exist back then, but trust me it's genuinely and inherently way more difficult than CPU debug), so the next step was to get some pipeline dumps to correlate against the disassembly shown in UMR. Easy peasy, point the Vulkan driver at a custom amdVulkanSettings.cfg by way of the AMD_CONFIG_DIR environment variable and enable pipeline dumping by adding EnablePipelineDump,1 to the config file. Oh, and setting the AMD_DEBUG_DIR environment variable is helpful, too. Except now the game crashed before it even reached the main menu. Oops.

Well, that's a CPU code problem, and CPU debugging has left the 1970s firmly behind for somewhere in the 1990s or early 2000s. So let's get ourselves a debug build of the driver and attach gdb. Easy, right? Right?!? No. Cyberpunk 2077 is a Windows game, run in Proton, which is really Wine, which is really an emulator that likes to think of itself as not an emulator, run in some kind of container called a "pressure vessel" to fit the Steam theme. Fun.

To its credit, Proton tries to be helpful. You can set PROTON_DUMP_DEBUG_COMMANDS=1 in the environment which dumps some shell scripts to /tmp/proton-$user/ which allowed me to comparatively easily launch Cyberpunk 2077 from the terminal without going through the Steam client each time. But Wine seems to hate debugging, and it seems to hate debugging of native Linux code even more, and obviously the Vulkan driver is native Linux code. All my attempts to launch the game in some form of debugger in order to catch it red-handed were in vain.

At this point, I temporarily resigned myself to more debugging time travel of the bad kind, i.e. backwards in time to worse tooling. printf() still works, after all, and since the crash was triggered by enabling pipeline dumps, I had a fairly good idea about the general area in the driver that must have contained the problem.

So I went on a spree of sprinkling printf()s everywhere, which led to some extremely confusing and non-determinstic results. Confusing and non-deterministic is a really great hint, though, because it points at multi-threading. Indeed, Cyberpunk 2077 is a good citizen and does multi-threaded pipeline compilation. Or perhaps VKD3D is being helpful. Either way, it's a good thing except it exposed a bug in the driver. So I started sprinkling std::lock_guards everywhere. That helped narrow down the problem area. Add some good old staring at code and behold: somebody had very recently added a use of strtok() to the pipeline dumping logic. Very bad idea, very easy fix.

Okay, so I can dump some pipelines now, but I still don't get to the main menu because the game now crashes with an assertion somewhere in PAL. I could start staring at pipeline dumps, but this is an assertion that (1) suggests a legitimate problem, which means prioritizing it might actually be helpful, and (2) is in the kind of function that is called from just about everywhere, which means I really, really need to be able to look at a stacktrace now. It's time to revisit debuggers.

One of the key challenges with my earlier attempts at using gdb was that (1) Wine likes to fork off tons of processes, which means getting gdb to follow the correct one is basically impossible, and (2) the crash happens very quickly, so manually attaching gdb after the fact is basically impossible. But the whole point of software development is to make the impossible possible, so I tweaked the implementation of PAL_ASSERT to poke at /proc/self to figure out whether a debugger is already attached and if one isn't, optionally print out a helpful message including the PID and then sleep instead of calling abort() immediately. This meant that I could now attach gdb at my leisure, which I did.

And was greeted with an absolutely useless gdb session because Wine is apparently being sufficiently creative with the dynamic linker structures that gdb can't make sense of what's happening on its own and doesn't find any symbols, let alone further debug info, and so there was no useful backtrace. Remember how I mentioned that Wine hates debugging?

Luckily, a helpful soul pointed out that /proc/$pid/maps exists and tells us where .so's are mapped into a process address space, and there's absolutely nothing Wine can do about that. Even better, gdb allows the user to manually tell it about shared libraries that have been loaded. Even even better, gdb can be scripted with Python. So, I wrote a gdb script that walks the backtrace and figures out which shared libraries to tell gdb about to make sense of the backtraces. (Update: Friedrich Vock helpfully pointed out that attaching with gdb -p $pid /path/to/the/correct/bin/wine64 also allows gdb to find shared libraries.)

At this point, another helpful soul pointed out that Fossilize exists and can play back pipeline creation in a saner environment than a Windows game running on VKD3D in Wine in a pressure vessel. That would surely have reduced my debugging woes somewhat. Oh well, at least I learned something.

From there, fixing all the bugs was almost a walk in the park. The assertion I had run into in PAL was easy to fix, and finally I could get back to the original problem: that GPU hang. That turned out to be a fairly mundane problem in LLPC's raytracing implementation, for which I have a fix. It's still going to take a while to trickle out, in part because this whole debugging odyssey has a corresponding complex chain of patches that just take a while to ferment, and in part because pre-Christmas is very far from a quiet time and things have just been generally crazy. Still: very soon you, too, will be able to play Cyberpunk 2077 with raytracing using the AMDVLK driver.