Tagebuch eines Interplanetaren Botschafters: 2023

Sonntag, Dezember 31, 2023

Vulkan driver debugging stories

Recently, I found myself wanting to play some Cyberpunk 2077. Thanks to Proton, that's super easy and basically just works on Linux. Except that I couldn't enable raytracing, which annoyed me given that I have an RDNA3-based GPU that should be perfectly capable. Part of it may have been that I'm (obviously) using a version of the AMDVLK driver.

The first issue was that Proton simply wouldn't advertise raytracing (DXR) capabilities on my setup. That is easily worked around by setting VKD3D_CONFIG=dxr in the environment (in Steam launch options, set the command to VKD3D_CONFIG=dxr %command%).

This allowed me to enable raytracing in the game's graphics settings which unfortunately promptly caused a GPU hang and a GPUVM fault report in dmesg. Oh well, time for some debugging. That is (part of) my job, after all.

The fault originated from TCP, which means it's a shader vector memory access to a bad address. There's a virtually limitless number of potential root causes, so I told the amdgpu kernel module to take it easy on the reset attempts (by setting the lockup_timeout module parameter to a rather large value - that can be done on the Grub command line, but I chose to add a setting in /etc/modprobe.d/ instead) broke out good old trusty UMR in client/server mode (run with --server on the system under debug, and with --gui tcp://${address}:1234 on another system) to look at the waves that were hung. Sure enough, they had the fatal_halt bit set, were stuck a few instructions past a global_load_b64, and looking at VGPRs did suggest a suspicious address.

Tooling for shader debugging is stuck in the earlier parts of the 20th century (which may seem like an impressive feat of time travel given that programmable shading didn't even exist back then, but trust me it's genuinely and inherently way more difficult than CPU debug), so the next step was to get some pipeline dumps to correlate against the disassembly shown in UMR. Easy peasy, point the Vulkan driver at a custom amdVulkanSettings.cfg by way of the AMD_CONFIG_DIR environment variable and enable pipeline dumping by adding EnablePipelineDump,1 to the config file. Oh, and setting the AMD_DEBUG_DIR environment variable is helpful, too. Except now the game crashed before it even reached the main menu. Oops.

Well, that's a CPU code problem, and CPU debugging has left the 1970s firmly behind for somewhere in the 1990s or early 2000s. So let's get ourselves a debug build of the driver and attach gdb. Easy, right? Right?!? No. Cyberpunk 2077 is a Windows game, run in Proton, which is really Wine, which is really an emulator that likes to think of itself as not an emulator, run in some kind of container called a "pressure vessel" to fit the Steam theme. Fun.

To its credit, Proton tries to be helpful. You can set PROTON_DUMP_DEBUG_COMMANDS=1 in the environment which dumps some shell scripts to /tmp/proton-$user/ which allowed me to comparatively easily launch Cyberpunk 2077 from the terminal without going through the Steam client each time. But Wine seems to hate debugging, and it seems to hate debugging of native Linux code even more, and obviously the Vulkan driver is native Linux code. All my attempts to launch the game in some form of debugger in order to catch it red-handed were in vain.

At this point, I temporarily resigned myself to more debugging time travel of the bad kind, i.e. backwards in time to worse tooling. printf() still works, after all, and since the crash was triggered by enabling pipeline dumps, I had a fairly good idea about the general area in the driver that must have contained the problem.

So I went on a spree of sprinkling printf()s everywhere, which led to some extremely confusing and non-determinstic results. Confusing and non-deterministic is a really great hint, though, because it points at multi-threading. Indeed, Cyberpunk 2077 is a good citizen and does multi-threaded pipeline compilation. Or perhaps VKD3D is being helpful. Either way, it's a good thing except it exposed a bug in the driver. So I started sprinkling std::lock_guards everywhere. That helped narrow down the problem area. Add some good old staring at code and behold: somebody had very recently added a use of strtok() to the pipeline dumping logic. Very bad idea, very easy fix.

Okay, so I can dump some pipelines now, but I still don't get to the main menu because the game now crashes with an assertion somewhere in PAL. I could start staring at pipeline dumps, but this is an assertion that (1) suggests a legitimate problem, which means prioritizing it might actually be helpful, and (2) is in the kind of function that is called from just about everywhere, which means I really, really need to be able to look at a stacktrace now. It's time to revisit debuggers.

One of the key challenges with my earlier attempts at using gdb was that (1) Wine likes to fork off tons of processes, which means getting gdb to follow the correct one is basically impossible, and (2) the crash happens very quickly, so manually attaching gdb after the fact is basically impossible. But the whole point of software development is to make the impossible possible, so I tweaked the implementation of PAL_ASSERT to poke at /proc/self to figure out whether a debugger is already attached and if one isn't, optionally print out a helpful message including the PID and then sleep instead of calling abort() immediately. This meant that I could now attach gdb at my leisure, which I did.

And was greeted with an absolutely useless gdb session because Wine is apparently being sufficiently creative with the dynamic linker structures that gdb can't make sense of what's happening on its own and doesn't find any symbols, let alone further debug info, and so there was no useful backtrace. Remember how I mentioned that Wine hates debugging?

Luckily, a helpful soul pointed out that /proc/$pid/maps exists and tells us where .so's are mapped into a process address space, and there's absolutely nothing Wine can do about that. Even better, gdb allows the user to manually tell it about shared libraries that have been loaded. Even even better, gdb can be scripted with Python. So, I wrote a gdb script that walks the backtrace and figures out which shared libraries to tell gdb about to make sense of the backtraces. (Update: Friedrich Vock helpfully pointed out that attaching with gdb -p $pid /path/to/the/correct/bin/wine64 also allows gdb to find shared libraries.)

At this point, another helpful soul pointed out that Fossilize exists and can play back pipeline creation in a saner environment than a Windows game running on VKD3D in Wine in a pressure vessel. That would surely have reduced my debugging woes somewhat. Oh well, at least I learned something.

From there, fixing all the bugs was almost a walk in the park. The assertion I had run into in PAL was easy to fix, and finally I could get back to the original problem: that GPU hang. That turned out to be a fairly mundane problem in LLPC's raytracing implementation, for which I have a fix. It's still going to take a while to trickle out, in part because this whole debugging odyssey has a corresponding complex chain of patches that just take a while to ferment, and in part because pre-Christmas is very far from a quiet time and things have just been generally crazy. Still: very soon you, too, will be able to play Cyberpunk 2077 with raytracing using the AMDVLK driver.

Freitag, Mai 12, 2023

An Update on Dialects in LLVM

EuroLLVM took place in Glasgow this week. I wasn't there, but it's a good opportunity to check in with what's been happening in dialects for LLVM in the ~half year since my keynote at the LLVM developer meeting.

Where we came from

To give an ultra-condensed recap: The excellent idea that MLIR brought to the world of compilers is to explicitly separate the substrate in which a compiler intermediate representation is implemented (the class hierarchy and basic structures that are used to represent and manipulate the program representation at compiler runtime) from the semantic definition of a dialect (the types and operations that are available in the IR and their meaning). Multiple dialects can co-exist on the same substrate, and in fact the phases of compilation can be identified with the set of dialects that are used within each phase.

Unfortunately for AMD's shader compiler, while MLIR is part of the LLVM project and shares some foundational support libraries with LLVM, its IR substrate is entirely disjoint from LLVM's IR substrate. If you have an existing compiler built on LLVM IR, you could bolt on an MLIR-based frontend, but what we really need is a way to gradually introduce some of the capabilities offered by MLIR throughout an existing LLVM-based compilation pipeline.

That's why I started llvm-dialects last year. We published its initial release a bit more than half a year ago, and have greatly expanded its capabilities since then.

Where we are now

We have been using llvm-dialects in production for a while now. Some of its highlights so far are:

Almost feature-complete for defining custom operations (aka intrinsics or instructions). The main thing that's missing is varargs support - we just haven't needed that yet.
Most of the way there for defining custom types: custom types can be defined, but they can't be used everywhere. I'm working on closing the gaps as we speak - some upstream changes in LLVM itself are required.
Expressive language for describing constraints on operation and type arguments and operation results - see examples here and here.
Thorough, automatically generated IR verifier routines.
A flexible and efficient visitor mechanism that is inspired by but beats LLVM's TypeSwitch in some important ways.

Transitioning to the use of llvm-dialects is a gradual process for us and far from complete. We have always had custom operations, but we used to do implement them in a rather ad-hoc manner. The old way of doing it consisted of hand-writing code like this:

SmallVector<Value *, 4> args;
std::string instName = lgcName::OutputExportXfb;
args.push_back(getInt32(xfbBuffer));
args.push_back(xfbOffset);
args.push_back(getInt32(streamId));
args.push_back(valueToWrite);
addTypeMangling(nullptr, args, instName);
return CreateNamedCall(instName, getVoidTy(), args, {});

With llvm-dialects, we can use a much cleaner builder pattern:

return create<InputImportGenericOp>(
resultTy, false, location, getInt32(0), elemIdx,
PoisonValue::get(getInt32Ty()));

Accessing the operands of a custom operation used to be a matter of code with magic numbers everywhere:

if (callInst.arg_size() > 2)
vertexIdx = isDontCareValue(callInst.getOperand(2))
? nullptr : callInst.getOperand(2);

With llvm-dialects, we get far more readable code:

Value *vertexIdx = nullptr;
if (!inputOp.getPerPrimitive())
vertexIdx = inputOp.getArrayIndex();

Following the example set by MLIR, these accessor methods as well as the machinery required to make the create<FooOp>(...) builder call work are automatically generated from a dialect definition written in a TableGen DSL.

An important lesson from the transition so far is that the biggest effort, but also one of the biggest benefits, has to do with getting to a properly defined IR in the first place.

I firmly believe that understanding a piece of software starts not with the code that is executed but with the interfaces and data structures that the code implements and interacts with. In a compiler, the most important data structure is the IR. You should think of the IR as the bulk of the interface for almost all compiler code.

When defining custom operations in the ad-hoc manner that we used to use, there isn't one place in which the operations themselves are defined. Instead, the definition is implicit in the scattered locations where the operations are created and consumed. More often than is comfortable, this leads to definitions that are fuzzy or confused, which leads to code that is fuzzy and confused, which leads to bugs and a high maintenance cost, which leads to the dark side (or something).

By having a designated location where the custom operations are explicitly defined - the TableGen file - there is a significant force pushing towards proper definitions. As the experience of MLIR shows, this isn't automatic (witness the rather thin documentation of many of the dialects in upstream MLIR), but without this designated location, it's bound to be worse. And so a large part of transitioning to a systematically defined dialect is cleaning up those instances of confusion and fuzziness. It pays off: I have found hidden bugs this way, and the code becomes noticeably more maintainable.

Where we want to go

llvm-dialects is already a valuable tool for us. I'm obviously biased, but if you're in a similar situation to us, or you're thinking of starting a new LLVM-based compiler, I recommend it.

There is more that can be done, though, and I'm optimistic we'll get around to further improvements over time as we gradually convert parts of our compiler that are being worked on anyway. My personal list of items on the radar:

As mentioned already, closing the remaining gaps in custom type support.
Our compiler uses quite complex metadata in a bunch of places. It's hard to read for humans, doesn't have a good compatibility story for lit tests, and accessing it at compile-time isn't particularly efficient. I have some ideas for how to address all these issues with an extension mechanism that could also benefit upstream LLVM.
Compile-time optimizations. At the moment, casting custom operations is still based on string comparison, which is clearly not ideal. There are a bunch of other things in this general area as well.
I really want to see some equivalent of MLIR regions in LLVM. But that's a non-trivial amount of work and will require patience.

There's also the question of if or when llvm-dialects will eventually be integrated into LLVM upstream. There are lots of good arguments in favor. Its DSL for defining operations is a lot friendlier than what is used for intrinsics at the moment. Getting nice, auto-generated accessor methods and thorough verification for intrinsics would clearly be a plus. But it's not a topic that I'm personally going to push in the near future. I imagine we'll eventually get there once we've collected even more experience.

Of course, if llvm-dialects is useful to you and you feel like contributing in these or other areas, I'd be more than happy about that!

Samstag, Januar 21, 2023

Diff modulo base, a CLI tool to assist with incremental code reviews

One of the challenges of reviewing a lot of code is that many reviews require multiple iterations. I really don't want to do a full review from scratch on the second and subsequent rounds. I need to be able to see what has changed since last time.

I happen to work on projects that care about having a useful Git history. This means that authors of (without loss of generality) pull requests use amend and rebase to change commits and force-push the result. I would like to see the only the changes they made since my last review pass. Especially when the author also rebased onto a new version of the main branch, existing code review tools tend to break down.

Git has a little-known built-in subcommand, git range-diff, which I had been using for a while. It's pretty cool, really: It takes two ranges of commits, old and new, matches old and new commits, and then shows how they changed. The rather huge problem is that its output is a diff of diffs. Trying to make sense of those quickly becomes headache-inducing.

I finally broke down at some point late last year and wrote my own tool, which I'm calling diff-modulo-base. It allows you to look at the difference of the repository contents between old and new in the history below, while ignoring all the changes that are due to differences in the respective base versions A and B.

As a bonus, it actually does explicitly show differences between A and B that would have caused merge conflicts during rebase. This allows a fairly comfortable view of how merge conflicts were resolved.

I've been using this tool for a while now. While there are certainly still some rough edges and to dos, I did put a bunch more effort into it over the winter holidays and am now quite happy with it. I'm making it available for all to try at https://git.sr.ht/~nhaehnle/diff-modulo-base. Let me know if you find it useful!

Better integration with the larger code review flow?

One of the rough edges is that it would be great to integrate tightly with the GitHub notifications workflow. That workflow is surprisingly usable in that you can essentially treat the notifications as an inbox in which you can mark notifications as unread or completed, and can "mute" issues and pull requests all with keyboard shortcut.

What's missing in my workflow is a reliable way to remember the most recent version of a pull request that I have reviewed. My somewhat passable workaround for now is to git fetch before I do a round of reviews, and rely on the local reflog of remote refs. A Git alias allows me to say

git dmb-origin $pull_request_id

and have that become

git diff-modulo-base origin/main origin/pull/$pull_request_id/head@{1} origin/pull/$pull_request_id/head

which is usually what I want.

Ideally, I'd have a fully local way of interacting with GitHub notifications, which could then remember the reviewed version in a more reliable way. This ought to also fix the terrible lagginess of the web interface. But that's a rant for another time.

Rust

This is the first serious piece of code I've written in Rust. I have to say that experience has really been quite pleasant so far. Rust's tooling is pretty great, mostly thanks to the rust-analyzer LSP server.

The one thing I'd wish is that the borrow checker was able to better understand "partial" borrows. I find it occasionally convenient to tie a bunch of data structures together in a general context structure, and helper functions on such aggregates can't express that they only borrow part of the structure. This can usually be worked around by changing data types, but the fact that I have to do that is annoying. It feels like having to solve a puzzle that isn't part of the inherent complexity of the underlying problem that the code is trying to solve.

And unlike, say, circular references or graph structures in general, where it's clear that expressing and proving the sort of useful lifetime facts that developers might intuitively reason about quickly becomes intractable, improving the support for partial borrows feels like it should be a tractable problem.

Samstag, Januar 14, 2023

Software testing, and why I'm unhappy about it

Automated testing of software is great. Unfortunately, what's commonly considered best practice for how to integrate testing into the development flow is a bad fit for a lot of software. You know what I mean: You submit a pull request, some automated testing process is kicked off in the background, and some time later you get back a result that's either Green (all tests passed) or Red (at least one test failed). Don't submit the PR if the tests are red. Sounds good, doesn't quite work.

There is Software Under Test (SUT), the software whose development is the goal of what you're doing, and there's the Test Bench (TB): the tests themselves, but also additional relevant parts of the environment like perhaps some hardware device that the SUT is run against.

The above development practice works well when the SUT and TB are both defined by the same code repository and are developed together. And admittedly, that is the case for a lot of useful software. But it just so happens that I mostly tend to work on software where the SUT and TB are inherently split. Graphics drivers and shader compilers implement some spec (like Vulkan or Direct3D), and an important part of the TB are a conformance test suite and other tests, the bulk of which are developed separately from the driver itself. Not to mention the GPU itself and other software like the kernel mode driver. The point is, TB development is split from the SUT development and it is infeasible to make changes to them in lockstep.

Down with No Failures, long live No Regressions

Problem #1 with keeping all tests passing all the time is that tests can fail for reasons whose root cause is not an SUT change.

For example, a new test case is added to the conformance test suite, but that test happens to fail. Suddenly nobody can submit any changes anymore.

That clearly makes no sense, and because Tooling Sucks(tm), what folks typically do is maintain a manual list of test cases that are excluded from automated testing. This unblocks development, but are you going to remember to update that exclusion list? Bonus points if the exclusion list isn't even maintained in the same repository as the SUT, which just compounds the problem.

The situation is worse when you bring up a large new feature or perhaps a new version of the hardware supported by your driver (which is really just a super duper large new feature), where there is already a large body of tests written by somebody else. Development of the new feature may take months and typically is merged bit by bit over time. For most of that time, there are going to be some test failures. And that's fine!

Unfortunately, a typical coping mechanism is that automated testing for the feature is entirely disabled until the development process is complete. The consequences are dire, as regressions in relatively basic functionality can go unnoticed for a fairly long time.

And sometimes there are simply changes in the TB that are hard to control. Maybe you upgraded the kernel mode driver for your GPU on the test systems, and suddenly some weird corner case tests fail. Yes, you have to fix it somehow, but removing the test case from your automated testing process is almost always the wrong response.

In fact, failing tests are, given the right context, a good thing! Let's say a bug is discovered in a real application in the field. Somebody root causes the problem and writes a simplified reproducer. This reproducer should be added to the TB as soon as possible, even if it is going to fail initially!

To be fair, many of the common testing frameworks recognize this by allowing tests to be marked as "expected to fail". But they typically also assume that the TB can be changed in lockstep with the SUT and fall on their face when that isn't the case.

What is needed here is to treat testing as a truly continuous exercise, with some awareness by the automation of how test runs relate to the development history.

During day-to-day development, the important bit isn't that there are no failures. The important bit is that there are no regressions.

Automation ought to track which tests pass on the main development branch and provide pre-commit reports for pull requests relative to those results: Have there been any regressions? Have any tests been fixed? Block code submissions when they cause regressions, but don't block them for pre-existing failures, especially when those failures are caused by changes in the TB.

Changes to the TB should also be tested where possible, and when they cause regressions those should be investigated. But it is quite common that regressions caused by a TB change are legitimate and shouldn't block the TB change.

Sparse testing

Problem #2 is that good test coverage means that tests take a very long time to run.

Your first solution to this problem should be to parallelize and throw more hardware at it. Let's hope the people who control the purse care enough about quality.

There is sometimes also low-hanging fruit you should pick, like wasting lots of time in process (or other) startup and teardown overhead. Addressing that can be a double-edged sword. Changing a test suite from running every test case in a separate process to running multiple test cases sequentially in the same process reduces isolation between the tests and can therefore make the tests flakier. It can also expose genuine bugs, though, and so the effort is usually worth it.

But all these techniques have their limits.

Let me give you an example. Compilers tend to have lots of switches that subtly change the compiler's behavior without (intentionally) affecting correctness. How good is your test coverage of these?

Chances are, most of them don't see any real testing at all. You probably have a few hand-written test cases that exercise the options. But what you really should be doing is run an entire suite of end-to-end tests with each of the switches applied to make sure you aren't missing some interaction. And you really should be testing all combinations of switches as well.

The combinatorial explosion is intense. Even if you only have 10 boolean switches, testing them each individually without regressing the turn-around time of the test suite requires 11x more test system hardware. Testing all possible combinations requires 1024x more. Nobody has that kind of money.

The good news is that having extremely high confidence in the quality of your software doesn't require that kind of money. If we run the entire test suite a small number of times (maybe even just once!) and independently choose a random combination of switches for each test, then not seeing any regressions there is a great indication that there really aren't any regressions.

Why is that? Because failures are correlated! Test T failing with a default setting of switches is highly correlated with test T failing with some non-default switch S enabled.

This effect isn't restriced to taking the cross product of a test suite with a bunch of configuration switches. By design, an exhaustive conformance test suite is going to have many sets of tests with high failure correlation. For example, in the Vulkan test suite you might have a bunch of test cases that all do the same thing, but with a different combination of framebuffer format and blend function. When there is a regression affecting such tests, the specific framebuffer format or blend function might not matter at all, and all of the tests will regress. Or perhaps the regression is related to a specific framebuffer format, and so all tests using that format will regress regardless of the blend function that is used, and so on.

A good automated testing system would leverage these observations using statistical methods (aka machine learning).

Combinatorial explosion causes your full test suite to take months to run? No problem, treat testing as a genuinely continuous task. Have test systems continuously run random samplings of test cases on the latest version of the main development branch. Whenever a change is made to either the SUT or TB, switch over to testing that new version instead. When a failure is encountered, automatically determine if it is a regression by referring to earlier test results (of the exact same test if it has been run previously, or related tests otherwise) combined with a bisection over the code history.

Pre-commit testing becomes an interesting fuzzy problem. By all means have a small traditional test suite that is manually curated to run within a few minutes. But we can also apply the approach of running randomly sampled tests to pre-commit testing.

A good automated testing system would learn a statistical model of regressions and combine that with the test results obtained so far to provide an estimate of the likelihood of regression. As long as no regression is actually found, this likelihood will keep dropping as more tests are run, though it will not drop to 0 unless all tests are run (and the setup here was this would take months). The team can define a likelihood threshold that a change must reach before it can be committed based on the their appetite for risk and rate of development.

The statistical model should be augmented with source-level information about the change, such as keywords that appear in the diff and commit message and the set of files that was changed. After all, there ought to be some meaningful correlation between regressions in a raytracing test case and the fact that the regressing change affected a file with "raytracing" in its name. The model should then also be used to bias the random sampling of tests to be run to maximize the information extracted per effort spent on running test cases.

Some caveats

What I've described is largely motivated by the fact that the world is messier than commonly accepted testing "wisdom" allows. However, the world is too messy even for what I've described.

I haven't talked about flaky (randomly failing) tests at all, though a good automated testing system should be able to cope with them. Re-running a test in the same configuration is not black magic and can be used to confirm that a test is flaky. If we wanted to get fancy, we could even estimate the failure probability and treat a significant increase of the failure rate as a regression!

Along similar lines, there can be state leakage between test cases that causes failures only when test cases are run in a specific order, or when specific test cases are run in parallel. This would manifest as flaky tests, and so flaky test detection ought to try to help tease out these scenarios. That is admittedly difficult and will probably never be entirely reliable. Luckily, it doesn't happen often.

Sometimes, there are test cases that can leave a test system in such a broken state that it has to be rebooted. This is not entirely unusual in very early bringup of a driver for new hardware, when even the device's firmware may still be unstable. An automated test system can and should treat this case just like one would treat a crashing test process: Detect the failure, perhaps using some timer-based watchdog, force a reboot, possibly using a remote-controlled power switch, and resume with the next test case. But if a decent fraction of your test suite is affected, the resulting experience isn't fun and there may not be anything your team can do about it in the short term. So that's an edge case where manual exclusion of tests seems legitimate.

So no, testing perfection isn't attainable for many kinds of software projects. But even relative to what feels like it should be realistically attainable, the state of the art is depressing.

Tagebuch eines Interplanetaren Botschafters