Samstag, Dezember 19, 2015

Debugging a GPU VM fault

Having finished debugging and fixing a rather tricky GPU VM fault bug in the radeonsi driver, I thought I'd document the bug chasing process I went through (some parts are cleaned up with dead ends removed). May it help myself and others in the future.

Fortunately, the original submitter of the bug had already bisected the cause of the VM faults to a change in LLVM, so the fault was clearly due to some shader. Unfortunately, the triggering commit in LLVM was completely unrelated to Radeon, so it was very unclear what was going on. Still, the bug occured in the publically available Unreal Elemental Demo and was easily reproducible, so off I went.

Since some shader was to blame, the first thing to do after reproduction was to collect all shaders before and after the bad commit, using RADEON_DUMP_SHADERS=y (R600_DEBUG=ps,vs,gs also does this). This resulted in a lot of output with a large diff between the good and the bad run. Clearly, the change in LLVM subtly affected register allocation and/or instruction scheduling in the compiler in a way that affected many shaders and exposed some pre-existing, underlying bug. I needed to find the exact shader that caused problems.

The next step, to ensure even more reliable and deterministic reproduction, was to record an apitrace. This allows us to replay the exact same sequence of OpenGL calls that leads to the VM faults over and over again, to learn ever more about what's going on. (The Unreal Elemental Demo always plays the same scene, but it is affected by timing.)

Now it was time to find the exact draw call that caused the problems. The driver has some tools to help with that: the GALLIUM_DDEBUG feature is meant to detect lockups, but it conveniently causes a command stream flush after every draw call, so I used it by setting GALLIUM_DDEBUG=800. This makes the replay terribly slow (there's a reason we batch many draw calls into a single CS, also called IB in the kernel). So I implemented a GALLIUM_DDEBUG_SKIP feature in the driver that let me skip the additional flushes and lockup checks for the initial, known-good segment of the trace.

In addition, the driver comes with a debug feature that detects and aborts on VM faults, which is enabled via R600_DEBUG=check_vm. Since the fault comes from a shader, we also need a way to cross-reference the dected fault to the currently bound shader. This is achieved by dumping shaders and enabling the vm debug option, for a full command line of something like
RADEON_DUMP_SHADERS=y R600_DEBUG=vm,check_vm \
glretrace -v ElementalDemo.trace > runXXX.log 2>&1
The option -v for glretrace dumps all OpenGL calls as they are executed, which also turned out to be useful.

How to find the faulty shader from all that? Well, the check_vm logic not only detects VM faults, but also writes helpful logging dumps to a file in ~/ddebug_dumps/ (use less -R to make sense of the coloring escape codes that are written to the file). Most crucially, this dump contains a list of all buffers that were mapped, obviously including the buffers that contain the shader binaries. In one example run:
        Size    VM start page         VM end page           Usage
         245    -- hole --
           1    0x0000000113a9b       0x0000000113a9c       USER_SHADER
         268    -- hole --
         564    -- hole --
           2    0x000000016f00d       0x000000016f00f       USER_SHADER
         145    -- hole --
Remember that we enabled the vm debug option together with shader dumping? This means our log contains lots of lines of the form
VM start=0x105249000  end=0x10524A000 | Buffer 4096 bytes
(Note that the log contains byte addresses, while the check_vm dump contains page numbers. Pages are 4KB, so you just need to add or remove three 0s at the end to go from bytes to pages and vice versa.) All we need to do is grep the log file for "VM start=0x113A9B" and "VM start=0x16F00D" (mmh, I'm getting hungry...). And while those might appear multiple times if a buffer is reused or destroyed, the last occurence will be where the shader binary was created.

The shader dump contains three versions of the shader: the initial TGSI, which is what Gallium's state tracker provides to the hardware-dependent driver, the LLVM IR after an initial optimization pass, and the disassembly of the final shader binary as provided by LLVM. I extracted the IR of the two shaders (vertex and fragment), and compiled them with LLVM's standalone compiler, once with the "good" version of LLVM and once with the "bad" (in both cases using the command line llc -march=amdgcn -mcpu=tonga < shader.ll). It turned out that both shaders were affected by the change, so I needed to figure out whether it was the vertex or the fragment shader.

The GUI of apitrace has the wonderful ability of allowing you to edit a trace. Remember that I used the -v option of glretrace? That produces lots of lines like
2511163 @2 glDrawRangeElements(mode = GL_TRIANGLES, start = 0, end = 9212, ...
Indeed, that was the final reported draw call, i.e. the one causing the fault. So let's open the trace with qapitrace and jump to that exact call using its number. Sure enough, not long before the call we find
We need to find where this program is linked, which is typically much earlier in the program. So we search for glLinkProgram(program = 505) and find exactly one such call. A short bit above, we find the calls
glAttachShader(505, 69)
glAttachShader(505, 504)
where the fragment and vertex shaders are attached to the program. Finally, we search for glShaderSource(shader = 69 and 504 to find the call where the source is loaded into the shader. Now, we can edit those calls to replace the shaders by dummy versions. Be careful, though: the length of the shader source is passed as a separate argument, and you must adjust it manually or you will get surprising error messages when running the modified trace.

By doing so, I determined that the fragment shader was at fault: even minor modifications such as re-ordering statements without changing any effects removed VM faults when applied to the fragment shader, but not when applied to the vertex shader.

So... time to stare at the disassembly of the fragment shader. Unfortunately, there was nothing that caught my eye. It was a long shader with more than 700 instructions. So what next? Since even minor changes at the source level fixed the fault, no matter what kind of change, I needed to go deeper and modify the binary directly. I wrote a new feature for radeonsi that would help me do just that, by allowing me to tell the driver to replace the binary created by LLVM on the N'th compile by a binary that is supplied as an environment variable.

I would have loved to be able to edit assembly at this point. Unfortunately, the llvm-mc tool, which can theoretically act as an assembler, is not able to parse all the assembly constructs that llc generates for the AMDGPU backend. So I went with the next best option, creating an ELF object file using llc -march=amdgcn -mcpu=tonga -filetype=obj and editing the binary directly with a hex editor.

That wasn't too bad though: since VM faults are generated by memory instructions, I could just replace those memory instructions by NOPs. Since the shader dumps collected above helpfully include the binary representation of instructions in addition to the assembly, the instructions aren't too hard to find, either. I only needed to take care not to NOP out memory instructions whose output was then later used as addresses or resource descriptors for other memory instructions, or else I would have introduced new sources for VM faults!

At that point, I ran into a tough problem. My plan was to NOP out large groups of memory instructions initially, and then do a kind of binary search to isolate the bad access. Unfortunately, what happened was, roughly speaking, that when I NOP'ed out group A of instructions or group B of instructions, the VM faults remained, but when I NOP'ed out both groups at the same time, the VM faults disappeared. This can happen when there are really two underlying bugs, but unfortunately I did not see a plausible culprit in either group (in fact, the first bug which I found was actually outside both groups, but - as far as I understand it - depended subtly on timing and on how that affected the scheduling of shader waves by the hardware).

Luckily, at that point I had long suspected the problem to be in wait state handling. You see, in order to save on complicated circuitry in the hardware, there are some rarely occuring data hazards which the compiler must avoid by inserting NOPs explicitly (there is also the s_waitcnt instruction which is part of the strategy for hiding memory latency without complex out-of-order circuitry). So I read the documentation of those hazards again and noticed that the compiler in fact didn't insert enough wait states for a sequence involving register spills (i.e., something affecting only very large shaders). I fixed that, and while it didn't eliminate the VM faults, it changed their pattern sufficiently to give me new hope.

Indeed, with the additional wait states, my original idea of finding the bad instructions by binary search was successful. I narrowed the problem down to accesses to one single texture. At that point, my brain was too exhausted to see the bug that was rather obvious in hindsight, but a colleague pointed it out to me: there was a multi-word register copy which attempted to copy a resource descriptor (consisting of 8 32-bit words) between overlapping register ranges, and it was doing that copy in the wrong direction - kind of like using memcpy when you should be using memmove.

Once this second bug was found, coming up with a fix was relatively straightforward. And yes, I checked: both bug fixes were actually required to completely fix the original bug.

That was my story. Hopefully you've learned something if you've come this far, but there is not really much of a moral to it. Perhaps it is this: pray you have deterministically reproducible bugs. If you do, patiently collecting more and more information will lead you to a solution. If you don't, well, sometimes you just have to be lucky.


Gustav B hat gesagt…

I didn't understand much but at least i appreciated to read about how it's done and how much that goes into fixing a bug like this.
I need to learn more about graphic stacks.

Rob hat gesagt…

fwiw, as far as tracing down which draw was trigging a hang, w/ freedreno I have used scratch register writes, writing incrementing counter value to a scratch reg before draws, etc.. dumping the scratch register values on hang.. I do insert extra WFI's to kill some of the pipelining in the CP, but doing a flush after every draw completely changes the behaviour of things with a tiler so frequently not a good option for me. And then to map back to draw # in apitrace I'd hacked up:

I need to revisit that and convert to khr_debug still... there are patches on mesa-devel which plumb that through to the gallium driver so it can insert the frame # msgs from apitrace in the cmdstream embedded in a CP_NOP packet.