Donnerstag, Mai 19, 2016

A little 5-to-8-bit mystery

Writing the accelerated glReadPixels path for reads to PBOs for Gallium, I wanted to make sure the various possible format conversions are working correctly. They do, but I noticed something strange: when reading from a GL_RGB565 framebuffer to GL_UNSIGNED_BYTE, I was getting tiny differences in the results depending on the code path that was taken. What was going on?

Color values are conceptually floating point values, but most of the time, so-called normalized formats are used to store the values in memory. In fact, many probably think of color values as 8-bit normalized values by default, because of the way many graphics programs present color values and because of the #cccccc color format of HTML.

Normalized formats generalize this well-known notion to an arbitrary number of bits. Given a normalized integer value x in N bits, the corresponding floating point value is x / (2**N - 1) - for example, x / 255 for 8 bits and x / 31 for 5 bits. When converting between normalized formats with different bit depths, the values cannot be mapped perfectly. For example, since 255 and 31 are coprime, the only floating point values representable exactly in both 5- and 8-bit channels are 0.0 and 1.0.

So some imprecision is unavoidable, but why was I getting different values in different code paths?

It turns out that the non-PBO path first blits the requested framebuffer region to a staging texture, from where the result is then memcpy()d to the user's buffer. It is the GPU that takes care of the copy from VRAM, the de-tiling of the framebuffer, and the format conversion. The blit uses the normal 3D pipeline with a simple fragment shader that reads from the "framebuffer" (which is really bound as a texture during the blit) and writes to the staging texture (which is bound as the framebuffer).

Normally, fragment shaders operate on 32-bit floating point numbers. However, Radeon hardware allows an optimization where color values are exported from the shader to the CB hardware unit as 16-bit half-precision floating point numbers when the framebuffer does not require the full floating point precision. This is useful because it reduces the bandwidth required for shader exports and allows more shader waves to be in flight simultaneously, because less memory is reserved for the exports.

And it turns out that the value 20 in a 5-bit color channel, when first converted into half-float (fp16) format, becomes 164 in an 8-bit color channel, even though the 8-bit color value that is closest to the floating point number represented by 20 in 5-bit is actually 165. The temporary conversion to fp16 cuts off a bit that would make the difference.

Intrigued, I wrote a little script to see how often this happens. It turns out that 20 in a 5-bit channel and 32 in a 6-bit channel are the only cases where the temporary conversion to fp16 leads to the resulting 8-bit value to be off by one. Luckily, people don't usually use GL_RGB565 framebuffers... and as a general rule, taking a value from an N-bit channel, converting it to fp16, and then storing the value again in an N-bit value (of the same bit depth!) will always result in what we started out with, as long as N <= 11 (figuring out why is an exercise left to the reader ;-)) - so the use cases we really care about are fine.

Donnerstag, April 28, 2016

Using LLVM bugpoint to hunt radeonsi compiler crashes

Shaders can be huge, and tracking down compiler crashes (or asserts) in LLVM with a giant shader isn't a lot of fun. Luckily, LLVM has a tool called Bugpoint. It takes a given piece of LLVM IR and tries a bunch of simplifications such as removing instructions or basic blocks, while checking that a given condition is still satisfied. Make the given condition something like "llc asserts with message X", and you have a very useful tool for reducing test cases. Unfortunately, its documentation isn't the greatest, so let me briefly dump how I have used it in the past.

I have a little script called that looks like this:

if ! llc -mtriple=amdgcn-- -verify-machineinstrs "$@" 2>&1 | grep "error message here"; then
  exit 0
  exit $?
When I encounter a compiler assertion, I first make sure to collect the offending shader from our driver using R600_DEBUG=ps,vs,gs,tcs,tes and extract it into a file like bug.ll. (In very rare cases, one may need the preoptir option in R600_DEBUG.) Then I edit with the correct error message and run
bugpoint -compile-custom -compile-command ./ bug.ll
It'll churn for some time and produce a hopefully much smaller .bc file that one can use the usual tools on, such as llc, opt, and llvm-dis.

Occasionally, it can be useful to run the result through opt -instnamer or to simplify it further by hand, but usually, bugpoint provides a good starting point.

Sonntag, Januar 24, 2016

History of internet culture in acronyms

I've been on the internet for a pretty long time now. At least for me, getting on the internet back in the 90s meant making myself familiar with its "local" culture, customs and traditions. I'm sure that is still the case for many today. Curiously, I started to notice the appearance of acronyms that I wasn't familiar with, and I thought it interesting to consider how changing acronyms reflect broader changes of internet culture.

Back in the old days, a large fraction of what the internet was about was more or less serious, text-based discussion among relatively small groups. This shows in acronyms that facilitate such discussion, like AFAIK (as far as I know) or IM(H)O (in my (humble) opinion). Of course, social interactions could be rough back then as well: for every IMHO there's an IMNSHO (in my not so humble opinion), and of course there is the tradition of telling people to RTFM (read the f{ine, ucking} manual).

The acronyms that I did not recognize at first reflect an internet that has changed. MFW (my face when) reflects a culture of image reaction macros that were made possible by higher bandwidth. ICYMI (in case you missed it) certainly developed as the internet became a mass communication medium where shared stories and memes roll around. Similarly, ITT (in this thread) is a pattern where the author of a posting tries to set themselves apart from the "herd" of others in the thread, clearly a reaction to the increased size of groups and increasing anonymity in which the interactions often take place.

Samstag, Dezember 19, 2015

Debugging a GPU VM fault

Having finished debugging and fixing a rather tricky GPU VM fault bug in the radeonsi driver, I thought I'd document the bug chasing process I went through (some parts are cleaned up with dead ends removed). May it help myself and others in the future.

Fortunately, the original submitter of the bug had already bisected the cause of the VM faults to a change in LLVM, so the fault was clearly due to some shader. Unfortunately, the triggering commit in LLVM was completely unrelated to Radeon, so it was very unclear what was going on. Still, the bug occured in the publically available Unreal Elemental Demo and was easily reproducible, so off I went.

Since some shader was to blame, the first thing to do after reproduction was to collect all shaders before and after the bad commit, using RADEON_DUMP_SHADERS=y (R600_DEBUG=ps,vs,gs also does this). This resulted in a lot of output with a large diff between the good and the bad run. Clearly, the change in LLVM subtly affected register allocation and/or instruction scheduling in the compiler in a way that affected many shaders and exposed some pre-existing, underlying bug. I needed to find the exact shader that caused problems.

The next step, to ensure even more reliable and deterministic reproduction, was to record an apitrace. This allows us to replay the exact same sequence of OpenGL calls that leads to the VM faults over and over again, to learn ever more about what's going on. (The Unreal Elemental Demo always plays the same scene, but it is affected by timing.)

Now it was time to find the exact draw call that caused the problems. The driver has some tools to help with that: the GALLIUM_DDEBUG feature is meant to detect lockups, but it conveniently causes a command stream flush after every draw call, so I used it by setting GALLIUM_DDEBUG=800. This makes the replay terribly slow (there's a reason we batch many draw calls into a single CS, also called IB in the kernel). So I implemented a GALLIUM_DDEBUG_SKIP feature in the driver that let me skip the additional flushes and lockup checks for the initial, known-good segment of the trace.

In addition, the driver comes with a debug feature that detects and aborts on VM faults, which is enabled via R600_DEBUG=check_vm. Since the fault comes from a shader, we also need a way to cross-reference the dected fault to the currently bound shader. This is achieved by dumping shaders and enabling the vm debug option, for a full command line of something like
RADEON_DUMP_SHADERS=y R600_DEBUG=vm,check_vm \
glretrace -v ElementalDemo.trace > runXXX.log 2>&1
The option -v for glretrace dumps all OpenGL calls as they are executed, which also turned out to be useful.

How to find the faulty shader from all that? Well, the check_vm logic not only detects VM faults, but also writes helpful logging dumps to a file in ~/ddebug_dumps/ (use less -R to make sense of the coloring escape codes that are written to the file). Most crucially, this dump contains a list of all buffers that were mapped, obviously including the buffers that contain the shader binaries. In one example run:
        Size    VM start page         VM end page           Usage
         245    -- hole --
           1    0x0000000113a9b       0x0000000113a9c       USER_SHADER
         268    -- hole --
         564    -- hole --
           2    0x000000016f00d       0x000000016f00f       USER_SHADER
         145    -- hole --
Remember that we enabled the vm debug option together with shader dumping? This means our log contains lots of lines of the form
VM start=0x105249000  end=0x10524A000 | Buffer 4096 bytes
(Note that the log contains byte addresses, while the check_vm dump contains page numbers. Pages are 4KB, so you just need to add or remove three 0s at the end to go from bytes to pages and vice versa.) All we need to do is grep the log file for "VM start=0x113A9B" and "VM start=0x16F00D" (mmh, I'm getting hungry...). And while those might appear multiple times if a buffer is reused or destroyed, the last occurence will be where the shader binary was created.

The shader dump contains three versions of the shader: the initial TGSI, which is what Gallium's state tracker provides to the hardware-dependent driver, the LLVM IR after an initial optimization pass, and the disassembly of the final shader binary as provided by LLVM. I extracted the IR of the two shaders (vertex and fragment), and compiled them with LLVM's standalone compiler, once with the "good" version of LLVM and once with the "bad" (in both cases using the command line llc -march=amdgcn -mcpu=tonga < shader.ll). It turned out that both shaders were affected by the change, so I needed to figure out whether it was the vertex or the fragment shader.

The GUI of apitrace has the wonderful ability of allowing you to edit a trace. Remember that I used the -v option of glretrace? That produces lots of lines like
2511163 @2 glDrawRangeElements(mode = GL_TRIANGLES, start = 0, end = 9212, ...
Indeed, that was the final reported draw call, i.e. the one causing the fault. So let's open the trace with qapitrace and jump to that exact call using its number. Sure enough, not long before the call we find
We need to find where this program is linked, which is typically much earlier in the program. So we search for glLinkProgram(program = 505) and find exactly one such call. A short bit above, we find the calls
glAttachShader(505, 69)
glAttachShader(505, 504)
where the fragment and vertex shaders are attached to the program. Finally, we search for glShaderSource(shader = 69 and 504 to find the call where the source is loaded into the shader. Now, we can edit those calls to replace the shaders by dummy versions. Be careful, though: the length of the shader source is passed as a separate argument, and you must adjust it manually or you will get surprising error messages when running the modified trace.

By doing so, I determined that the fragment shader was at fault: even minor modifications such as re-ordering statements without changing any effects removed VM faults when applied to the fragment shader, but not when applied to the vertex shader.

So... time to stare at the disassembly of the fragment shader. Unfortunately, there was nothing that caught my eye. It was a long shader with more than 700 instructions. So what next? Since even minor changes at the source level fixed the fault, no matter what kind of change, I needed to go deeper and modify the binary directly. I wrote a new feature for radeonsi that would help me do just that, by allowing me to tell the driver to replace the binary created by LLVM on the N'th compile by a binary that is supplied as an environment variable.

I would have loved to be able to edit assembly at this point. Unfortunately, the llvm-mc tool, which can theoretically act as an assembler, is not able to parse all the assembly constructs that llc generates for the AMDGPU backend. So I went with the next best option, creating an ELF object file using llc -march=amdgcn -mcpu=tonga -filetype=obj and editing the binary directly with a hex editor.

That wasn't too bad though: since VM faults are generated by memory instructions, I could just replace those memory instructions by NOPs. Since the shader dumps collected above helpfully include the binary representation of instructions in addition to the assembly, the instructions aren't too hard to find, either. I only needed to take care not to NOP out memory instructions whose output was then later used as addresses or resource descriptors for other memory instructions, or else I would have introduced new sources for VM faults!

At that point, I ran into a tough problem. My plan was to NOP out large groups of memory instructions initially, and then do a kind of binary search to isolate the bad access. Unfortunately, what happened was, roughly speaking, that when I NOP'ed out group A of instructions or group B of instructions, the VM faults remained, but when I NOP'ed out both groups at the same time, the VM faults disappeared. This can happen when there are really two underlying bugs, but unfortunately I did not see a plausible culprit in either group (in fact, the first bug which I found was actually outside both groups, but - as far as I understand it - depended subtly on timing and on how that affected the scheduling of shader waves by the hardware).

Luckily, at that point I had long suspected the problem to be in wait state handling. You see, in order to save on complicated circuitry in the hardware, there are some rarely occuring data hazards which the compiler must avoid by inserting NOPs explicitly (there is also the s_waitcnt instruction which is part of the strategy for hiding memory latency without complex out-of-order circuitry). So I read the documentation of those hazards again and noticed that the compiler in fact didn't insert enough wait states for a sequence involving register spills (i.e., something affecting only very large shaders). I fixed that, and while it didn't eliminate the VM faults, it changed their pattern sufficiently to give me new hope.

Indeed, with the additional wait states, my original idea of finding the bad instructions by binary search was successful. I narrowed the problem down to accesses to one single texture. At that point, my brain was too exhausted to see the bug that was rather obvious in hindsight, but a colleague pointed it out to me: there was a multi-word register copy which attempted to copy a resource descriptor (consisting of 8 32-bit words) between overlapping register ranges, and it was doing that copy in the wrong direction - kind of like using memcpy when you should be using memmove.

Once this second bug was found, coming up with a fix was relatively straightforward. And yes, I checked: both bug fixes were actually required to completely fix the original bug.

That was my story. Hopefully you've learned something if you've come this far, but there is not really much of a moral to it. Perhaps it is this: pray you have deterministically reproducible bugs. If you do, patiently collecting more and more information will lead you to a solution. If you don't, well, sometimes you just have to be lucky.