Dienstag, September 01, 2009

The shader optimization challenge

During my vacation - great trip through beautiful Iceland - a lot of important improvements have been brought to the r300 driver, the Mesa driver that provides hardware accelerated OpenGL for Radeon R300 to R500 chipsets.

The biggest noticeable improvement is that, mostly thanks to Maciej's (osiris) push, we finally have real support for ARB_vertex_buffer_object (short: VBO) and ARB_occlusion_query (short: OQ).

What does this mean? First of all, it means that Sauerbraten finally approaches good framerates on my Athlon XP 2400 + Radeon X1650 Pro setup (unfortunately still in PCI mode due to a crappy AGP bridge). The performance difference is impressive; the CPU performance profile now looks entirely different, because all of the previously most CPU intensive tasks have simply disappeared thanks to the fact that we don't constantly have to reupload VBOs - and you can expect those performance improvements in essentially all 3D games. It also means that the driver can finally support OpenGL 1.5. It's about damn time.

In the meantime, I have been exploratorily experimenting with support for the OpenGL Shading Language (short: GLSL). This is still a long way off, but today I would like to give you a taste of the kind of challenges waiting for us.

The glsl/trirast test that comes with Mesa implements a very simple and stupid triangle rasterizer within a fragment shader. Said shader looks like this:
uniform vec2 v0, v1, v2;

float crs(const vec2 u, const vec2 v)
return u.x * v.y - u.y * v.x;

void main() {
vec2 p = gl_FragCoord.xy;
if (crs(v1 - v0, p - v0) >= 0 &&
crs(v2 - v1, p - v1) >= 0 &&
crs(v0 - v2, p - v2) >= 0)
gl_FragColor = vec4(1.0);
gl_FragColor = vec4(0.5);

Mesa's GLSL compiler turns this into an assembly program which looks like this (the style is that of ARB_fragment_program, but the control flow instructions are a Mesa invention):
  0: MOV TEMP[0].xy, INPUT[0];
1: SUB TEMP[2].xy, UNIFORM[1], UNIFORM[0];
2: SUB TEMP[4].xy, TEMP[0], UNIFORM[0];
3: MUL TEMP[1].y, TEMP[2].xxxx, TEMP[4].yyyy;
4: MUL TEMP[1].z, TEMP[2].yyyy, TEMP[4].xxxx;
5: SUB TEMP[1].x, TEMP[1].yyyy, TEMP[1].zzzz;
6: SGE TEMP[1].y, TEMP[1].xxxx, CONST[3].xxxx;
7: IF TEMP[1].yyyy; # (if false, goto 15);
8: SUB TEMP[2].xy, UNIFORM[2], UNIFORM[1];
9: SUB TEMP[4].xy, TEMP[0], UNIFORM[1];
10: MUL TEMP[1].z, TEMP[2].xxxx, TEMP[4].yyyy;
11: MUL TEMP[1].w, TEMP[2].yyyy, TEMP[4].xxxx;
12: SUB TEMP[1].x, TEMP[1].zzzz, TEMP[1].wwww;
13: SGE TEMP[0].w, TEMP[1].xxxx, CONST[3].xxxx;
14: ELSE; # (goto 17)
15: MOV TEMP[0].w, CONST[3].xxxx;
16: ENDIF;
17: IF TEMP[0].wwww; # (if false, goto 25);
18: SUB TEMP[2].xy, UNIFORM[0], UNIFORM[2];
19: SUB TEMP[4].xy, TEMP[0], UNIFORM[2];
20: MUL TEMP[1].w, TEMP[2].xxxx, TEMP[4].yyyy;
21: MUL TEMP[2].z, TEMP[2].yyyy, TEMP[4].xxxx;
22: SUB TEMP[1].x, TEMP[1].wwww, TEMP[2].zzzz;
23: SGE TEMP[0].z, TEMP[1].xxxx, CONST[3].xxxx;
24: ELSE; # (goto 27)
25: MOV TEMP[0].z, CONST[3].xxxx;
26: ENDIF;
27: IF TEMP[0].zzzz; # (if false, goto 30);
28: MOV OUTPUT[1], CONST[4];
29: ELSE; # (goto 32)
30: MOV OUTPUT[1], CONST[5];
31: ENDIF;
32: END
Observe how the subroutine crs was inlined three times.

There are a lot of instructions here that operate on a single component or on two components. For a chip like the Intel i965 this is fine, because every shader instruction at the hardware level conceptually only operates on a single floating point value. This is in contrast to the Radeon chips, where the hardware level instructions still conceptually operate on four-component vectors.

The important point is that on Intel chips, one could emit the 32 instructions seen above more or less as is, without wasting too many resources. On a Radeon chip - and let's use the R500 fragment processor to make it concrete - one could also emit those 32 instructions as is. The problem, however, is that we would use 32 instruction slots that can potentially operate on 4-component vectors and use them to operate on single components or sometimes two components. In every cycle, we waste two or three of the four available computation channels. Roughly two thirds of the available computational resources are wasted.

With a little bit of thought, one finds a better program to emit on an R500, particularly by rearranging the register usage a bit:
  0: MAD r0, v1.xy11, p.11xy, -v0.xyxy;
1: DP4 r0.x, r0.x00-y, r0.w00z;
2: IF ALU_result >= 0; (if false, goto 11);
3: MAD r0, v2.xy11, p.11xy, -v1.xyxy;
4: DP4 r0.x, r0.x00-y, r0.w00z;
5: IF ALU_result >= 0; (if false, goto 11);
6: MAD r0, v0.xy11, p.11xy, -v2.xyxy;
7: DP4 r0.x, r0.x00-y, r0.w00z;
8: IF ALU_result >= 0; (if false, goto 11);
9: MOV out, .1111;
10: ELSE;
11: MOV out, .HHHH;
12: ENDIF;
13: END
Yes, the IF and ENDIF are unbalanced; this is possible in some cases in the R500 flow control model. By some clever optimizations, including the fact that the R500 fragment shader can negate the w part of an instruction independently from the xyz part, we more than halved the number of instructions to 13. Compare this to the estimate that the previous version would waste about two thirds of the available computational power.

The real challenge is recognizing these opportunities for optimizations automatically and applying them in our driver. The field is wide open here. Incidentally, the example above illustrates why I don't believe LLVM is of too much use for us. Somehow I doubt that a compiler project that has its roots in normal CPUs has useful knowledge about these kinds of optimization problems.

Of course, the first step is to support GLSL at all. Afterwards, we can talk again about such optimizations.

P.S.: Can you reduce the length of the program above even further? I have a version that uses only 8 instructions, though it involves quite significant changes to the flow control logic. Can you get there, too?

1 Kommentar:

Anonym hat gesagt…

Rest assured that us Intel guys hate the Mesa GLSL output nearly as much as you do. Granted, our fragment shader is SOA so some of the pain goes away, but the vertex shader is AOS, so we've got the same problems. And if you look at the code, there's some terrible stuff -- MOV temp.xy, uniform.xy, SUB temp2.xy temp.xy uniform.xy?

There's a make-SSA-and-optimize thing in brw_wm_pass*.c, but it can't recover from flow control being present, so we end up in the GLSL path with no optimization. In tests I've done, having an empty if statement present (if uniform.x > 0 {}) costs about 30% performance.

Clearly we need some SSA coming out of the GLSL and ARB program parsers, and some general optimization passes we can do on them to get to something that we actually want to peephole optimize in our drivers. It would be fun to do things like filling my pipeline latency slots to improve performance, but that seems like a waste before I've even eliminated dead code!

Also, 100% agreement on llvm for GPU shaders. With just an afternoon of looking at the docs, I'm baffled that anyone ever thought it was a viable idea. LLVM requires that you break down loops and if statements to conditional branches. I can't do that on my GPU -- I'd have to take the LLVM output and recognize loops versus if statements with 100% accuracy to recover the high-level constructs and avoid blowing out my mask stack. I doubt that's possible, and even if it is, it sounds like a fragile mess. That doesn't even bring up how to support AOS in my vertex shaders, which on several apps I've been looking at appears to be my bottleneck these days.