Samstag, September 30, 2017

Posits

The posit number system is a proposed alternative to floating point numbers. Having heard of posits a couple of times now, I'd like to take the time to digest them and, in the second half, write a bit about their implementation in hardware. Their creator makes some bold claims about posits being simpler to implement, and - spoiler alert! - I believe he's mistaken. Posits are still a clever idea and may indeed be a good candidate for replacing floating point in the long run. But trade-offs are an inescapable fact of life.

Floating point revisited

In floating point, numbers are represented by a sign bit, an exponent, and a mantissa:

The value of a normal floating point number is ±1.m2*2e (actually, e is stored with a bias in order to be able to treat it like an unsigned number most of the time, but let's not get distracted by that kind of detail). By using an exponent, a wide range of numbers can be represented at a constant relative accuracy.

There are some non-normal floating point numbers. When e is maximal, the number is either considered infinity or "not a number", depending on m. When e is minimal, it represents a sub-normal number: either a denormal or zero.

Denormals can be confusing at first, but their justification is actually quite simple. Let's take single-precision floating point as an example, where there are 8 exponent bits and 23 mantissa bits. The smallest positive normal single-precision floating point number is 1.000000000000000000000002*2-126. The next larger representable number is 1.000000000000000000000012*2-126. Those numbers are not equal, but their difference is not representable as a normal single-precision floating point number. It would be rather odd if the difference between non-equal numbers were equal to zero, as it would be if we had to round the difference to zero!

When e is minimal, the represented number is (in the case of single-precision floating point) ±0.m2*2-126, which means that the difference between the smallest normal numbers, 0.000000000000000000000012*2-126, can still be represented.

Note how with floating point numbers, the relative accuracy with which numbers can be represented is constant for almost the entire range of representable numbers. Once you get to sub-normal numbers, the accuracy drops very quickly. At the other end, the drop is even more extreme with a sudden jump to infinity.

Posits

The basic idea of posits is to vary the size of the mantissa and to use a variable-length hybrid encoding of the exponent that mixes unary with binary encodings. The variable-length exponent encoding is shorter for exponents close to zero, so that more bits of mantissa are available for numbers close to one.

Posits have a fixed number of binary exponent bits e (except in the extreme ranges), and a posit system is characterized by that number. A typical choice appears to be es = 3. The unary part of the exponent is encoded by the r bits. For positive posits, 101 encodes 0, 1101 encodes 1, 011 encodes -1, 0011 encodes -2, and so on. The overall encoded number is then ±1.m2*2r*2es + e.

Let's look at some examples of 16-bit posits with es = 3.

0 10 000 0000000000 is 1.02*20*23+0 = 1.
0 10 000 1000000000 is 1.12*20*23+0 = 1.5.
0 10 001 0000000000 is 1.02*20*23+1 = 2.
0 10 111 0000000000 is 1.02*20*23+7 = 128.
0 110 000 000000000 is 1.02*21*23+0 = 256.
0 1110 000 00000000 is 1.02*22*23+0 = 65536.
0 111111111110 000 is 1.02*210*23+0 = 280. Note that there is no mantissa anymore! The next larger number is:
0 111111111110 001 is 1.02*210*23+1 = 281.
0 111111111110 111 is 1.02*210*23+7 = 287.
0 1111111111110 00 is 1.02*211*23+0 = 288. Now the number of binary exponent bits starts shrinking. The missing bits are implicitly zero, so the next larger number is:
0 1111111111110 01 is 1.02*211*23+2 = 290.
0 1111111111110 11 is 1.02*211*23+6 = 294.
0 11111111111110 0 is 1.02*212*23+0 = 296.
0 11111111111110 1 is 1.02*212*23+4 = 2100.
0 111111111111110 is 1.02*213*23+0 = 2104. There are no binary exponent bits left, but the presentation in the slides linked above still allows for one larger normal number:
0 111111111111111 is 1.02*214*23+0 = 2112.
Going in the other direction, we get:
0 01 111 0000000000 is 1.02*2-1*23+7 = 1/2 = 0.5.
0 01 000 0000000000 is 1.02*2-1*23+0 = 1/256 = 0.00390625.
0 001 000 000000000 is 1.02*2-2*23+0 = 1/65536 = 0.0000152587890625.
0 000000000001 111 is 1.02*2-11*23+7 = 2-81.
0 000000000001 000 is 1.02*2-11*23+0 = 2-88.
0 0000000000001 11 is 1.02*2-12*23+6 = 2-90.
0 0000000000001 00 is 1.02*2-12*23+0 = 2-96.
0 00000000000001 0 is 1.02*2-13*23+0 = 2-104.
0 000000000000001 is 1.02*2-14*23+0 = 2-112. This is the smallest positive normal number, since we have no choice but to treat 0 specially:
0 000000000000000 is 0.

For values close to 1, the accuracy is the same as for half-precision floating point numbers (which have 5 exponent and 10 mantissa bits). Half-precision floating point numbers do have slightly higher accuracy at the extreme ends of their dynamic range, but the dynamic range of posits is much higher. This is a very tempting trade-off for many applications.

By the way: if we had set es = 2, we could have larger accuracy for values close to 1, while still having a higher dynamic range than half-precision floating point.

You'll note that we have not encountered an infinity. Gustafson's proposal here is to do away with the distinction between positive and negative zero and infinity. Instead, his proposal is to think of the real numbers projectively, and use a two's complement representation, meaning that negating a posit is the same operation at the bit level as negating an integer. For example:

1 111111111111111 is -1.02*2-14*23+0 = -2-112.
1 10 000 0000000000 is -1.02*20*23+0 = -1. The next smaller number (larger in absolute magnitude) is:
1 01 111 1111111111 is -1.00000000012*20*23+0.
1 01 111 1000000000 is -1.12*20*23+0 = -1.5
1 000000000000001 is -1.02*214*23+0 = -2112.

The bit pattern 1 000000000000000 (which, like 0, is its own inverse in two's complement negation) would then represent infinity.

There's an elegance to thinking projectively in this way. Comparison of posits is the same as comparison of signed integers at the bit level (except for infinity, which is unordered). Even better, it's great that the smallest and largest normal numbers are multiplicative inverses of each other.

But to people used to floating point, not having a "sign + magnitude" representation is surprising. I also imagine that it could be annoying for a hardware implementation, so let's look into that.

Hardware implementations

In his presentations, Gustafson claims that by reducing the number of special cases, posits are easier to implement than floating point. No doubt there are fewer special cases (no denorms, no NaNs), but at the cost of a more complicated normal case.

Let's take a look at a floating point multiply. The basic structure is conceptually quite simple, since all parts of a floating point number can be treated separately:

By far the most expensive part here is the multiplication of the mantissas. There are of course a bunch of special cases that need to be accounted for: the inputs could be zero, infinity, or NaN, and the multiplication could overflow. Each of these cases are easily detected and handled with a little bit of comparatively inexpensive boolean logic.

Where it starts to get complicated is when handling the possibility that an input is denormal, or when the multiplication of two normal numbers results in a denormal.

When an input is denormal, the corresponding input for the multiply is 0.m instead of 1.m. Some logic has to decide whether the most significant input bit to the multiply is 0 or 1. This could potentially add to the latency of the computation. Luckily, deciding whether the input is denormal is fairly simple, and only the most significant input bit is affected. Because of carries, the less significant input bits tend to be more critical for latency. Conversely, this means that the latency of determining the most significant input bit can be hidden well.

On the output side, the cost is higher, both in terms of the required logic and in terms of the added latency, because a shifter is needed to shift the output into the correct position. Two cases need to be considered: When a multiplication of two normal numbers results in a denormal, the output has to be shifted to the right an appropriate number of places.

When a denormal is multiplied by a normal number, the output needs to be shifted to the left or the right, depending on the exponent of the normal number. Additionally, the number of leading zeros of either the denormal input or of the multiplication output is required to determine the exponent of the final result. Since the area cost is the same either way, I would expect implementations to determine the leading zero of the denormal input, since that allows for better latency hiding.

(The design space for floating point multipliers is larger than I've shown here. For example, you could deal with denormals by shifting their mantissa into place before the multiply. That seems like a waste of hardware considering that you cannot avoid the shifter after the multiply, but my understanding of hardware design is limited, so who knows.)

So there is a bit more hardware required than just what is shown in the diagram above: a leading-zero-count and a shifter, plus a bit more random logic. But now compare to the effort required for a posit multiply:

First of all, there is unavoidable latency in front of the multiplier. Every single bit of mantissa input may be masked off, depending on the variable size of the exponent's unary part. The exponents themselves need to be decoded in order to add them, and then the resulting exponent needs to be encoded again. Finally, the multiplication result needs to be shifted into place; this was already required for floating point multiplication, but the determination of the shift becomes more complicated since it depends on the exponent size. Also, each output bit needs a multiplexer since it can originate from either the exponent or the mantissa.

From my non-expert glance, here's the hardware you need in addition to the multiplier and exponent addition:
• two leading-bit counts to decode the unary exponent parts (floating-point multiply only needs a single leading-zero count for a denormal input)
• two shifters to shift the binary input exponent parts into place
• logic for masking the input mantissas
• one leading bit encoder
• one shifter to shift the binary output exponent part into place
• one shifter to shift the mantissa into place (floating-point also needs this)
• multiplexer logic to combine the variable-length output parts
Also note that the multiplier and mantissa shifter may have to be larger, since - depending on the value of es - the mantissa of posits close to 1 can be larger than the mantissa of floating point numbers.

On the other hand, the additional shifters don't have to be large, since they only need to shift es bits. The additional hardware is almost certainly dominated by the cost of the mantissa multiplier. Still, the additional latency could be a problem - though obviously, I have no actual experience designing floating point multipliers.

There's also the issue of the proposed two's complement representation for negative posits. This may not be too bad for the mantissa multiplication, since one can probably treat it as a signed integer multiplication and automatically get the correct signs for the resulting mantissa. However, I would expect some more overhead for the decoding and encoding of the exponent.

The story should be similar for posit vs. floating point addition. When building a multiply-accumulate unit, the latency that is added for masking the input based on the variable exponent length can likely be hidden quite well, but there does not appear a way around the decoding and encoding of exponents.

Closing thoughts

As explained above, I expect posit hardware to be more expensive than floating point hardware. However, the gain in dynamic range and accuracy is nothing to sneeze at. It's worth giving posits a fair shot, since the trade-off may be worth it.

There is a lot of legacy software that relies on floating point behavior. Luckily, a posit ALU contains all the pieces of a floating point ALU, so it should be possible to build an ALU that can do both at pretty much the cost of a posit-only ALU. This makes a painless transition feasible.

Posits have an elegant design based on thinking about numbers projectively, but the lack of NaNs, the two's complement representation, and not having signed zeros and infinities may be alien to some floating point practicioners. I don't know how much of an issue this really is, but it's worth pointing out that a simple modification to posits could accommodate all these concerns. Using again the example of 16-bit posits with es = 3, we could designate bit patterns at the extreme ends as NaN and infinity:
0 111111111111111 is +inf (instead of 2112).
0 000000000000001 is +NaN (instead of 2-112).
We could then treat the sign bit independently, like in floating point, giving us ±0, ±inf, and ±NaN. The neat properties related to thinking projectively would be lost, but the smallest and largest positive normal numbers would still be multiplicative inverses of each other. The hardware implementation may even be smaller, thanks to not having to deal with two's complement exponents.

The inertia of floating point is massive, and I don't expect it to be unseated anytime soon. But it's awesome to see people rethinking such fundamental building blocks of computing and coming up with solid new ideas. Posits aren't going to happen quickly, if at all, but it's worth taking them seriously.

Samstag, September 09, 2017

radeonsi: out-of-order rasterization on VI+

I've been polishing a patch of Marek to enable out-of-order rasterization on VI+. Assuming it goes through as planned, this will be the first time we're adding driver-specific drirc configuration options that are unfamiliar to the enthusiast community (there's radeonsi_enable_sisched already, but Phoronix has reported on the sisched option often enough). So I thought it makes sense to explain what those options are about.

Background: Out-of-order rasterization

Out-of-order rasterization is an optimization that can be enabled in some cases. Understanding it properly requires some background on how tasks are spread across shader engines (SEs) on Radeon GPUs.

The frontends (vertex processing, including tessellation and geometry shaders) and backends (fragment processing, including rasterization and depth and color buffers) are spread across SEs roughly like this:

(Not shown are the compute units (CUs) in each SE, which is where all shaders are actually executed.)

The input assembler distributes primitives (i.e., triangles) and their vertices across SEs in a mostly round-robin fashion for vertex processing. In the backend, work is distributed across SEs by on-screen location, because that improves cache locality.

This means that once the data of a triangle (vertex position and attributes) is complete, most likely the corresponding rasterization work needs to be distributed to other SEs. This is done by what I'm simplifying as the "crossbar" in the diagram.

OpenGL is very precise about the order in which the fixed-function parts of fragment processing should happen. If one triangle comes after another in a vertex buffer and they overlap, then the fragments of the second triangle better overwrite the corresponding fragments of the first triangle (if they weren't rejected by the depth test, of course). This means that the "crossbar" may have to delay forwarding primitives from a shader engine until all earlier primitives (which were processed in another shader engine) have been forwarded. This only happens rarely, but it's still sad when it does.

There are some cases in which the order of fragments doesn't matter. Depth pre-passes are a typical example: the order in which triangles are written to the depth buffer doesn't matter as long as the "front-most" fragments win in the end. Another example are some operations involved in stencil shadows.

Out-of-order rasterization simply means that the "crossbar" does not delay forwarding triangles. Triangles are instead forwarded immediately, which means that they can be rasterized out-of-order. With the in-progress patches, the driver recognizes cases where this optimization can be enabled safely.

By the way #1: From this explanation, you can immediately deduce that this feature only affects GPUs with multiple SEs. So integrated GPUs are not affected, for example.

By the way #2: Out-of-order rasterization is entirely disabled by setting R600_DEBUG=nooutoforder.

Why the configuration options?

There are some cases where the order of fragments almost doesn't matter. It turns out that the most common and basic type of rendering is one of these cases. This is when you're drawing triangles without blending and with a standard depth function like LEQUAL with depth writes enabled. Basically, this is what you learn to do in every first 3D programming tutorial.

In this case, the order of fragments is mostly irrelevant because of the depth test. However, it might happen that two triangles have the exact same depth value, and then the order matters. This is very unlikely in common scenes though. Setting the option radeonsi_assume_no_z_fights=true makes the driver assume that it indeed never happens, which means out-of-order rasterization can be enabled in the most common rendering mode!

Some other cases occur with blending. Some blending modes (though not the most common ones) are commutative in the sense that from a purely mathematical point of view, the end result of blending two triangles together is the same no matter which order they're blended in. Unfortunately, additive blending (which is one of those modes) involves floating point numbers in a way where changing the order of operations can lead to different rounding, which leads to subtly different results. Using out-of-order rasterization would break some of the guarantees the driver has to give for OpenGL conformance.

The option radeonsi_commutative_blend_add=true tells the driver that you don't care about these subtle errors and will lead to out-of-order rasterization being used in some additional cases (though again, those cases are rarer, and many games probably don't encounter them at all).

tl;dr

Out-of-order rasterization can give a very minor boost on multi-shader engine VI+ GPUs (meaning dGPUs, basically) in many games by default. In most games, you should be able to set radeonsi_assume_no_z_fights=true and radeonsi_commutative_blend_add=true to get an additional very minor boost. Those options aren't enabled by default because they can lead to incorrect results.