tag:blogger.com,1999:blog-361375062024-02-20T20:02:25.380+01:00Tagebuch eines Interplanetaren BotschaftersLerne, wie die Welt wirklich ist, aber vergiss niemals, wie sie sein sollte.Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.comBlogger217125tag:blogger.com,1999:blog-36137506.post-45332543258488551692024-02-07T12:30:00.001+01:002024-02-07T12:30:00.138+01:00Building a HIP environment from scratch<p>HIP is a C++-based, single-source programming language for writing GPU code. "Single-source" means that a single source file can contain both the "host code" which runs on the CPU and the "device code" which runs on the GPU. In a sense, HIP is "CUDA for AMD", except that HIP can actually target both AMD and Nvidia GPUs.</p><p>If you merely want to <i>use</i> HIP, your best bet is to look at <a href="https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html">the documentation</a> and download pre-built packages. (By the way, the documentation calls itself "ROCm" because that's what AMD calls its overall compute platform. It includes HIP, OpenCL, and more.)</p><p>I like to dig deep, though, so I decided I want to build at least the user space parts myself to the point where I can build a simple <a href="https://github.com/ROCm/HIP-Examples/tree/master/HIP-Examples-Applications/HelloWorld">HelloWorld</a> using a Clang from <a href="https://github.com/llvm/llvm-project">upstream LLVM</a>. It's all open-source, after all!</p><p>It's a bit tricky, though, in part because of the kind of bootstrapping problems you usually get when building toolchains: Running the compiler requires runtime libraries, at least by default, but building the runtime libraries requires a compiler. Luckily, it's not quite <i>that</i> difficult, though, because compiling the host libraries doesn't require a HIP-enabled compiler - any C++ compiler will do. And while the device libraries do require a HIP- (and OpenCL-)enabled compiler, it is possible to build code in a "freestanding" environment where runtime libraries aren't available.<br /></p><p>What follows is pretty much just a list of steps with running commentary on what the individual pieces do, since I didn't find an equivalent recipe in the official documentation. Of course, by the time you read this, it may well be outdated. Good luck!</p><p>Components need to be installed, but installing into some arbitrary prefix inside your <span style="font-family: courier;">$HOME</span> works just fine. Let's call it <span style="font-family: courier;">$HOME/prefix</span>. All packages use CMake and can be built using invocations along the lines of:</p><p><span style="font-family: courier;">cmake -S . -B build -GNinja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_PREFIX=$HOME/prefix -DCMAKE_PREFIX_PATH=$HOME/prefix<br />ninja -C build install</span></p><p><span style="font-family: inherit;">In some cases, additional variables need to be set.</span><br /></p><h3 style="text-align: left;">Step 1: clang and lld</h3><p>We're going to need a compiler and linker, so let's get <a href="https://github.com/llvm/llvm-project">llvm/llvm-project</a> and build it with Clang and LLD enabled: <span style="font-family: courier;">-DLLVM_ENABLE_PROJECTS='clang;lld' -DLLVM_TARGETS_TO_BUILD='X86;AMDGPU'</span></p><p>Building LLVM is an art of its own which is luckily <a href="https://llvm.org/docs/GettingStarted.html#local-llvm-configuration">reasonably well documented</a>, so I'm going to leave it at that.</p><h3 style="text-align: left;">Step 2: Those pesky cmake files</h3><p>Build and install <a href="https://github.com/ROCm/rocm-cmake">ROCm/rocm-cmake</a> to avoid cryptic error messages down the road when building other components that use those CMake files without documenting the dependency clearly. Not rocket science, but man am I glad for GitHub's search function.<br /></p><h3 style="text-align: left;">Step 3: libhsa-runtime64.so</h3><p>This is the lowest level user space host-side library in the ROCm stack. Its services, as far as I understand them, include setting up device queues and loading "code objects" (device ELF files). All communication with the kernel driver goes through here.</p><p>Notably though, this library does <i>not</i> know how to dispatch a kernel! In the ROCm world, the so-called Architected Queueing Language is used for that. An AQL queue is setup with the help of the kernel driver (and <i>that</i> does go through libhsa-runtime64.so), and then a small ring buffer and a "door bell" associated with the queue are mapped into the application's virtual memory space. When the application wants to dispatch a kernel, it (or rather, a higher-level library like libamdhip64.so that it links against) writes an AQL packet into the ring buffer and "rings the door bell", which basically just means writing a new ring buffer head pointer to the door bell's address. The door bell virtual memory page is mapped to the device, so ringing the door bell causes a PCIe transaction (for us peasants; <a href="https://www.amd.com/en/products/accelerators/instinct/mi300/mi300a.html">MI300A</a> has slightly different details under the hood) which wakes up the GPU.</p><p>Anyway, libhsa-runtime64.so comes in two parts for what I am being told are largely historical reasons:</p><ul style="text-align: left;"><li><a href="https://github.com/ROCm/ROCT-Thunk-Interface">ROCm/ROCT-Thunk-Interface</a></li><li><a href="https://github.com/ROCm/ROCR-Runtime">ROCm/ROCR-Runtime</a>; this one has one of those bootstrap issues and needs a <span style="font-family: courier;">-DIMAGE_SUPPORT=OFF</span></li></ul><p>The former is statically linked into the latter...</p><h3 style="text-align: left;">Step 4: It which must not be named</h3><p>For Reasons(tm), there is a fork of LLVM in the ROCm ecosystem, <a href="https://github.com/ROCm/llvm-project">ROCm/llvm-project</a>. Using upstream LLVM for the compiler seems to be fine and is what I as a compiler developer obviously want to do. However, this fork has an <a href="https://github.com/ROCm/llvm-project/tree/amd-staging/amd"><span style="font-family: courier;">amd</span></a> directory with a bunch of pieces that we'll need. I believe there is a desire to upstream them, but also an unfortunate hesitation from the LLVM community to accept something so AMD-specific.<br /></p><p>In any case, the required components can each be built individually against the upstream LLVM from step 1:<br /></p><ul style="text-align: left;"><li><a href="https://github.com/ROCm/llvm-project/tree/amd-staging/amd/hipcc">hipcc</a>; this is a frontend for Clang which is supposed to be user-friendly, but at the cost of adding an abstraction layer. I want to look at the details under the hood, so I don't want to and don't have to use it; but some of the later components want it</li><li><a href="https://github.com/ROCm/llvm-project/tree/amd-staging/amd/device-libs">device-libs</a>; as the name says, these are libraries of device code. I'm actually not quite sure what the intended abstraction boundary is between this one and the HIP libraries from the next step. I think these ones are meant to be tied more closely to the compiler so that other libraries, like the HIP library below, don't have to use <span style="font-family: courier;">__builtin_amdgcn_*</span> directly? Anyway, just keep on building...</li><li><a href="https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr">comgr</a>; the "code object manager". Provides a stable interface to LLVM, Clang, and LLD services, up to (as far as I understand it) invoking Clang to compile kernels at runtime. But it seems to have no direct connection to the code-related services in libhsa-runtime64.so.<br /></li></ul><p>That last one is annoying. It needs a <span style="font-family: courier;">-DBUILD_TESTING=OFF</span></p><p>Worse, it has a fairly large interface with the C++ code of LLVM, which is famously not stable. In fact, at least during my little adventure, comgr wouldn't build as-is against the LLVM (and Clang and LLD) build that I got from step 1. I had to hack out a little bit of code in its symbolizer. I'm sure it's fine.</p><h3 style="text-align: left;">Step 5: libamdhip64.so</h3><p>Finally, here comes the library that implements the host-side HIP API. It also provides a bunch of HIP-specific device-side functionality, mostly by leaning on the device-libs from the previous step.</p><p>It lives in <a href="https://github.com/ROCm/clr">ROCm/clr</a>, which stands for either Compute Language Runtimes or Common Language Runtime. Who knows. Either one works for me. It's obviously for compute, and it's common because it also contains OpenCL support.<br /></p><p>You also need <a href="https://github.com/ROCm/HIP/">ROCm/HIP</a> at this point. I'm not quite sure why stuff is split up into so many repositories. Maybe ROCm/HIP is also used when targeting Nvidia GPUs with HIP, but ROCm/CLR isn't? Not a great justification in my opinion, but at least this <i>is</i> documented in the <a href="https://github.com/ROCm/clr/blob/develop/README.md">README</a>.<br /></p><p>CLR also needs a bunch of additional CMake options: <span style="font-family: courier;">-DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=${checkout of ROCm/HIP} -DHIPCC_BIN_DIR=$HOME/prefix/bin</span></p><h3 style="text-align: left;">Step 6: Compiling with Clang</h3><p>We can now build simple HIP programs with our own Clang against our own HIP and ROCm libraries:</p><p><span style="font-family: courier;">clang -x hip --offload-arch=gfx1100 --rocm-path=$HOME/prefix -rpath $HOME/prefix/lib -lstdc++ <a href="https://github.com/ROCm/HIP-Examples/blob/master/HIP-Examples-Applications/HelloWorld/HelloWorld.cpp">HelloWorld.cpp</a><br />LD_LIBRARY_PATH=$HOME/prefix/lib ./a.out</span></p><p>Neat, huh?<br /></p>Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com2tag:blogger.com,1999:blog-36137506.post-10333132434875491872023-12-31T12:52:00.003+01:002023-12-31T14:25:10.471+01:00Vulkan driver debugging stories<p>Recently, I found myself wanting to play some Cyberpunk 2077. Thanks to Proton, that's super easy and basically just works on Linux. Except that I couldn't enable raytracing, which annoyed me given that I have an RDNA3-based GPU that should be perfectly capable. Part of it may have been that I'm (obviously) using a version of the <a href="https://github.com/GPUOpen-Drivers/amdvlk">AMDVLK</a> driver.</p><p>The first issue was that Proton simply wouldn't advertise raytracing (DXR) capabilities on my setup. That is easily worked around by setting <span style="color: #990000; font-family: courier;">VKD3D_CONFIG=dxr</span> in the environment (in Steam launch options, set the command to <span style="color: #990000; font-family: courier;">VKD3D_CONFIG=dxr %command%</span>).</p><p>This allowed me to enable raytracing in the game's graphics settings which unfortunately promptly caused a GPU hang and a GPUVM fault report in <span style="color: #990000; font-family: courier;">dmesg</span>. Oh well, time for some debugging. That is (part of) my job, after all.</p><p>The fault originated from TCP, which means it's a shader vector memory access to a bad address. There's a virtually limitless number of potential root causes, so I told the <span style="color: #990000; font-family: courier;">amdgpu</span> kernel module to take it easy on the reset attempts (by setting the <span style="color: #990000; font-family: courier;">lockup_timeout</span> module parameter to a rather large value - that can be done on the Grub command line, but I chose to add a setting in <span style="color: #990000; font-family: courier;">/etc/modprobe.d/</span> instead) broke out good old trusty <a href="https://gitlab.freedesktop.org/tomstdenis/umr/">UMR</a> in client/server mode (run with <span style="color: #990000; font-family: courier;">--server</span> on the system under debug, and with <span style="color: #990000; font-family: courier;">--gui tcp://${address}:1234</span> on another system) to look at the waves that were hung. Sure enough, they had the <span style="color: #990000; font-family: courier;">fatal_halt</span> bit set, were stuck a few instructions past a <span style="color: #990000; font-family: courier;">global_load_b64</span>, and looking at VGPRs did suggest a suspicious address.</p><p>Tooling for shader debugging is stuck in the earlier parts of the 20th century (which may seem like an impressive feat of time travel given that programmable shading didn't even exist back then, but trust me it's genuinely and inherently way more difficult than CPU debug), so the next step was to get some pipeline dumps to correlate against the disassembly shown in UMR. Easy peasy, point the Vulkan driver at a custom <span style="color: #990000; font-family: courier;">amdVulkanSettings.cfg</span> by way of the <span style="color: #990000; font-family: courier;">AMD_CONFIG_DIR </span>environment variable and enable pipeline dumping by adding <span style="color: #990000; font-family: courier;">EnablePipelineDump,1</span> to the config file. Oh, and setting the <span style="color: #990000; font-family: courier;">AMD_DEBUG_DIR</span> environment variable is helpful, too. Except now the game crashed before it even reached the main menu. Oops.</p><p>Well, that's a CPU code problem, and CPU debugging has left the 1970s firmly behind for somewhere in the 1990s or early 2000s. So let's get ourselves a debug build of the driver and attach gdb. Easy, right? Right?!? No. Cyberpunk 2077 is a Windows game, run in Proton, which is really Wine, which is really an emulator that likes to think of itself as not an emulator, run in some kind of container called a "pressure vessel" to fit the Steam theme. Fun.</p><p>To its credit, Proton tries to be helpful. You can set <span style="color: #990000; font-family: courier;">PROTON_DUMP_DEBUG_COMMANDS=1</span> in the environment which dumps some shell scripts to <span style="color: #990000; font-family: courier;">/tmp/proton-$user/</span> which allowed me to comparatively easily launch Cyberpunk 2077 from the terminal without going through the Steam client each time. But Wine seems to hate debugging, and it seems to hate debugging of native Linux code even more, and obviously the Vulkan driver is native Linux code. All my attempts to launch the game in some form of debugger in order to catch it red-handed were in vain.</p><p>At this point, I temporarily resigned myself to more debugging time travel of the bad kind, i.e. backwards in time to worse tooling. <span style="color: #990000; font-family: courier;">printf()</span> still works, after all, and since the crash was triggered by enabling pipeline dumps, I had a fairly good idea about the general area in the driver that must have contained the problem.</p><p>So I went on a spree of sprinkling <span style="color: #990000; font-family: courier;">printf()</span>s everywhere, which led to some extremely confusing and non-determinstic results. Confusing and non-deterministic is a really great hint, though, because it points at multi-threading. Indeed, Cyberpunk 2077 is a good citizen and does multi-threaded pipeline compilation. Or perhaps VKD3D is being helpful. Either way, it's a good thing except it exposed a bug in the driver. So I started sprinkling <span style="color: #990000; font-family: courier;">std::lock_guard</span>s everywhere. That helped narrow down the problem area. Add some good old staring at code and behold: somebody had very recently added a use of <span style="color: #990000; font-family: courier;">strtok()</span> to the pipeline dumping logic. Very bad idea, very easy fix.</p><p>Okay, so I can dump some pipelines now, but I still don't get to the main menu because the game now crashes with an assertion somewhere in PAL. I could start staring at pipeline dumps, but this is an assertion that (1) suggests a legitimate problem, which means prioritizing it might actually be helpful, and (2) is in the kind of function that is called from just about everywhere, which means I really, really need to be able to look at a stacktrace now. It's time to revisit debuggers.</p><p>One of the key challenges with my earlier attempts at using gdb was that (1) Wine likes to fork off tons of processes, which means getting gdb to follow the correct one is basically impossible, and (2) the crash happens very quickly, so manually attaching gdb after the fact is basically impossible. But the whole point of software development is to make the impossible possible, so I tweaked the implementation of <span style="color: #990000; font-family: courier;">PAL_ASSERT</span> to poke at <span style="color: #990000; font-family: courier;">/proc/self</span> to figure out whether a debugger is already attached and if one isn't, optionally print out a helpful message including the PID and then sleep instead of calling abort() immediately. This meant that I could now attach gdb at my leisure, which I did.</p><p>And was greeted with an absolutely useless gdb session because Wine is apparently being sufficiently creative with the dynamic linker structures that gdb can't make sense of what's happening on its own and doesn't find any symbols, let alone further debug info, and so there was no useful backtrace. Remember how I mentioned that Wine hates debugging?</p><p>Luckily, a helpful soul pointed out that <span style="color: #990000; font-family: courier;">/proc/$pid/maps</span> exists and tells us where .so's are mapped into a process address space, and there's absolutely nothing Wine can do about that. Even better, gdb allows the user to manually tell it about shared libraries that have been loaded. Even even better, gdb can be scripted with Python. So, I wrote <a href="https://gist.github.com/nhaehnle/754a914587f5da52c37c3481eb806996">a gdb script that walks the backtrace and figures out which shared libraries to tell gdb about to make sense of the backtraces</a>. (Update: Friedrich Vock <a href="https://mastodon.gamedev.place/@pixelcluster/111675112004784516">helpfully pointed out</a> that attaching with <span style="color: #990000; font-family: courier;">gdb -p $pid /path/to/the/correct/bin/wine64</span> also allows gdb to find shared libraries.)<br /></p><p>At this point, another helpful soul pointed out that <a href="https://github.com/ValveSoftware/Fossilize">Fossilize</a> exists and can play back pipeline creation in a saner environment than a Windows game running on VKD3D in Wine in a pressure vessel. That would surely have reduced my debugging woes somewhat. Oh well, at least I learned something.</p><p>From there, fixing all the bugs was almost a walk in the park. The assertion I had run into in PAL was easy to fix, and finally I could get back to the original problem: that GPU hang. That turned out to be a fairly mundane problem in <a href="https://github.com/GPUOpen-Drivers/llpc">LLPC</a>'s raytracing implementation, for which I have a fix. It's still going to take a while to trickle out, in part because this whole debugging odyssey has a corresponding complex chain of patches that just take a while to ferment, and in part because pre-Christmas is very far from a quiet time and things have just been generally crazy. Still: very soon you, too, will be able to play Cyberpunk 2077 with raytracing using the AMDVLK driver.<br /></p>Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-74592187777010350922023-05-12T10:40:00.001+02:002023-05-12T10:40:28.742+02:00An Update on Dialects in LLVM<p>EuroLLVM took place in Glasgow this week. I wasn't there, but it's a good opportunity to check in with what's been happening in dialects for LLVM in the ~half year since my <a href="https://www.youtube.com/watch?v=VbFqA9rvxPs">keynote</a> at the LLVM developer meeting.</p><h2 style="text-align: left;">Where we came from <br /></h2><p>To give an ultra-condensed recap: The excellent idea that MLIR brought to the world of compilers is to explicitly separate the <i>substrate</i> in which a compiler intermediate representation is implemented (the class hierarchy and basic structures that are used to represent and manipulate the program representation at compiler runtime) from the semantic definition of a <i>dialect</i> (the types and operations that are available in the IR and their meaning). Multiple dialects can co-exist on the same substrate, and in fact the phases of compilation can be identified with the set of dialects that are used within each phase.</p><p>Unfortunately for AMD's shader compiler, while MLIR is part of the LLVM project and shares some foundational support libraries with LLVM, its IR substrate is entirely disjoint from LLVM's IR substrate. If you have an existing compiler built on LLVM IR, you could bolt on an MLIR-based frontend, but what we really need is a way to gradually introduce some of the capabilities offered by MLIR throughout an existing LLVM-based compilation pipeline.<br /></p><p>That's why I started <a href="https://github.com/GPUOpen-Drivers/llvm-dialects">llvm-dialects</a> last year. We published its initial release a bit more than half a year ago, and have greatly expanded its capabilities since then.<br /></p><h2 style="text-align: left;">Where we are now</h2><p>We have been using llvm-dialects in production for a while now. Some of its highlights so far are:</p><ul style="text-align: left;"><li>Almost feature-complete for defining custom operations (aka intrinsics or instructions). The main thing that's missing is varargs support - we just haven't needed that yet.<br /></li><li>Most of the way there for defining custom types: custom types can be defined, but they can't be used everywhere. I'm <a href="https://discourse.llvm.org/t/rfc-target-type-classes-for-extensibility-of-llvm-ir/69813/">working on closing the gaps</a> as we speak - some upstream changes in LLVM itself are required.</li><li>Expressive <a href="https://github.com/GPUOpen-Drivers/llvm-dialects/blob/dev/docs/Constraints.md">language for describing constraints</a> on operation and type arguments and operation results - see examples <a href="https://github.com/GPUOpen-Drivers/llvm-dialects/blob/dev/example/ExampleDialect.td">here</a> and <a href="https://github.com/GPUOpen-Drivers/llpc/blob/dev/lgc/interface/lgc/LgcDialect.td">here</a>.<br /></li><li>Thorough, automatically generated IR verifier routines.<br /></li><li>A flexible and efficient <a href="https://github.com/GPUOpen-Drivers/llvm-dialects/blob/dev/include/llvm-dialects/Dialect/Visitor.h">visitor mechanism</a> that is inspired by but beats LLVM's TypeSwitch in some important ways.</li></ul><p>Transitioning to the use of llvm-dialects is a gradual process for us and far from complete. We have always had custom operations, but we used to do implement them in a rather ad-hoc manner. The old way of doing it consisted of hand-writing code like this:</p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">SmallVector<Value *, 4> args;<br />std::string instName = lgcName::OutputExportXfb;<br />args.push_back(getInt32(xfbBuffer));<br />args.push_back(xfbOffset);<br />args.push_back(getInt32(streamId));<br />args.push_back(valueToWrite);<br />addTypeMangling(nullptr, args, instName);<br />return CreateNamedCall(instName, getVoidTy(), args, {});</span><br /></p><p>With llvm-dialects, we can use a much cleaner builder pattern:</p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">return create<InputImportGenericOp>(<br /> resultTy, false, location, getInt32(0), elemIdx,<br /> PoisonValue::get(getInt32Ty()));</span><br /></p><p>Accessing the operands of a custom operation used to be a matter of code with magic numbers everywhere:</p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">if (callInst.arg_size() > 2)<br /> vertexIdx = isDontCareValue(callInst.getOperand(2))<br /> ? nullptr : callInst.getOperand(2);</span><br /></p>With llvm-dialects, we get far more readable code:<p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">Value *vertexIdx = nullptr;<br />if (!inputOp.getPerPrimitive())<br /> vertexIdx = inputOp.getArrayIndex();</span><br /></p><p></p><p>Following the example set by MLIR, these accessor methods as well as the machinery required to make the <span style="font-family: courier;">create<FooOp>(...)</span> builder call work are automatically generated from a dialect definition written in a TableGen DSL.<br /></p><p>An important lesson from the transition so far is that the biggest effort, but also one of the biggest benefits, has to do with getting to a properly defined IR in the first place.</p><p>I firmly believe that understanding a piece of software starts not with the code that is executed but with the interfaces and data structures that the code implements and interacts with. In a compiler, the most important data structure is the IR. You should think of the IR as the bulk of the interface for almost all compiler code.<br /></p><p>When defining custom operations in the ad-hoc manner that we used to use, there isn't one place in which the operations themselves are defined. Instead, the definition is implicit in the scattered locations where the operations are created and consumed. More often than is comfortable, this leads to definitions that are fuzzy or confused, which leads to code that is fuzzy and confused, which leads to bugs and a high maintenance cost, which leads to the dark side (or something).</p><p>By having a designated location where the custom operations are explicitly defined - the TableGen file - there is a significant force pushing towards proper definitions. As the experience of MLIR shows, this isn't <i>automatic</i> (witness the rather thin documentation of many of the dialects in upstream MLIR), but without this designated location, it's bound to be worse. And so a large part of transitioning to a systematically defined dialect is cleaning up those instances of confusion and fuzziness. It pays off: I have found hidden bugs this way, and the code becomes noticeably more maintainable.<br /></p><h2 style="text-align: left;">Where we want to go<br /></h2><p>llvm-dialects is already a valuable tool for us. I'm obviously biased, but if you're in a similar situation to us, or you're thinking of starting a new LLVM-based compiler, I recommend it.</p><p>There is more that can be done, though, and I'm optimistic we'll get around to further improvements over time as we gradually convert parts of our compiler that are being worked on anyway. My personal list of items on the radar:</p><ul style="text-align: left;"><li>As mentioned already, closing the remaining gaps in custom type support.</li><li>Our compiler uses quite complex metadata in a bunch of places. It's hard to read for humans, doesn't have a good compatibility story for lit tests, and accessing it at compile-time isn't particularly efficient. I have some ideas for how to address all these issues with an extension mechanism that could also benefit upstream LLVM. <br /></li><li>Compile-time optimizations. At the moment, casting custom operations is still based on string comparison, which is clearly not ideal. There are a bunch of other things in this general area as well.<br /></li><li>I really want to see some equivalent of MLIR regions in LLVM. But that's a non-trivial amount of work and will require patience.<br /></li></ul><p>There's also the question of if or when llvm-dialects will eventually be integrated into LLVM upstream. There are lots of good arguments in favor. Its DSL for defining operations is a lot friendlier than what is used for intrinsics at the moment. Getting nice, auto-generated accessor methods and thorough verification for intrinsics would clearly be a plus. But it's not a topic that I'm personally going to push in the near future. I imagine we'll eventually get there once we've collected even more experience.</p><p>Of course, if llvm-dialects is useful to you and you feel like
contributing in these or other areas, I'd be more than happy about
that!<br /></p><p></p>Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-83520821768339677802023-01-21T05:20:00.001+01:002023-01-21T17:19:18.945+01:00Diff modulo base, a CLI tool to assist with incremental code reviews<p>One of the challenges of reviewing a lot of code is that many reviews require multiple iterations. I really don't want to do a full review from scratch on the second and subsequent rounds. I need to be able to see what has changed since last time.</p><p>I happen to work on projects that care about having a useful Git history. This means that authors of (without loss of generality) pull requests use amend and rebase to change commits and force-push the result. I would like to see the only the changes they made since my last review pass. Especially when the author also rebased onto a new version of the main branch, existing code review tools tend to break down.</p><p>Git has a little-known built-in subcommand, <span style="font-family: courier;">git range-diff</span>, which I had been using for a while. It's pretty cool, really: It takes two ranges of commits, old and new, matches old and new commits, and then shows how they changed. The rather huge problem is that its output is a diff of diffs. Trying to make sense of those quickly becomes headache-inducing.</p><p>I finally broke down at some point late last year and wrote my own tool, which I'm calling <span style="font-family: courier;">diff-modulo-base</span>. It allows you to look at the difference of the repository contents between <span style="font-family: courier;">ol</span>d and <span style="font-family: courier;">new</span> in the history below, while ignoring all the changes that are due to differences in the respective base versions <span style="font-family: courier;">A</span> and <span style="font-family: courier;">B</span>.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieoGKnet10s5KErjwMpNxgO6_IFNxeCbMb-pEFgXg4AcKnD1LUioYqiWW3GcXPNx-_oW4uCKUvWAYXdvcK2ue4CqEsJeQ0jY687uBoGhZvbUTtKVVGF6CGAuLDo90Rwa6C-vQZVUjhwmlw4zbg5_cvEzyIEJ-FY2OKbAyBP7LiHSVgDyhutYI/s135/path31.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="57" data-original-width="135" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieoGKnet10s5KErjwMpNxgO6_IFNxeCbMb-pEFgXg4AcKnD1LUioYqiWW3GcXPNx-_oW4uCKUvWAYXdvcK2ue4CqEsJeQ0jY687uBoGhZvbUTtKVVGF6CGAuLDo90Rwa6C-vQZVUjhwmlw4zbg5_cvEzyIEJ-FY2OKbAyBP7LiHSVgDyhutYI/s16000/path31.png" /></a></div> <p></p><p>As a bonus, it actually does explicitly show differences between A and B that would have caused merge conflicts during rebase. This allows a fairly comfortable view of how merge conflicts were resolved.</p><p>I've been using this tool for a while now. While there are certainly still some rough edges and to dos, I did put a bunch more effort into it over the winter holidays and am now quite happy with it. I'm making it available for all to try at <a href="https://git.sr.ht/~nhaehnle/diff-modulo-base">https://git.sr.ht/~nhaehnle/diff-modulo-base</a>. Let me know if you find it useful!<br /></p><h3 style="text-align: left;">Better integration with the larger code review flow?<br /></h3><p>One of the rough edges is that it would be great to integrate tightly with the <a href="https://github.com/notifications">GitHub notifications</a>
workflow. That workflow is surprisingly usable in that you can
essentially treat the notifications as an inbox in which you can mark
notifications as unread or completed, and can "mute" issues and pull
requests all with keyboard shortcut.</p><p>What's missing in my workflow is a reliable way to remember the most recent version of a pull request that I have reviewed. My somewhat passable workaround for now is to <span style="font-family: courier;">git fetch</span> before I do a round of reviews, and rely on the local reflog of remote refs. A Git alias allows me to say</p><p><span style="font-family: courier;">git dmb-origin $pull_request_id</span></p><p>and have that become</p><p><span style="font-family: courier;">git diff-modulo-base origin/main origin/pull/$pull_request_id/head@{1} origin/pull/$pull_request_id/head<br /></span></p><p>which is usually what I want.</p><p>Ideally, I'd have a fully local way of interacting with GitHub notifications, which could then remember the reviewed version in a more reliable way. This ought to also fix the terrible lagginess of the web interface. But that's a rant for another time.</p><h3>Rust<br /></h3><p>This is the first serious piece of code I've written in Rust. I have to say that experience has really been quite pleasant so far. Rust's tooling is pretty great, mostly thanks to the rust-analyzer LSP server.<br /></p><p>The one thing I'd wish is that the borrow checker was able to better understand "partial" borrows. I find it occasionally convenient to tie a bunch of data structures together in a general context structure, and helper functions on such aggregates can't express that they only borrow part of the structure. This can usually be worked around by changing data types, but the fact that I have to do that is annoying. It feels like having to solve a puzzle that isn't part of the inherent complexity of the underlying problem that the code is trying to solve.</p><p>And unlike, say, circular references or graph structures in general, where it's clear that expressing and proving the sort of useful lifetime facts that developers might intuitively reason about quickly becomes intractable, improving the support for partial borrows feels like it should be a tractable problem.</p>Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-59818296350733481742023-01-14T23:30:00.001+01:002023-01-14T23:35:03.409+01:00Software testing, and why I'm unhappy about it<p>Automated testing of software is great. Unfortunately, what's commonly considered best practice for how to integrate testing into the development flow is a bad fit for a lot of software. You know what I mean: You submit a pull request, some automated testing process is kicked off in the background, and some time later you get back a result that's either Green (all tests passed) or Red (at least one test failed). Don't submit the PR if the tests are red. Sounds good, doesn't quite work.</p><p>There is Software Under Test (SUT), the software whose development is the goal of what you're doing, and there's the Test Bench (TB): the tests themselves, but also additional relevant parts of the environment like perhaps some hardware device that the SUT is run against.</p><p>The above development practice works well when the SUT and TB are both defined by the same code repository and are developed together. And admittedly, that is the case for a lot of useful software. But it just so happens that I mostly tend to work on software where the SUT and TB are inherently split. Graphics drivers and shader compilers implement some spec (like Vulkan or Direct3D), and an important part of the TB are a conformance test suite and other tests, the bulk of which are developed separately from the driver itself. Not to mention the GPU itself and other software like the kernel mode driver. The point is, TB development is split from the SUT development and it is infeasible to make changes to them in lockstep.</p><h2 style="text-align: left;">Down with No Failures, long live No Regressions</h2><p>Problem #1 with keeping all tests passing all the time is that tests can fail for reasons whose root cause is not an SUT change.</p><p>For example, a new test case is added to the conformance test suite, but that test happens to fail. Suddenly nobody can submit any changes anymore.</p><p>That clearly makes no sense, and because Tooling Sucks(tm), what folks typically do is maintain a manual list of test cases that are excluded from automated testing. This unblocks development, but are you going to remember to update that exclusion list? Bonus points if the exclusion list isn't even maintained in the same repository as the SUT, which just compounds the problem.</p><p>The situation is worse when you bring up a large new feature or perhaps a new version of the hardware supported by your driver (which is really just a super duper large new feature), where there is already a large body of tests written by somebody else. Development of the new feature may take months and typically is merged bit by bit over time. For most of that time, there are going to be some test failures. And that's <i>fine</i>!</p><p>Unfortunately, a typical coping mechanism is that automated testing for the feature is entirely disabled until the development process is complete. The consequences are dire, as regressions in relatively basic functionality can go unnoticed for a fairly long time.</p><p>And sometimes there are simply changes in the TB that are hard to control. Maybe you upgraded the kernel mode driver for your GPU on the test systems, and suddenly some weird corner case tests fail. Yes, you have to fix it somehow, but removing the test case from your automated testing process is almost always the wrong response. <br /></p><p>In fact, failing tests are, given the right context, a good thing! Let's say a bug is discovered in a real application in the field. Somebody root causes the problem and writes a simplified reproducer. This reproducer should be added to the TB as soon as possible, even if it is going to fail initially!</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjL2uq30HckWSXTD2FVakr8tW3SZQm16JpFpBtfFGe_YztdPMAg_Ks9ZCSN6on_FrglbxpWdY6BQYLbEta4ZL9sM1esXIz8kHsBLFsd-FtipULcmozF2vCE-vrDJJ87AOPQdbNtBAuvzImdlBDFdA6QC61cc-1N0ELsReAcdtOyrVeF0msRtEo/s886/77htb8.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="499" data-original-width="886" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjL2uq30HckWSXTD2FVakr8tW3SZQm16JpFpBtfFGe_YztdPMAg_Ks9ZCSN6on_FrglbxpWdY6BQYLbEta4ZL9sM1esXIz8kHsBLFsd-FtipULcmozF2vCE-vrDJJ87AOPQdbNtBAuvzImdlBDFdA6QC61cc-1N0ELsReAcdtOyrVeF0msRtEo/s320/77htb8.jpg" width="320" /></a></div><p>To be fair, many of the common testing frameworks recognize this by allowing tests to be marked as "expected to fail". But they typically also assume that the TB can be changed in lockstep with the SUT and fall on their face when that isn't the case.</p><p>What is needed here is to treat testing as a truly continuous exercise, with some awareness by the automation of how test runs relate to the development history.</p><p>During day-to-day development, the important bit isn't that there are no failures. The important bit is that there are no regressions.</p><p>Automation ought to track which tests pass on the main development branch and provide pre-commit reports for pull requests relative to those results: Have there been any regressions? Have any tests been fixed? Block code submissions when they cause regressions, but don't block them for pre-existing failures, <i>especially</i> when those failures are caused by changes in the TB.</p><p>Changes to the TB should also be tested where possible, and when they cause regressions those should be investigated. But it is quite common that regressions caused by a TB change are legitimate and shouldn't block the TB change.<br /></p><p></p><h2 style="text-align: left;">Sparse testing</h2><p>Problem #2 is that good test coverage means that tests take a very long time to run.</p><p>Your first solution to this problem should be to parallelize and throw more hardware at it. Let's hope the people who control the purse care enough about quality.</p><p>There is sometimes also low-hanging fruit you should pick, like wasting lots of time in process (or other) startup and teardown overhead. Addressing that can be a double-edged sword. Changing a test suite from running every test case in a separate process to running multiple test cases sequentially in the same process reduces isolation between the tests and can therefore make the tests flakier. It can also expose genuine bugs, though, and so the effort is usually worth it.</p><p>But all these techniques have their limits.</p><p>Let me give you an example. Compilers tend to have <i>lots</i> of switches that subtly change the compiler's behavior without (intentionally) affecting correctness. How good is your test coverage of these?</p><p>Chances are, most of them don't see any real testing at all. You probably have a few hand-written test cases that exercise the options. But what you really should be doing is run an entire suite of end-to-end tests with each of the switches applied to make sure you aren't missing some interaction. And you really should be testing all combinations of switches as well.<br /></p><p>The combinatorial explosion is intense. Even if you only have 10 boolean switches, testing them each individually without regressing the turn-around time of the test suite requires 11x more test system hardware. Testing all possible combinations requires 1024x more. Nobody has that kind of money.</p><p>The good news is that having extremely high confidence in the quality of your software doesn't require that kind of money. If we run the entire test suite a small number of times (maybe even just once!) and independently choose a random combination of switches for each test, then not seeing any regressions there is a great indication that there really aren't any regressions.</p><p>Why is that? Because failures are correlated! Test T failing with a default setting of switches is highly correlated with test T failing with some non-default switch S enabled.</p><p>This effect isn't restriced to taking the cross product of a test suite with a bunch of configuration switches. By design, an exhaustive conformance test suite is going to have many sets of tests with high failure correlation. For example, in the Vulkan test suite you might have a bunch of test cases that all do the same thing, but with a different combination of framebuffer format and blend function. When there is a regression affecting such tests, the specific framebuffer format or blend function might not matter at all, and all of the tests will regress. Or perhaps the regression is related to a specific framebuffer format, and so all tests using that format will regress regardless of the blend function that is used, and so on.</p><p>A good automated testing system would leverage these observations using statistical methods (aka machine learning).</p><p>Combinatorial explosion causes your full test suite to take months to run? No problem, treat testing as a genuinely continuous task. Have test systems continuously run random samplings of test cases on the latest version of the main development branch. Whenever a change is made to either the SUT or TB, switch over to testing that new version instead. When a failure is encountered, automatically determine if it is a regression by referring to earlier test results (of the exact same test if it has been run previously, or related tests otherwise) combined with a bisection over the code history.</p><p>Pre-commit testing becomes an interesting fuzzy problem. By all means have a small traditional test suite that is manually curated to run within a few minutes. But we can also apply the approach of running randomly sampled tests to pre-commit testing.</p><p>A good automated testing system would learn a statistical model of regressions and combine that with the test results obtained so far to provide an estimate of the likelihood of regression. As long as no regression is actually found, this likelihood will keep dropping as more tests are run, though it will not drop to 0 unless all tests are run (and the setup here was this would take months). The team can define a likelihood threshold that a change must reach before it can be committed based on the their appetite for risk and rate of development.</p><p>The statistical model should be augmented with source-level information about the change, such as keywords that appear in the diff and commit message and the set of files that was changed. After all, there ought to be some meaningful correlation between regressions in a raytracing test case and the fact that the regressing change affected a file with "raytracing" in its name. The model should then also be used to bias the random sampling of tests to be run to maximize the information extracted per effort spent on running test cases.<br /></p><h2 style="text-align: left;">Some caveats</h2><p>What I've described is largely motivated by the fact that the world is messier than commonly accepted testing "wisdom" allows. However, the world is too messy even for what I've described.</p><p>I haven't talked about flaky (randomly failing) tests at all, though a good automated testing system should be able to cope with them. Re-running a test in the same configuration is not black magic and can be used to confirm that a test is flaky. If we wanted to get fancy, we could even estimate the failure probability and treat a significant increase of the failure rate as a regression!</p><p>Along similar lines, there can be state leakage between test cases that causes failures only when test cases are run in a specific order, or when specific test cases are run in parallel. This would manifest as flaky tests, and so flaky test detection ought to try to help tease out these scenarios. That is admittedly difficult and will probably never be entirely reliable. Luckily, it doesn't happen often.<br /></p><p>Sometimes, there are test cases that can leave a test system in such a broken state that it has to be rebooted. This is not entirely unusual in very early bringup of a driver for new hardware, when even the device's firmware may still be unstable. An automated test system can and should treat this case just like one would treat a crashing test process: Detect the failure, perhaps using some timer-based watchdog, force a reboot, possibly using a remote-controlled power switch, and resume with the next test case. But if a decent fraction of your test suite is affected, the resulting experience isn't fun and there may not be anything your team can do about it in the short term. So that's an edge case where manual exclusion of tests seems legitimate.<br /></p><p>So no, testing perfection isn't attainable for many kinds of software projects. But even relative to what feels like it should be realistically attainable, the state of the art is depressing.<br /></p>Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-71352250076900098802022-03-09T17:00:00.001+01:002022-03-09T17:00:00.226+01:00A New Type of Convergence Control Intrinsic?<p><i>Subgroup operations</i> or <i>wave intrinsics</i>, such as reducing a
value across the threads of a shader subgroup or wave, were introduced in GPU programming languages a while ago. They communicate
with other threads of the same wave, for example to exchange the input values of
a reduction, but not necessarily with all of them if there is divergent control
flow.</p><p>In LLVM, we call such operations <i>convergent</i>. Unfortunately, LLVM does not define how the set of communicating threads in convergent
operations -- the set of <i>converged</i> threads -- is affected by control
flow.</p><p>If you're used to thinking in terms of structured control flow, this may
seem trivial. Obviously, there is a tree of control flow constructs: loops,
if-statements, and perhaps a few others depending on the language. Two threads
are converged in the body of a child construct if and only if both execute that
body and they are converged in the parent. Throw in some simple and
intuitive rules about loop counters and early exits (nested return, break and
continue, that sort of thing) and
you're done.</p><p>In an unstructured control flow graph, the answer is not obvious at all. I gave
<a href="https://youtu.be/_Z5DuiVCFAw">a presentation</a> at the 2020 LLVM
Developers' Meeting that explains some of the challenges as well as a solution
proposal that involves adding <i>convergence control tokens</i> to the IR.</p><p>Very briefly, convergent operations in the proposal use a token variable that is defined by a
<i>convergence control intrinsic</i>. Two dynamic instances of the same
static convergent operation from two different threads are converged if and only
if the dynamic instances of the control intrinsic producing the used token
values were converged.</p><p>(The <a href="https://github.com/nhaehnle/llvm-project/blob/f1c9f9ff14c0c436259a47303c81f53f7304d7d6/llvm/docs/ConvergentOperations.rst">published draft</a> of the proposal
talks of multiple threads executing the same dynamic instance. I have since been
convinced that it's easier to teach this matter if we instead always give every
thread its own dynamic instances and talk about a convergence equivalence
relation between dynamic instances. This doesn't change the resulting
semantics.)</p><p>The draft has three such control intrinsics: anchor, entry, and (loop) heart.
Of particular interest here is the heart. For the most common and intuitive use cases, a
heart intrinsic is placed in the header of natural loops. The token it defines
is used by convergent operations in the loop. The heart intrinsic itself also uses
a token that is defined outside the loop: either by another heart in the case of
nested loops, or by an anchor or entry. The heart combines two intuitive behaviors:
</p><ul>
<li>It uses a token in much the same way that convergent operations do: two
threads are converged for their first execution of the heart if and only if they
were converged at the intrinsic that defined the used token.
</li><li>Two threads are converged at subsequent executions of the heart if and only if they were converged for the first execution and they are currently at the same loop iteration, where iterations are counted by a virtual loop counter that is incremented at the heart.
</li></ul><p>
Viewed from this angle, how about we define a weaker version of these rules
that lies somewhere between an anchor and a loop heart? We could call it a
<i>"light heart"</i>, though I will stick with <i>"iterating anchor"</i>.
The iterating anchor defines a token but has no arguments. Like for the anchor,
the set of converged threads is implementation-defined -- when the iterating
anchor is first encountered. When threads encounter the iterating anchor again
<i>without leaving the dominance region of its containing basic block</i>,
they are converged if and only if they were converged during their previous
encounter of the iterating anchor.</p><p>The notion of an iterating anchor came up when discussing the
convergence behaviors that can be guaranteed for natural loops. Is it possible to
guarantee that natural loops always behave in the natural way -- according to
their loop counter -- when it comes to convergence?</p><p>Naively, this should be possible: just put hearts into loop headers!
Unfortunately, that's not so straightforward when multiple natural loops are
contained in an irreducible loop:<example picture=""> </example></p><p><example picture=""><img alt="" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAARkAAAC5CAYAAAAce0tNAAAgAElEQVR4Xu1dC3RV1Zn+sZ3BdonQUQSERYKlIHZBEosQOoUQSwVaIKAMCfIIEUhAigFTHhEJCaghtEgAMUShIVKEoBbCoyEIQvDBY5yB6KwKnbYkDlToWCtDB3XqWpn9Hdzh3pube97nnnPuf9bK4pH9Ot/e+zv/Y+//b9MkHuKHEWAEGAGbEGjDJGMTsh5v9ujRo3T27Fm6dOlS85vU1dUpf+/QoQN17txZ+X3ok5ycTCdOnGgul5CQ0Fzk7rvvJvwkJiZ6HB0evh4EmGT0oOXDsiARkAVI5dy5c/T5558rbwlyAGHcfPPNyr8D/64HBtk+6si/g6xAVPjp3bu30jaIB//mx38IMMn4b04jvtGZM2eoqqpKkTakRIKNjk1ulEiMQhiO4EByKSkpNHbsWEXq4cf7CDDJeH8OVd8AUgqIpaGhQdm4mZmZrlZZDhw4QNXV1YqEBcLJyMhgwlGdZfcWYJJx79yYGplfNioIEoQDCQzSltsJ0tSk+bQyk4yPJhZf/pKSEkViSUtLoxEjRvhKAgDRVFZWKoQD1S43N1cxQPPjbgSYZNw9P5pGh699eXm5suGw8eLj4zXV83IhEM3atWuVV1i2bFlMvLNX54tJxqszJ8a9e/du5csOT9C8efNi0jsDqQ3S26effkqLFi1yta3Jw0vN1NCZZEzB53xleF927NihkEt6ejpNmzat2c3s/Gjc0yM8VRs3bqT6+nrKyclRVEV+3IEAk4w75kHTKLZs2aJ4ibCJ4OLlpyUCIGGQTW1traJGwXbDT3QRYJKJLv6aeseZFqgEMOZCcuFHHQGQTVFRkXIAsLi4mA3E6pDZVoJJxjZozTeMjTJ//nzF1gJ7A5+I1Y8pbDb5+fk0cOBAxW7Fj/MIMMk4j7mmHiG9YHPgK8wivybIIhaSKlRZWRlLNebh1NUCk4wuuJwpDHLBFxgbgqUX6zDHOaLZs2crbn62aVmHq1pLTDJqCDn4e6hHWVlZylH6WbNmOdhz7HQFjEE00u0fO28evTdlkoke9kE945zHxIkTFdvL0KFDXTIq/w6jtLRUuXUOaZEfexFgkrEXX02tQzWCBIMFzzePNUFmSSEcZoSrm4nGEjhbbYRJxl58VVuXBFNRUcFH41XRsr6AvEjKRGM9trJFJhn7sFVtGWc4oCIxwahCZWsBJhpb4SUmGXvxjdg6CAaeDnZRR3ESvuoaNho8fJbG+rlgkrEe07AtwrAb6I7GuQ14OnhROzQBGrqRhneOQawBLB1FmGR0gGWm6KBBg5TquNSIRYwwBbt27TLTJNe1GAF8CGCA3759O186tRBbJhkLwYzUFM5mQHqRD7xIOBAmScehYXA3KgjA4wRjPEuY1i0VJhnrsIzYEnR+3EMK96xZs4YXtUPzoKWbcePGsTSjBSiNZZhkNAJlthg8GCNHjmzRDMIRFBYWKv+PCHf4aTx/XnxNzweVjY/vQXE9eiiqFh+JjzwbZnHEXEGa4VPXZlf99fpMMtbgqNoK3NVdunQJKodFjMDY5eIQ3m4RLDvhu30oZUASxXXrSvHd7gwq23DhT9R44SLVn/091R49RmPHpFG68E4x4VyHCRdKrcIRBnlIMzU1NarzygXUEWCSUcfIshLf+ta3lDCReMaPH09f/t8XdOniBZo58SFKe+B+6nBrO019ff7FF1R98A3auf91unLtc1qzdl3Mhp2ExDE/9zHLcYSnCWosByrXtCQjFmKSMY+h5hbgYcIXN1HE5P382v/S0wvmKuRi5qk78a+U99QvKPHe71FZ+Qsx5RUpXFZAVcITZAeOMNJDNeV7ZGZWJ6tL5tHT2UJ2djYdev11mjLuJ7RwVhbd3LatzhZaL175ajVtrT5A23e+4vuvr3JbPXMq9ep6h204Zj4yXQGbIxGaX6IsyZjHUFMLEOvHjB5Fi3OmUfpoe4Jcnzz9Hs0pKKaKl7b6Vn0CjhPTJ9DcKRNsxfGRhcvo4clTqKCgQNP8cqHWEWCScWB14MubmjKENhQupIR77M3vfOm/P6ZRj/yUdu3Z57sLl4yjA4vVhi6YZGwANbTJcWljaMqoYabtL1qHWv/bszSncBUdqTvmKxsN46h1BbirHJOMzfMB42TT1U+oINfZSHfwPm3dd4h2Ve+x+Q2daZ5xdAZnO3phkrED1a/aRCrV2TOn06FtwutjoZFX65AfmvU4Tct51PNnaRhHrTPuznJMMjbOi9PifeirQG2avng5na5/z8a3tL9pxtF+jO3sgUnGJnRxtL1wST4dfvlFm3rQ1uycpc9Q4qDBNEtc0PTiwzh6cdaCx8wkY9McThQhHUYP7i/crC3vK9nUZdhm4dbOKy6l4ydPOdmtZX0xjpZBGbWGmGRsgB6u1i6dO9OFU4dN2WLGzphL+w4dpS8b3jc1yp5Dfqx4muLj402143Rlszh+Pb5v0JDf+vWvKPneBMOv4VUcDb+wRRWZZCwCMrAZ3OItXVVM+ys2GG79w4sf0WPLnqGune6g7MkTKKFPb8NtzVn6NCV+3325nHCwLhLxmcURJCMJ+s8ff0J39k8xRdhuxdHwwnCoIpOMDUDPnpVDfbt3opxJEwy3XrX3+g3gO277J/pD4wWaIS5RGn3c6s7u8VXoCoS7CBfy0iyOgSQD7EL/rRdPt+Ko9z2cLs8kYwPiVnhDoCqteiJPUbfu+ucHTH2B3WqXAclAmsGDkBWhZGMWx0BSgWToVxxtWMKWNskkYymc1xsbNHAArc6fRwOT+hlqPXRDgHCK8n5qWGVCLJphD8+k842NhsZjV6VAkpF9BJKNWRxDbTKjhg2ldUVPUPeuwXF9tL6fW3HUOv5olWOSsRh5xIvp07s3/ealMup3dy9Dre87fJQu/fkvzSoSVKerf7tmSmXChnNb2AKEvYBxN9yDgF41+/fT4e2bWgTw0gpqqHr0uz820MJnVtPuTeu1NtGiHNo8LyIXes2IbviFLajIJGMQxLNnzxKi3eEcx+XLlwn/lpvmpptuov98s4biugZHt9Pa1ZwlK6h8286g4vgKm90cTU1NWofgSLlwkgyCRCEfOEimR3wcvbuvijp3vN3QeMLZYMzaZSRZI6AVp07RNi1MMtpwCioFO0KfPn1a/Qp3vbML7dm0ztCNa+kF+cv7x6l9u1ua+8Xi/uPbBw2J+riZ3X9UOn106bKBt7WvSiDJBJLLzTffrHSalNCPNq8sMIQj6ocSivTYGSVrieMHZ88pmT9zcnI8f2XDvtm90TKTjEGUt2zZouToCX3wdet8R0eaO+lBGp7yA92th6pKsoFN21+jdrd809DhPrdeLwDJQF2SkoskF/nOI4c/YBhHSTKhE2DmrEwgjhg30twkiCiHnD4l8jJnktFNAzcqINg08vQEPkjYVi1+BvfrRZnj00y0bl3V2rq3aP22X1NN7UHrGrWgpR07diiSQCi5yKazRJB1t+O4cuVKahQG9TIRDJ6f8AgwyRhcGbgZjDxK0jaDZiDFnD59miDlHK3ZS5tXFRls3dpqy0vLqE37js2pV6xt3b7WvIIjyLKqqooqKiqCUhHbh4y3WmaSMTBf+HrV19crXy+QTWpqqtIKpBh8meFhgtHy4/q3DbRufRXYYyq2bvOcodJLOMLon5+frxANe56C1zCTjI49DW8SpJfhw4cHBZjG/4Fsjhw50twawm0ufTSLUpLv09GD9UW9frbDSzhCqoWdDh8f9jzdWMtMMhr3NWwv5eXlSi4e5LEOfGAEhMcp8P+RUuPMO3W0YcUSjT3YU6ykbDNdpbZULKQvLz5ewxHSFzxPubm5NGKEPQHjvTaPTDIqMya9CHCxFhcX65pfsy5YXZ2FKQyX67BJ2UqYhw4dOphtLmr1vYajkrJFSDQpKe67lBqNSWSSiYA69OyioiLlTk1ycrLu+YH0s6X8eXpt47O661pRIe+pn1P8PYk0T6hzXn68iiNsNCAcSL+x/DDJtDL7MO6eO3dOWSBmpIBo2RRgi0FqFITebM1F7KWF71Uc4SGrra1VDMJ+mAcja4ZJJgQ1GHch6maKMxoZGRlGMA2qAx09dchgerVsteE7OHoH8en/XFUIBh6lUPuR3rbcUt7LOOLqCSTi7SKlbizm1maSCdhF+OrI8w5WLgbF6zBlEu375XPU4dZ2tu9bJUtB9mwaKw4L+unxMo4YO04Ix+KdJyYZsQvxlYT+HBcXR4sXL7ZlXx6oqaGigifpV6XFtkk0kGCmi/SqYydk0LSsR2x5j2g36mUcsc5wShzXKGLJ8xTzJGPWuKtn0yn5g7Jn0OonHjcca6a1/mCDmbGokHLzFvhOggl9Zy/jCEMwXNxpaWlBZ630rCOvlY1pkoH0AhsMDk85ZZRDf4j4ltC7Jy19LNtwGAO50CC9rKvYRnsO1VHFS1tj5hCY13HEAU6sOb3HIrxGMBhvTJIM9GNMcjQPTG0UxFayspimPjRG+Ynvpi/2DM7AVO09QOu3vEyZ07IUN7UZL5gXFy/G7GUccdCwrk58HHzueYo5kpEuRRjgrDTuGtmkEJ1XigN+lcLg3OHWW2iMCEyVktyf4rp1bUE6UIcaL1ykMyIr5Cv7X6cG8Xd4v5YVFsUkuQTi7WUckZFh7dq1iufJrx+JmCEZGN1g3R84cKAr43/AzoBDZ3VH3hBXFBqp4cMPg3grvnt3cfEujhKT7qV0QS5GDgcaIUKv1fEijvJGP9R2vxw5CFw3MUEy+FqUlJQothevTSK+0gjuBBsExo4fBErCBTz5b68RgVPjBXZQizt16uT6MBeYXxiEcbrcbbGYzc6Xr0kGiwyHoPBg8pwy7pqdlND6paWlymYJ9+CdPvroI9+K2kaxlHY3uIu9smn96nnyLcm4wbhrdIOE1guUZkJ/B+kMQbf5uYEA7G7V1dWK5Bptu5uReYFaD/uMXzxPviQZfPml1d4vxrRw0gwCZCFQFj/XEZA35nuLlDR2Hap0CmvMtwyM5lUJXGLlK5KBXouvgB8POoVKM/hCg2TgJfP6IrRi48pj+/j6+8UoLmMYed3z5BuSgXEXQaWwyLxm3NW6yaQ0A1KRUfhg0PbzO2vBRqpHfoyxKz1PXg7r6XmSkR4EP+mwrW0sKc3AmCnTcMA1D6MwAiRNmzZNy570TRk/qUeRJgVRFxEZwKtSmqdJRrK8V8E3sttx1yqcOoDTo9DhY0V98qN6FGk94GMCoklPT7ckBImRtWe0jmdJRua7AcH4xbhrdBJlPZCujOTn50DWflaP1NYAbI52RgtQ69/I7z1HMn427hqZwNA6MmwFDuz5zbUtVUM/eI/MzLWM2ujkxV4z4/UUyUhrO8Dl3DaRpx1fe7jxzYYPNbO4rKwLKQ235o3GW7ZyLG5oCwnlKisrPXHnyRMkw18wY8sadgtsTBiKvezWlbeV8XFh1fjGWmgtoRwCY7npIKLrSQZAwk3r9Y1ijCbM15LeN6gYXksMLz8unNS+9XUgPU8yrCdsNiBl2CrdciDRMMkgODJ+Gs+fF7eGzwehEB/fg+LEpT4YH3FgzOhjVcYAo/37qZ7X1CdWj7SvPplQDlIe1Cg8OKyJO21ueHSRDKSKciGy7hb3QhK+24dSBiRFjH1Sf/b3VHv0GI0dk0bp4oapVsIBO4ORc3JyNNdxA5huH4NX1CdWj/SvJJALbnEHPjI3u1prdgsMmkgGm35+7mN06eIFmjnxIUp74H7NUfc//+ILqj74Bu0UgZauXBOJrtauixgi0k1BpdQmx4u/l+qTG92gbo/549b5xsc/NTVVubsV+CBYeY0IYB/uAbFUioh8dgsM6FuVZAqXFVCViNr19IK5CrmYeepO/CvlPfULSrz3e1RW/kLQnRs27ppBVn9dfPnkTWU3GFNlQHfYEvx8xkf/TEWuAQEgKSlJybgR7vnggw+CrtkoB1iFwND05d9p6oOjDAsMjX+6RMuKlmvSNFolGSWfb+ZU6tX1Dlo4K4tubtvWMnwqX62mrdUHaPvOVxTdUVrJYzEnjWWgGmhIkVDFlYRoG9XljWO/uNsNTIWpKiAOSCY4soA/AwkHxn7gqkiwglxOvPM2rX7yZyLM632m+qwXYWCXr3tB0U52Ve+J6PULSzJYfBPTJ9DcKRMoffQIU4NprfLJ0+/RnIJiypg0WYn6hi8Y3ya2BeqIjWLxwc3dvn17x6PHSfWIE9NbO++BpCP/Pm3KZPqXkT+knIfHW9qZop08/WzETBktSAaLDnmHNxQupIR77rZ0QKGNIeI+0qnu2rOPD9fZirR64zjoiMNdTp2vYPVIfU6sKAGSyZo6hTYsz7c815ccH4LcT54nDkouf4pGjBzZYtgtSAY5gaaMGmba/qIVIIhdcwpX0ZG6YyzJaAXNpnJSfUKqGDtDVrJ6ZNMEhjSraCT/Ml5kLX1Gd8odvSOEg+eh2Xm0eGlhi7UTRDIw8jZd/YQKcp0N5wjv09Z9hxTdjp/oIwD1qa2wwRUWFlo6GHmTePjw4b67V2UpUBY05qRGIoeLRIPDJs1soZk0k4yS+nPmdDq0TXh9LDTyasVLSRKf86gma7XWNrmccQRkPiAES7IiTq407nsxY4RxFKNXEzbV0YMH2GZTbe3Nzv7hPE2ev4SOnzzVrJk0k4zTalLoIKE2TV+8nE7Xvxe9meGegxCAQR4xTMxm2vTarWGvLwPY17aUP0+vbXw2Kq+S99TPKf6eRCWrKR6FZOD2KlyST4dffjEqg5Kdzln6DCUOGkyzxGlfftyDANQnPHqj50v1yI8xl90zO8EjiYaaFIrFdYfOXKo5+LoiBSskM1FE2xo9uL8QrVpahp0EE27tvOJSRdTix10I4EOEdKo4c6ElzAarR9GZv1IxPw2/PSPOwiyIzgC+6nXtL39F/3XlM1pTupbafPbZZ01dBNtcOHU4KraYUCR6Dvmx4mnSspCjimIMdi4DhmVmZka0nbF6FL3FMWjgAFqdP882d7XWN4M0039UOn106TK1EXcbmkpXFdP+ig1a6weV+3p83xb1vmx431BbqDRn6dOU+P0U9j4YRtD+ivA6XblypcUBSo5aaD/2kXoA/kkiIiIEBj3P7/7YQPfcP5r+8v5xat/uFqVq/QfnaNnq56hyTXHz/+lpE2V/8NBUWvPc89RmVk52U9/unShn0gS9bSjlQTJmSCW0U3ZnG5oGxytJ9UmmY8G/EfcH6pRfU9I4DrLODnF7/cw7dbRhxRKdNYk2bX+N2t3yTcVkcuXq3yhzfj6teiKPet0Vr7stWaGkbDNdpbbUZuyY0U1mDt9ZTTJslzE8p45XlNcC/v73v1PHjh1jJlOC40Br7DBLqLGD+/WizPFpGmvcKAZiua3vIPrj2wfp+L+foat/u0YzRMQFMw+uHKx4voLaJA+4r8mMDmc1yeCI8rCHZ9L5xkYz78d1HUBAqke4c4ab3LGSjsUBaA11MXL4AzR30oM0POUHhuqf+Pd62vraHirftjNIdTLUmKiEMzPjZ/+M2sR37950SLiu47vdaaitUJvME3OzaXneXENtyUpoUzi9TLXBle1FIFQ9kneRWF2yF/dIrScl9KPNKwtM3TnE3ttYXGhaisE4cQK45+CR1KZzpzua3t1XRZ073m4IHaslGQyCScbQVDhWCYbfy5cvt5BcpPqEawOxls3SMfAjdNSlcycys5f3HT5KNW+8qUgyf3q3ju64/Z9Mvxb2cpvEfn2bzLCf1SQT6Poy/YbcgKUISPVILYshLkCeO3eO1SdL0VdvrI8IFv9q2S/o7m/3UC8cUuLPH39Cd/ZPUcjllDh1f+rM+6Y1ElyavD3hn6nNiAd+1GRGj7OaZPh6ge714UgFeZdJqzrE6pMj0xLUCUK0LH00y1BAqjUvVtJ37oqjUT8cqrQ5Z8kKyp48gRL69Db8ItK+2mba1KlNRi3SUrUJHYUZl3Zt3Vu0ftuvqab2oOGX44rWIoBrBTiurjewGAelsnYe1Foz6l2SZ2J2b1rf3EW4szNq/Yf+vtm7JG7ZNh2t2UubVxXpbcOW8stLy6hN+46WhxmwZbA+b1RekDSbNQLnN06ePOmbbJZunXbYypqu/DcVzHPH3T+E2X3zvd9Rm7/+9a9NPeLj6OP6t12BHY4iV2zdxsGkozwbUj2yKiUwQokUFRUpaWY5ULg9k4uUN+PGjKL/eH23PR3obPUnWXNo3sL86xckzehyOvuNWJzPyFiJpvG2jKpHaj1yRkg1hMz/vkdcHJk5kmJ+BNdbgNG324AfirtLl66TjJnjyFYNCu3IY8jFK1da2Sy3pREBq9Qjte44eZsaQsZ/n794MbWjL2jR7OnGG7GgZtXeGtr75ru0varqRt4lKw7ymBkbXNfDJmUrYR7ckAfIzLt4sa7V6pEaBhDtkY4F6lNycrJacf69RgTwoUgdMpjeeu0lzQkYNTatq1ig2aM5Mp7bomnpeiMubBgBmRIFVwP0BqUy3OlXFd2czdLsu0WzfrRjypRve4Xeb7wkEjiWKzAEBRKPlm0GthikRkHoTc695NzylDnHzYbXNDtipCZGNkvEE2Yp1iyawh7iYFqj0NHiKgFCPBw59mZzbOggkoFhDqLWq2WrDd9l0gsRBgWCgUeJQwToRc94eUiu5eJLY1WgcOMjuV6T1SezCAbXxwcEnqZD2150VG36idjL8362KCj/Uou8S5jsrCmTaN8vn3NkcEqWguzZNHbcOGtR5tbCIhBN9UhtSlh9UkNI3+9xiXXlikJ6TQgNTmQgQZ77+Hv6iQDijwcNNGya2gM1NVRU8KRIClVsm0QDCWb6wmXU93sDaNLkySzF6Fs/hkq7RT1SG/yOHTuoSnglnMpmqTYer/4epF0oDOvvHDtKu14otU1ogLs676nV1Dn+21S4fEULuMKSDEopeZiyZ9DqJx63PF4obDAzFhVSbt4CGpqaquRihi4OTwPbZOxZ0m5Tj9TeUqpPixYtsjWbpdo4vPJ77Fdghp+6ujrlT3iacPAR983mP/ZT2lxSaCoMRDgs4BWeJFLUZkyaIrKMPBoWrlZJBqUxSORjSujdk5Y+lm04HITsGdLLuopttOdQXYsE3TK6PRbViBEjvDK3rh+nVEE6derkuasaUrVr376958bu1MLAvhk0aFCr3e3atUsJ+q6krBUJ31LuS6QFOVmmpRpIL/Aibd21n8peeDHiMYSIJCNHvrGsjEpWFtPUh8YoP3oDXIHtqvYeoPVbXqbMaVlK0qfWvAiIdF9fX6+wrxWZC52abDf24xdpwGtSmNNrYbbIU4YDjqEPpJjTp083/zdIu3TNs1Quys6d9rASprPDre10DRfkgr28Yl25YkddtDhfdZ9qIhmMAgNcWVxMlcLd2OHWW2jMsKHiSnl/iuvWtQXpQB1qvHCRzoiskK/sf50axN8zMjJoWWGRJhclWBcq1MCBA2nevHm6QODC1xHwm13DK/akaKw/7M2kpCRFRQp8pBQTOiZ4kYsKl9GWLZV0d8+7aPSwFEpO6qe6lw++eYLqjp+kjAnpVCyCxmsVAjSTTOBAof/h61J35A0hhjVSw4cfBr2HCOkp8ibFUWLSvZQuyMXoiU70UVlZSVChjLYRjUmPZp9eVo/UcHOzZ0xt7Hb+XhIwvEnACE+oFNNa/1C3cEbpxDtvq+7l4cKMYcSUYYhk7AQsLOuK27swCINs+LBW6+j7RT1SW19OX4FQG080fw+JFR9inHcCLshdjqc1KSYaY3U9yUhQwLjI64PYJkbYNBrg2tEnpMhwoRL8ph6pYefUZU61cUTr95BYYIuJE7euEUdGPiAZrJFAW0y0xij79QzJyAHHsmEYC6tHjx7KHSMZqFsutt4ivuticQM31h67wlK4GUdIrCAYHPkYOnRo0FCxHkAybjIveI5kgKgMaA3DcCxtLAToxs1lqI7yS4XFBtJx06JyeoNCTcAVCXgk/Z5DHV6k2tpaT93z8iTJyEUsdfNYCBcgpRgQLB5sJqhNfKnw+mrQmknBaQK0qj8Z8MuLEqunSQYTiM2HsI74E2TjV8OwlGICFy1UJpAMPzcQaC0nlJcxkgdVvSqxep5k5OKRMWTVcgJ5cbGFSjGB7wCS4URqwbMamt3Si3MeaIOUOay8+gH1DcnIScEXH5Hxwfp+0c/DSTHyfaV9hsNkBFMJ1CfYr7yazRLqETxFGP+sWbO8zJPBQas8/SYBg8cCg9fBi/pr6ByESjHwJsAWg3cDsYR6F/wyh2beA5IMfhrPn6djbx6ja9c+o17f6Uk33XST+PD0oDjhoQOGuNPjxgdjhwkAt9D98PHwnSQTuGi8ZhgO3BwNDeeVV7l27Rpd++wz+k7P71A/sTHcvDmiuWFhtygXm3K3OL2a8N0+lDIgqfmY/G//8w/UM747/eM//IO44nL9ykv92d9T7dFjNHZMGqVPnOgawoFNqbGxUSEYv0Qk8DXJBBqGId3Axek2vTbS5gjctG7fHNEiGBypn5/7GF26eIFmTnyI0h64X/OlP1z2qz74Bu0U9+uuXPuc1qxdF7WcUPJwYWZmpnLPz0+P70lGThYOMEGFSktLc4Wh1C+bI5qboXBZAVVt305PL5irkIuZBylVEdkt8d7viQDYLzgqRejNM27mPaNRN2ZIRoKLoNU4zAR3d7T0Xb9sjmgsWCmdZmVOpV5d76CFs7IsDS2J1Kpbqw/Q9p2vaL5lbBQHeeET9fXmGTfaZzTqxRzJAGSIpjCsIZATTgw7pftiUflhc0Rjoco+ZfCluVMmUPpoe4KbnTz9Hs0pKG4RWM3K95aXWaOdKcLKd2qtrZgkGQmGPI6OibbbS+OXzeHEomytDydTfSDQGrJo7Nqzz/KjEJCmEcPYLZki7J7TmCYZKXrj0uXly5cVkdUOw7BfNofdi1GtfYSCnTJqmGn7i1o/8vf1IujanMJVdKTumCXSbms3p7WOx6vlYp5knDAMe31zuGFxw47VdPUTKsh19oyJGO4AAArESURBVGAavE9b9x2iXdV7TMGAE+kyLa/dUrOpgdpQmUkmBFQpysLdbYVh2Oubw4Y1p7tJJXPGzOkiUZnw+rRtq7u+2QpKbrCcRw2fpfHizWmzmAXWZ5IJgyaOdFuRpsXrm8PKhWamLaclwdCxQm2avni5kkZZz4N1hFAcCQkJMRWSJBQjJpkIqwYH5eCFMuoB8Orm0LOR7C6LU9CFS/Lp8Msv2t1VxPbnLH2GEgcNFrmFZmsah9dvTmt6SY2FmGRUgFLSSIhLl7gJC8NwuAjt8ByFXsb06ubQuG4cKzYxPZ1GD+4v3NUjHeszXEdwa+cVl9Lxk6dUxyGjN+JqgB2OBNUBuKwAk4zGCcG5Bkg1KSkpQbdi5QVGLKjAC3de3BwaoXCsGLDt0rkzXTh1OCq2mNAX7Tnkx4qnqbXb/TJwlh9uTls5yUwyOtGUAbtxYhiXFeExgKSDhffBBx8ork4rNseHFz+ikuc3iSx9O5URjhJ5rnZvWq9ztNeLq20OQ406UAnnmEpXFdP+ig26e/t6fN+gOjmTJtCiR2dQ965ddLclK8xZ+jQlfj/4IyN/J29Ow2EQLtC74U59UJFJxsAkSsMwyATeKPng9DBUKjObo3lBL1lBU0S2zuR7E5T/unL1b9S+3S0GRksUaXMYatCiSvjyR0oQNntWDvXt3olAEHofkMyXDe83Vzvyzkk6cvwULc+bq7ep5vKtubPhJMC7+OnmtGGQwlRkkjGIJggGOYjhQZIPpJjjx4+LNKBlhjeHbCt0kxgcplLNqrMeZsYQrm5qaqpis5BSYWgZM4bzcPiZxTTULiNvTiMaI0cnbH11MMkY3Dkw7uELFvrgoBXyC5s9mWp2QwSOS4/R0iAchqqBZKBm4IE9K5RsBg0cQKvz59FAkUJV72MHySDcxrCHZ9J5Ee8F2U3Xrl3rm8BSevHVU55JRg9aX5WFN6lPnz7NKUFDm/j2XT3opdVPGdocdkgygZvDwOvaViWQZGQngWTTQyQuOyRc1/Hd7tQ9BjtIBoNAuzI/u59vTusGPEIFJhkDaIJkoCZJVam+vp5gp4EKhfMRX/va1+itXdvovn7fNdD69SpWSjKyPUgKbnqQXhVYhnugftT8Zj/92/6d1Lnj7bqHbSfJuCkFrG5golCBScYG0Lt07kTv7qsytDnskGQkyRw5csSSt4UnzYog7eEkGdhocPgR0kJqyhDavLKAEu65W/e47SAZ3MzuPyqdPrp0Wfd4YrkCk4wNs5+U0M/w5pDDmWOhd8mtmyOQZALJRR5gGzn8AZo76UEanvID3bNkh3fJ6PUC3YP3WQUmGRsm1MzmkMOx8pyMWzcHSAbqJVJ+hEvMlyXi3Q7u14syx6fpniU7zsnU1r1F67f9mmpqD+oeTyxXYJKxYfbNbA4bhkNu3Rw4YzRixIhWz8rg90dr9tLmVUV2wKK7zeWlZdSmfUdCRgF+tCPAJKMdK80leXNohipiQRjTe8TH0cf1b1vToMlWYI+p2LqNT/TqxJFJRidgWorz5tCCkrYyMP4ufTSLUpLv01bBplJuPQZg0+ta2iyTjKVw3miMN4c1wCLg05l36mjDiiXWNGiwlZKyzXSV2lKxOITJjz4EmGT04aW5NG8OzVCpFrTCW6faSYQC8M4Nm5SthHng0A36kWSS0Y+Z5hq8OTRDFbEgjvBvKX+eXtv4rDUN6mwl76mfU/w9iTRP3LjnRz8CTDL6MdNcgzeHZqhUC0ZL/YQtBqlREHrTqfxcqmB4rACTjM0TxpvDGoBhTE8dMpheLVtt6C6TkVF8+j9XFYKBR8mKoPJGxuCHOkwyNs8ibw7rAEZ0wqwpk2jfL59Tbrrb/ShZCrJn09hx4+zuytftM8k4ML28OawD+UBNDRUVPEm/Ki22TaKBBDN1/hJKHT6CfrZgoXWDj9GWmGQcmnhsjpzsGcIV+ySNTB1sS6/YHNMXLqOxEzJoWtYjtvThhkaVVDMCy9VPPG4qnEa4d4ENZsaiQhpy/zDqcuedQfGc3fDuXhwDk4yDs3b//ffTm8eO0aafr6DJD462tGe5OXLzFsSEeI+odIicl9C7Jy19LNvUjXdMBAh6XcU22nOojipe2qrcqUJo0MDg8JZOWAw1xiTj4GQj0BVUp6+LeDNjR/yISgsXWb45Yi2I9UaRJaJkZTFNFfGQ8aM3wBXOwFTtPUDrt7xMmdOyFDc1zsIgQHymuKAZa3jasR2YZOxANUybCGj1jW98o/k3t912G32j7T/SI+kPWro5HHodV3UDbFeKAO6V4kJlh1tvoTEis0NKcn+K69a1BelA4mu8cJHOiKyQr+x/nRrE3zMyMmhZYVHQQTvcELcq/o6rwIrCYJhkHAIddoSkpKSg3uJEeMmxaWlULQ6bWbU5HHod13YDnHE+qe7IGyLqXiM1fPhh0Fjju3cXAbfiKDHpXkoX5JKcnNziXSBtImofwmvyYx4BJhnzGGpqAfmaJk6c2KIsIszhiwlXt9nNoWkgXEgVAVaVVCHSVYBJRhdcxgsjswEyHIQ+0PkXLVqkiOz8RB8BxBwuKSlRshDwYw0CTDLW4KjayjhxoAuSCh6I6DhB2lpubdXGuIBtCGCeQDCRks7Z1rlPG2aScWhiEU0NxAKXKO7A8GJ2CHgd3SDYGNRWmfJER1UuGgEBJpkoLQ8YF6H714hDevxEHwGeD/vmgEnGPmxVW8aXE4fKkEObn+ghABf4yJEjafv27awm2TANTDI2gKqnSahR8DBxLmU9qFlXFgSTlZWlGN/54J11uAa2xCRjD666WmWi0QWXZYWZYCyDMmJDTDLO4KzaC9zbV65c4QNgqkhZUwAGXpxbgoePJRhrMG2tFSYZe/HV1TpsNLW1tVRRUcFR2HQhp68wTgXD6A5XNQej0oedkdJMMkZQs7HO0aNHqaioiDeATRjj5DWuDMDIy0HBbQI5pFkmGWdw1tULPE4wRqanp7NBWBdyrReGegTpBYfskBKX4/VaBKyGZphkNIAUrSKw05w7d47WrFnDX10Tk4DYMJAOQS7hLkSaaJqrakCASUYDSNEsgkNiuPeUJm5rs5tb30xAegG54AHBsHqkDz+rSjPJWIWkze3AKFxdXa14Q9hYqQ42kuvBiA5yYe+ROl52lmCSsRNdi9uWX2b8iahtQ4cOtbgHbzeHcy+SjIEP32x3x3wyybhjHnSNAiSDL/XJkycVson1OLTAo7S0VLFfwVge63joWkwOFGaScQBkO7vAl7uqqqrZZhNLXhPEflm7dq1yczo3N5fVIjsXmom2mWRMgOemqgcOHFDOf+AeFKQbP9ttEJcH9hY8uHOEd+bHvQgwybh3bgyNDN4oHDiDKgWigfrgdbctbC0gFhi+IamlpKQoKhF7iwwtEccrMck4DrlzHQYSDr72w4cP94y9AioQpDMQCx45diYW59aPVT0xyViFpMvbgf1CSgMgnIEDByqSjls8VCAV3CnCwTlIYXhwNoglFpcvLA3DY5LRAJLfioBwIOXIDQ0VBBJC7969FdUKP3YakHFtIpRQZP8gPZxrYYnFP6uOScY/c2nqTQIlifr6eoIdBP8HaadTp07NbQdKPrgHFGhgBnGgDh7UB4nJp7FR5EAS5AbygiQFQgOZuEWSMgUeV46IAJMMLxBVBLSQBwijffv2mshItUMu4CsE/h+B+/O/Kq7K2QAAAABJRU5ErkJggg==" /></example></p><p><example picture="">Hearts in A and C must refer to a token defined outside the loops; that is, a
token defined in E. The resulting program is ill-formed because it has a closed
path that goes through two hearts that use the same token, but the path does
not go through the definition of that token. This well-formedness rule exists
because the rules about heart semantics are unsatisfiable if the rule is broken.</example></p><p><example picture="">The underlying intuitive issue is that if the branch at E is divergent in a typical implementation, the
wave (or subgroup) must choose whether A or C is executed first. Neither choice works.
The heart in A indicates that (among the threads that are converged in E) all
threads that visit A (whether immediately or via C) must be converged during
their first visit of A. But if the wave executes A first, then threads which
branch directly from E to A cannot be converged with those that first branch to
C. The opposite conflict exists if the wave executes C first.</example></p><p><example picture="">If we replace the hearts in A and C by iterating anchors, this problem goes
away because the convergence during the initial visit of each block is
implementation-defined. In practice, it should fall out of which of the blocks
the implementation decides to execute first.</example></p><p><example picture="">So it seems that iterating anchors can fill a gap in the expressiveness of the
convergence control design. But are they really a sound addition? There are two
main questions:
</example></p><ul>
<li>Satisfiability: Can the constraints imposed by iterating anchors be
satisfied, or can they cause the sort of logical contradiction discussed for the example above? And if so, is there a simple static rule that prevents such
contradictions?
</li><li>Spooky action at a distance: Are there generic code transforms which change
semantics while changing a part of the code that is distant from the iterating
anchor?
</li></ul>
The second question is important because we want to add convergence control to
LLVM without having to audit and change the existing generic transforms. We certainly don't want to hurt compile-time performance by increasing the amount of code that generic transforms have to examine for making their
decisions.<p></p><h2 style="text-align: left;"><example picture="">Satisfiability </example></h2><p><example picture="">Consider the following simple CFG with an iterating anchor in A and a heart in B
that refers back to a token defined in E:</example></p><p><example picture=""><img alt="" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAASIAAABjCAYAAAA/zQ3LAAAYeklEQVR4Xu1dDXRWxZl+v7PrEbeAoJSf4iHBbeXHhQS1QuQnUK1Iy0+QlRiymkQkCVJMEApECkmskoQthSg2xGohUhqRUgjCBpQIafkL9RSiVciebSEeKGzXVru4Fc7pOdn7DJ1wc3O/L9+dO/d+93535pwcNLl3Zu7zzvvM+74z806oTSukSswQOHfuHOEnMTGR/Rw6dIj9oGRnZ7Pfbd68mT3Tq1cvKiwspEuXLlFGRgYlJyfTjBkzaOLEiR3ewzN49vjx43TlyhXq1q0bjRkzhv03foeCd/GMKgoBLyAQUkTknBg4EXClnz9/PiOR1NRURiifffYZrV+/nnUAZIIfTkxWyUL/HkgH5GMkIrP2QHp1dXV06tQpqq2tpf79+7M+gaSGDh3KCEwVhYDTCCgikoAwLI3y8nJqaWmh0aNHM5KBYu/atYsp80MPPeQr64MT2JkzZyg/P5+RY05ODiO3ZcuWdbDAQLL4PvxNFYWAKAKKiCwgB4sClgJcpcbGRqagBw8eZDVAeYNkPXA3DxYeJ1q4iyiwqtatW8csPmAEslJFIRAJAUVEYdCBgkGhoEhwqbg1ECSysas63CpsampixATr8I033mC44r/xryoKASCgiMgwDkpKSgiKg1m8rKxMjRLJCICcEJdqbW1tt5o2btzIrEkVQJcMto+qCzQRweqpqKggxEJ47MNHsoubrsKtxQ+ICAH7ffv2saA+XD5lNcWNmCN+SKCICG4WAsgIKldVVbHlbPwOboIq3kEAJASrCXE4uHRwi2E1gaSUrLwjJ5k9iXsiAtFgIGNmRXwC/52WliYTQ1WXwwhgwsAEAmLCqiT2V/G9Vw43rap3CYG4JSKs2GAVBxsCCwoK1Ezq0oByqxm4bzU1Nay5TZs2qe0DbgHvUDtxRUSwePbv38/cLrWvxaER49FqYTVh4sFOc1i8ate4RwUVplu+JyK+twckhMH46KOPKhLy1xiU1luMBb6JFKtwyn2TBq3jFfmWiEA8OJqA4xLY/auKQsCIADaewn3LyspicSVVvIuA74iIW0CY+dTRAu8OLC/1DNszsNqmX7jwUv9UX3yyoREuFw5iYqNhcXFxYI4M8M1/WC2CIiH+oXZ2W1NbPYYDBgxg+5MSEhLYZlUVR7SGpZNPe9oiAgGh8JPkQVFCfDeOlSDgytN8QKGwUgRSwiqR2ugXWS0iYQgc4doDXzynAttOUkx0dXuWiLCBDTEgfkYpus/x/1P8bBssPzPi5X/HjK4Ok5rLO1oMlyxZQhs2bKC8vDzm5qsSOwQ8SUQwnzFrBS3AiPjXzJkzaefOnRFnaTyHtBwgaeyTUuU6AiIYIt6IBQ/lqsVuJHmGiEA+RUVFzHcPqtsBEsL3R3OMAXiBjEBaSoGuK5AdDHGshGfKjJ1KBrNlTxARN6WD5obphxxO/cPVsnL8BAdFsTyNDZyqENnFkJO7cnvdH00xJSIIHoHCoM/o3BpEENpqQWZIWFBWCMxqG354XhaGCF4jLBB0PN2WecyICH55dXW1WgHSJI4VMgSnRVxSKA7ckfr6erfHjqfaUxh6ShyWOxMzIoIZvXz58sBbQ1iWh3sFt1S0YI8VYhtBncWdwBCWJvCMJl4nKjf13nUEXCUizN4YNEHZDxTNQMNBTZCQiDXE6w+6VeQEhnDPsCMbFyGo4jwCrhERT8sRbn+M85/qvRZ4hkg71hD/Kqw4BnHntVMYYrzirJoiInf0xjUiwk5WWEJq38t1wcI9hfkvY2MiZu/KysrAraApDN0hCqdbcZyIMLNgVSzoK2NmgpwyZYrUIDPqC9q+IoWh0xThTv2OEpHaARxeiE7EIHAsBoQflB3pCkN3SMKNVhwlokWLFrFcMDJcDzfAcLMNLDfj5hCZriqIH/Xi6uggFLsYIsiPkAG/QIGfNzNiCMLD5lEsKMCVtrOwEAS5iHyjNCKCUI3ul9nvRDoZj+9MmjSp/ZZYWd+HFcnMzEzq2bMnUxx+iSHIDosE+BdKB8KKh8nBDoYgFxyRQbCbF+ADEscVU3Bx8TesyOHoBy8Y49jJHhSrU9bY7KoeaUSETXUQXHp6Ops14mWwdwWgyN9BBqWlpWwzp6wC6xOze58+feiRRx6hCRMmsKohE+Txxt94JkvIRmbbsr7BSj12MARhp6SksBQgxgKiwWn8OXPmsH/xrFkBYSEtsSpyEJBGRFAEbKzjBeYrBAlhqU1hHYUlcwMiP22OfzGLQ0GhPMZlZ70FgNnc70RkB0NMmtjZH67AWuzXrx8j8HAF4/vs2bNqEUYOD8m7chqBUvjWZgWHCLGLWpVrCMDch3kvIyEXCIbnbcJsjlkeLocxToTfYV8MCtw0LHv7uYhiCHxuuukmKZ9+8uTJuHBxpYBhsxJpFhFiEjB3jSUeZl+bGHd6necckl0vr0/2krZT/bRTryiGsBx79+5tp+n2d48dO6ZOCUhBUmLOajMBKxK6JiVYInCZUGDS42wZzzskugID1wIbGGHd4CpmfYH1EymvE+JGcKX9tLomE0Oeu9quDn366adSrFq7/YiH96VZRAADMw0ICUWR0PXhgZjNqFGjTMcL3LPTp09bWhKGWwIigpuBeAZcBH3BgU2QU7gzfdw9uXjxoqV2YzngZWII/BC8t1PMcLdTX9DflUpEcM3goikS6jysODbGvyBWZPVeNiwn8yMdfAVMn3MZJAWrJ1K9w4YNY4dt/ZSrWRaGkUgtWkLAwkBQsx1Ei5GV56QSEYLVmG39viJjBcBonzUL5sNiQZyBFxAMflq11Zhz5852qDoxcTAlDB7cnsWRXx7Ib/TQ7+ECSXH3L1z/EGPBIVk/7YeJBkN8bzQ4NjQ00OHDh6MVX4fnlDUkBFvEl4SJyEzYn3/+f9S9+5e0jXMdlUZ+t/1XIywXxCb0VySBhPD/1ZpVtEu7sSTpzmGUeu8oSrhtICXe9pUOH3nu/B+o9fwFaj7zX7T/0C9p/Nhx9JFGOOfPn+8EBr8HHrN2uAJXEXEkP1lE4TAEMcASt4LjieaPGI5tbW2WBhMI/+DBg4ELUkdD7lZTHeuBt0REVoXNlSZt+gxK1/zyoJuy+iV0WI8XL5ynS9rPvIxZNOPBb1Cvnj2iUoorV69Sbd1emre0mCaMG0uVL23otIwcaeWMLywgNuW3PV56DLElBHvVFhU8LYTj9r37KWPBkqgw5w/Fw9aHaD/YTX2Pioiw4iMqbChN3dvv0pt736G//PUKrat8MbB7LzCr4FgCAtT9+txKq5c+zQhItIydmUl3j7iTjv6mmZLvupuqql9p32CHgGy4VTG4dVhxMwa5Rfvh5nscQxzHyJyTQTu2b6cXvrtQGMeKqtdoRcX1jbiRvgUTaSQr000cnGwrFvreJRGVFK+ibdp2djvC5qA1Hv81LX7+B52UxklQvVQ3XCYcwQD5vFJeTN1uvNFW95pOvk/fzsqn3zbspv2NR2hL3T6qfXM7WwkLZxEhiI1ANWJLfrVQ2ZU/CYMo9e6RtDQ/xzaOC1Z8n6q3vhlRFnA74ErHezqbWOl7WCJiO3SzHqc7BvaVImy9lGt+XtdBaWxpo09exiyTkT6bpk8aR0vnPyGt17NyC+iz/71MO16ppJbfnaUFq8po0+tbmMXDFw1gRWA5HzIFQUGZ/JpsHzhOnzaVludlU/o0ebezzl3yPcK4NCsgIcSFZOyElyZ4yRXFWt9NiYgrzcLHZksVth47zOZcaeLhJHikcQEhT0qdQC+XLKWk4UOlDiG4vnOXrKSmk8302g+epyH/PJimPvEd6valHmwGx1I+4lHYJoB/gbVfk6c5iSOEYkZGw4cPpyNHjsQ1CXlB3zsRkdPC1mvhpf/5hCnNzt17pOblkarpEiqbOWM6PTb1AeE4RjRdeG79j2hN1U/Yats3x99Hxz44Te8caGDHbjDQYAUVFBT4+uYUt3B8bv21CyvvuzuZ/hb6B2r81eG4dcm8ou+diMgNYesVq/mjM7SgZA0dbPxlXAobPnfb5T/TqoL8aPjE1jPtCwNv7aP7x42h/zjyHp1qfp+5ZAhc+2mp3giEmzhue6ue3v+ohV5YVsgWWrbsOUA763bbko1XX/aKvncgIjeFrRdMvAobO3jnz5tLB7Zqq1k2A9MiA3lW/jOU+cQ8+vDDD6mlpcVXZ8v03+sFHLPznvJtcD/c2PGSvrcTkRK2iKpHfsft2cbYG1ibc5c/Ryc1q8jPReEoX3pe0/d2IlLClitsrFSVrCiihp/9WG7FFmtbsHI1JaeMp/wwuaIsVuf64wpHZyD3mr4zIlLCli/sDC1l7rTx92irjlPkV26hRqxOLi5bT8eaTlh4yzuPKhzly8KL+s6ISAlbrrARHB6gbSo8f6LBVmwo7cmFtOfAIfrbuQ9sdfCrE77FFgNk3hhiq0NRvmwXx39MHNGhpcO/+CmNuSspytY7P+ZXHI1f4kV9D33xxRdtdpTGKGx8tB3F8YuwI91QgvSt69eU0d5NLwsP+o8vXKSni1fTwH59KfffZlPSsCHCdS1Y+QIl35dqOd2IcINRvohtBZHI0S6OGJt8LP7xkz/TV+5JtTU2vYpjlHCzx0TJ/T9/f46Gf2Ma/emDY3Rzj+6srubTLVS8dgPVrCtr/52VvuBZru8hbYdtmx2l0QvbaifMnveLsHGSHveSIeePcdv//Pw8GjGoH+VlzhaGBEvIKH1vvYV+13qentQOxooWr65KDv57WhMcJDXb1GoXR+PYtDtWvYqjcVzgYgGzcYnn7JD7q7U7qEf3f2Lhhr9c/pyyFhXRmmcX0x23J4oOTeL6HsrPy22zozR2hWv8Ar8IOxQKsa5j2z82CuLWDH4EQEYgEG4ZhIxl/9vHPmhrJvdqnAhExFPo4tybkZDs4qgfm7Aw4xVHow5hbOK8odlEaYfcQT63jkih3x95m4795hRd/vyvtiZI9Jvreyht+rQ2O7t+ZRORV5XGTNj638EqwiwE4UOB1hYV0uhRI4VmCqPSgJRKF39H2D1DLqMH5syjs62tQv1x6iU9EfE29ISUMvpeWzgawwZTH5hIL5Y+S4MGDhD6JK/iGGls8ss1eQI8u+R+XMv0sGXHbnZIWO+mCQGqvcT1PTTm3q+32VEao7CfXZhLzy1eKNovsits5NoJdyleV52C/4wcLNEUXJAYrvS6+WZ6b++bnZKbRVMvntnTcIgu/fFP7bMN3DS7sw/kZEyyH21/nHoOWJtdcoj2QOr1e/dSQ+2rwjgaJ0nEOZauXku7Xn1J+JOMOCKfE+5AEymwWkTzQcH6DndGk1vr+j6BkJAauKJstS1yR53AYGNZiW1rCHVxfQ8lDhrUdkDb62LMCBgtsLItIv6hVpQmNTU12u52eg7pWkVSOyCvkLFwc7iivIze27ON+n+5j1C/zNJSYDa3q0BWsxEKdd7CS2YWkd6lGJyYYAtHs7Fpd7zi/Ug4RiLXrqBBil/9FdhdPa//e6tm7XI3V39FtrEOGZNk/bu/YhbRH95rpL59brHSTdNngWmof7++bXaUxq5gzXrWlbBtf7mECvSzDmYnuGSIE4HURiWNpNfKVwmdtOerO0azF5jANxdxK3C4+J6p6XTx0n9L+HJ5VeiJyCymYQdHPqHpV3D5SqQooXsVR6NEzCwiuGYYo5MmpgqTOx+bIKAT2m79E6c+sOX98H4zIkoeOaJNVGnMhG13mPpJ2CAdkA8ErM9VM2Xyg7Qw82GanDrOMhxGt4xXoF+xsFqpV496gIjgmoVbfbSDIx+bRqzs7CXyKo7hiAjjE1e+YxGAb5OwQ+7rflxDX7s9gabeP5E1Ccvd7tYSru+hhx78Zpuo0oQTtp19RH4RNi4ohAKZXZCYk5VF40feQVn/OsMqZzjy/P7Gw/TS1l9Q/f63HalftFLcLYbgdDjXWOEohiyu1OYLJ8bxKUrufM+Q3po021tktcdc30PZjz/eppTGKnyRn0dO6EP1b9Fra8IHtOW2GLk25NcJ3fxl3913r3AUGyWIM4W7Qdir5B7S0om2KaURE3i4t7Byh0DrJ81H5FYsWBviQ5u2bPXdpQUKR0GBR3jNq+Qe0u7vblNKI1/gSA278qkcSh3zdfmVW6jR7nYIC0058qjCUS6sXiV3duhVCVuusFEbbiU9dbSRXv7+CvmVW6gR1+VcphuprLzcwlveeVThKF8WXtR3RkRK2PKFjRrtrFDI6BFWJB7IzGUpQPx8A4XCUcZouF6HF/W9PTGaErZcYaM23KCxufpHtGPjD+VXHkWNi5//d0ocnkyF2gqfn4vCUb70vKbv7USkhC1f2KgxVmYwYkO4IQVpYkV2jjuDhnitCkdx7Mze9Jq+d0ier4QtV9ioDcHBSRPG08+r1gofo7HaK1y4CBLCSpnoWSarbTr9vMJRPsJe0vcORKSELV/YqBFniHIey6StleWUIHjy20rPcHtHdu58Sps508prnn+W47jnJxuoV88ejvYXZJ6xcCk9VfBM3OHIgfOSvne618xNYQOQeFUao5bsq6+nJ3KyqXDuY7Q4N9sRJYLyzF1aTGmzH6XsHHnXWjvSWcFKgWPpqu/RT9eXOWZhAsdpmkX5teF3Uk3N64I99cdrXtF30yun3RJ2vCuNcSiOHTuWjh49SisW5rH8QjILYkJPLiuhgsXfjdsZnOPFrsLJfZLWPvuMcM6ncNhzHIf+y0iaoR0/8fOllNGOLy/ouykR4QPcEHYQlEY/GJBelqd5uF+7Frrmhy8IpwppN6+12fvFTVtp94FG2vT6Ft/tno5WWYzPAUck+Uoa8lVa+XSudBxramooLy8vbmJsXeEca30PS0TouNPCDpfYqSvQ/Ph3+OO9e/du7/oNN9xAt/TuxfJaPz5rumU3A3uEtmlXS7+0+WeUlZ3Dluj9vFdIVKYbq6oI+Z+AoUwcp0yZQlo+d9Fu+fK9WOp7RCLiaDolbF9KS7DTSJiVkpLS4W0sq896+GE6cviwFnztTtO15GepY+6hhNsGdiImuAyt5y/QKe321u1739Ey2124luKhpDSQBKQHEqlEysvKqEY7bCwDR8gKSeZLSkoEpe3v12Kh71EREWCVLWx/i8p677Gbdb7Jbasgo507d7LT0tjb0XjwXS3TnpZt7+OPOzSiZdLUcsokUPKouyhdIyBkllSlMwJwMeziWFRURFlaKpd42fogMk7c1veoiUj/MTKELQKOn99B/iJc86IvGOgIhmLQB8lN9bIcMbYRH0J+Z1WuIeCGvgsRkRKQdQQQc4C5D0sG+bgx46jBbh1Hp9+Yqe29qtLiTuHy+TjdflDrV0TkkuThLoCE+ACHmwZLSLlYLgkgimZgsSLgz6/eieIV9YgkBBQRSQLSajVYRcPsi/hQEFe7rOLl9PMIUFdWVlJtba3TTan6TRBQRBTDYYHBX11dTVqWzBj2QjWNSSEjI4ORkJoUYjMeFBHFBvf2VhE3qqurY3EJVdxHACSUk5PD4nX8pgv3e6FaVETkgTGgyCg2QlAkFBvczVpVROQRWYCMuJum3APnhYLDnlgwgFusLCHn8e6qBUVEXSHk4t+xXwP7jZRyOAs6rmQuLS1VCwXOwmypdkVEluBy/mHcX46YBS5vDMLJb+cR7dhCuXaJQHNzMyP7eMhc6TZ+TrWniMgpZG3Ui82OOGaAUqadoVIKYwPMv7+KA50g+PT0dLVPyD6c0mtQRCQdUnkVIm5UUVHBVnTUERBxXHG1NT+2EeTzY+IIOv+mIiLnMbbVAlZ2EM9AKS4uVvtcLKCJgDSwGz16NBUWFlp4Uz3qNgKKiNxGXLA9BLKhVJMnT2auhXLXwgMJNwyWJFxcuLZqFVJw0Ln4miIiF8GW0RTOrG3bto2GDBnCZnmlZNdRRaAfZA2SRnZF5c7KGHHu1KGIyB2cpbfCj4eAiAoKCgK9F4afE8OB4qBjIX2guVShIiKXgHaqGVgB3A2BEgbJCuDWYVJSEuXn5yvr0KlB5kK9iohcANmNJhDURhoL7JFBHClNu4EiHnPqIACNVbCmpia2FI90uSpe5sYIc7YNRUTO4huT2rFzGHEkWEuclPx8jIFnCAT5YPldZbSMybBytFFFRI7CG/vK9aSUmprKLCU/7KXhKVvxL9xNRT6xH0tO9kARkZPoeqxuTkpwb0BGWHmDkiN1bSwL3EoQDvrX0tLCrrFS5BNLibjftiIi9zH3RItQdig/VpwQV+IFgV+krwUROBFjAgmiXfzb2NjIAsz4ASnydtWWBE8MEVc7oYjIVbi93xisEpAELBMQBiwnkBasFhS4d7yAqPRuHp7nz129epWRHAqeQ0AZMSuQDUgH76l83d4fD271UBGRW0jHSTvYrcwJBp8EcsEPLCi9JRNrdy9O4A7MZ/w/1tm1ttKLNPsAAAAASUVORK5CYII=" /></example></p><p><example picture=""><picture -="" a="" b="" e="" x="">Now consider two threads that are initially converged with execution traces:
</picture></example></p><ol>
<li>E - A - A - B - X
</li><li>E - A - B - A - X
</li></ol>
The heart rule implies that the threads must be converged in B. The iterating
anchor rule implies that if the threads are converged in their first dynamic
instances of A, then they must also be converged in their second dynamic
instances of A, which leads to a temporal paradox.<p></p><p><example picture=""><picture -="" a="" b="" e="" x="">One could try to resolve the paradox by saying that the threads cannot be
converged in A at all, but this would mean that the threads <i>must</i>
diverge before a divergent branch occurs. That seems unreasonable, since
typical implementations want to avoid divergence as long as control flow is
uniform.</picture></example></p><p><example picture=""><picture -="" a="" b="" e="" x="">The example arguably breaks the spirit of the rule about <i>convergence
regions</i> from the draft proposal linked above, and so a minor change to the
definition of convergence region may be used to exclude it.</picture></example></p><p><example picture=""><picture -="" a="" b="" e="" x="">What if the CFG instead looks as follows, which does not break any rules
about convergence regions:</picture></example></p><p><example picture=""><picture -="" a="" b="" e="" x=""><img alt="" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAASEAAADFCAYAAAAbv0Q7AAAgAElEQVR4Xu19C3RV1bnuT29PtaeiYBVEuCSxyEOPhFgUqIWAUh4tEBAlRBQSEJLIIwHkERSSgBiC0hCphIgYIiKCIITHCUEUAiKBdgjBjkJbjyReELAqWnoUz+04ueubnpW7s7OTvR5zrrX23v8cgyGYNf851zfX/PLPf/6PFnVaI26MACPACLiEQAsmIZeQDzLsyZMn6eDBg1RZWUldu3alhIQE6t27tzcny7NiBGwgwCRkAzwVXa9evUrp6enUqlUrQTz9+/cnENLevXsFIZWUlNAtt9yiYmiWyQi4ggCTkCuwBx60pqZGEFB2dnZArUf/eV5eHvXo0cNDM+epMALWEWASso6d1J5fffUVjRo1irZv3y60oKYanktJSaGCggKKjo6WOgcWxgi4gQCTkBuoBxgTBAQNB/afYO3ixYuCiEBY1157bbDH+eeMgKcRYBLywPLk5OSI49XIkSMNz6aqqopKS0upqKjIcB9+kBHwIgJMQi6vCrSarKwsYXA225YtWyY0JzPkZXYMfp4RUI0Ak5BqhIPI1w3RVm68cJOGY1x5ebnLb8HDMwLWEWASso6d7Z64eseRCkZmq23lypXCQM3akFUEuZ/bCDAJubgCSUlJgoCsaEH6tFkbcnEBeWgpCDAJSYHRvBDYgvLz821pQfqosCmxR7X5NeAe3kCAScildcCNGI5QMpwOz5w5Q4WFhXxT5tJa8rD2EGASsoef5d5Dhw6ValCGPPYbsrwc3NFFBJiEXAAfcWDQXjIzM6WNvmbNGuG4mJycLE0mC2IEnECAScgJlP3GwLX8vHnzpIZdIJwDcjdt2uTCG/GQjIB1BJiErGNnueeAAQPowIEDlvsH6ohbMkTcx8XFiYh7GL711B8ZGRnC/oRnkB5kyJAhUsdmYYyAHQSYhOygZ6EvIuFzc3MteUg3NRwIB06LkA3b0Pjx4+sfraiooPXr19fnJIIB++zZsxZmzl0YATUIMAmpwbVJqbKdCxFDBgIaO3asuKaHA6S/rQlHNZATnoVjI5OQw4vOwzWLAJOQwx8IHBQRdNpcug4zU0L8GJwdYZDGcQvR9f52IRzBcAREw5FN9lHQzHz5WUbAHwEmIYe/CT1nkKphZV/9q5ony2UEdASYhBR/C7DHwFaDBo0FsWJ63iA74Rq4ki8rK2vkGwRNCPKbkw0Paxir7YyvGDYWH0EIMAkpXmzYaHBjFajhSHb69GnTZNCnTx9h+8Hxa/78+YJ09IbjGY5czSXFR3+4CHDQq+LFZ/GGEGASMgSTvYew6WEU9m+wDaWlpZkSDuKBLGhBICL8/dtvv62XsWPHDnE935xcaEtdunQRBMaNEXAbASYhB1YARyc4Evo2aCpHjx5t8P9gQMafWu0Kvaam4TV6dHQMRcXE1GdghLy2bduKWzHflLDwxNaPfE29GuLWLl26xLFmDqw9DxEcASah4BjZfgJX5O3atRPHJzSEV4CAELwKTaZY04h2aJpN7J3dKP7eOIrq0J6iO9zaYNyac59S7bnzVH3mI6o4eIi6de1GbTQSgmOib8MYuIFDHFlTjTUh20vKAiQiwCQkEczmRGHjw0iNhmNQamoqzcyYQRfPn6PJSaMpYdD91Or6loZmc/W772jxyiJau2krxXbvTgWFLzSIxg92QwYbFcoKsU3IENz8kGIEmIQUA6yL13114Cw47pEk2vbmm7R0znRBPlbaV3+/Qh3uGUCFuVlU9NoW6nH3z6mo+CWhZUETaiqGDDd13bp1Ew6LfDtmBXnuIxsBJiHZiDYjDwQUHdWR4n/eneampdC111xja/T8onV0qOoPtKe0iEq3ltGGsr20acubwmGxqbzTcFrEMdBOSllbk+bOjIAfAkxCDn0S0EBGDB9G81OTKXG4nABSHMs6/XIITXgogZbOy6RjJ07R1EV51DHmNnF7hgZDNVwBoPXAIA2jNexRrAU5tPA8TFAEmISCQmT/ARiLB8T3oxdz5lLsHcGLG5oZEQbrYRPShNwVi+aKrl37D6MP//hHQTQx2o0a/Ig2b94sjOAnTpyQmkLEzFz5WUYgEAJMQg58F6MSRtBjwwZatv8EmyLsQ6OnZFBl1e/FGL+5vx+98tYeGv3wGJozZ47QhODAaLTCa7Dx+OeMgEwEmIRkohlAVk72Iqq78iUtyjDnlGhlWtCK1m7cQm1u/ql2xd+esp5bRX/560fCcZErtVpBlPs4gQCTkEKU4dGcPnkS7d+o3VrZNEJbmebotFkUe28fOn78uCAhGMa5MQJeQ4BJSOGKqD6GBZt69Z/O0KT5i+lE9algj/LPGQHXEGASUgQ9/IJynsqid15fq2gEY2KnLnyWevTpS2l+YSPGevNTjIB6BJiEFGGclJhIw/v21K7jhyoawZhYXNvPzltJR48dN9aBn2IEHEaASUgB4LiSb6ddj587/o4rtiD/V+rU79d0oPIQ24QUrDWLtI8Ak5B9DBtJQFDpyuV5tKfkRcvSRz4+nXbvP0j/rPnQsgy949SFS6nHL+JNpw2xPTALYAQMIMAkZAAks4+kp6XSXR3bUuq4MWa7iuc/OX+BZmQ/S+3btqEpj46h2G5dLMnRO5Xte5c27N5P28t22pLDnRkBFQgwCSlA1e6t2OZd5WJWbX56I/1H7Tl6XIuyt9PYLmQHPe6rGgEmIQUI9+l1L63IyqRecd0tScdRbPmC2cKedNt9g2wfyeDEOPCRyXS2ttbSfLgTI6ASASYhBejGREXRfu1q3j8xmZGhcBTzJR4QUu7sabaPZD+Mvovq6uqMTIGfYQQcRYBJyCbces5nXzGJiWPogz1b6JabbzItffc7B+niZ1/UH8FwNLvyj29sH8mYhEwvBXdwCIGIJSGk1tBL8SDfM5KBoToqUrEi4BNVTPF3/D/8DLmcEfbwxhtviKRhTTXIuvrNf9K6ZYssRcxPfWoJFWvxX75t2MD+tOPlVZY/iYt/+5x6DkukCxcvWZbBHRkBVQiELQmBYFB1AqSAv6PWFrQWpFUdMmSISCiP1Bb4uU5C+DcSfoF00EBCiP9CEwnJtD9I0Tpr1iy6fPlyozVBP6TKQNrW6eMepMHxvzS1bp99/iXd2jOevvjwKN3Q8rr6vtBiPj6yjzq2b2dKnv4wh29Ygo07OYRAWJAQSAakAjJB5YmZM2fSDTfcIP4ODUZ2O3z4MPXr16+RWKTKQP7olAkTqG/3ziLZmJnmfxTT+768aRu1vO5fLXtfV1S+R6s2vkXlFfvMTIefZQQcQSAkSQgaCohHr1aBhF3x8fFCw9G1GJXoobYXUqhiHnq78cYb6ciRI4L4oC0dLN9F65bnqpyGYdlIit/ihptFZkVujIDXEAg5EoKWAwJKSEig5ORkR/HE0Sw3N5d69eolNCwkjMcRD8SHSqp6qozVq1fTUwuy6PPqI47Or6nBYA8q2bCxQUUOT0yMJ8EIaAh4moRg04HWAU0HR53mShurXE3MA+SHdKkon6znZ4adCWWX/SupYs6TJk2kvLkZNGmsPUdDu+/FPkJ2EeT+qhHwJAlB48BRC5sff8cxy40GLQckU11dHTA1Kn6OY1mg8jqrVq2i3x96l0qeX+LG1OvHREWOK3QN5WnvwY0R8CICniIhaBDFxcXCvuN2nXTYdaCBZWRkWCbBuNju4qoeNpm56ZMse1Bb/XBwNT9w3BSRxgNuB9wYAS8i4DoJQdvB1Tiqger2FTeBwi1bfn6+sDkhN7OdBlJdX7xaGKhnL86nGRMfteQ7ZHUOs595jqLv6EGZ2lGSGyPgVQRcJSHc1ly6dEnYWdzOf6z7EmEemI8szQGlfhY+kULxve+p/wZQL0x1zmlRCmjiNJHa1YkbQ69+4Dwv7yPgCglhw2Oze0HzwTU7brygkcH4LZsMIX9Av760tWiFiCWD4+CC/ELauGq54drzZj8jlAACAeFGDC4D3BgBLyPgKAmhGihulCZoznw4frndEJJRUVFB2dnZSm/e8N4pj42j3a/8ThDPmf84S89pBuN1zz+jBAJU2Uiekk4jR41SIp+FMgIyEXCUhGDsBfnIOupYBQK2msLCQhHCocKjOtC8ZmuhHu8frqTXVmralqYRqWjQgMbPfIoeHvcoJadMVDEEy2QEpCOgnIRwHEHQp10jr4w393U2RICqk7aSAQMGUEFBAaVPeZxWLJhVf1OGyqkrFmk2MZvEBBvQ4/Ny6NO/fUGVlZVca17GB8MyHEFAKQnB9gMnPxx34PfjVoO9B3YfNMxFdzZ0cj4goQMHDgjbEzIvxnbpRAtnTKGr3/0XTXryaWEjspL6A9rPCyUbaef+ShqvaT8IrgXBgvC4MQKhgIA0EtJTYPi+NDQPHL1kG3uNAqs7Gx47dkwYnd0kQp2E9Lmv0Sqi5i/Lo/GjR9DQAX3pnu53Gn0t8Rx8gDbv2kur1r9OE5JTxDU8xgDm0PDOnj3rCtmaegl+mBHQEJBGQjExMdS/f39hZ9HTZ7hFPlhZHAHh+Ijrdrc8rn2/MH8Sws8ESWrkWKrZylpdfx11jomilDEj6fYYLW2I3/EMx63ac+fppHa79uaet6lG+zvsWdk5uYLoYeca5WOIZm2I93eoICCVhPQkYSAfEBJiqpy0uwB0OBvi6AVnQ2xEr7RAJOSvNYrbOq1cEHyIaj75pMHUozt21DTKKOoRdzclauTjH0cXFxdXn/sIHVkb8srK8zyCISCNhLDJQAC+DbYXRLoj9EG1HUaVs2EwAI3+PBgJ6XJwg4hjo5mjo78WpMtibcjo6vBzbiKglITwYjgylJSUKNOIYItCmAVISIWzoazFMUpCVsbz14J0GawNWUGT+ziNgDQSQt5l2GF8G26iVCbSwvGlrKxM3Hjh+OflZpaEcKsIDTKYXQ2GaDyrNz0Dgf5vOIY6nXfJy+vAc/MeAtJICCktcJRAw29gaD+qHAFRZhnaj5POhnaXziwJQcPT04SYsauZHcfue3F/RsAuAtJICL+NoZnA9lNeXm7KpmH0JfBbHmEfyGyIVB9mNqfRMVQ9Z4Uc8L7QhMx4mFsZR9U7s1xGwAgC0kgIxy4cjUBAso3QurMhrrRh95Et3whQdp+xQw6BfLCamo+dcey+I/dnBKwgYImEsClwIwNj8LGj7wt/l79rnrvXXfcT+sEPfkDxA+4XRIE4MTuEAbnQruBs6LbXtRVwfftYJQdgMHToUNq+fbshjcjqOHbfj/uHLwK49cafWs0BtqbmbIMXjY6OoSjNRxC3uVaD0k2RkMg2uOl1qtJIIWHwQIq6ta2IgfLPjVNZ9Xu69MVlKqt4VyOhtjRBCyeAcdTMsQJGbgSZwtnQ6st56bOwQw6IwocNDHa2YM3OOMFk888jBwGQTqn2ve3QTjexd3aj+HvjKKpD+yadaKvPfEQVBw/RyBEJlKhdUpnZs4ZICISQpZFBfO+eNOY3vzJV1A/5c17dtosqDh+l1LT0oFn+kGURzoaDBw8WQa+hZPdp7hO1Sw44khrRKu2OEznbjN80EALitlUr3ln3z/9L4x8cRgmD7jec9wrJ+sr2vUtbNI/+2k8vUnbuYkNk1CwJ6cGWUe3a0NI5M2xFeiPW6bniEqo4VEXbNXb1T7alB7vCEIujlxmtKRQ+JxnkYISIZIwTCnjyHOUigGM/yKfq/SO04uknG2QCtTISlI/FL7xEX39zVdvvO5vdz02SEDSS9CmT6cXF2m2UduSS1ZDQCykn5i14WiTd0p0NceSA0TlcMwHKIAfY4YBTc0UAZIwja61ZTmgggF9uSWMepoeHPkCpjzwkddIwzcxe+lsqeXVDkzfmAUlofckrtH7dWtqoJeCykl4i2Fsg/cSkudnUsVMX+utHHwm7j9edDYO9U7CfyyIHBKkiTUdTToyyxgn2Pvzz8EAAx6+U8Y9JVzZ80UHw9aOZWZS9+Bkaol2y+LdGJAQNaP6TM2mPlopUdTL2YSnTKHV6RkSkIZVFDsGOZLLGCY8txm/RHAIwgSQ9/JCW7fNZW6YWIyjDXjQ6fTbNX5jTSOFoQEL6pLYWPa9EA/KfLCaGulhFa9cpcW40Ao5Tz8gmh6bISPY4TuHD4ziLAGxAqATzYs5cx8pQ4QQ0cNxk2r5zdwNNvgEJ9el1L63IypRqAwoGbaSUppFJDnpZ6kCVX2WOE2zt+Oehi0BS4hga3vdeShzubHVj2IQf1fKgoyCnfvNdT0LrNZ+Ag3t3i0J9Trf8Na/Qlbp/0UoV5zs9tGPjySYHhMnAjcE/YZvscRwDiAdyDAG9KOe2Nb91bEzfgfyLcgoSgmrWrUtnem/bq44cwwIdy/7tV6MEOxrxhXEFOZuDyiYH3Cpi3fzxkj2Ozdfm7h5DwI1jmD8EcNcZNnE6le97W3y/goRWarctNX86qfkHzHENssJXXqP/8/W3VLCy0LU5qBzYKXJwahyVWLFsdQh4Ya/j7Xz3uyAhN2xB/jDDaNWp71C6rP2GD8emihwQV4agYb2pGicc1yQS38kLex24QxvqOSyRLly8RC0uXLhQFxcbS+eOv2NpTX4YfVejfv+s+dCSrAcemUw5S/PC0mdIFTkgwBfBg7qflapxLC0od/IUArjQsLrX//JxDd1x/3D64sOjdEPL68R7VZ/+M2Wv+B2VFuTV/z8zL/zL0eOp4HerqYWWjL7u5PuV9OKSp8z0r38WJGSVdPwHzNdKI1+hazQD9TJLc/FyJ1XkgA8LXtRMQl5efW/Mbc2aNWRnr7+8aRu1vO5ftRu1ofT1lX/QhJlZtHzBbOp8W7SlF9T3e4vk8ePr+nbvTBMeSrAkSCYJwcV7yeoSOlB5yNJcvNxJFQn5v7NT43gZa55bYARStFS/dvY6iOend/Whj4/so6MfnKQr//iGHk8abRlufb+3GDLoV3XTxz1oKjLed1SZJAQfgofSn6TTf/6z5RfzakeV5ICbMvyWQ0yZynG8ii3PyxgCQwcPIjt7HaNUfVBNG7btpOKNWxoczYzNoOFT+n5v0aP7XXXrli2y7DXpbxNaMH0KLZ493cqcKJyN06rJQU98hv+i3DQ3RsAfgbjY7mRnr+vysOfX5OXY0oIgS9/vLW5p26buD7s3W/YPkqkJYWKQp13Yhd0XpJqEkPwfHqjIxcQkFHafj6EX0u2DTT08+sFR9M6mdXRX186G5AV6aPc7B6n83cNCE/r0D5XU5qYbLcvS93uLrp071yFWrOvPYiwJk0lCiCW7KfY++vbbq5bm4uVOqklIf3enxvEy1pE6N5AQyrHDITFQ+9GPfkRHy16n2G5dLEH02edf0q094wX5HK8+RcdPfmj51IMJ6Pu9Rf9+fesWPpFiOYmRTBJCHNlA7Zr+bG2tJZC83MkJcsAt2aRJk+jIkSNehoLnphABveqN/xDQkrv/252U9+RUy3u9YG0p3X5bFA17oL8QP/WpJTTl0TGWSU3f71Jux/xf2OqVPd+O2fs6kYYFRSjPagnJuUUmAk1pQ8hBVX3ihOXbMd0naMfLq+qBDeQ7ZAb1+tsxLZVqXd3Xf6NFmelm+it5tnRrGR0+9RcqKS1VIt9NoU5oQni/1q1b0+XLl918VR7bJQRAQCiIAPsgtGK9Iek8qrWgLJdX9jrmpu/3FqdPn64bNWIY/fHtHS5B9/+HHZ02i5JTnzCUHNv1yZqcgFMkdN999/FxzOTahPrjcNGA53ytZsZA6XCkSIZtSM85pe1xkeMZxOSVvQ7Mf5MylTLnZn0fwBoTFUX7X1+rPLtac4sNI1Wnvr8WPkLhluQe7+0UCTk1Tqhv3HCYv29dPqRI7t27d/1rgZRgH8JNqW/qZC/sdUwS+73DvQ9osWMXvyehLM3JrSV9R/PSJ7m2NuFsD3KahDAeX9O79ikrHxjkA+dUFAVNTU0NGGuJZ1AnEGWzfJsX9jrms3lXOe06/AfatHnz9yQEtQ3RtX98e7vyvNJNrZAezObL5spX08EBnNJQME58fLz4MMO9eICDy+eZoUQBUm3jgnzMFBjUXwB7fUC/viJ3WKvrW7r2XoigL9mwUQRf12dWnJmZQf/7hh9TxsRHHZ8YCqZt2L1f1CcK1+YkCWlByaKUUrgSerh+I829FwqQFhcXC5uPFfLxle12TqHijW/Sh7UXqUh7H7R6EnIruyLOhj2HjRVZ1poqYxMOH52TJMRHsXD4Yr5/B6RiLdVuixMSEkQpdRnNzeyKCNXAqefAocP1WUEbJLpH/ellS3JoW9EKx45l4zKyaGTiIzRW828J5+Y0CeGaFppQOBr5w/k70d8Ne7GwsFAcrVWUQ0dlHdyU7d+41tFj2W8mTqPMJ+c1qD/WqO7YmqLVdPLoe1p+oQXK1zp/TQld+e//RXn5y5WP5fYATpMQbAcgILuqu9u4Rdr4qsnHF0+nlY7ZzzxP0Xd0p8yZsxosa8AKrOmpU+ia//4vUZNaVSss2UhvH/1AWMcj4be10ySE33T4TQpPWW7eRwCVULFeUZq7TGZmpmN7QlRbfnktofKGKkM1TC6zn1lBt0T/jHIWL2m0GE3Wol9Z8FvasfVN6ZPDhCbNzaF2UTGUoTEifBlKtHJD4U5ETpOQ97cdzxAI4JdFVlaWsI9o0Quu7ANoRDNnTKN1+TmWU/o0tZrIJT1OKwE9dtxjlJb+RMDHmiQhPI3JpU+ZTNOTH6HUcQ/b/mpwC7bkhZdo3oKn6m1AWIRIICImIdufT1gJwHePtCv45QtHQ7dLXYnqy1pBxPh7etCc1BTbWhGUDdyCbdi+h4peWtvsTW2zJIRVh19BrsbQVe+/p00uWeSXNduOnThFC55bRa1vulnYf+BW7tsigYjcICF4zeLGke1CZr9Ydc/r8V24oQL5eOlGWHhgayegYs0REooHUj6bPaKBfDbv2qspG8U0ctQomjf/ey2vuRaUhPTOOLPm5+XRjrIyShgykAb17SPCPHrFdW90kwbvZ6hh+w5XUcXB9zSgowT5NOc8ByJKT08XgXZ6eVh1n4Lzkt0gIUTVQ5tF2ldu7iIAvy0El+I7z9P2kZfIxx8ZzDU3J1vzuC6lrp1uo+ED46m3ts+jOrRvFNqFdBy1587TyT+dEfu98ugxGjsmUdvv+UHJRx/XMAnpHcCW8F2o0Gpd1dScparjx7UkSt81eA8tR5E2gXY0WEs1ijLFwZiwAdFpk4eNKNyIyA0Sasp1393tGFmj+weXwkM4lBp+kZVpikfV+0e0/V5LNZ980mD60R07CiWjR9zdNFjb6/5lyY28q2kSMiLUzjNC4wpDInKDhOysA/e1h4Ae31VZWSm8nDmEpmk8PUdCmGo4EpFbJAStFRsg3G8f7VGG3N6wxVVUVAibD5NPcGw9SULhSERukZB/hdbgnwQ/YRUBOIji6DJBq+/FlwHGUfQsCYUbEblFQnqWPTjAcVODALRNBJcmJiZKi+9SM1NvSvU0CYUTEblFQjCM4o+Xb2O8uTWCz4rJJzhGRp7wPAnhJWChh7F606ZNIXtr5hYJGfkI+BlzCMDtAZpPr169lASXmptN6D8dEiQEmHGswMKHKhG5SUJIcI4/3OwhgAsTeDnHxsYK36twcyOxh4713iFDQjoRwfCHpF2h1twkoVGa5yqcQLlZQ0C/re3SpYujwaXWZht6vUKKhEKZiNwkIdaErG1MeDfDDACNxwvxXdbewvu9Qo6EQpWI3CQh73+G3poh4rsQ2Q7yQWS7UY9/b71F6MwmJEkoFInITRLCpkLjzdT8xvRycGnoUIr5mYYsCYUaEblJQnCiQ5OVo9j8Z+btHnp815+1mnc4doVafJe30Q0+u5AmIZ2IUAIFQa9ebm6SEG4W8VueSajhF4L4rmXLllF1dbU4djH5uLODQp6EABt+0yNQ0MtE5CYJwcCK3/a8yb7fZEaKB7qzHSNz1LAgoVAgIjdJKDI/7cBvrRcPRGS7lbQTjKV8BMKGhHQiwrkeSaO81twkIWhB8HWJ5Ihuu5VLvfY9hdN8woqEdCLC8cNrHsJukhDwgLe5F8lZ9WZSUTxQ9ZwjTX7YkZBXichtEkK4gZdtZrI3HuK74Gg4ePBg4eXMzbsIhCUJeZGI3CQhGGKhDfkXGPDuZ2l9Zk4WD7Q+S+7pi0DYkhBeco1WNQBX0144mrlJQpHwybtVPDASsFX9jmFNQgBPJyC3iYhJSM2nfObMGRHZjnxJcDTkNLZqcFYpNexJyCtEpJqEcOR64403xIaEARrXz4E2JBwXkZ8JIRxIQRqqoRwcXKqSFpyVHREk5AUiUklCIJaUlBRx9NQbNAPkXurdu7f4X/hZUlKSqEOmNwRoIi1KKHlS6/FdcDuAlzNnjHSWMFSMFjEkpBMRtAM3bktUkRBsIX369BFewP4NJHP69Gmh7eAZPBuogazGjh2r4vuSJlMU5NO0PJCQ14sHSnvpCBEUUSSkExF+ezr9218VCSFhGXxhmmpwUIyPjxcbuKkGkjp79qwnMwWGevHACOERW68ZcSTkFhGpICFoPz/+8Y9tfQB65xMnTngqtkzURdfqdx07dkwYnPVjpZSXZSGeQiAiScgNIlJBQtASWrduLeWDOnr0qCc2OlculbKcISUkYkkIqwRjLo4qThzNVJAQ3qFdu3YNDNJWv77Lly+7fr3NxQOtrl5o94toElJJRPBfwZW53kpLS0VlTr3heCEjihs3Xr7jWPkckeIDxzG3GmxahYWFlJqa6nkDuVsYhfO4EU9CqogIx4qYmJgmtRRZNhjceMXFxdn6RlGJw42yxRxcamvZwqYzk9D/LKWKoxkMqzNnzmz0sWDDByrBAx8e/KnVbqpqas426BcdHUNRGqlBa/EnDDvakBtaEBcPDBv+kPIiTEI+MIKIEHUty2emKW3IVwuC93Kx5jC4Q6unFntnN4q/N46iOrSn6A63NljgmnOfUu2581R95iOqOHiIRo5IoETtKAZCgu8MtCFfZ0UjXwf8iA4cOOCYQVqvpIvKper05tAAAAo+SURBVPDV4uKBRlYp/J9hEvJb4/T0dEpISJBir4Fof20IRnCk1EDYwcyMGXTx/DmanDSaEgbdT62ub2noi7v63XdUtu9d2rLnbfr6m6tUUPiCkAefITMNHsdOxNRxcKmZVYm8Z5mEAqy5TCLy1Ybwmx9OgWuKVtNmzUt56ZzpgnzstMqq39PsZ56nHnf/nH52e2dauHChIXFNHQkNdTb4EIgRTpLwUufigQZBi8DHmIQCLDqIA3YW3NbIuMHStaFp06bR559dos7t29DctBS69pprpH1ypVvLaEPZXjrz14+CHstgB4JfkKrjEBcPlLasESGISaiJZZZJRJB1++23U9ubb6JZE8dR4vAhSj6uYydO0eBHp9D9DzxAO3fuDDgGCAh2IBUpL7h4oJJlDXuhTELNLLEsIoKcX/TuRS/nLaTYO7oq/aheKHmNXn1rN3Xq0o22bt3aYCxVBATPbaRSxfELdqZIyOCodBEjTDiTUJAFl0FEoxJG0GPDBtq2/xj9Nqv/dIam5iynB341iJYuXSq64VgJtwCZRzCuXGp0Rfi55hBgEjLwfdghopzsRVR35UtalJFmYCR5j+D2bMPu/do1/iOiwqjMShtcPFDeOrEkIiYhg1+BFSLC1XT65Em0f+NLUo3QBqdMo9NmUXLqE1K9oZG3u0zzaeLigUZXgZ8LhgCTUDCEfH4OIoJDI66b/Usq42jib+x1+hjm/yo4lk2av5hOVJ8y8ZaBH+XgUtsQsoAmEGASMvlpgIiGDh1KBQUF9USE9KowzOLWSW8ITch5KoveeX2tyRHkPj514bPUo09fStOcMK00xHehcGJiYqIj2QaszJH7hDYCTEIW1g9aD7yTQUS4lsbfQU5wRNRzHidpm3Z4357adfxQCyPI64Jr+9l5K+noseOmhIJYEdkO7/G0NGftWaYmyg+HPAJMQhaXEEQEjQh2Hz2/M4y/8+fPF/9up6VMPXf8HVdsQf6v1Knfr+lA5SFDSeE5uNTiB8HdLCPAJGQROmgKugaki0COIHgi42crl+fRnpIXLUn/YfRdDfq999Zr1PvuWEuy0GnqwqXU4xfxzWo0IFMcKbt06SKCS1U4M1p+Ae4Y1ggwCVlY3kAEpIu5cOEC5eZk010d21LquDEWpBOBhP5Z86Ho+9nnX9KtPePr/21FoH5dv72ssRe1XjyQyccKstxHBgJMQiZRxDEsKytLZDPE3/0b6nhVlP+7LedEXxKCfP9/m5wyBbILcfFAsyjy86oQYBKygSxujuAzg//qhIQSO1e/+U9akZVJveK6W5LuSzqfnL9At903yJYmhFxEAx+ZTGdra4UhHZHtsFvBhhWqFVgtAcudPIkAk5CkZfElpJY/+Qkd2PxKo8RkRofytwkNG9ifXshdQB3btzMqotFzkAlbD1cutQwhd1SEAJOQAmDbtm1DH+zZQrdoUfNWmv/x6y8f19DcZ1fQjpdXWREn+kCmrLzWlifBHRmBAAgwCSn4LOJiu9O6ZYssR8wHsgHZsQtd/Nvn1HNYIl24eEnB27JIRsAeAkxC9vAL2Hvo4EE0fdyDNDj+l5ak+xMO7EIzsp+1rAnJDN+w9ELciRFoBgEmIQWfR4pWX6xv98404aEES9L9bUIQYsdXqKLyPVq18S0qr9hnaT7ciRFQiQCTkAJ0Eex5sHwXrVueq0C6eZGLVxZRixtudiSpvfnZcY9IR4BJSMEXgBuomOgo+rz6iALp5kXCHlSyYWOjyH/zkrgHIyAfASYh+ZgKiQPi+9HCJ7Ra973vUTSCMbG+PkLGevBTjICzCDAJKcIbyb9Ovl9JLy55StEIxsTmF62jK3QN5S1bZqwDP8UIOIwAk5BCwO1e1dudGq7mB46bItJ4cECqXTS5vyoEmIRUIavJhRf1+uLVtG3NbxWO0rTo2c88R9F39KDMmTNdGZ8HZQSMIMAkZAQlG8+4ZRuCLWjYxGkitavMChs2oOCujEBABJiEFH8YuCkb0K8vbS1aYTmWzOwUv/r7FUFAuBHjGmBm0ePnnUaAScgBxJGzJ+WxcbT7ld9Rq+tbKh8REfMZT86lkVraWW6MgNcRYBJyaIX2lpdT7qKn6bWVeco0ImhAk+Zm08damMepU98nRePGCHgdASYhB1cItbqqjhymgqeftJxrqKnpwgb0+Lwcypg9hzZv2UKbNm1y8M14KEbAOgJMQtaxM9UTScRat25NU6dOpSOHD1Fsl060cMYUy+k+9MGh/bxQspF27q+kklc3CK/oAQMGNCg/ZGqi/DAj4DACTEIOAY54MhROxE0V8voc1GqU5S/Lo/GjR4g/0R1uNTUT+ABt3rWXVq1/nSYkp4hreN0XCOOUlJSYkscPMwJuIcAk5BDyffr0oaqqKjEaUsCiUCK0o2VaitVSjaBaXX8djdAyKMb37klRHdo3IiUct2rPnaeTWlXVN/e8TTXa38eOHUvZObmNHBFZE3JoUXkYKQgwCUmBsXkhuB3r1q1bg4egqSQnJ9f/P5TcgXNj5YF3qaamlmo++aTB89EdO2p1w6KoR9zdlKiRD8oLBWpwCUjXqq2yTciBheUhpCDAJCQFxuaFoDrHMr/YLSSYx7FMdqJ5aFsoYIgijNwYgVBAgEnIgVVq166dqHLh36AJybbdoBQRGo5q3BiBUECASUjxKqFQIspF+zZoP7onM45NMrUhaF2JiYmcO0jxurJ4eQgwCcnDMqAkFBnEHzQYpFE6evv27cpGhXwQG8eLKYOYBUtGgElIMqDBxKkmIWhd5Zp3NjdGIFQQYBJyeKVwczVv3jztpita+sjQuFBdVbadSfpEWSAj4IMAk5DDnwOu4WGkTktLkz4yHCLRfK/+pQ/CAhkByQgwCUkGNJg4+PHAo1mFXQhys7OzlWhZwd6Lf84IWEWAScgqcjb6wW4DEpJtPGZPaRuLwl1dQ4BJyAXoc3JyxE0Z/shqbA+ShSTLcRoBJiGnEdfG00M0QEaymgpikzU3lsMINIcAk5BL34fsq3o+irm0kDysbQSYhGxDaE0AbrKQemPkyJHWBPj0QqwYYsY4Xsw2lCzABQSYhFwAHUMijQd8hmT49OBWLE9LCSIz/MMlWHjYCESAScjFRZ+pJSKbMGGCrTgv+BxBDqfucHEheWhbCDAJ2YLPXmcZBCKDyOy9BfdmBOwhwCRkDz/bvVeuXCki6ocMGWJaFkissLBQHMW4MQKhigCTkMsrB9uQflNm1nmRbUEuLx4PLwUBJiEpMNoTgpxDFRUVVFBQYFgQNCiQlooYNMOT4AcZAQkIMAlJAFGGCKR/xe2WkeBTXMeXlpZSUVGRjKFZBiPgKgJMQq7C33DwpKQkkRWxOd8heFsjXQcnLvPQwvFUbCHAJGQLPvmdcduFY1YgYzPyR5eVlQkNSK8xJn8GLJERcBYBJiFn8TY0GnIOFRcXi1uz+Ph4OnbsmIg369WrF8mMNzM0GX6IEVCMAJOQYoDtiEdkPOw/ICOUd+bGCIQjAv8P1Hg9fZm730oAAAAASUVORK5CYII=" /></picture></example></p><p><example picture=""><picture -="" a="" b="" e="" x=""><picture irreducible="" version="">For the same execution traces, the heart rule again implies that the threads
must be converged in B. The convergence of the first dynamic instances of A are
technically implementation-defined, but we'd expect most implementations to be
converged there.</picture></picture></example></p><p><example picture=""><picture -="" a="" b="" e="" x=""><picture irreducible="" version="">The second dynamic instances of A cannot be converged due to the convergence of
the dynamic instances of B. That's okay: the second dynamic instance of A in
thread 2 is a re-entry into the dominance region of A, and so its convergence
is unrelated to any convergence of earlier dynamic instances of A.</picture></picture></example></p><h2 style="text-align: left;"><example picture=""><picture -="" a="" b="" e="" x=""><picture irreducible="" version="">Spooky action at a distance </picture></picture></example></h2><p><example picture=""><picture -="" a="" b="" e="" x=""><picture irreducible="" version="">Unfortunately, we still cannot allow this second example. A program transform
may find that the conditional branch in E is constant and the edge
from E to B is dead. Removing that edge brings us back to the previous example
which is ill-formed. However, a transform which removes the dead edge would not
normally inspect the blocks A and B or their dominance relation in detail. The program
becomes ill-formed by spooky action at a distance.</picture></picture></example></p><p><example picture=""><picture -="" a="" b="" e="" x=""><picture irreducible="" version="">The following static rule forbids both example CFGs: if there is a closed path
through a heart and an iterating anchor, but not through the definition of the
token that the heart uses, then the heart must dominate the iterating anchor.</picture></picture></example></p><p><example picture=""><picture -="" a="" b="" e="" x=""><picture irreducible="" version="">There is at least one other issue of spooky action at a distance. If the
iterating anchor is not the first (non-phi) instruction of its basic block,
then it may be preceded by a function call in the same block. The callee may
contain control flow that ends up being inlined. Back edges that previously pointed at the
block containing the iterating anchor will then point to a different block,
which changes the behavior quite drastically. Essentially, the iterating anchor
is reduced to a plain anchor.</picture></picture></example></p><p><example picture=""><picture -="" a="" b="" e="" x=""><picture irreducible="" version="">What can we do about that? It's tempting to decree that an iterating anchor must
always be the first (non-phi) instruction of a basic block. Unfortunately, this
is not easily done in LLVM in the face of general transforms that might sink
instructions or merge basic blocks.</picture></picture></example></p><h2 style="text-align: left;"><example picture=""><picture -="" a="" b="" e="" x=""><picture irreducible="" version="">Preheaders to the rescue </picture></picture></example></h2><p><example picture=""><picture -="" a="" b="" e="" x=""><picture irreducible="" version="">We could chew through some other ideas for making iterating anchors work, but
that turns out to be unnecessary. The desired behavior of iterating anchors can
be obtained by inserting preheader blocks. The initial example of two natural loops contained in an irreducible loop becomes:<picture preheaders="" with=""> </picture></picture></picture></example></p><p><example picture=""><picture -="" a="" b="" e="" x=""><picture irreducible="" version=""><picture preheaders="" with=""><img alt="" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAW0AAAC4CAYAAAAohb0KAAAgAElEQVR4Xu19C3RW1bXupKP3gpZXqwgKlwTr4aEHkihCaIuJlvJogYBYAvJIIq8gYoJUICJ5IBpCiyQoJlExRIQQXyGINwRBiE+gnkK0Q6HHI8QDBeujUnzguY6Ru7+tK+78+X/+/e+13/9cY2TwyHp+a+1vzTXXXHO2a1YScWIEGAFGgBHwBALtmLQ9MU/cSY8jsH//fjp69CidOXOmZSQNDQ3q37t27Uo9evRQfx+YEhMT6cCBAy354uLiWrL079+f8BMfH+9xdLj7kSDApB0JWpyXEQiDAEgZ5AuSPnbsGJ0/f14tAbIFAXfo0EH9t/bvkYAq6kcZ8XeQP4gfP/369VPrBpHj35z8hwCTtv/mlEdkIwJHjhyh6upqVRoWEjOIE6RplJiNdj/YhoFNIykpiSZMmKBK5Zy8jwCTtvfnkEdgMwKQokHUJ06cUIkwLS3N1SqKXbt2UW1trXoCAIFPmTKFCdzmNWNmc0zaZqLJdfkWAb8QHzYcEDhOCDgNuH3D8e2CkhgYk7YEeFzU3whAMi0qKlIl6pSUFBo9erSvJFQQd2VlpUrgUOVkZWWpF6Kc3I0Ak7a754d75wACkEbLy8tVAgORxcbGOtALe5sEcZeUlKiN5uXlRcWY7UXYvNaYtM3DkmvyOALbt29XJU9YemRnZ0el9QVOFThdfP7557R06VJX6+o9vtwMd59J2zB0XNAPCMC6Ytu2bSpZp6amUnp6eotZnh/GZ3QMsEQpKyujxsZGmjdvnqoa4uQOBJi03TEP3AsHENi0aZNqBQJSgkkcp7YIYFMDedfX16tqE+i+OTmLAJO2s/hz6w4gAJtqqABwuQjJmlN4BEDeBQUF6oOewsJCvrAMD5llOZi0LYOWK3YbAiCeRYsWqbpq6Gv5xWDkMwSdd05ODg0dOlTV+3OyHwEmbfsx5xYdQADSNcgGUiIf8eUnQKhMSktLWeqWhzOiGpi0I4KLM3sRAZA1JEQQDEvX5s0g7Njnz5+vmkXynYB5uIariUk7HEL8e88iAHVIRkaG+nQ7MzPTs+Nwc8eBMYhbmEm6ua9+6RuTtl9mksfRCgHYGU+dOlXVXScnJzM6FiNQXFysejXEaYaTtQgwaVuLL9fuAAJQhUDCBoGwZzv7JgCPk2AayMRtLeZM2tbiy7XbjIAg7IqKCn6KbTP2aE441mLitg58Jm3rsOWabUYANsRQiTBh2wx8QHNM3Nbiz6RtLb5cu40IgLBhycAmfTaCHqIp6LiR2Jbb/Llg0jYfU67RBgRw0ag134PdMCwZmCRsAF9nE+IimGNY6gRMZzYmbZ1AcTZ3ITBs2DC1Q3DyBFKAW9Gamhp3dTLKe4ONFRfCVVVV7ITLxLXApG0imFyVfQjANhjStUiwEsEDD0Hi9vWEW7oQArAoweUwn4DMWydM2uZhaUpNcMCPn6bjx5XFfrxVnbGxfSimTx9Vsoz2F2jQmcKPSLC0bt26FpJgPOWXpSyGEydOZGlbfhpaamDSNhFMo1XBL0a5YlO8XYndF3fNAEoakkAxvXpSbK8rWlV54uTfqenkKWo8+j7V73+FJoxPoVTl8i0aCRwWCmPGjGkDOdyHwvcz42l0NX5Xzsw1ibmCtM2vUuXmRJRm0jYHR0O1YCEvyrqTzpw6SXOmTqKUkTdR186ddNV1/ptvqHb3y/T0iy/R2a/O07qS9VEVZQTmfZdffnkrrKZNm0ZfnvsX46lrBQXPZMWaxAUxpO26ujqJnnFRJm2H10B+Xi5VKxc099+9UCVrmdRw4M+0eNWfKP7a66i0/NGoufT56U9/qobFQopXQoSd/+pLxlNiIVm5JnEhCbUVBw6WmKDvi7KkLY9hRDWoTozSZlLfnpfRkswM6tC+fUTlL5S58tla2ly7i6qefiYqPg5YkOAY30cJvDtj4u8YT4MryY41edNvRtLw4cPZD4zBOdIWY9I2AUS9VeDoOTV1Mi2cMZlSx1kTc+/g4bdpQW4hVTy52ffqkunTp9NuxddFcd5SxlPvIgzIZ9eanJ6dQ+m3zaLc3FyDPeVirB6xeQ1Amrkx6QbakL+E4q7ub2nrZz7+hMbedgfV7NjpW/8bwHPI9YNp05p8xtPgarJzTe59/QDNWZpPDa++5ts1aXAaIi7GknbEkBkrMDFlPM0YO0Jaf6239cZ3j9KC/DW0r+EVX+q4GU+9KyF0PsZQHkMnamDStgF1XPA0n/uMcrPsdcQP65LNO/dQTe0OG0ZpXxOMpzzWjKE8hk7VwKRtMfJHjhyh+XNm0Z4tilWHiZeOers9KfMuSp93u29suRlPvTMfOh9jKI+hkzUwaVuMvt1H0MDhQE0ya9lKOtz4tsUjtad6xlMeZ8ZQHkMna2DSthB9PP/NX55De7c+ZmEr4atesOIBih82nDIVfx1eToyn/OwxhvIYOl0Dk7aFMzBV8UA3bvhgxRyt7XNrC5ttUzXMABcXFtObBw/Z2azpbTGe8pAyhvIYOl0Dk7ZFMwBzqst79KCTh/Y6ossOHNZVN/xWtSSJVR6ieDExnvKzxhjKY+iGGpi0LZoFOMkpXlNIL1ZsiKiFCbMX0s49++nbE+9EVC5c5gUr7qf4XyR51mmPUTy1uJiJrRfxNAPDH8cObLXUXnv+KUq8Ni7c8gv6ey9iaGigJhdi0jYZUFHd/Mx5NLB3d5o3bbLuFj48dZruzHuAena/jOZOn0xxA/rpLhsuo9vN/4QPEW00Gu2YjOCpLW82tm7FEy8cQ52mZDEEniBtIVD845PP6IrBSYYFDLdiGO5bcvr3TNoWzYCRG/rqF77zgnbZJT+j/2o6SbMVz39mJbfrtXFBBk9wiPEIh/mB5G0ETy12ZmPrVjz7fO9vHS5qA8N8yWIYSNrB/h3JenUrhpGMwYm8TNoWoT5s6BBam5NNQxMG6W4Bx/c19yxWdeBX/nJkKwkGEk5ZYT5l5uSr9UV6LIUv7hG3zqHjTU26+2NnRpD2jTfeqDYJwg4kbyN4avt/IWyNjNOteIK0IW0jwc+6lrxlMQwkaZxeAtdpJFi6FcNIxuBEXiZti1DvExNDexRTv8BABqGaC/wAQDIFi+9oUZGAtP+j7ln132fPfUGXDBwW8bEUdTQ3N1s0YrlqtaQtatKSd4LiejUSPLW9CYet0Z67EU8taYtxCfKemJJiGENRV6BOe+yIZFpfcA/17tnat7leTN2Iod6+O5WPSdsC5OEuNGX8OPrL/1VcpHa7VFcLO/fupzP/+LRFJYLj/Lkvvmr5t1aXGCjx6GpAyYQ6kpOT9Wa3NR902nipFyx9pypppr++tF03ntp6wmFrdKBuxBNrD1YiwdLFF11E7+zZTjE9W0dEimT8gevwbx+coCUPrKXtjz8USTUteVHfcSW0nletmgwNWrIQk7ZBABE55ejRoyrRfPTRR6pfZxxLxdH03676OW1bv1q3B7oFy++j8i1Pt+oNpBjxMciSNjz/DR6bSqfPfGRwxNYWCyZpo0VsMjjiI8LPxtW5uvHU9jYctkZG5lY8g0naCDywdOlSqnhiI1UU5RnCUCtpB1o2Ba5NvXgCw4Qxt9A1/z5QDZAQqIPXW0+05WPSNjjj0L+CaIIlEE2H//2/aOG0m2lU0q/CtiBu4T99503q0qljKynkg9d3q0dP7YchLCEikW7c/pw9kLQFWYuTwZhRI3XjqQVcD7ZhJyhIBrfiqSVtQdaIzdihQwcyiqF2+IEEbWQtivoEhng/MFWJdTpv3jzf+Mgxsqb0lmHS1otUQD5I1AkJCS3hrrS/Pnz4MJUoksPwQX0p7ZaUsC0EHt9FgcernqNOHS9WX1TiY4HkDRtupEgvIusbXqOHtjxPdfW7w/bHiQyCtAPJWvQlIy1NN57a/uvB1sh43YonSBvqEUjWgqxlMQwk7UC8Il2LorwWQ/R5vuJmIU65u4D1EKfQCDBpS6yOTZs2EWLfaRMufWpqagi/21/3Am1cUyDRwg9FjR5BRQ0ri0upXZdulJ+fb0p/zK5EqJZC6dzNxlO2/27Fc9u2baq0Csk6MHkBw9WrV1OTYuFUWloqO0W+Lc+kbXBqIRksWrRI1WVrL9AgZUM3h4u1PrEx9Enj6wZbaF1MlrShz67YvMWzekOz8ZSdFC/i6RUMsfFUV1dTRUVFG3t92XnzQ3kmbQOzCKIuKipSL8hw6z1gwADCxSSOdbhQEQnhxVbcnkFJidcbaMU80vaLPayZeMpMiJfx9AqG+MZycnJU4mbLktarlUk7wq8X6oWzZ89SYWFhyxEU+tgxY8aopku4/BGprKyMjrzRQBvuWx5hK+ZmLyrdSOeoPRUqR08vJ8ZTfva8hCGss6B+hKqELUt+mHsmbZ3fARYQdn7ccI8e3TaSOlQkwRZWQtwgw6ZqOrt2wWwwqxoxba7qljWUXw8z2rGrDsZTHmkvYQiVDixL8EI22Hcnj4b3amDS1jFnuMBpaGhQVR+REt/27dtpU/kj9FzZgzpaMj/L4lV/pNir4ylb0b/7ITGe8rPoNQxxfwSJOynJu14q5WeNJW1dGGKXx2UjFkt6erquMsEyOaVHhO517G13qKHGglkTGB6QwwUZT/kJ8CKGOOmCwLX3RvJIeK8GlrRDzBl8D5eXl6u66/79+0vNLMj/xhuG07Ola3X7IpFqUCn8+b/OqYQNixHZ/sv2xezyjKc8ol7FEKfe+vp69YLST4JIJDPKpB2AFnZy7OhdunQx1aZZvVSZMY12PvEwde3cKZI5MpRXjcI+dz5NUNyd+jExnvKz6lUMcfFfUFBAVVVVrS7+5RHxRg1M2pp5wmUiFgNekyUmJpo+g7vq6qgg9156qrjQMokbEvasJXk0YfIUSs+4zfQxuKlCxlN+NryKITYcvKCMRp8lTNrfr3vxEgvqkEgvGyP5dLAxzJ87m9bec1dEvrb1tAEd9uyl+ZS1+G7fStiBODCeelbGhfN4FUOoeBA4A0JWNFmWRD1p41EMduw0xbcFnv/akdAmoojE9buKVtw515C7UW0/IV2vr9hCO/Y0UMWTm6POppXxlF+1XsUQ6kyYBKYovsJljAXkEbSvhqgmbTyXra2tVY9Y2kcxdsFfpjwaKFpdSDMnjVd/9AZMEP2DDXb1C7vooU1bKS09QzXrs/KUYBcuRtthPI0i90M5r2IIKy9cTOKk7PcUlaSNYxUuG2OU6DLLli1zdI4hKaxWFlqlcivetXNHGq948ktKHEwxvXq2IXGoP5pOnqIj7x6lZ158SYkj+d90k+IiduWqVdS3b19Hx+GWxmXwPKFgO2XKFMrLL4jqzc+rGOK1J95T+N2yJOpIGz4NcNkI6dptpnDQLeLhQ8O+l5VgCk104sMPW3HhT7t2oZ9cfDF9/Mmn9M3//E/L74STKrcQp1v6EQ7P2N69Fb8WMRSfcC2lKmRtxeWzW7Aw2g+vYQhT3ZKSEtWyxK+nzqghbUgPIGskOHryko0nXoPBPjVYwtN5kDYnRsBKBCBMQJWIBP2xXfc/RsaEjQbqEvgscZtgZmQ8gWWigrSF3xD4L3BrjMRwk4nLFujgAxN0eE6reML1nX/vTQRAfnhghktKQdRekV7RZ3wzENC8+s2HWjW+J23ouQ4ePGjIb4ibPjWcFGDehOOfNkHiGTVqlCr5OHGZ6iaMuC/yCCAYBYQD6IaHDh2qWmR41TWqXy1LfEva2GlxRAKh+cUUKJC4YZtapzzYwQuxyspK1S+DkIi8pP6RpxquQQYBXMwL9Qc2fpi/+km/D5NenBD8YlniS9IWlxHQaXlVSgj1Eao+IxSLERxdcUuu3ZBA2pCS4JsBpI2Pz29HQxly4rKtEQBRI0IM1gqEG1jO+DUVFxdTY2Ojquf2ukDjK9IWwUH79evnaz0viBtBF/bt2xdyAeKkIQjc68dcvxKJE+PC/Q701FCDCKL2ip5aFi9sUhi71y1LfEPaIgSYVX5DZBeMk+Wj+UN1Ene3tM0b+A8zISxLvBzGzBekDb8hx44d88XRx+oPPZqOxFZj6eb6taoySNJ+01PLYI9TBsxooeP2ou7e06QdLgSYzMT6vWzg5RPCqHEcPu/POu5zoKdG8rueWma2sP5B3KmpqZ7T5XuWtIUzdKf8hsgsGLeVFWZeMI2E/hsXUn67wHUb5mb2R6i/8KewHmLzT30Iw7LEDe4s9PX2u1yeI23skACa48VFMs368+JuAOaDXnxQoX+U3s+p1VPjhAT1hx9f/9kxU15Tr3qKtM0MAWbHYvB6G8L6BPpRkEI0+Sx249yxSad1s4K1DmHFC5YlniBtL/sNsW6Z2VczTjeCwKE2gf6bpTr78Bd6an48ZS3mOGXC+2egZQleIsO+2y0qJ8OkjVd4+Gk6flyx+TzeCs3Y2D4U06ePerEl61hGmOh49abX2mVmf+3Qf8PWFfMiLrrcspjtR8O6FqGfhuQHnFlPbR3OgTULyxIRxgyqWLjCcJOPn4hIGztRubLjbFe8fcVdM4CShiRc0O9z49H3qX7/KzRhfAqlKs5bIiVwu0KA2bck/NWS9vm8IHCvvzZzcoagp4ZJJl604iTDempnZgMnSzibgqmkcNIGweT06dO6OmS1QKuLtLH7LMq6k86cOklzpk6ilJE36Y4ofv6bb6h298v0tOK0/+xX52ldyfqwpmUiBFg0hRDStRpcmgnHdkE26CI/n9c/UQI7uD1l1wP6cbM6J8gaxK1NNTU1IQVPOwXasKSdn5dL1YpD8fvvXqiStUxqOPBnWrzqTxR/7XVUWv5o0CfY4kIAeiU+dsug7UxZtmrQhzufUvTh5EQuEDD8+2BD1SbhoE37f3YLtGg7JGmjwxlpM6lvz8toSWYGdWjf3jT8Kp+tpc21u6jq6WdaiNlNIcBMG2iUV8T2w60XgFZPzfcB7vw4QMIJCQkEPgqW3nvvvZZLeLsFWtGfoKSNjk9NnUwLZ0ym1HGjLUH34OG3aUFuoRo9HBsEbm2F8t+SBrlSRxGI1pd6Qk8N9Qf01Gx54+gy1NU4Ln9xEoJPcfypJfDs7Gz1UtJOgTaw021IGwR6Y9INtCF/CcVd3V/XII1mQjTxX948g379m5G0YcMGz7tMNIpDNJUTtsYgMT/6bsZcBuqp8VSabdy9u8q1JP7WW2/RFT26U1baFFsE2mCuJdqQ9sSU8TRj7Ahp/bXeKWpUIosvyF9D+xpeYdLWC5pP8vkpSgqmhPXUPlmYIYZht0A79rY7qGbHzjYuJVqRNnQ0zec+o9ysTFvRh3XJ5p17qKZ2h63tcmPuQcCr8QjZbt09a8jqnrhFoG0hbXw08+fMoj1bFKsOEy8d9QI5KfMuSp93e8S23Hrr53zeQUCEvoJkA9VCpPb9Vo+UX4hajbD76neTQNtC2nbvIoHTAjXJrGUr6XDj2+6bMe6RIwhoyRH6b6fdx2p9sbhxM3FkkqKgUbcJtCppQxeXvzyH9m59zNEpWLDiAYofNpwylaejnBgBLQJQQ8AdL9zHwsMjYmPaYcfPXg95HbpNoFVJe6pyBB03fLByGzrG0RmCGeDiwmJ68+AhR/vBjbsbAfX1meL/BOoTK57PB24Q7F/c3evByt65UaBt9/XXXzdfrhw9Tx7a64guOxDwq274rWpJwk74rVyK/qjbTNM6t6li/DFD3h+FGwXadnV1dc3FawrpxYoNhhD+cezANuW+PfGOobpQaMGK+yn+F0mUmWmvBYvhDnNBVyBg9BEL66ldMX2u7ASEAqMC7d8+OEFX3zSOPn3nTerSqaM6vsb3jlHe2oepcl1hy/9FMnAh0LbLnDe3eWDv7jRv2uRIyrfkBWnLkHRgo2z+Z2gauJAGgXBuTVlPzctFDwJ4xSsj0D5e9Rx16nixqnY+e+4LSluUQ2vuWUx9r4zV03ybPEKgbTdh/Lhmmcc0ZpM267UNzScXCoGAeD7/6aef0k9+8hP67LPPVGdArKfmJRMOgfmZ80hGoAVRXzJwGH3w+m568y9H6NwXX9FsxUuq0SQE2naJQ65vXpuTTUMTBhmqy2zSPnHy7zTi1jl0vKnJUH+4ECMgENBGnO/WrRt1796d3n//fXaByktEFwJmWI0c+EsjbX5uB5VvebqVqkRXBwIyCYG2XWzv3s17FFO/2F5XGKmHAnXa9yycSysXLzRUlyiEOhWjFqk6uHD0IoDHOdXV1So5C+sSLRpa97GIPg/zQb74jt71Emrkw4YOIRmBVstnZYX5UlI26hICbbse3S9rfmtnNfXodqmhWTNb0kYnmLQNTUVUF9I+gxdEjcgj4ZLRcuHq5d97H4E+MTEkI9ACgZ1791Pdy6+qkvbf32qgyy79mRQw4MZ28YMGNm9cnWvYo5/ZpA3Pf4PHptLpMx9JDY4L+x8Bsx1OhZPQ/Y8oj1CLwOWKNz8ZgfYfn3xGVwxOUsn6kPLS+9CRd0zRQrQbPfI3zQun3Uyjkn5laMbMJm1+zm5oGqKmkFZPbZVr18A2nH4+HzWT67KBJsQNIhmBdt1jlfRvV8bQ2F8nqyNbsPw+mjt9MsUN6GdopEKgbZc+c2bz8EF9Ke2WFEMVmW2nXd/wGj205Xmqq99tqD9cyJ8IOCUF8+tIf64nPaMaM2okGRVohU329scfamkqmO22nn6IPEKgbafEYmzeX/cCbVxTEEl5y/KuLC6ldl26UX5+vmVtcMXeQMBt+ma27/bGujGrlxlpaSQj0JrVD1GPEGjb/fOf/2zuExtDnzS+bnYbhuqDPrti85awEdsNVc6FXI8ALDvgGAqhnhC1A5HdEabLbUn7khJ95Mg0bpsh+f5gHbpRoFUdRiG82IrbMygp8Xr5kUrUwDbaEuB5uKgIQVZfX+85G2qjz+c9PF1R03XcbbhRoFVJu6ysjI680UAb7lvu6IQUlW6kc9SeClevdrQf3Lg9CIjXiiDtlJQUNdgBbKu9mjjauldnLnS/3SjQtgRBkL0plZ0u3IyOmDZXdcuqx75Wtj0u7wwCIDa4VcWfgqjt8Itt92g5XqTdiFvTnhsF2hbSxu38pvJH6LmyB60ZfZhaF6/6I8VeHU/ZixY50j43ah0C2heIbtZTW4GAl1U/VuDhxTrdJtC2Cuzr1FEAumxEHkaoMS8fj724IK3qM5NVW2SjefOyap3ZUa/bBNpWpA3F+403DKdnS9ca9kUSKYif/+ucStiwGHGjlUCk44n2/H7TU1s1n9GiJrIKP7vrdZNA24q0AQQWU8aMabTziYepa+dOlmOjRmGfO58mTJxoeVvcgDUICAKCXTX01HB76kc9tTXoEWGjq6ysDOngyqp2uV79CLhJoG1D2hjGrro6Ksi9l54qLrRM4oaEPWtJHg28bghNmz6dpWz968cVOYWpG7zpQU+Np958UpKbGjue6Mv1MLpLNzY20q2pv6eXnnqMune7xHIwQgm0QUkbvVHDxs+dTWvvucuwr+1Qo4IOe/bSfMpafDclKw7pc3JyVIuRvLw81mlbvhSMN2BmTEbjvYiOkmY7w4oO1MwbJfgPJ0j84KEX/oSgcuWVV9JlP+tqi0A7YfIUSs+4rc2gQpI2cqKTcAQe1+8qWnHnXMPuW0WrkK7XV2yhHXsaqOLJza1ePeKJcEFBAWVlZfHrMvPWnik1sfmaKTAarkQ8nweRp6amqvbsbBZrGM4LFgTWw4YNC5mnpqZG9b1uh0AbSmV8QdIWPS8rLaWi1YU0c9J49SfSgAmwwa5+YRc9tGkrpaVnqGZ9oRbdauVhDY4h69atY72oNetSV638UEQXTLZn4kDE1kM+f/58gn12YIIa8PDhw+p/2ynQBvZDF2mjEI7GqwsLqVJ5j9+1c0caPyJZefY+mGJ69WxD4lB/NJ08RUfePUrPvPiSEnHhlHo5lZdfoEtCACCLFGKPi4ujZcuWWT9L3ELLQoR5U21traqfZj21excG9N+CwHHpy+5jzZsrcF1CQoKqEtEmSNk45WiTnQKtaFc3aWs7Cn0PPu6GfS/TiRNNdOLDD1sNRAlhphwhYig+4VpKVcg6MTHREKK4VS8pKVF13UbrMNRwFBViPbX3JxtqE7wyxXeZlJTEQYslpxR4QtqGWhDfB5JWyg6s3k6BFm0bIm1JTCIqDkCg64ZkUahI+qzLiwi+kJmFnhq4iufkjK052DpZC98/yKGP0wvMLxWX1aopZkZGhlphMCk7WEt2CLSuJ20BDI4qsDIBwSAQK6fIEdDqqVkiixw/L5XgE1RkswW8IF3HKHEhtb78QdogYqHLjqxWa3J7hrTF8OHjFjpXSN3RbBcMvb+eByxC9wnMcOvNuk9rPiQ318o29ReeHQgzIGyoYZOTk1tlBpmDtN2knvUcaQNREBEuKkFa0WjbjUU0ZswYdfcPRdxsZeBmGnWub9rXq8J8UM/m71yPrW0ZViLw4w51iFfUg54kbTGNsKmEymTp0qVRY9uNS5IblQdJ+BNmkdnZ2S2rmsNhWfuB+632aPYTIwS/fv36ec5CzdOkLT4i2HYfPHiQShV7cj9LDVrCxthxo40LEqiMMH7WU/uNVu0ZT7R5ZBTCHlSsblJ76J1tX5A2Bgu9HfRSQ4cO9dzOqWeyIBnAdhTErU2wf8dJAwTOiRGQRUD7fF74lMFdiF8SBLxjx46pp1SvqEMCsfcNaYuB4chXVFSkXlR6cRcN9nGoHsYUlQh02YEJj48wVk6MgNkIYL3B/hsC0ahRo1T7b68SHb4hWIJgHJmZmWZDZWt9viNtoCdsu7HYvLSjwsYWP03HjysS9XF1IXz77bd05O236Ysvvgy6MKAOOn36tK2LhhuLPgTwmA4eHZGE+12voIBvCm89oD71g8WZL0lbLCbclMPKBLfkbrXthn6tXFlM2xWTvLhrBlDSkIRWrgHgt+Uj5eeDD0/RO0eP0ZlPPqP//OAEXXrppfTll1+qP3WKK93Ro0d75RuytZ/BNkLRgdjYPhTTp4+qWgp8nmxrJ13emBbD99//TxtdbT8AABBTSURBVPrHxx9Tj+7d6Uc/+pFiRupuDGFz3dTUpBK2X6Ji+Zq0xbeAizpICZC63bLTQne4KOtOOnPqJM2ZOolSRt6kO+jE+W++odrdL9PTil+Xs1+dp3Ul61mnrSG+cBuhyCp85DQefZ/q979CE8anUOrUqUzgCkDqy0rFDC6UMOF2DHHKhjokLS1NVev4KUUFaWPC3GTbnZ+XS9VVVXT/3QtVspZJDQf+TItX/Ynir72OSssf9Y00YQQT3giNoNa6DPTYECaav/1/NPPmsYaFiaa/n6G8gpWObIDCZ5GbhDT5mfmhhqghbTFkJ227oWvPSJtJfXteRksyM6hD+/amzWXls7W0uXYXVT39jK/NHkMBxhuh3FLC2gRZH3jjdVp77x8UD57XS1XYqHj4XLn+UfUkWFO7w5YLTIwB7zaQcDnvF3VI4EREHWkLAOy27YYUODV1Mi2cMZlSx1mjfz54+G1akFvYJsCE1Nfn8sK8EcpPEFQJUyf/nn4/5tc079Zb5CvU1KCeBO9/0PI1Ke6voiGIStSSNtaVXbbdIBZEc96Qv4Tiru5v6kcRWBkuLhHdvmbHTtXXiJ8Tb4Tyswt1SMbMGbRhZY7pYQVF73B3MD07h/JWrqLRivsFs5O4s8JTdD8/rhO4RTVpCxCstu1GyLYZY0dI66/1LnYcTRfkr6F9Da/49ojIG6He1RA6n7rp/f4WJd7hAxFHo4q0dVyeT5q/mJatyG/jlCnSukT+UJ75jNbnlXJM2t/PFC4q8SjHbNtu6Fqbz31GuVn2GvTDumTzzj2qPtGPiTdCuVm1c9MTPUWM2BHT5phyClQvTBVz3mCe+eSQcX9pJu2AOcJiwGWGGbbdakT7ObNozxbFqsPES0e9y2pS5l2UPu92R27w9fbRSD7eCI2g1roM7lfGDR9i2f1KqB4e/a/jNH3Rcnrz4CHDp0AveuaTn7EfamDSDoEmFgZ8UMuYDdktDQYOBWqSWctW0uHGt81cM47WxRuhPPx43bip/BF6ruxB+coM1LB41R8p9up4NcB3JAmnYfgXivbYsUzaF1g1UJVA6jbitxuPE/KX59DerY9Fsi5Nz7tgxQMUP2w4ZSqL3Q+JN0K5WXRCLRLY4+8uyxdS3e6XdF8cet0zn9ystS7NpK0DTeG7IJTfbkjleCavtQudqjydHzd8sHL8NP+2XEeXW7LADHBxYbF6HPV64o1QfgaLlVfBJ949othi3y1fmUQNJU88Rf999mtaV1wSthaY5zY2NqpP0b3qsCrsICPIwKStEyxIKMXFxW38dkMCGDZsmGrMD497SMh7ueLI6eShvY7osgOHdNUNv1UtSbxuAsgboc7FeoFsw4YOobU52ZaZ9+ntIaTtwWNT6fSZj0IWESa5fvDMpxcXPfmYtPWgpMmjPpVWdHHw242oMSBs6FkhZb/33nsqMcKEsHhNIb1YsUF37R+eOk1FjzxO5VueVsuMHZFM2x9/SHf5C2VcsOJ+iv9FkqddUspuhGbj68WNECSYEBenChNG0o9jB7YqNm/aZFp6+2zq3fNyI9XRrybNpHUPPxLUhbI43eJOiX3Ft4aXSdvQciPCZQ703XiJJRI87cHj3vzMeTSwd3fCotabFiy/j2ZMGk+J18apRc6e+4K6dOqot/gF8/nB/M/IRqgFxWx83boRXijgM9R4R95ooA33LTe0rkDa3554p6XsvjcO0r43D9HKxQsN1VdUupHOUXsqVNQf2oTvCuPwk2c+QwCFKMSkbRBNLKo+iltPSIDaVKU4gqqu2hrxY5rAD8Jgt4IW84JeG0Ee4JEtlAtdIxuhFgyz8XXrRggcofeF/XKghJqh4Dt8UF9KuyXF0PIKhqEMrnjift8jFarqDkl45jPD3NbQAD1SiEnb4ERNnDhRlbYDEyxNevW8gkruXRyR3vBCix+/KyvMp8ycfLW5155/qkUi19N9PCMecescOq74FXZrwgYI1RPUSyCcQPKWtRqRIZdgmLl1IwRpQ7WABB/hWvIeM2okLZx2M41K+pWhZWA2acNm+5b5f6D3lPBf+JZKSkp8E6jAEMA6CzFp6wRKmw0LDKQdKnXq2JEO73ouoqfB4Uj7P+qepbgB/VS1ySUDh7U6puoZAupvbm7Wk9WRPIK0ReOQEkE4IjiB7AWa2aTt1o1QS9oCS0He8DC5cXWuYf83ZpM2Xkj+fPgY1e81kp8985n5UTFpS6AJyVAE2sVlJIz/xeI7uKOKBvbvq7v2cKSt1SUaISCUSU5ODtkfHKnxaMGpBClL4KftgyDvRVlZtEexeY/tdYWhLhrBLFxDqBMbi5tSZWVlm+DPon8XXdSB/rqnlmKUk6CRZDZpow+os6amxnevdo3gq7cMk7ZepCLIlxA3KGKJxkrS1mNeFW540N3DvNGqNFWJGAOdZmASpA2d9ls7q6lHt0sNdcEq0t63b5+h/gQWglrIDJPMYJI2NmS4LN26ZQvVlD9I/X/ex1CfzSZtOJG6NO6X9PXXre+FDHUuigoxaVsw2UZ0hxeybtB+LDBduzPvgYjMAb3wnD1QPYJTgfYxk5GNUDu1ZluPmLERWrD0SEvagqxhmoq/wz3witszDAc4MNt6xK0qJivmxcw6mbTNRPP7uozc0l/IjhgfC+y2d+7Zr7YQ6UVkfcNr9NCW56mufrcFozWnSkHaIOtgntuMbITanpltp+3WjRCkjRNRZmamiqP2BaGRdanF0Gw77UDrEXNWkv9rYdK2YI7hlH1/3Qu0cU2BKbXLHu1XFpdSuy7dCJGp3ZpEENZQendZwjF73G7dCLH28F4gWDAAzH/z2Y8pN9sdfmgQIu/Vt/9GFYoenpN+BJi09WOlOycu1PrExtAnja/rLnOhjLKkjefCFZu3ePplmdkboezEeGEjDBwjHoJNHD+W/vpSW1NVWTyMlP9dxgLKXpKjbjKc9CPApK0fq4hyyuoPA4+lWuuRSDriF72h2RthJBgGy+vVjbBPTIyUFY4sbqI8LiF7Dfm14nvkjGG/2mb1xWv1MGlbNGOyT4bN6laop8Jm1W9nPWZuhDL99vJGmKM4NetE39DS+bNkIJAuW/1CHb3w6ltUVV0tXVe0VcCkbeGMy1o8yHYNFg4jps1V3bL6waUlb4SyK+K7p+I33jCcXnvuSerauZN8hQZr8OpJxeBwTS3GpG0qnK0r82qEEAshka6aN0JpCMlpn9rlW56hd5oUh1Dl5fKDicIamLQtnnSnjvQ4wo+97Q411Jg2OIPFw7W8et4I5SF2MnoNnq7DJeu+V17VHbVGfsT+qoFJ2+L5xAUajqPPlq41/AQ70i7iwwBhw2Kkf//+kRZ3fX7eCOWnCO4XYEmyZ8tjtqpJfqesy+w/LKXRY5yN6CSPoHM1MGnbgD1MrTJmTKOdTzxsyweiRmGfO58mXMCplQ3DtqwJ3gjNgRbeAFffl0/PKQJFh/btzan0ArUsXvUnJaDvICWg712Wt+XnBpi0bZrdXUpwhILce+mp4kLLJG5I2LOWKJ7xJk+h9IzbbBqZM83wRmgO7psqnqBNjz+mRma36mIS5n0Lcwup82WXU/F6c6IxmTN6b9bCpG3jvMET4Py5s2ntPXdF5GtbTxehw569NJ+yFt/tWwk7EIfbFJeebx06SDWPFvNGqGeRhMgDiXvRnXfQxqJ8w25bQzUPC6Zp2Tk0MOE6uuaaf/d0yDsJiE0tyqRtKpzhK4PJFRz6x/W7ilbcOdew1zrREqTr9RVbaMeeBqp4crOnXz2GR691Dvg037t3L8X27kVlq+7ljTBSADX5oeOemjqZkq6Pp7vnZUhL3ZCuYSWyueZFKn30MTWOKp7WC//oEl2N+qJM2g4tgbLSUipaXUgzlbiQ+InUTzQkmOoXdtFDm7ZSWnqGoidc5Atb7EimY8CAAWqMzs6dO1NM7/9Dv0gYxBthJAAG5IVVSfG6B6lciSW5MP1WNSxZpCoTkDXW5X3ry9UT39JlOSpZIxg2wslxkF6JCfq+KJO2PIaGa8BHsrqwkCoVJz9dO3ek8Yonv6TEwRTTq2cbEof6o+nkKTry7lF65sWX6ITy9ylTplBefkHUkTUAB3YXXXRRC/bwRT171ix6/LFHeSM0vCK/K4iL3oL8PNq0qZL6X3UljRuRRInKhhhuXe5+9QA1vHmQpkxOpcKiolYmffA+aJbvccnheb44k7ZLphDHR9ggN+x7WYk80kQnPvywVc9ie/dWnOTHUHzCtZSqkHViYqJLeu5MN4BXQkJCa4wU4kbU9qqtW3kjNGla4Oa1traWDrzxeth1OUpx/BTM+RNOQ4iog3BinOQRYNKWx5BrcACBbdu2EaLdBCZI3JDo8CdvhA5MTJAmWTVi7jwwaZuLJ9dmEwI5OTm0evXqNq1BZ4qIN1AdcXIeAVxwFimqklLlDoeTOQgwaZuDI9diMwKwHIE6CQmqIrz8xPE7mPN/m7vGzWkQwDyBsHlezFsWTNrmYck12YgAorCAqGFCBt8qTA42gq+zKQSuwKUmYlRyMg8BJm3zsOSaHEQAl13QndYpL085OY8Az4d1c8CkbR22XLPNCECyw+OlZYqjf07OIQBzzDGKQ6iqqipWi1gwDUzaFoDKVTqHANQmsBxJT093rhNR3DIIG0GacRnMD2msWQhM2tbgyrU6iAATtzPgM2HbgzuTtj04cys2IwBzwLNnz/KDDptwx4Uj7OZhwcMStrWgM2lbiy/X7iAC0HHX19dTRUWFr6L3OAhp0KbxiAmXwDDt82PQDbfhzaTtthnh/piKANyOFhQUMKGYiuoPleFlKp6o49LRD8GjLYLJ1GqZtE2FkytzIwKwKMHlWGpqKl9QmjRBUIdAusajmby8PD7JmISrnmqYtPWgxHl8gQD03MeOHaN169axVCgxo3AihdMLyDraHZdJwGi4KJO2Yei4oBcRwKMP+C1JSUlhqTvCCVRdtipkjQTCZnVIhACalJ1J2yQguRpvIYBLSrgchbUDX56Fn7syJTACLnVB1mwdEh4vK3MwaVuJLtftagSE5Ig/EVUlOTnZ1f21u3OwuxabG/Bhz4l2z0Dw9pi03TEP3AsHEQBpQ5I8ePCgSt7RHscQeBQXF6v6f1zeRjseDi7NoE0zabttRrg/jiIAybK6urpF5w0PgtGS4Pu6pKRE9cyXlZXFahCXTjyTtksnhrvlLAIIWwb7Y/gxgfTtZ703/JJDX40EnyEYMyf3IsCk7d654Z65AAFYm+ABCVQnIG6oC7xu5gZdNYgaF7E4SSQlJakqELYGccGC09EFJm0dIHEWRgAIaAkc0uioUaM8o++FygOnBxA1kug7E7X31jaTtvfmjHvsAgSg/xXSKgh86NChqiTuFgsUkDR8guAhDE4JSLBNZ4naBYtHsgtM2pIAcnFGAAQOKVwQJFQOkGD79eunqlLwY+WFJp7pBxK0aB+bCOyqWaL2zzpl0vbPXPJIXISAVtJtbGwk6JHxf5DGu3fv3tJTrWQOPx7aC08QMcogoTw2BZGampoImwU2A0j62CBAzm6R9F00Fb7rCpO276aUB+R2BPSQMQi4S5cuusjd7ePl/pmLwP8HbQenicLENqEAAAAASUVORK5CYII=" /></picture></picture></picture></example></p><p><example picture=""><picture -="" a="" b="" e="" x=""><picture irreducible="" version=""><picture preheaders="" with="">Place anchors in Ap and Cp and hearts in A and C that use the token defined by
their respective dominating anchor. Convergence at the anchors is
implementation-defined, but relative to this initial convergence at the anchor,
convergence inside the natural loops headed by A and C behaves in the natural
way, based on a virtual loop counter. The transform of inserting an anchor in the preheader is easily generalized.<br /></picture></picture></picture></example></p><p><example picture=""><picture -="" a="" b="" e="" x=""><picture irreducible="" version=""><picture preheaders="" with="">To sum it up: We've concluded that defining an "iterating anchor"
convergence control intrinsic is problematic, but luckily also unnecessary. The
control intrinsics defined in the original proposal are sufficient. I hope that the
discussion that led to those conclusions helps illustrate some aspects of the convergence control proposal for LLVM as well as the
goals and principles that drove it.</picture></picture></picture></example></p>Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-76859028257648300122021-06-11T07:40:00.001+02:002021-06-14T11:26:26.064+02:00Can memcpy be implemented in LLVM IR?<p><i>(Added a section on load/store pairs on June 14th)</i> <br /></p><p>This question probably seems absurd. An unoptimized <span style="font-family: courier;">memcpy</span> is a simple loop that copies bytes. How hard can that be? Well...<br /><br />There's a <a href="https://lists.llvm.org/pipermail/llvm-dev/2021-June/150883.html" target="">fascinating thread on llvm-dev</a> started by George Mitenkov proposing a new family of "byte" types. I found the proposal and discussion difficult to follow. In my humble opinion, this is because the proposal touches some rather subtle and underspecified aspects of LLVM IR semantics, and rather than address those fundamentals systematically, it jumps right into the minutiae of the instruction set. I look forward to seeing how the proposal evolves. In the meantime, this article is a byproduct of me attempting to digest the problem space.<br /><br />Here is a fairly natural way to (attempt to) implement <span style="font-family: courier;">memcpy</span> in LLVM IR:<br /></p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">define void @memcpy(i8* %dst, i8* %src, i64 %n) {</span><br /><span style="font-family: courier;">entry:</span><br /><span style="font-family: courier;"> %dst.end = getelementptr i8, i8* %dst, i64 %n</span><br /><span style="font-family: courier;"> %isempty = icmp eq i64 %n, 0</span><br /><span style="font-family: courier;"> br i1 %isempty, label %out, label %loop</span><br /><span style="font-family: courier;"></span><br /><span style="font-family: courier;">loop:</span><br /><span style="font-family: courier;"> %src.loop = phi i8* [ %src, %entry ], [ %src.next, %loop ]</span><br /><span style="font-family: courier;"> %dst.loop = phi i8* [ %dst, %entry ], [ %dst.next, %loop ]</span><br /><span style="font-family: courier;"> %ch = load i8, i8* %src.loop</span><br /><span style="font-family: courier;"> store i8 %ch, i8* %dst.loop</span><br /><span style="font-family: courier;"> %src.next = getelementptr i8, i8* %src.loop, i64 1</span><br /><span style="font-family: courier;"> %dst.next = getelementptr i8, i8* %dst.loop, i64 1</span><br /><span style="font-family: courier;"> %done = icmp eq i8* %dst.next, %dst.end</span><br /><span style="font-family: courier;"> br i1 %done, label %out, label %loop</span><br /><span style="font-family: courier;"></span><br /><span style="font-family: courier;">out:</span><br /><span style="font-family: courier;"> ret void</span><br /><span style="font-family: courier;">}</span><br /></p><p><span style="font-family: courier;"></span>Unfortunately, the copy that is written to the destination is not a perfect copy of the source.<br /></p><p>Hold on, I hear you think, each byte of memory holds one of 256 possible bit patterns, and this bit pattern is perfectly copied by the `load`/`store` sequence! The catch is that in LLVM's model of execution, a byte of memory can in fact hold more than just one of those 256 values. For example, a byte of memory can be <a href="https://llvm.org/docs/LangRef.html#poison-values">poison</a>, which means that there are at least 257 possible values. Poison is forwarded perfectly by the code above, so that's fine. The trouble starts because of pointer provenance.<br /><br /><br /></p><h2 style="text-align: left;">What and why is pointer provenance?</h2><p style="text-align: left;"></p><p>From a machine perspective, a pointer is just an integer that is interpreted as a memory address.<br /><br />For the compiler, alias analysis -- that is, the ability to prove that different pointers point at different memory addresses -- is crucial for optimization. One basic tool in the alias analysis toolbox is to recognize that if pointers point into different "memory objects" -- different stack or heap allocations -- then they cannot alias.<br /><br />Unfortunately, many pointers are obtained via <a href="https://llvm.org/docs/LangRef.html#getelementptr-instruction"><span style="font-family: courier;">getelementptr</span> (GEP)</a> using dynamic (non-constant) indices. These dynamic indices could be such that the resulting pointer points into a different memory object than the base pointer. This makes it nearly impossible to determine at compile time whether two pointers point into the same memory object or not.<br /><br />Which is why there is a <a href="https://llvm.org/docs/LangRef.html#pointer-aliasing-rules">rule</a> which says (among other things) that if a pointer P obtained via GEP ends up going out-of-bounds and pointing into a different memory object than the pointer on which the GEP was based, then dereferencing P is undefined behavior even though the pointer's memory address is valid from the machine perspective.<br /><br />As a corollary, a situation is possible in which there are two pointers whose underlying memory address is identical but whose <i>provenance</i> is different. In that case, it's possible that one of them can be dereferenced while dereferencing the other is undefined behavior.<br /><br />This only makes sense if, in the formal semantics of LLVM IR, pointer values carry more information than just an integer interpreted as a memory address. They also carry <i>provenance information</i>, which is essentially the set of memory objects that can be accessed via this pointer and any pointers derived from it.<br /><br /><br /></p><h2 style="text-align: left;">Bytes in memory carry provenance information</h2><p>What is the provenance of a pointer that results from a <span style="font-family: courier;">load</span> instruction? In a clean operational semantics, the <span style="font-family: courier;">load</span> must derive this provenance from the values stored in memory.<br /><br />If bytes of memory can only hold one of 256 bit patterns (or poison), that doesn't give us much to work with. We could say that the provenance of the pointer is "empty", meaning the pointer cannot be used to access any memory objects -- but that's clearly useless. Or we could say that the provenance of the pointer is "all", meaning the pointer (or pointers derived from it) can be freely used to access <i>all</i> memory objects, assuming the underlying address is adjusted appropriately. That isn't much better.[0]<br /><br />Instead, we must say that -- as far as LLVM IR semantics are concerned -- each byte of memory holds pointer provenance information in addition to its <span style="font-family: courier;">i8</span> content. The provenance information in memory is written by pointer <span style="font-family: courier;">store</span>, and pointer <span style="font-family: courier;">load</span> uses it to reconstruct the original provenance of the loaded pointer.<br /><br />What happens to provenance information in non-pointer <span style="font-family: courier;">load</span>/<span style="font-family: courier;">store</span>? A <span style="font-family: courier;">load</span> can simply ignore the additional information in memory. For <span style="font-family: courier;">store</span>, I see 3 possible choices:<br /><br />1. Leave the provenance information that already happens to be in memory unmodified.<br />2. Set the provenance to "empty".<br />3. Set the provenance to "all".<br /><br />Looking back at our attempt to implement <span style="font-family: courier;">memcpy</span>, there is no choice which results in a perfect copy. All of the choices lose provenance information.<br /><br />Without major changes to LLVM IR, only the last choice is potentially viable because it is the only choice that allows dereferencing pointers that are loaded from the <span style="font-family: courier;">memcpy</span> destination.<br /><br /></p><h2 style="text-align: left;">Should we care about losing provenance information?</h2><p style="text-align: left;">Without major changes to LLVM IR, we can only implement a <span style="font-family: courier;">memcpy</span> that loses provenance information during the copy.<br /><br />So what? Alias analysis around <span style="font-family: courier;">memcpy</span> and code like it ends up being conservative, but reasonable people can argue that this doesn't matter. The burden of evidence lies on whoever wants to make a large change here in order to improve alias analysis.<br /><br />That said, we cannot just call it a day and go (or stay) home either, because there are related correctness issues in LLVM today, e.g. <a href="https://bugs.llvm.org/show_bug.cgi?id=37469">bug 37469</a> mentioned in the initial email of that llvm-dev thread.<br /><br />Here's a simpler example of a correctness issue using our hand-coded <span style="font-family: courier;">memcpy</span>:<br /></p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">define i32 @sample(i32** %pp) {</span><span style="font-family: courier;"></span><br /><span style="font-family: courier;"> %tmp = alloca i32*</span><br /><span style="font-family: courier;"> %pp.8 = bitcast i32** %pp to i8*</span><br /><span style="font-family: courier;"> %tmp.8 = bitcast i32** %tmp to i8*</span><span style="font-family: courier;"></span><br /><span style="font-family: courier;"> call void @memcpy(i8* %tmp.8, i8* %pp.8, i64 8)</span><br /><span style="font-family: courier;"></span><span style="font-family: courier;"> %p = load i32*, i32** %tmp</span><span style="font-family: courier;"></span><br /><span style="font-family: courier;"> %x = load i32, i32* %p</span><span style="font-family: courier;"></span><br /><span style="font-family: courier;"> ret i32 %x</span><span style="font-family: courier;"></span><br /><span style="font-family: courier;">}</span><br /></p><p style="text-align: left;">A transform that <i>should be</i> possible is to eliminate the <span style="font-family: courier;">memcpy</span> and temporary allocation:<br /></p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">define i32 @sample(i32** %pp) {</span><br /><span style="font-family: courier;"> %p = load i32*, i32** %pp</span><br /><span style="font-family: courier;"> %x = load i32, i32* %p</span><br /><span style="font-family: courier;"> ret i32 %x</span><br /><span style="font-family: courier;">}</span><br /></p><p style="text-align: left;"><span style="font-family: courier;"></span>This transform is incorrect because it introduces undefined behavior.<br /><br />To see why, remember that this is the world where we agree that integer stores write an "all" provenance to memory, so <span style="font-family: courier;">%p</span> in the original program has "all" provenance. In the transformed program, this may no longer be the case. If <span style="font-family: courier;">@sample</span> is called with a pointer that was obtained through an out-of-bounds GEP whose resulting address just happens to fall into a different memory object, then the transformed program has undefined behavior where the original program didn't.<br /><br />We could fix this correctness issue by introducing an <span style="font-family: courier;">unrestrict</span> instruction which elevates a pointer's provenance to the "all" provenance:<br /></p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">define i32 @sample(i32** %pp) {</span><br /><span style="font-family: courier;"> %p = load i32*, i32** %pp</span><br /><span style="font-family: courier;"> %q = unrestrict i32* %p</span><br /><span style="font-family: courier;"> %x = load i32, i32* %q</span><br /><span style="font-family: courier;"> ret i32 %x</span><br /><span style="font-family: courier;">}</span><br /></p><p style="text-align: left;"><span style="font-family: courier;"></span>Here, <span style="font-family: courier;">%q</span> has "all" provenance and therefore no undefined behavior is introduced.<br /><br />I believe that (at least for address spaces that are well-behaved?) it would be correct to fold <span style="font-family: courier;">inttoptr(ptrtoint(x))</span> to <span style="font-family: courier;">unrestrict(x)</span>. The two are really the same.<br /><br />For that reason, <span style="font-family: courier;">unrestrict</span> could also be used to fix the above-mentioned bug 37469. Several folks in the bug's discussion stated the opinion that the bug is caused by incorrect store forwarding that should be weakened via <span style="font-family: courier;">inttoptr(ptrtoint(x))</span>. <span style="font-family: courier;">unrestrict(x)</span> is simply a clearer spelling of the same idea.<br /><br /><br /></p><h2 style="text-align: left;">A dead end: integers cannot have provenance information</h2><p style="text-align: left;">A natural thought at this point is that the situation could be improved by adding provenance information to integers. This is technically correct: our hand-coded <span style="font-family: courier;">memcpy</span> would then produce a perfect copy of the memory contents.<br /><br />However, we would get into serious trouble elsewhere because global value numbering (GVN) and similar transforms become incorrect: two integers could compare equal using the <span style="font-family: courier;">icmp</span> instruction, but still be different because of different provenance. Replacing one by the other could result in miscompilation.<br /><br />GVN is important enough that adding provenance information to integers is a no-go.<br /><br />I suspect that the <span style="font-family: courier;">unrestrict</span> instruction would allow us to apply GVN to pointers, at the cost of making later alias analysis more conservative and sprinkling <span style="font-family: courier;">unrestrict</span> instructions that may inhibit other transforms. I have no idea what the trade-off is on that.<br /><br /><br /></p><h2 style="text-align: left;">The "byte" types: accurate representation of memory contents</h2><p style="text-align: left;">With all the above in mind, I can see the first-principles appeal of the proposed "byte" types. They allow us to represent the contents of memory accurately in SSA values, and so they fill a real gap in the expressiveness of LLVM IR.<br /><br />That said, the software development cost of adding a whole new family of types to LLVM is very high, so it better be justified by more than just aesthetics.<br /><br />Our hand-coded <span style="font-family: courier;">memcpy</span> can be turned into a perfect copier with straightforward replacement of <span style="font-family: courier;">i8</span> by <span style="font-family: courier;">b8</span>:<br /></p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">define void @memcpy(b8* %dst, b8* %src, i64 %n) {</span><br /><span style="font-family: courier;">entry:</span><br /><span style="font-family: courier;"> %dst.end = getelementptr b8, b8* %dst, i64 %n</span><br /><span style="font-family: courier;"> %isempty = icmp eq i64 %n, 0</span><br /><span style="font-family: courier;"> br i1 %isempty, label %out, label %loop</span><br /><span style="font-family: courier;"></span><br /><span style="font-family: courier;">loop:</span><br /><span style="font-family: courier;"> %src.loop = phi b8* [ %src, %entry ], [ %src.next, %loop ]</span><br /><span style="font-family: courier;"> %dst.loop = phi b8* [ %dst, %entry ], [ %dst.next, %loop ]</span><br /><span style="font-family: courier;"> %ch = load b8, b8* %src.loop</span><br /><span style="font-family: courier;"> store b8 %ch, b8* %dst.loop</span><br /><span style="font-family: courier;"> %src.next = getelementptr b8, b8* %src.loop, i64 1</span><br /><span style="font-family: courier;"> %dst.next = getelementptr b8, b8* %dst.loop, i64 1</span><br /><span style="font-family: courier;"> %done = icmp eq b8* %dst.next, %dst.end</span><br /><span style="font-family: courier;"> br i1 %done, label %out, label %loop</span><br /><span style="font-family: courier;"></span><br /><span style="font-family: courier;">out:</span><br /><span style="font-family: courier;"> ret void</span><br /><span style="font-family: courier;">}</span><br /></p><p style="text-align: left;"><span style="font-family: courier;"></span>Looking at the concrete choices made in the proposal, I disagree with some of them.<br /><br /><b>Memory should not be typed.</b> In the proposal, storing an integer always results in different memory contents than storing a pointer (regardless of its provenance), and implicitly trying to mix pointers and integers is declared to be undefined behavior. In other words, a sequence such as:<br /></p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">store i64 %x, i64* %p</span><br /><span style="font-family: courier;">%q = bitcast i64* %p to i8**</span><br /><span style="font-family: courier;">%y = load i8*, i8** %q</span><br /></p><p style="text-align: left;"><span style="font-family: courier;"></span>... is undefined behavior under the proposal instead of being effectively <span style="font-family: courier;">inttoptr(%x)</span>. That seems fine for C/C++, but is it going to be fine for other frontends?<br /><br />The corresponding distinction between bytes-as-integers and bytes-as-pointers complicates the proposal overall, e.g. it forces them to add a <span style="font-family: courier;">bytecast</span> instruction.<br /><br />Conversely, the benefits of the distinction are unclear to me. One benefit appears to be guaranteed non-aliasing between pointer and non-pointer memory accesses, but that is a form of type-based alias analysis which in LLVM should idiomatically be done via TBAA metadata. (Update: see the addendum for another potential argument in favor of typed memory.)<br /><br />So let's keep memory untyped, please.<br /><br /><b>Bitwise poison</b> in byte values makes me <i>really</i> nervous due to the arbitrary deviation from how poison works in other types. I don't see any justification for it in the proposal. I can kind of see how one could be motivated by implementing memcpy with vector intrinsics operating on, for example, <span style="font-family: courier;"><8 x b32></span>, but a simpler solution would be to just use <span style="font-family: courier;"><32 x b8></span> instead. And if poison is indeed bitwise, then certainly pointer provenance would also have to be bitwise!<br /><br />Finally, no design discussion is complete without a little bit of bike-shedding. I believe <b>the name "byte"</b> is inspired by C++'s <span style="font-family: courier;">std::byte</span>, but given that types such as <span style="font-family: courier;">b256</span> are possible, this name would forever be a source of confusion. Naming is hard, and I think we should at least try to look for a better one. Let me kick off the brainstorming by suggesting we think of them as "memory content" values, because that's what they are. The types could be spelled <span style="font-family: courier;">m8</span>, <span style="font-family: courier;">m32</span>, etc. in IR assembly.<br /><br /></p><h2 style="text-align: left;">A variation: adding a pointer provenance type</h2><p style="text-align: left;">In the llvm-dev thread, Jeroen Dobbelaere points out <a href="https://reviews.llvm.org/D68488">work being done to introduce explicit `ptr_provenance` operands on certain instructions</a>, in service of C99's <span style="font-family: courier;">restrict</span> keyword. I haven't properly digested this work, but it inspired the thoughts of this section.<br /><br />Values of the proposed byte types have both a bit pattern and a pointer provenance. Do we really need to have both pieces of information in the same SSA value? We could instead split them up into an integer bit pattern value and a pointer provenance value with an explicit <span style="font-family: courier;">provenance</span> type. Loads of integers could read out the provenance information stored in memory and provide it as a secondary result. Similarly, stores of integers could accept the desired provenance to be stored in memory as a secondary data operand. This would allow us to write a perfect memcpy by replacing the core <span style="font-family: courier;">load</span>/<span style="font-family: courier;">store</span> sequence with something like:<br /></p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">%ch, %provenance = load_with_provenance i8, i8* %src</span><br /><span style="font-family: courier;">store_with_provenance i8 %ch, provenance %provenance, i8* %dst</span><br /></p><p style="text-align: left;"><span style="font-family: courier;"></span>The syntax and instruction names in the example are very much straw men. Don't take them too seriously, especially because LLVM IR doesn't currently allow multiple result values.<br /><br />Interestingly, this split allows the derivation of pointer provenance to follow a different path than the calculation of the pointer's bit pattern. This in turn allows us in principle to perform GVN on pointers without being conservative for alias analysis.<br /><br />One of the steps in bug 37469 is not quite GVN, but morally similar. Simplifying a lot, the original program sequence:<br /></p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">%ch1 = load i8, i8* %p1</span><br /><span style="font-family: courier;">%ch2 = load i8, i8* %p2</span><br /><span style="font-family: courier;">%eq = icmp eq i8 %ch1, %ch2</span><br /><span style="font-family: courier;">%ch = select i1 %eq, i8 %ch1, i8 %ch2</span><br /><span style="font-family: courier;">store i8 %ch, i8* %p3</span><br /></p><p style="text-align: left;"><span style="font-family: courier;"></span>... is transformed into:<br /></p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">%ch2 = load i8, i8* %p2</span><br /><span style="font-family: courier;">store i8 %ch2, i8* %p3</span><br /></p><p style="text-align: left;"><span style="font-family: courier;"></span>This is correct for the bit patterns being loaded and stored, but the program also indirectly relies on pointer provenance of the data. Of course, there is no pointer provenance information being copied here because <span style="font-family: courier;">i8</span> only holds a bit pattern. However, with the "byte" proposal, all the <span style="font-family: courier;">i8</span>s would be replaced by <span style="font-family: courier;">b8</span>s, and then the transform becomes incorrect because it changes the provenance information.<br /><br />If we split the proposed use of <span style="font-family: courier;">b8</span> into a use of <span style="font-family: courier;">i8</span> and explicit provenance values, the original program becomes:<br /></p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">%ch1, %prov1 = load_with_provenance i8, i8* %p1</span><br /><span style="font-family: courier;">%ch2, %prov2 = load_with_provenance i8, i8* %p2</span><br /><span style="font-family: courier;">%eq = icmp eq i8 %ch1, %ch2</span><br /><span style="font-family: courier;">%ch = select i1 %eq, i8 %ch1, i8 %ch2</span><br /><span style="font-family: courier;">%prov = select i1 %eq, provenance %prov1, provenance %prov2</span><br /><span style="font-family: courier;">store_with_provenance i8 %ch, provenance %prov, i8* %p3</span><br /></p><p style="text-align: left;"><span style="font-family: courier;"></span>This could be transformed into something like:<br /></p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">%prov1 = load_only_provenance i8* %p1</span><br /><span style="font-family: courier;">%ch2, %prov2 = load_with_provenance i8, i8* %p2</span><br /><span style="font-family: courier;">%prov = merge provenance %prov1, %prov2</span><br /><span style="font-family: courier;">store_with_provenance i8 %ch2, provenance %prov, i8* %p3</span><br /></p><p style="text-align: left;"><span style="font-family: courier;"></span>... which is just as good for code generation but loses only very little provenance information.<br /><br /></p><h2 style="text-align: left;">Aside: loop idioms</h2><p style="text-align: left;">Without major changes to LLVM IR, a perfect <span style="font-family: courier;">memcpy</span> cannot be implemented because pointer provenance information is lost.<br /><br />Nevertheless, one could still define the <span style="font-family: courier;">@llvm.memcpy</span> intrinsic to be a perfect copy. This helps <span style="font-family: courier;">memcpy</span>s in the original source program be less conservative in terms of alias analysis. However, it also makes it incorrect to replace a memcpy loop idiom with a use of <span style="font-family: courier;">@llvm.memcpy</span>: without adding <span style="font-family: courier;">unrestrict</span> instructions, the replacement may introduce undefined behavior; and there is no way to bound the locations where such <span style="font-family: courier;">unrestrict</span>s may be needed.<br /><br />We could augment <span style="font-family: courier;">@llvm.memcpy</span> with an immediate argument that selects its provenance behavior.<br /><br />In any case, one can argue that bug 37469 is really a bug in the loop idiom recognizer. It boils down to the details of how everything is defined, and unfortunately, these weird corner cases are currently underspecified in the LangRef.<br /><br /></p><h2 style="text-align: left;">Conclusion</h2><p style="text-align: left;">We started with the question of whether <span style="font-family: courier;">memcpy</span> can be implemented in LLVM IR. The answer is a qualified Yes. It is possible, but the resulting copy is imperfect because pointer provenance information is lost. This has surprising implications which in turn happen to cause real miscompilation bugs -- although those bugs could be fixed even without a perfect <span style="font-family: courier;">memcpy</span>.<br /><br />The "byte" proposal has a certain aesthetic appeal because it fixes a real gap in the expressiveness of LLVM IR, but its software engineering cost is large and I object to some of its details. There are also alternatives to consider.<br /><br />The miscompilation bugs obviously need to be fixed, but they can be fixed much less intrusively, albeit at the cost of more conservative alias analysis in the affected places. It is not clear to me whether improving alias analysis justifies the more complex solutions.<br /><br />I would like to understand better how all of this interacts with the C99 <span style="font-family: courier;">restrict</span> work. That work introduces mechanisms for explicitly talking about pointer provenance in the IR, which may allow us to kill two birds with one stone.<br /><br />In any case, this is a fascinating topic and discussion, and I feel like we're only at the beginning.<br /></p><p style="text-align: left;"><br /></p><h2 style="text-align: left;">Addendum: storing back previously loaded integers<br /></h2><p style="text-align: left;"><i>(Added this section on June 14th)</i> <br /></p><p style="text-align: left;">Harald van Dijk <a href="https://reviews.llvm.org/D104013#inline-987423">on Phrabricator</a> and Ralf Jung on llvm-dev, referring to a <a href="https://github.com/rust-lang/unsafe-code-guidelines/issues/286#issuecomment-860189806">Rust issue</a>, explicitly and implicitly point out a curious issue with loading and storing integers.</p><p style="text-align: left;">Here is Harald's example:</p><p style="margin-left: 40px; text-align: left;"><span style="font-family: courier;">define i8* @f(i8* %p) {</span><br /><span style="font-family: courier;"> %buf = alloca i8*</span><br /><span style="font-family: courier;"> %buf.i32 = bitcast i8** %buf to i32*</span><br /><span style="font-family: courier;"> store i8* %p, i8** %buf</span><br /><span style="font-family: courier;"> %i = load i32, i32* %buf.i32</span><br /><span style="font-family: courier;"> store i32 %i, i32* %buf.i32</span><br /><span style="font-family: courier;"> %q = load i8*, i8** %buf</span><br /><span style="font-family: courier;"> ret i8* %q</span><br /><span style="font-family: courier;">}</span><br /></p><p><span style="font-family: courier;"></span>There is a pair of <span style="font-family: courier;">load</span>/<span style="font-family: courier;">store</span> of <span style="font-family: courier;">i32</span> which is fully redundant from a machine perspective and so we'd like to optimize that away, after which it becomes obvious that the function really just returns <span style="font-family: courier;">%p</span> -- at least as far as bit patterns are concerned.</p><p class="remarkup-code" style="text-align: left;">However, in a world where memory is untyped but has provenance information, this optimization is incorrect because it can introduce undefined behavior: the <span style="font-family: courier;">load</span>/<span style="font-family: courier;">store</span> of <span style="font-family: courier;">i32</span> resets the provenance information in memory to "all", so that the original function returns an unrestricted version of <span style="font-family: courier;">%p</span>. This is no longer the case after the optimization.</p><p class="remarkup-code" style="text-align: left;">There are at least two possible ways of resolving this conflict.</p><p class="remarkup-code" style="text-align: left;">We could define memory to be typed, in the sense that each byte of memory remembers whether it was most recently stored as a pointer or a non-po<span>inter. A load with the wrong type returns poison. In that case, the example above returns poison before the optimization (because <span style="font-family: courier;">%i</span> is guaranteed to be poison). After the optimization it returns non-poison, which is an acceptable refinement, so the optimization is correct.</span></p><p class="remarkup-code" style="text-align: left;"><span>The alternative is to keep memory untyped and say that directly eliminating the <span style="font-family: courier;">i32</span> <span style="font-family: courier;">store</span> in the example is incorrect.</span></p><p class="remarkup-code" style="text-align: left;"><span>We are facing a tradeoff that depends on how important that optimization is for performance.</span></p><p class="remarkup-code" style="text-align: left;"><span>Two observations to that end. First, the more common case of dead store elimination is one where there are multiple stores to the same address in a row, and we remove all but the last one of them. That more common optimization is unaffected by provenance issues either way.</span></p><p class="remarkup-code" style="text-align: left;"><span>Second, we can still perform store forwarding / peephole optimization across such <span style="font-family: courier;">load</span>/<span style="font-family: courier;">store</span> pairs, as long as we are careful to introduce <span style="font-family: courier;">unrestrict</span> where needed. The example above can be optimized via store forwarding to:</span></p><p style="margin-left: 40px; text-align: left;"><span><span style="font-family: courier;">define i8* @f(i8* %p) {</span><br /><span style="font-family: courier;"> %buf = alloca i8*</span><br /><span style="font-family: courier;"> %buf.i32 = bitcast i8** %buf to i32*</span><br /><span style="font-family: courier;"> store i8* %p, i8** %buf</span><br /><span style="font-family: courier;"> %i = load i32, i32* %buf.i32</span><br /><span style="font-family: courier;"> store i32 %i, i32* %buf.i32</span><br /><span style="font-family: courier;"> %q = <b>unrestrict i8* %p</b></span><br /><span style="font-family: courier;"> ret i8* %q</span><br /><span style="font-family: courier;">}</span></span></p><p style="text-align: left;">We can then dead-code eliminate the bulk of the function and obtain:</p><p style="margin-left: 40px; text-align: left;"><span><span style="font-family: courier;">define i8* @f(i8* %p) {<br /></span><span style="font-family: courier;"> %q = <b>unrestrict i8* %p</b></span><br /><span style="font-family: courier;"> ret i8* %q</span><br /><span style="font-family: courier;">}</span></span></p><p style="text-align: left;">... which is as good as it can possibly get. </p><p style="text-align: left;">So, there is a good chance that preventing this particular optimization is relatively cheap in terms of code quality, and the gain in overall design simplicity may well be worth it.<br /></p><p style="text-align: left;"><br /></p><p style="text-align: left;"><br /></p><p style="text-align: left;"><br /></p><p style="text-align: left;">[0] We could also say that the loaded pointer's provenance is magically the memory object that happens to be at the referenced memory address. Either way, provenance would become a useless no-op in most cases. For example, mem2reg would have to insert <span style="font-family: courier;">unrestrict</span> instructions (defined later) everywhere because pointers become effectively "unrestricted" when loaded from alloca'd memory.<br /><br /></p><p></p>Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com1tag:blogger.com,1999:blog-36137506.post-82024415261336204532020-06-25T00:34:00.000+02:002020-06-25T00:34:24.405+02:00They want to be small, they want to be big: thoughts on code reviews and the power of patch seriesCode reviews are a central fact of life in software development. It's important to do them well, and developer quality of life depends on a good review workflow.<br />
<br />
Unfortunately, code reviews also appear to be a difficult problem. Many projects are bottlenecked by code reviews, in that reviewers are hard to find and progress gets slowed down by having to wait a long time for reviews.<br />
<br />
The "solution" that I've often seen applied in practice is to have lower quality code reviews. Reviewers don't attempt to gain a proper understanding of a change, so reviews become shallower and therefore easier. This is convenient on the surface, but more likely to allow bad code to go through: a subtle corner case that isn't covered by tests (yet?) may be missed, there may be a misunderstanding of a relevant underlying spec, a bad design decision slips through, and so on. This is bound to cause pains later on.<br />
<br />
I've experienced a number of different code review workflows in practice, based on a number of tools: GitHub PRs and GitLab MRs, Phabricator, and the e-mail review of patch series that is the <i>original</i> workflow for which Git was designed. Of those, the e-mail review flow produced the highest quality. There may be confounding factors, such as the nature of the projects and the selection of developers working on them, but quality issues aside I certainly feel that the e-mail review flow was the <i>most pleasant</i> to work with. Over time I've been thinking and having discussions a lot about just why that is. I feel that I have distilled it to two key factors, which I've decided to write down here so I can just link to this blog post in the future.<br />
<br />
First, the UI experience of e-mail is a lot nicer. All of the alternatives are ultimately web-based, and their UI latency is universally terrible. Perhaps I'm particularly sensitive, but I just cannot stand web UIs for serious work. Give me something that reacts to <i>all</i> input reliably in under 50ms and I will be much happier. E-mail achieves that, web UIs don't. Okay, to be fair, e-mail is probably more in the 100ms range given the general state of the desktop. The point is, web UIs are about an order of magnitude worse. It's incredibly painful. (All of this assumes that you're using a decent native e-mail client. Do yourself a favor and give that a try if you haven't. The CLI warriors all have their favorites, but frankly Thunderbird works just fine. Outlook doesn't.)<br />
<br />
Second, I've come to realize that there are conflicting goals in review granularity that e-mail happens to address pretty well, but none of the alternatives do a good job of it. Most of the alternatives don't even seem to understand that there is a problem to begin with! Here's the thing:<br />
<br />
<b>Reviews want to be small.</b> The smaller and the more self-contained a change is, the easier it is to wrap your head around and judge. If you do a big refactor that is supposed to have no functional impact, followed by a separate small functional change that is enabled by the refactor, then each change individually is much easier to review. Approving changes at a fine granularity also helps ensure that you've really thought through each individual change and that each change has a reasonable justification. Important details don't get lost in something larger.<br />
<br />
<b>Reviews want to be big.</b> A small, self-contained change can be difficult to understand and judge in isolation. You're doing a refactor that moves a function somewhere else? Fine, it's easy to tell that the change is correct, but is it a <i>good</i> change? To judge that, you often need to understand how the refactored result ends up being used in later changes, so it's good to see all those changes at once. Keep in mind though that you don't necessarily have to <i>approve</i> them at the same time. It's entirely possible to say, yes, that refactor looks good, we can go ahead with that, but please fix the way it's being used in a subsequent change.<br />
<br />
There is another reason why reviews want to be big. Code reviews have a mental context-switching overhead. As a reviewer, you need to think yourself into the affected code in order to judge it well. If you do many reviews, you typically need to context-switch between each review. This can be very taxing mentally and ultimately unpleasant. A similar, though generally smaller, context-switching overhead applies to the author of the change as well: let's say you send out some changes for review, then go off and do another thing, and come back a day or two later to some asynchronously written reviews. In order to respond to the review, you may now have to page the context of that change back in. The point of all this is that when reviews are big, the context-switching overhead gets amortized better, i.e. the cost per change drops.<br />
<br />
<b>Reviews want to be both small and big.</b> Guess what, patch series solve that problem! You get to review an entire feature implementation in the form of a patch series at once, so your context-switching overhead is reduced and you can understand how the different parts of the change play together. At the same time, you can drill down into individual patches and review those. Two levels of detail are available simultaneously.<br />
<br />
So why e-mail? Honestly, I don't know. Given that the original use case for Git is based on patch series review, it's mind-boggling in a bad way that web-based Git hosting and code review systems do such a poor job of it, if they handle it at all.<br />
<br />
<b>Gerrit</b> is the only system I know of that really takes patch series as an idea seriously, but while I am using it occasionally, I haven't had the opportunity to really stress it. Unfortunately, most people don't even want to consider Gerrit as an option because it's ugly.<br />
<br />
Phabricator's stacks are a pretty decent attempt and I've made good use of them in the context of LLVM. However, they're too hidden and clumsy to navigate. Both Phabricator and Gerrit lack a mechanism for discussing a patch series as a whole.<br />
<br />
GitHub and Gitlab? They're basically unusable. Yes, you can look at individual commits, but then GitHub doesn't even display the commits in the correct order: they're sorted by commit or author date, not by the Git DAG order, which is an obviously and astonishingly bad idea. Comments tend to get lost when authors rebase, which is what authors <i>should</i> do in order to ensure a clean history, and actually <i>reviewing</i> an individual commit is impossible. Part of the power of patch series is the ability to say: "Patches 1-6, 9, and 11 are good to go, the rest needs work."<br />
<br />
Oh, and by the way: Commit messages? They're important! <b>Gerrit</b> again is the only system I know of that allows review comments on commit messages. It's as if the authors of Gerrit are the only ones who really understood the problem space. Unfortunately, they seem to lack business and web design skills, and so we ended up in the mess we're in right now.<br />
<br />
Mind you, even if the other players got their act together and supported the workflow properly, there'd still be the problem of UI latency. One can dream...Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com5tag:blogger.com,1999:blog-36137506.post-67831708975966198352019-02-05T11:03:00.000+01:002019-02-05T11:03:00.108+01:00FOSDEM talk on TableGenVideo and slides for my talk in the LLVM devroom on TableGen are now available <a href="https://fosdem.org/2019/schedule/event/llvm_tablegen/">here</a>.<br />
<br />
Now I only need the time and energy to continue my <a href="http://nhaehnle.blogspot.com/2018/02/tablegen-1-what-has-tablegen-ever-done.html">blog series</a> on the topic...Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-15366823612704181982018-03-09T11:25:00.000+01:002018-03-09T11:25:45.304+01:00TableGen #5: DAGs<i>This is the fifth part of a series; see the <a href="http://nhaehnle.blogspot.de/2018/02/tablegen-1-what-has-tablegen-ever-done.html">first part</a> for a table of contents.</i><br />
<br />
With <a href="http://nhaehnle.blogspot.de/2018/02/tablegen-3-bits.html">bit sequences</a>, we have already seen one unusual feature of TableGen that is geared towards its specific purpose. <i>DAG nodes</i> are another; they look a bit like S-expressions:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">def op1;<br />def op2;<br />def i32:<br /><br />def Example {<br /> dag x = (op1 $foo, (op2 i32:$bar, "Hi"));<br />}</span></blockquote>
In the example, there are two <i>DAG nodes</i>, represented by a <span style="font-family: "courier new" , "courier" , monospace;">DagInit</span> object in the code. The first node has as its <i>operation</i> the record <span style="font-family: "courier new" , "courier" , monospace;">op1</span>. The operation of a DAG node must be a record, but there are no other restrictions. This node has two children or <i>arguments</i>: the first argument is named <span style="font-family: "courier new" , "courier" , monospace;">foo</span> but has no value. The second argument has no name, but it does have another DAG node as its value.<br />
<br />
This second DAG node has the operation <span style="font-family: "courier new" , "courier" , monospace;">op2</span> and two arguments. The first argument is named <span style="font-family: "courier new" , "courier" , monospace;">bar</span> and has value <span style="font-family: "courier new" , "courier" , monospace;">i32</span>, the second has no name and value <span style="font-family: "courier new" , "courier" , monospace;">"Hi"</span>.<br />
<br />
DAG nodes can have any number of arguments, and they can be nested arbitrarily. The values of arguments can have any type, at least as far as the TableGen frontend is concerned. So DAGs are an extremely free-form way of representing data, and they are really only given meaning by TableGen backends.<br />
<br />
There are three main uses of DAGs:<br />
<ol>
<li>Describing the operands on machine instructions.</li>
<li>Describing patterns for instruction selection.</li>
<li>Describing register files with something called "set theory".</li>
</ol>
I have not yet had the opportunity to explore the last point in detail, so I will only give an overview of the first two uses here.<br />
<br />
Describing the operands of machine instructions is fairly straightforward at its core, but the details can become quite elaborate.<br />
<br />
I will illustrate some of this with the example of the <span style="font-family: "courier new" , "courier" , monospace;">V_ADD_F32</span> instruction from the AMDGPU backend. <span style="font-family: "courier new" , "courier" , monospace;">V_ADD_F32</span> is a standard 32-bit floating point addition, at least in its 32-bit-encoded variant, which the backend represents as <span style="font-family: "courier new" , "courier" , monospace;">V_ADD_F32_e32</span>.<br />
<br />
Let's take a look at some of the fully resolved records produced by the TableGen frontend:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">def V_ADD_F32_e32 { // Instruction AMDGPUInst ...<br /> dag OutOperandList = (outs anonymous_503:$vdst);<br /> dag InOperandList = (ins VSrc_f32:$src0, VGPR_32:$src1);<br /> string AsmOperands = "$vdst, $src0, $src1";<br /> ...<br />}</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><br /><br />def anonymous_503 { // DAGOperand RegisterOperand VOPDstOperand<br /> RegisterClass RegClass = VGPR_32;<br /> string PrintMethod = "printVOPDst";<br /> ...<br />}</span></span></blockquote>
As you'd expect, there is one <i>out</i> operand. It is named <i>vdst</i> and an anonymous record is used to describe more detailed information such as its register class (a 32-bit general purpose vector register) and the name of a special method for printing the operand in textual assembly output. (The string <span style="font-family: "courier new" , "courier" , monospace;">"printVOPDst"</span> will be used by the backend that generates the bulk of the instruction printer code, and refers to the method <span style="font-family: "courier new" , "courier" , monospace;">AMDGPUInstPrinter::printVOPDst</span> that is implemented manually.)<br />
<br />
There are two <i>in</i> operands. <i>src1</i> is a 32-bit general purpose vector register and requires no special handling, but <i>src0</i> supports more complex operands as described in the record <span style="font-family: "courier new" , "courier" , monospace;">VSrc_f32</span> elsewhere.<br />
<br />
Also note the string <span style="font-family: "courier new" , "courier" , monospace;">AsmOperands</span>, which is used as a template for the automatically generated instruction printer code. The operand names in that string refer to the names of the operands as defined in the DAG nodes.<br />
<br />
This was a nice warmup, but didn't really demonstrate the full power and flexibility of DAG nodes. Let's look at <span style="font-family: "courier new" , "courier" , monospace;">V_ADD_F32_e64</span>, the 64-bit encoded version,<span style="font-family: "courier new" , "courier" , monospace;"></span> which has some additional features: the sign bits of the inputs can be reset or inverted, and the result (output) can be clamped and/or scaled by some fixed constants (0.5, 2, and 4). This will seem familiar to anybody who has worked with the old <a href="https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_fragment_program.txt">OpenGL assembly program extensions</a> or with DirectX shader assembly.<br />
<br />
The fully resolved records produced by the TableGen frontend are quite a bit more involved:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">def V_ADD_F32_e64 { // Instruction AMDGPUInst ...<br /> dag OutOperandList = (outs anonymous_503:$vdst);<br /> dag InOperandList =<br /> (ins FP32InputMods:$src0_modifiers, VCSrc_f32:$src0,<br /> FP32InputMods:$src1_modifiers, VCSrc_f32:$src1,<br /> clampmod:$clamp, omod:$omod);<br /> string AsmOperands = "$vdst, $src0_modifiers, $src1_modifiers$clamp$omod";<br /> list<dag> Pattern =<br /> [(set f32:$vdst, (fadd<br /> (f32 (VOP3Mods0 f32:$src0, i32:$src0_modifiers,<br /> i1:$clamp, i32:$omod)),<br /> (f32 (VOP3Mods f32:$src1, i32:$src1_modifiers))))];<br /> ...<br />}<br /><br />def FP32InputMods { // DAGOperand Operand InputMods FPInputMods<br /> ValueType Type = i32;<br /> </span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">string PrintMethod = "printOperandAndFPInputMods";<br /> </span></span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"> AsmOperandClass ParserMatchClass = FP32InputModsMatchClass;<br /> ...<br />}</span><br /><br />def FP32InputModsMatchClass { // AsmOperandClass FPInputModsMatchClass<br /> string Name = "RegOrImmWithFP32InputMods";<br /> string PredicateMethod = "isRegOrImmWithFP32InputMods";<br /> string ParserMethod = "parseRegOrImmWithFPInputMods";<br /> ...<br />}</span></blockquote>
The <i>out</i> operand hasn't changed, but there are now many more special <i>in</i> operands that describe whether those additional features of the instruction are used.<br />
<br />
You can again see how records such as <span style="font-family: "courier new" , "courier" , monospace;">FP32InputMods</span> refer to manually implemented methods. Also note that the <span style="font-family: "courier new" , "courier" , monospace;">AsmOperands</span> string no longer refers to <i>src0</i> or <i>src1</i>. Instead, the <span style="font-family: "courier new" , "courier" , monospace;">printOperandAndFPInputMods</span> method on <i>src0_modifiers</i> and <i>src1_modifiers</i> will print the source operand together with its sign modifiers. Similarly, the special <span style="font-family: "courier new" , "courier" , monospace;">ParserMethod</span> <span style="font-family: "courier new" , "courier" , monospace;">parseRegOrImmWithFPInputMods</span> will be used by the assembly parser.<br />
<br />
This kind of extensibility by combining generic automatically generated code with manually implemented methods is used throughout the TableGen backends for code generation.<br />
<br />
Something else is new here: the <span style="font-family: "courier new" , "courier" , monospace;">Pattern</span>. This pattern, together will all the other patterns defined elsewhere, is compiled into a giant domain-specific bytecode that executes during instruction selection to turn the <a href="https://llvm.org/docs/CodeGenerator.html#instruction-selection-section">SelectionDAG</a> into machine instructions. Let's take this particular pattern apart:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">(set f32:$vdst, (fadd ...))</span></blockquote>
We will match an <span style="font-family: "courier new" , "courier" , monospace;">fadd</span> selection DAG node that outputs a 32-bit floating point value, and this output will be linked to the out operand <i>vdst</i>. (<span style="font-family: "courier new" , "courier" , monospace;">set</span>, <span style="font-family: "courier new" , "courier" , monospace;">fadd</span> and many others are defined in the target-independent <a href="https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Target/TargetSelectionDAG.td"><span style="font-family: "courier new" , "courier" , monospace;">include/llvm/Target/TargetSelectionDAG.td</span></a>.)<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">(fadd (f32 (VOP3Mods0 f32:$src0, i32:$src0_modifiers,<br /> i1:$clamp, i32:$omod)),<br /> (f32 (VOP3Mods f32:$src1, i32:$src1_modifiers)))</span></blockquote>
Both input operands of the fadd node must be 32-bit floating point values, and they will be handled by <i>complex</i> patterns. Here's one of them:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">def VOP3Mods { // ComplexPattern<br /> string SelectFunc = "SelectVOP3Mods";<br /> int NumOperands = 2;<br /> ...<br />} </span></blockquote>
As you'd expect, there's a manually implemented <span style="font-family: "courier new" , "courier" , monospace;">SelectVOP3Mods</span> method. Its signature is<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">bool SelectVOP3Mods(SDValue In, SDValue &Src,<br /> SDValue &SrcMods) const;</span></blockquote>
It can reject the match by returning <i>false</i>, otherwise it pattern matches a single input SelectionDAG node into nodes that will be placed into <i>src1</i> and <i>src1_modifiers</i> in the particular pattern we were studying.<br />
<br />
Patterns can be arbitrarily complex, and they can be defined outside of instructions as well. For example, here's a pattern for generating the <span style="font-family: "courier new" , "courier" , monospace;">S_BFM_B32</span> instruction, which generates a bitfield mask:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;"> def anonymous_2373anonymous_2371 { // Pattern Pat ...<br /> dag PatternToMatch =<br /> (i32 (shl (i32 (add (i32 (shl 1, i32:$a)), -1)), i32:$b));<br /> list<dag> ResultInstrs = [(S_BFM_B32 ?:$a, ?:$b)];<br /> ...<br />}</span></blockquote>
The name of this record doesn't matter. The instruction selection TableGen backend simply looks for all records that have <span style="font-family: "courier new" , "courier" , monospace;">Pattern</span> as a superclass. In this case, we match an expression of the form <span style="font-family: "courier new" , "courier" , monospace;">((1 << a) - 1) << b</span> on 32-bit integers into a single machine instruction.<br />
<br />
So far, we've mostly looked at how DAGs are interpreted by some of the key backends of TableGen. As it turns out, most backends generate their DAGs in a fairly static way, but there are some fancier techniques that can be used as well. This post is already quite long though, so we'll look at those in the next post.Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-7801569941191880772018-03-06T14:17:00.000+01:002018-03-06T14:17:33.270+01:00TableGen #4: Resolving variables<i>This is the fourth part of a series; see the <a href="http://nhaehnle.blogspot.de/2018/02/tablegen-1-what-has-tablegen-ever-done.html">first part</a> for a table of contents.</i><br />
<br />
It's time to look at some of the guts of TableGen itself. TableGen is split into a frontend, which parses the TableGen input, instantiates all the records, resolves variable references, and so on, and many different backends that generate code based on the instantiated records. In this series I'll be mainly focusing on the frontend, which lives in <span style="font-family: "courier new" , "courier" , monospace;">lib/TableGen/</span> inside the LLVM repository, e.g. <a href="https://github.com/llvm-mirror/llvm/tree/master/lib/TableGen">here on the GitHub mirror</a>. The backends for LLVM itself live in <a href="https://github.com/llvm-mirror/llvm/tree/master/utils/TableGen"><span style="font-family: "courier new" , "courier" , monospace;">utils/TableGen/</span></a>, together with the command line tool's main() function. Clang also has <a href="https://github.com/llvm-mirror/clang/tree/master/utils/TableGen">its own backends</a>.<br />
<br />
<i></i>
Let's revisit what kind of variable references there are and what kind of resolving needs to be done with an example:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">class Foo<int src> {<br /> int Src = src;<br /> int Offset = 1;<br /> int Dst = !add(Src, Offset);<br />}<br /><br />multiclass Foos<int src> {<br /> def a : Foo<src>;<br /> let Offset = 2 in<br /> def b : Foo<src>;<br />}<br /><br />foreach i = 0-3 in<br />defm F#i : Foos<i>;</span></blockquote>
This is actually broken in older LLVM by one of the many bugs, but clearly it should work based on what kind of features are generally available, and <a href="https://reviews.llvm.org/D44108">with my patch series</a> it certainly does work in the natural way. We see four kinds of variable references:<br />
<ul>
<li>internally within a record, such as the initializer of <span style="font-family: "courier new" , "courier" , monospace;">Dst</span> referencing <span style="font-family: "courier new" , "courier" , monospace;">Src</span> and <span style="font-family: "courier new" , "courier" , monospace;">Offset</span></li>
<li>to a class template variable, such as <span style="font-family: "courier new" , "courier" , monospace;">Src</span> being initialized by <span style="font-family: "courier new" , "courier" , monospace;">src</span></li>
<li>to a multiclass template variable, such as <span style="font-family: "courier new" , "courier" , monospace;">src</span> being passed as a template argument for <span style="font-family: "courier new" , "courier" , monospace;">Foo</span></li>
<li>to a <span style="font-family: "courier new" , "courier" , monospace;">foreach</span> iteration variable</li>
</ul>
As an aside, keep in mind that <span style="font-family: "courier new" , "courier" , monospace;">let</span> in TableGen does not mean the same thing as in the many functional programming languages that have a similar construct. In those languages <span style="font-family: "courier new" , "courier" , monospace;">let</span> introduces a new variable, but TableGen's <span style="font-family: "Courier New", Courier, monospace;">let</span> instead overrides the value of a variable that has already been defined elsewhere. In the example above, the <span style="font-family: "courier new" , "courier" , monospace;">let</span>-statement causes the value of <span style="font-family: "courier new" , "courier" , monospace;">Offset</span> to be changed in the record that was instantiated from the <span style="font-family: "courier new" , "courier" , monospace;">Foo</span> class to create the <span style="font-family: "courier new" , "courier" , monospace;">b</span> prototype inside multiclass <span style="font-family: "courier new" , "courier" , monospace;">Foos</span>.<br />
<br />
TableGen internally represents variable references as instances of the <span style="font-family: "courier new" , "courier" , monospace;">VarInit</span> class, and the variables themselves are simply referenced by name. This causes some embarrassing issues around template arguments which are papered over by qualifying the variable name with the template name. If you pass the above example through a sufficiently fixed version of llvm-tblgen, one of the outputs will be the description of the Foo class:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">class Foo<int Foo:src = ?> {<br /> int Src = Foo:src;<br /> int Offset = 1;<br /> int Dst = !add(Src, Offset);<br /> string NAME = ?;<br />}</span></blockquote>
As you can see, <span style="font-family: "courier new" , "courier" , monospace;">Foo:src</span> is used to refer to the template argument. In fact, the template arguments of both classes and multiclasses are temporarily added as variables to their respective prototype records. When the class or prototype in a multiclass is instantiated, all references to the template argument variables are resolved fully, and the variables are removed (or rather, some of them are removed, and making that consistent is one of the many things I set out to clean up).<br />
<br />
Similarly, references to <span style="font-family: "courier new" , "courier" , monospace;">foreach</span> iteration variables are resolved when records are instantiated, although those variables aren't similarly qualified. If you want to learn more about how variable names are looked up, <span style="font-family: "courier new" , "courier" , monospace;">TGParser::ParseIDValue</span> is a good place to start.<br />
<br />
The order in which variables are resolved is important. In order to achieve the flexibility of overriding defaults with <span style="font-family: "courier new" , "courier" , monospace;">let</span>-statements, internal references among record variables must be resolved after template arguments.<br />
<br />
Actually resolving variable references used to be done by the implementations of the following virtual method of the <span style="font-family: "courier new" , "courier" , monospace;">Init</span> class hierarchy (which represents initializers, i.e. values and expressions):<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">virtual Init *resolveReferences(Record &R, const RecordVal *RV) const;</span></blockquote>
This method recursively resolves references in the constituent parts of the expression and then performs constant folding, and returns the resulting value (or the original value if nothing could be resolved). Its interface is somewhat magical: <span style="font-family: "courier new" , "courier" , monospace;">R</span> represents the "current" record which is used as a frame of reference for magical lookups in the implementation of <span style="font-family: "courier new" , "courier" , monospace;">!cast</span>; this is a topic for another time, though. At the same time, variables referencing <span style="font-family: "courier new" , "courier" , monospace;">R</span> are supposed to be resolved, but only if <span style="font-family: "courier new" , "courier" , monospace;">RV</span> is null. If <span style="font-family: "courier new" , "courier" , monospace;">RV</span> is non-null, then only references to that specific variable are supposed to be resolved. Additionally, some behaviors around unset depend on this.<br />
<br />
This is replaced in my changes with<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">virtual Init *resolveReferences(Resolver &R) const;</span></blockquote>
where <span style="font-family: "courier new" , "courier" , monospace;">Resolver</span> is an abstract base class / interface which can lookup values based on their variable names:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">class Resolver {<br /> Record *CurRec;<br /><br />public:<br /> explicit Resolver(Record *CurRec) : CurRec(CurRec) {}<br /> virtual ~Resolver() {}<br /><br /> Record *getCurrentRecord() const { return CurRec; }<br /> virtual Init *resolve(Init *VarName) = 0;<br /> virtual bool keepUnsetBits() const { return false; }<br />}; </span></blockquote>
The "current record" is used as a reference for the aforementioned magical <span style="font-family: "courier new" , "courier" , monospace;">!cast</span>s, and <span style="font-family: "courier new" , "courier" , monospace;">keepUnsetBits</span> instructs the implementation of bit sequences in <span style="font-family: "Courier New", Courier, monospace;">BitsInit</span> not to resolve to ? (as was explained in <a href="https://nhaehnle.blogspot.de/2018/02/tablegen-3-bits.html">the third part of the series</a>). <span style="font-family: "courier new" , "courier" , monospace;">resolve</span> itself is implemented by one of the subclasses, most notably:<br />
<ol>
<li><span style="font-family: "courier new" , "courier" , monospace;">MapResolver</span>: Resolve based on a dictionary of name-value pairs.</li>
<li><span style="font-family: "courier new" , "courier" , monospace;">RecordResolver</span>: Resolve variable names that appear in the current record.</li>
<li><span style="font-family: "courier new" , "courier" , monospace;">ShadowResolver</span>: Delegate requests to an underlying resolver, but filter out some names.</li>
</ol>
This last type of resolver is used by the implementations of <span style="font-family: "courier new" , "courier" , monospace;">!foreach</span> and <span style="font-family: "courier new" , "courier" , monospace;">!foldl</span> to avoid mistakes with nesting. Consider, for example:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">class Exclamation<list string=""><list<string> messages> {<br /> list<string> Messages = !foreach(s, messages, s # "!");<br />}<br /><br />class Greetings<list string=""><list<string> names><br /> : Exclamation&lt!foreach(s, names, "Hello, " # s)>;<br /><br />def : Greetings<["Alice", "Bob"]>;</list></string></list></span></blockquote>
This effectively becomes a nested <span style="font-family: "courier new" , "courier" , monospace;">!foreach</span>. The iteration variable is named <span style="font-family: "courier new" , "courier" , monospace;">s</span> in both, so when substituting <span style="font-family: "courier new" , "courier" , monospace;">s</span> for the outer <span style="font-family: "courier new" , "courier" , monospace;">!foreach</span>, we must ensure that we don't also accidentally substitute <span style="font-family: "courier new" , "courier" , monospace;">s</span> in the inner <span style="font-family: "courier new" , "courier" , monospace;">!foreach</span>. We achieve this by having <span style="font-family: "courier new" , "courier" , monospace;">!foreach</span> wrap the given resolver with a <span style="font-family: "courier new" , "courier" , monospace;">ShadowResolver</span>. The same principle applies to <span style="font-family: "courier new" , "courier" , monospace;">!foldl</span> as well, of course.Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-65537758253911904972018-02-23T01:31:00.000+01:002018-02-23T01:31:34.724+01:00TableGen #3: Bits<i>This is the third part of a series; see the <a href="http://nhaehnle.blogspot.de/2018/02/tablegen-1-what-has-tablegen-ever-done.html">first part</a> for a table of contents.</i><br />
<br />
One of the main backend uses of TableGen is describing target machine instructions, and that includes describing the binary encoding of instructions and their constituents parts. This requires a certain level of bit twiddling, and TableGen supports this with explicit <span style="font-family: "courier new" , "courier" , monospace;">bit</span> (single bit) and <span style="font-family: "courier new" , "courier" , monospace;">bits</span> (fixed-length sequence of bits) types:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">class Enc<bits<7> op> {<br /> bits<10> Encoding;<br /><br /> let Encoding{9-7} = 5;<br /> let Encoding{6-0} = op;<br />}<br /><br />def InstA : Enc<0x35>;<br />def InstB : Enc<0x08>;</span></blockquote>
... will produce records:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">def InstA { // Enc<br /> bits<10> Encoding = { 1, 0, 1, 0, 1, 1, 0, 1, 0, 1 };<br /> string NAME = ?;<br />}<br />def InstB { // Enc<br /> bits<10> Encoding = { 1, 0, 1, 0, 0, 0, 1, 0, 0, 0 };<br /> string NAME = ?;<br />}</span></blockquote>
So you can quite easily slice and dice bit sequences with curly braces, as long as the indices themselves are constants.<br />
<br />
But the real killer feature is that so-called <i>unset</i> initializers, represented by a question mark, aren't fully resolved in bit sequences:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">class Enc<bits<3> opcode> {<br /> bits<8> Encoding;<br /> bits<3> Operand;<br /><br /> let Encoding{0} = opcode{2};<br /> let Encoding{3-1} = Operand;<br /> let Encoding{5-4} = opcode{1-0};<br /> let Encoding{7-6} = { 1, 0 };<br />}<br /><br />def InstA : Enc<5>;</span></blockquote>
... produces a record:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">def InstA { // Enc<br /> bits<8> Encoding = { 1, 0, 0, 1, Operand{2}, Operand{1}, Operand{0}, 1 };<br /> bits<3> Operand = { ?, ?, ? };<br /> string NAME = ?;<br />}</span></blockquote>
So instead of going ahead and saying, hey, <span style="font-family: "courier new" , "courier" , monospace;">Operand{2}</span> is <span style="font-family: "courier new" , "courier" , monospace;">?</span>, let's resolve that and plug it into <span style="font-family: "courier new" , "courier" , monospace;">Encoding</span>, TableGen instead keeps the fact that bit 3 of <span style="font-family: "courier new" , "courier" , monospace;">Encoding</span> refers to <span style="font-family: "courier new" , "courier" , monospace;">Operand{2}</span> as part of its data structures. <br />
<br />
Together with some additional data, this allows a backend of TableGen to automatically generate code for instruction encoding and decoding (i.e., disassembling). For example, it will create the source for a giant C++ method with signature<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">uint64_t getBinaryCodeForInstr(const MCInst &MI, /* ... */) const;</span></blockquote>
which contains a giant constant array with all the fixed bits of each instruction followed by a giant switch statement with cases of the form:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">case AMDGPU::S_CMP_EQ_I32:<br />case AMDGPU::S_CMP_EQ_U32:<br />case AMDGPU::S_CMP_EQ_U64:<br /> // more cases...<br />case AMDGPU::S_SET_GPR_IDX_ON: {<br /> // op: src0<br /> op = getMachineOpValue(MI, MI.getOperand(0), Fixups, STI);<br /> Value |= op & UINT64_C(255);<br /> // op: src1<br /> op = getMachineOpValue(MI, MI.getOperand(1), Fixups, STI);<br /> Value |= (op & UINT64_C(255)) << 8;<br /> break;<br />}</span></blockquote>
The bitmasks and shift values are all derived from the structure of unset bits as in the example above, and some additional data (the operand DAGs) are used to identify the operand index corresponding to TableGen variables like <span style="font-family: "courier new" , "courier" , monospace;">Operand</span> based on their name. For example, here are the relevant parts of the <span style="font-family: "courier new" , "courier" , monospace;">S_CMP_EQ_I32</span> record generated by the AMDGPU backend's TableGen files:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;"> def S_CMP_EQ_I32 { // Instruction (+ other superclasses)<br /> field bits<32> Inst = { 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, src1{7}, src1{6}, src1{5}, src1{4}, src1{3}, src1{2}, src1{1}, src1{0}, src0{7}, src0{6}, src0{5}, src0{4}, src0{3}, src0{2}, src0{1}, src0{0} };<br /> dag OutOperandList = (outs);<br /> dag InOperandList = (ins SSrc_b32:$src0, SSrc_b32:$src1);<br /> bits</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><</span>8> src0 = { ?, ?, ?, ?, ?, ?, ?, ? };<br /> bits</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><</span>8> src1 = { ?, ?, ?, ?, ?, ?, ?, ? };<br /> // many more variables...<br />}</span></blockquote>
Note how <span style="font-family: "courier new" , "courier" , monospace;">Inst</span>, which describes the 32-bit encoding as a whole, refers to the TableGen variables <span style="font-family: "courier new" , "courier" , monospace;">src0</span> and <span style="font-family: "courier new" , "courier" , monospace;">src1</span>. The operand indices used in the calls to <span style="font-family: "courier new" , "courier" , monospace;">MI.getOperand()</span> above are derived from the <span style="font-family: "courier new" , "courier" , monospace;">InOperandList</span>, which contains nodes with the corresponding names. (<span style="font-family: "courier new" , "courier" , monospace;">SSrc_b32</span> is the name of a record that subclasses <span style="font-family: "courier new" , "courier" , monospace;">RegisterOperand</span> and describes the acceptable operands, such as registers and inline constants.)<br />
<br />
Hopefully this helped you appreciate just how convenient TableGen can be. Not resolving the <span style="font-family: "courier new" , "courier" , monospace;">?</span> in bit sequences is an odd little exception to an otherwise fairly regular language, but the resulting expressive power is clearly worth it. It's something to keep in mind when we discuss how variable references are resolved.Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-17782561810397563062018-02-21T11:22:00.000+01:002018-02-21T11:25:26.954+01:00TableGen #2: Functional Programming<i>This is the second part of a series; see the <a href="http://nhaehnle.blogspot.de/2018/02/tablegen-1-what-has-tablegen-ever-done.html">first part</a> for a table of contents.</i><br />
<br />
When the basic pattern of having classes with variables that are filled in via template arguments or <span style="font-family: "courier new" , "courier" , monospace;">let</span>-statements reaches the limits of its expressiveness, it can become useful to calculate values on the fly. TableGen provides string concatenation out of the box with the paste operator ('<span style="font-family: "courier new" , "courier" , monospace;">#</span>'), and there are built-in functions which can be easily recognized since they start with an exclamation mark, such as <span style="font-family: "courier new" , "courier" , monospace;">!add</span>, <span style="font-family: "courier new" , "courier" , monospace;">!srl</span>, <span style="font-family: "courier new" , "courier" , monospace;">!eq</span>, and <span style="font-family: "courier new" , "courier" , monospace;">!listconcat</span>. There is even an <span style="font-family: "courier new" , "courier" , monospace;">!if</span>-builtin and a somewhat broken and limited <span style="font-family: "courier new" , "courier" , monospace;">!foreach</span>.<br />
<br />
There is no way of defining new functions, but there is a pattern that can be used to make up for it: classes with <span style="font-family: "courier new" , "courier" , monospace;">ret</span>-values:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">class extractBit<int val, int bitnum> {<br /> bit ret = !and(!srl(val, bitnum), 1);<br />}<br /><br />class Foo<int val> {<br /> bit bitFour = extractBit<val, 4>.ret;<br />}<br /><br />def Foo1 : Foo<5>;<br />def Foo2 : Foo<17>;</span></blockquote>
This doesn't actually work in LLVM trunk right now because of the deficiencies around anonymous record instantiations that I mentioned in the first part of the series, but after a lot of refactoring and cleanups, I got it to work reliably. It turns out to be an extremely useful tool.<br />
<br />
In case you're wondering, this does not support recursion and it's probably better that way. It's possible that TableGen is already accidentally Turing complete, but giving it that power on purpose seems unnecessary and might lead to abuse.<br />
<br />
Without recursion, a number of builtin functions are required. There has been a <span style="font-family: "courier new" , "courier" , monospace;">!foreach</span> for a long time, and it is a very odd duck:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">def Defs {<br /> int num;<br />}<br /><br />class Example<list<int> nums> {<br /> list<int> doubled = !foreach(Defs.num, nums, !add(Defs.num, Defs.num));<br />}<br /><br />def MyNums : Example<[4, 1, 9, -3]>;</span></blockquote>
In many ways it does what you'd expect, except that having to define a dummy record with a dummy variable in this way is clearly odd and fragile. Until very recently it did not actually support everything you'd think even then, and even with the recent fixes there are plenty of bugs. Clearly, this is how <span style="font-family: "courier new" , "courier" , monospace;">!foreach</span> should look instead:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">class Example<list<int> nums> {<br /> list<int> doubled =<br /> !foreach(x, nums, !add(x, x));<br />}<br /><br />def MyNums : Example<[4, 1, 9, -3]>;</span></blockquote>
... and that's what I've implemented.<br />
<br />
This ends up being a breaking change (the only one in the whole series, hopefully), but <span style="font-family: "courier new" , "courier" , monospace;">!foreach</span> isn't actually used in upstream LLVM proper anyway, and external projects can easily adapt.<br />
<br />
A new feature that I have found very helpful is a fold-left operation:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">class Enumeration<list<string> items> {<br /> list<string> ret = !foldl([], items, lhs, item,<br /> !listconcat(lhs, [!size(lhs) # ": " # item]));<br />}<br /><br />def MyList : Enumeration<["foo", "bar", "baz"]>;</span></blockquote>
This produces the following record:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">def MyList { // Enumeration<br /> list<string> ret = ["0: foo", "1: bar", "2: baz"];<br /> string NAME = ?;<br />}</span></blockquote>
Needless to say, it was necessary to refactor the TableGen tool very deeply to enable this kind of feature, but I am quite happy with how it ended up.<br />
<br />
The title of this entry is "Functional Programming", and in a sense I lied. Functions are not first-class values in TableGen even with my changes, so one of the core features of functional programming is missing. But that's okay: most of what you'd expect to have and actually need is now available in a consistent manner, even if it's still clunkier than in a "real" programming language. And again: making functions first-class would immediately make TableGen Turing complete. Do we really want that?Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-87213102693909357742018-02-19T19:10:00.000+01:002019-02-05T11:04:32.443+01:00TableGen #1: What has TableGen ever done for us?<i>This is the first entry in an on-going series. Here's a list of all entries:</i><br />
<ol>
<li><i><a href="http://nhaehnle.blogspot.de/2018/02/tablegen-1-what-has-tablegen-ever-done.html">What has TableGen ever done for us?</a></i></li>
<li><i><a href="http://nhaehnle.blogspot.de/2018/02/tablegen-2-functional-programming.html">Functional Programming</a></i></li>
<li><i><a href="http://nhaehnle.blogspot.de/2018/02/tablegen-3-bits.html">Bits</a></i></li>
<li><i><a href="http://nhaehnle.blogspot.de/2018/03/tablegen-4-resolving-variables.html">Resolving variables</a></i></li>
<li><i><a href="http://nhaehnle.blogspot.de/2018/03/tablegen-5-dags.html">DAGs</a> </i></li>
<li><i>to be continued </i> </li>
</ol>
<i>Also: <a href="https://fosdem.org/2019/schedule/event/llvm_tablegen/">here</a> is a talk (slides + video) I gave in the FOSDEM 2019 LLVM devroom on TableGen.</i><br />
<br />
Anybody who has ever done serious backend work in LLVM has probably developed a love-hate relationship with TableGen. At its best it can be an extremely useful tool that saves a lot of manual work. At its worst, it will drive you mad with bizarre crashes, indecipherable error messages, and generally inscrutable failures to understand what you want from it.<br />
<br />
TableGen is an internal tool of the LLVM compiler framework. It implements a domain-specific language that is used to describe many different kinds of structures. These descriptions are translated to read-only data tables that are used by LLVM during compilation.<br />
<br />
For example, all of LLVM's intrinsics are described in TableGen files. Additionally, each backend describes its target machine's instructions, register file(s), and more in TableGen files.<br />
<br />
The unit of description is the <i>record</i>. At its core, a record is a dictionary of key-value pairs. Additionally, records are typed by their superclass(es), and each record can have a name. So for example, the target machine descriptions typically contain one record for each supported instruction. The name of this record is the name of the enum value which is used to refer to the instruction. A specialized backend in the TableGen tool collects all records that subclass the <span style="font-family: "courier new" , "courier" , monospace;">Instruction</span> class and generates instruction information tables that is used by the C++ code in the backend and the shared codegen infrastructure.<br />
<br />
The main point of the TableGen DSL is to provide an ostensibly convenient way to generate a large set of records in a structured fashion that exploits regularities in the target machine architecture. To get an idea of the scope, the X86 backend description contains ~47k records generated by ~62k lines of TableGen. The AMDGPU backend description contains ~39k records generated by ~24k lines of TableGen.<br />
<br />
To get an idea of what TableGen looks like, consider this simple example:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">def Plain {<br /> int x = 5;<br />}<br /><br />class Room<string name> {<br /> string Name = name;<br /> string WallColor = "white";<br />}<br /><br />def lobby : Room<"Lobby">;<br /><br />multiclass Floor<int num, string color> {<br /> let WallColor = color in {<br /> def _left : Room<num # "_left">;<br /> def _right : Room<num # "_right">;<br /> }<br />}<br /><br />defm first_floor : Floor<1, "yellow">;<br />defm second_floor : Floor</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><</span>2, "gray">;</span></blockquote>
This example defines 6 records in total. If you have an LLVM build around, just run the above through <span style="font-family: "courier new" , "courier" , monospace;">llvm-tblgen</span> to see them for yourself. The first one has name <span style="font-family: "courier new" , "courier" , monospace;">Plain</span> and contains a single value named <span style="font-family: "courier new" , "courier" , monospace;">x</span> of value <span style="font-family: "courier new" , "courier" , monospace;">5</span>. The other 5 records have <span style="font-family: "courier new" , "courier" , monospace;">Room</span> as a superclass and contain different values for <span style="font-family: "courier new" , "courier" , monospace;">Name</span> and <span style="font-family: "courier new" , "courier" , monospace;">WallColor</span>.<br />
<br />
The first of those is the record of <span style="font-family: inherit;">name</span> <span style="font-family: "courier new" , "courier" , monospace;">lobby</span>, whose <span style="font-family: "courier new" , "courier" , monospace;">Name</span> value is "<span style="font-family: "courier new" , "courier" , monospace;">Lobby</span>" (note the difference in capitalization) and whose <span style="font-family: "courier new" , "courier" , monospace;">WallColor</span> is "<span style="font-family: "courier new" , "courier" , monospace;">white</span>".<br />
<br />
Then there are four records with the names <span style="font-family: "courier new" , "courier" , monospace;">first_floor_left</span>, <span style="font-family: "courier new" , "courier" , monospace;">first_floor_right</span>, <span style="font-family: "courier new" , "courier" , monospace;">second_floor_left</span>, and <span style="font-family: "courier new" , "courier" , monospace;">second_floor_right</span>. Each of those has <span style="font-family: "courier new" , "courier" , monospace;">Room</span> as a superclass, but not <span style="font-family: "courier new" , "courier" , monospace;">Floor</span>. <span style="font-family: "courier new" , "courier" , monospace;">Floor</span> is a multiclass, and multiclasses are not classes (go figure!). Instead, they are simply collections of record prototypes. In this case, <span style="font-family: "courier new" , "courier" , monospace;">Floor</span> has two record prototypes, <span style="font-family: "courier new" , "courier" , monospace;">_left</span> and <span style="font-family: "courier new" , "courier" , monospace;">_right</span>. They are instantiated by each of the <span style="font-family: "courier new" , "courier" , monospace;">defm</span> directives. Note how even though <span style="font-family: "courier new" , "courier" , monospace;">def</span> and <span style="font-family: "courier new" , "courier" , monospace;">defm</span> look quite similar, they are conceptually different: one instantiates the prototypes in a multiclass (or several multiclasses), the other creates a record that may or may not have one or more superclasses.<br />
<br />
The <span style="font-family: "courier new" , "courier" , monospace;">Name</span> value of <span style="font-family: "courier new" , "courier" , monospace;">first_floor_left</span> is <span style="font-family: "courier new" , "courier" , monospace;">"1_left</span>" and its <span style="font-family: "courier new" , "courier" , monospace;">WallColor</span> is "<span style="font-family: "courier new" , "courier" , monospace;">yellow</span>", overriding the default. This demonstrates the late-binding nature of TableGen, which is quite useful for modeling exceptions to an otherwise regular structure:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">class Foo {<br /> string salutation = "Hi";<br /> string message = salutation#", world!";<br />}<br /><br />def : Foo {<br /> let </span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">salutation</span> = "Hello";<br />}</span></blockquote>
The <span style="font-family: "courier new" , "courier" , monospace;">message</span> of the anonymous record defined by the <span style="font-family: "courier new" , "courier" , monospace;">def</span>-statement is <span style="font-family: "courier new" , "courier" , monospace;">"Hello, world!"</span>.<br />
<br />
There is much more to TableGen. For example, a particularly surprising but extremely useful feature are the bit sets that are used to describe instruction encodings. But that's for another time.<br />
<br />
For now, let me leave you with just one of the many ridiculous inconsistencies in TableGen:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">class Tag<int num> {<br /> int Number = num;<br />}<br /><br />class Test<int num> {<br /> int Number1 = Tag<5>.Number;<br /> int Number2 = Tag<num>.Number;<br /> Tag Tag1 = Tag<5>;<br /> Tag Tag2 = Tag<num>;<br />}<br /><br />def : Test<5>;</span></blockquote>
What are the values in the anonymous record? It turns out that <span style="font-family: "courier new" , "courier" , monospace;">Number1</span> and <span style="font-family: "courier new" , "courier" , monospace;">Number2</span> are both <span style="font-family: "courier new" , "courier" , monospace;">5</span>, but <span style="font-family: "courier new" , "courier" , monospace;">Tag1</span> and <span style="font-family: "courier new" , "courier" , monospace;">Tag2</span> refer to different records. <span style="font-family: "courier new" , "courier" , monospace;">Tag1</span> refers to an anonymous record with superclass <span style="font-family: "courier new" , "courier" , monospace;">Tag</span> and <span style="font-family: "courier new" , "courier" , monospace;">Number</span> equal to <span style="font-family: "courier new" , "courier" , monospace;">5</span>, while <span style="font-family: "courier new" , "courier" , monospace;">Tag2</span> also refers to an anonymous record, but with the <span style="font-family: "courier new" , "courier" , monospace;">Number</span> equal to an unresolved variable reference.<br />
<br />
This clearly doesn't make sense at all and is the kind of thing that sometimes makes you want to just throw it all out of the window and build your own DSL with blackjack and Python hooks. The problem with that kind of approach is that even if the new thing looks nicer initially, it'd probably end up in a similarly messy state after another five years.<br />
<br />
So when I ran into several problems like the above recently, I decided to take a deep dive into the internals of TableGen with the hope of just fixing a lot of the mess without reinventing the wheel. Over the next weeks, I plan to write a couple of focused entries on what I've learned and changed, starting with how a simple form of functional programming should be possible in TableGen.Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com2tag:blogger.com,1999:blog-36137506.post-13381443031453096602017-09-30T20:25:00.000+02:002017-09-30T20:30:35.633+02:00PositsThe <a href="https://www.slideshare.net/insideHPC/beyond-floating-point-next-generation-computer-arithmetic">posit number system</a> is a proposed alternative to floating point numbers. Having heard of posits a couple of times now, I'd like to take the time to digest them and, in the second half, write a bit about their implementation in hardware. Their creator makes some bold claims about posits being simpler to implement, and - spoiler alert! - I believe he's mistaken. Posits are still a clever idea and may indeed be a good candidate for replacing floating point in the long run. But trade-offs are an inescapable fact of life.<br />
<br />
<h3>
Floating point revisited</h3>
In floating point, numbers are represented by a sign bit, an exponent, and a mantissa:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiN0Z10kXYRehI7uGHTf_gZaf-fjTW6q2HBB7IwTGnrHaip0dSafmuv1B0H2xpgn1LqGZFvKy036yIix7PTpgHfPAJy49gVVqPdTmxuRT58QSq6oY8DtWiZcGjvxq9IAjciz8kBSg/s1600/floatingpoint.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="24" data-original-width="336" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiN0Z10kXYRehI7uGHTf_gZaf-fjTW6q2HBB7IwTGnrHaip0dSafmuv1B0H2xpgn1LqGZFvKy036yIix7PTpgHfPAJy49gVVqPdTmxuRT58QSq6oY8DtWiZcGjvxq9IAjciz8kBSg/s1600/floatingpoint.png" /></a></div>
<br />
The value of a normal floating point number is <i>±1.m<sub>2</sub>*2<sup>e</sup></i> (actually, <i>e</i> is stored with a bias in order to be able to treat it like an unsigned number most of the time, but let's not get distracted by that kind of detail). By using an exponent, a wide range of numbers can be represented at a constant relative accuracy.<br />
<br />
There are some non-normal floating point numbers. When <i>e</i> is maximal, the number is either considered infinity or "not a number", depending on <i>m</i>. When <i>e</i> is minimal, it represents a <i>sub-normal</i> number: either a <i>denormal</i> or zero.<br />
<br />
Denormals can be confusing at first, but their justification is actually quite simple. Let's take single-precision floating point as an example, where there are 8 exponent bits and 23 mantissa bits. The smallest positive normal single-precision floating point number is <i>1.00000000000000000000000<sub>2</sub>*2<sup>-126</sup></i>. The next larger representable number is <i>1.00000000000000000000001<sub>2</sub>*2<sup>-126</sup></i>. Those numbers are not equal, but their difference is not representable as a normal single-precision floating point number. It would be rather odd if the difference between non-equal numbers were equal to zero, as it would be if we had to round the difference to zero!<br />
<br />
When <i>e</i> is minimal, the represented number is (in the case of single-precision floating point) <i>±0.m<sub>2</sub>*2<sup>-126</sup></i>, which means that the difference between the smallest normal numbers, <i>0.00000000000000000000001<sub>2</sub>*2<sup>-126</sup></i>, can still be represented.<br />
<br />
Note how with floating point numbers, the relative accuracy with which numbers can be represented is constant for almost the entire range of representable numbers. Once you get to sub-normal numbers, the accuracy drops very quickly. At the other end, the drop is even more extreme with a sudden jump to infinity.<br />
<br />
<h3>
Posits</h3>
The basic idea of posits is to vary the size of the mantissa and to use a variable-length hybrid encoding of the exponent that mixes unary with binary encodings. The variable-length exponent encoding is shorter for exponents close to zero, so that more bits of mantissa are available for numbers close to one.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNQ2qh4FrC5zAJt9i9xW4c1VQYomVq2TOrTbLPwOsdcpXiq_lwoN0foyYQalxo0bckHPWq6G5rYdhkxXkWgXnggktYbarUl8NdPzqkT96laO1REQkFqQKODr4hLpLHix1i6rEFKA/s1600/posit.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="24" data-original-width="336" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNQ2qh4FrC5zAJt9i9xW4c1VQYomVq2TOrTbLPwOsdcpXiq_lwoN0foyYQalxo0bckHPWq6G5rYdhkxXkWgXnggktYbarUl8NdPzqkT96laO1REQkFqQKODr4hLpLHix1i6rEFKA/s1600/posit.png" /></a></div>
<br />
Posits have a fixed number of binary exponent bits <i>e</i> (except in the extreme ranges), and a posit system is characterized by that number. A typical choice appears to be <i>es = 3</i>. The unary part of the exponent is encoded by the <i>r</i> bits. For positive posits, <i>10<sub>1</sub></i> encodes <i>0</i>, <i>110<sub>1</sub></i> encodes <i>1</i>, <i>01<sub>1</sub></i> encodes <i>-1</i>, <i>001<sub>1</sub></i> encodes <i>-2</i>, and so on. The overall encoded number is then <i>±1.m<sub>2</sub>*2<sup>r*2<sup>es</sup> + e</sup></i>.<br />
<br />
Let's look at some examples of 16-bit posits with <i>es = 3</i>.<br />
<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">10</span> <span style="color: blue;">000</span> 0000000000 is <i>1.0<sub>2</sub>*2<sup>0*2<sup>3</sup>+0</sup> = 1</i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">10</span> <span style="color: blue;">000</span> 1000000000 is <i>1.1<sub>2</sub>*2<sup>0*2<sup>3</sup>+0</sup> = 1.5</i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">10</span> <span style="color: blue;">001</span> 0000000000 is <i>1.0<sub>2</sub>*2<sup>0*2<sup>3</sup>+1</sup> = 2</i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">10</span> <span style="color: blue;">111</span> 0000000000 is <i>1.0<sub>2</sub>*2<sup>0*2<sup>3</sup>+7</sup> = 128</i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">110</span> <span style="color: blue;">000</span> 000000000 is <i>1.0<sub>2</sub>*2<sup>1*2<sup>3</sup>+0</sup> = 256</i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">1110</span> <span style="color: blue;">000</span> 00000000 is <i>1.0<sub>2</sub>*2<sup>2*2<sup>3</sup>+0</sup> = 65536</i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">111111111110</span> <span style="color: blue;">000</span> is <i>1.0<sub>2</sub>*2<sup>10*2<sup>3</sup>+0</sup> = 2<sup>80</sup></i>. Note that there is no mantissa anymore! The next larger number is:<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">111111111110</span> <span style="color: blue;">001</span> is <i>1.0<sub>2</sub>*2<sup>10*2<sup>3</sup>+1</sup> = 2<sup>81</sup></i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">111111111110</span> <span style="color: blue;">111</span> is <i>1.0<sub>2</sub>*2<sup>10*2<sup>3</sup>+7</sup> = 2<sup>87</sup></i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">1111111111110</span> <span style="color: blue;">00</span> is <i>1.0<sub>2</sub>*2<sup>11*2<sup>3</sup>+0</sup> = 2<sup>88</sup></i>. Now the number of binary exponent bits starts shrinking. The missing bits are implicitly zero, so the next larger number is:<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">1111111111110</span> <span style="color: blue;">01</span> is <i>1.0<sub>2</sub>*2<sup>11*2<sup>3</sup>+2</sup> = 2<sup>90</sup></i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">1111111111110</span> <span style="color: blue;">11</span> is <i>1.0<sub>2</sub>*2<sup>11*2<sup>3</sup>+6</sup> = 2<sup>94</sup></i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">11111111111110</span> <span style="color: blue;">0</span> is <i>1.0<sub>2</sub>*2<sup>12*2<sup>3</sup>+0</sup> = 2<sup>96</sup></i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">11111111111110</span> <span style="color: blue;">1</span> is <i>1.0<sub>2</sub>*2<sup>12*2<sup>3</sup>+4</sup> = 2<sup>100</sup></i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">111111111111110</span><span style="color: blue;"></span> is <i>1.0<sub>2</sub>*2<sup>13*2<sup>3</sup>+0</sup> = 2<sup>104</sup></i>. There are no binary exponent bits left, but the presentation in the slides linked above still allows for one larger normal number:<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">111111111111111</span><span style="color: blue;"></span> is <i>1.0<sub>2</sub>*2<sup>14*2<sup>3</sup>+0</sup> = 2<sup>112</sup></i>.<br />
Going in the other direction, we get:<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">01</span> <span style="color: blue;">111</span> 0000000000 is <i>1.0<sub>2</sub>*2<sup>-1*2<sup>3</sup>+7</sup> = 1/2 = 0.5</i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">01</span> <span style="color: blue;">000</span> 0000000000 is <i>1.0<sub>2</sub>*2<sup>-1*2<sup>3</sup>+0</sup> = 1/256 = 0.00390625</i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">001</span> <span style="color: blue;">000</span> 000000000 is <i>1.0<sub>2</sub>*2<sup>-2*2<sup>3</sup>+0</sup> = 1/65536 = 0.0000152587890625</i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">000000000001</span> <span style="color: blue;">111</span> is <i>1.0<sub>2</sub>*2<sup>-11*2<sup>3</sup>+7</sup> = 2<sup>-81</sup></i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">000000000001</span> <span style="color: blue;">000</span> is <i>1.0<sub>2</sub>*2<sup>-11*2<sup>3</sup>+0</sup> = 2<sup>-88</sup></i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">0000000000001</span> <span style="color: blue;">11</span> is <i>1.0<sub>2</sub>*2<sup>-12*2<sup>3</sup>+6</sup> = 2<sup>-90</sup></i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">0000000000001</span> <span style="color: blue;">00</span> is <i>1.0<sub>2</sub>*2<sup>-12*2<sup>3</sup>+0</sup> = 2<sup>-96</sup></i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">00000000000001</span> <span style="color: blue;">0</span> is <i>1.0<sub>2</sub>*2<sup>-13*2<sup>3</sup>+0</sup> = 2<sup>-104</sup></i>.<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">000000000000001</span> is <i>1.0<sub>2</sub>*2<sup>-14*2<sup>3</sup>+0</sup> = 2<sup>-112</sup></i>. This is the smallest positive normal number, since we have no choice but to treat 0 specially:<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">000000000000000</span> is 0.<br />
<br />
For values close to 1, the accuracy is the same as for half-precision floating point numbers (which have 5 exponent and 10 mantissa bits). Half-precision floating point numbers do have slightly higher accuracy at the extreme ends of their dynamic range, but the dynamic range of posits is <i>much</i> higher. This is a very tempting trade-off for many applications.<br />
<br />
By the way: if we had set <i>es = 2</i>, we could have larger accuracy for values close to 1, while still having a higher dynamic range than half-precision floating point.<br />
<br />
You'll note that we have not encountered an infinity. Gustafson's proposal here is to do away with the distinction between positive and negative zero and infinity. Instead, his proposal is to think of the real numbers projectively, and use a two's complement representation, meaning that negating a posit is the same operation at the bit level as negating an integer. For example:<br />
<br />
<span style="color: red;">1</span> <span style="color: #b45f06;">111111111111111</span> is <i>-1.0<sub>2</sub>*2<sup>-14*2<sup>3</sup>+0</sup> = -2<sup>-112</sup></i>.<br />
<span style="color: red;">1</span> <span style="color: #b45f06;">10</span> <span style="color: blue;">000</span> 0000000000 is <i>-1.0<sub>2</sub>*2<sup>0*2<sup>3</sup>+0</sup> = -1</i>. The next smaller number (larger in absolute magnitude) is:<br />
<span style="color: red;">1</span> <span style="color: #b45f06;">01</span> <span style="color: blue;">111</span> 1111111111 is <i>-1.0000000001<sub>2</sub>*2<sup>0*2<sup>3</sup>+0</sup></i>.<br />
<span style="color: red;">1</span> <span style="color: #b45f06;">01</span> <span style="color: blue;">111</span> 1000000000 is <i>-1.1<sub>2</sub>*2<sup>0*2<sup>3</sup>+0</sup> = -1</i>.5<br />
<span style="color: red;">1</span> <span style="color: #b45f06;">000000000000001</span> is <i>-1.0<sub>2</sub>*2<sup>14*2<sup>3</sup>+0</sup> = -</i><i><i>2<sup>112</sup></i></i>.<br />
<br />
The bit pattern <span style="color: red;">1</span> <span style="color: #b45f06;">000000000000000</span> (which, like 0, is its own inverse in two's complement negation) would then represent infinity.<br />
<br />
There's an elegance to thinking projectively in this way. Comparison of posits is the same as comparison of signed integers at the bit level (except for infinity, which is unordered). Even better, it's great that the smallest and largest normal numbers are multiplicative inverses of each other.<br />
<br />
But to people used to floating point, not having a "sign + magnitude" representation is surprising. I also imagine that it could be annoying for a hardware implementation, so let's look into that.<br />
<br />
<h3>
Hardware implementations</h3>
In his presentations, Gustafson claims that by reducing the number of special cases, posits are easier to implement than floating point. No doubt there are fewer special cases (no denorms, no NaNs), but at the cost of a more complicated normal case.<br />
<br />
Let's take a look at a floating point multiply. The basic structure is conceptually quite simple, since all parts of a floating point number can be treated separately:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdwRMb9do2R6KOKC1BvSh0ieDidICQD1hjHc5UrqmgJucvRluZmqZfT4w06T-1mdJXcPI8450hnIWoqSlCWQySelxxijOHUnFwCGI9fvvfzY7hfWqTr8Bwj7X7IYK8Ks8CJ2DA4Q/s1600/fp-multiply.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="232" data-original-width="693" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdwRMb9do2R6KOKC1BvSh0ieDidICQD1hjHc5UrqmgJucvRluZmqZfT4w06T-1mdJXcPI8450hnIWoqSlCWQySelxxijOHUnFwCGI9fvvfzY7hfWqTr8Bwj7X7IYK8Ks8CJ2DA4Q/s1600/fp-multiply.png" /></a></div>
<br />
By far the most expensive part here is the multiplication of the mantissas. There are of course a bunch of special cases that need to be accounted for: the inputs could be zero, infinity, or NaN, and the multiplication could overflow. Each of these cases are easily detected and handled with a little bit of comparatively inexpensive boolean logic.<br />
<br />
Where it starts to get complicated is when handling the possibility that an input is denormal, or when the multiplication of two normal numbers results in a denormal.<br />
<br />
When an input is denormal, the corresponding input for the multiply is <i>0.m</i> instead of <i>1.m</i>. Some logic has to decide whether the most significant input bit to the multiply is 0 or 1. This could potentially add to the latency of the computation. Luckily, deciding whether the input is denormal is fairly simple, and only the most significant input bit is affected. Because of carries, the less significant input bits tend to be more critical for latency. Conversely, this means that the latency of determining the most significant input bit can be hidden well.<br />
<br />
On the output side, the cost is higher, both in terms of the required logic and in terms of the added latency, because a shifter is needed to shift the output into the correct position. Two cases need to be considered: When a multiplication of two normal numbers results in a denormal, the output has to be shifted to the right an appropriate number of places.<br />
<br />
When a denormal is multiplied by a normal number, the output needs to be shifted to the left or the right, depending on the exponent of the normal number. Additionally, the number of leading zeros of either the denormal input or of the multiplication output is required to determine the exponent of the final result. Since the area cost is the same either way, I would expect implementations to determine the leading zero of the denormal input, since that allows for better latency hiding.<br />
<br />
(The design space for floating point multipliers is larger than I've shown here. For example, you could deal with denormals by shifting their mantissa into place <i>before</i> the multiply. That seems like a waste of hardware considering that you cannot avoid the shifter after the multiply, but my understanding of hardware design is limited, so who knows.)<br />
<br />
So there is a bit more hardware required than just what is shown in the diagram above: a leading-zero-count and a shifter, plus a bit more random logic. But now compare to the effort required for a posit multiply:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi21AFMjROG7AvHGMD6VkM0NRlBMTzcNcnMXqlRg7b4QknylJr3eM2XJYvn-zTTJIGWJ1VSIeCF8S0Y7cEzUuEwoYHkZQy2aKb9u6BwA_wSG_PdZgvdF75kFa80c8eDsM2voPaEXg/s1600/posit-multiply.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="298" data-original-width="689" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi21AFMjROG7AvHGMD6VkM0NRlBMTzcNcnMXqlRg7b4QknylJr3eM2XJYvn-zTTJIGWJ1VSIeCF8S0Y7cEzUuEwoYHkZQy2aKb9u6BwA_wSG_PdZgvdF75kFa80c8eDsM2voPaEXg/s1600/posit-multiply.png" /></a></div>
<br />
First of all, there is unavoidable latency in front of the multiplier. Every single bit of mantissa input may be masked off, depending on the variable size of the exponent's unary part. The exponents themselves need to be decoded in order to add them, and then the resulting exponent needs to be encoded again. Finally, the multiplication result needs to be shifted into place; this was already required for floating point multiplication, but the determination of the shift becomes more complicated since it depends on the exponent size. Also, each output bit needs a multiplexer since it can originate from either the exponent or the mantissa.<br />
<br />
From my non-expert glance, here's the hardware you need in addition to the multiplier and exponent addition:<br />
<ul>
<li>two leading-bit counts to decode the unary exponent parts (floating-point multiply only needs a single leading-zero count for a denormal input)</li>
<li>two shifters to shift the binary input exponent parts into place</li>
<li>logic for masking the input mantissas</li>
<li>one leading bit encoder</li>
<li>one shifter to shift the binary output exponent part into place</li>
<li>one shifter to shift the mantissa into place (floating-point also needs this)</li>
<li>multiplexer logic to combine the variable-length output parts </li>
</ul>
Also note that the multiplier and mantissa shifter may have to be larger, since - depending on the value of <i>es</i> - the mantissa of posits close to 1 can be larger than the mantissa of floating point numbers.<br />
<br />
On the other hand, the additional shifters don't have to be large, since they only need to shift <i>es</i> bits. The additional hardware is almost certainly dominated by the cost of the mantissa multiplier. Still, the additional latency could be a problem - though obviously, I have no actual experience designing floating point multipliers.<br />
<br />
There's also the issue of the proposed two's complement representation for negative posits. This may not be too bad for the mantissa multiplication, since one can probably treat it as a signed integer multiplication and automatically get the correct signs for the resulting mantissa. However, I would expect some more overhead for the decoding and encoding of the exponent.<br />
<br />
The story should be similar for posit vs. floating point addition. When building a multiply-accumulate unit, the latency that is added for masking the input based on the variable exponent length can likely be hidden quite well, but there does not appear a way around the decoding and encoding of exponents.<br />
<br />
<h3>
Closing thoughts</h3>
As explained above, I expect posit hardware to be more expensive than floating point hardware. However, the gain in dynamic range and accuracy is nothing to sneeze at. It's worth giving posits a fair shot, since the trade-off may be worth it.<br />
<br />
There is a lot of legacy software that relies on floating point behavior. Luckily, a posit ALU contains all the pieces of a floating point ALU, so it should be possible to build an ALU that can do both at pretty much the cost of a posit-only ALU. This makes a painless transition feasible.<br />
<br />
Posits have an elegant design based on thinking about numbers projectively, but the lack of NaNs, the two's complement representation, and not having signed zeros and infinities may be alien to some floating point practicioners. I don't know how much of an issue this really is, but it's worth pointing out that a simple modification to posits could accommodate all these concerns. Using again the example of 16-bit posits with <i>es = 3</i>, we could designate bit patterns at the extreme ends as NaN and infinity:<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">111111111111111</span> is <i>+inf</i> (instead of <i>2<sup>112</sup></i>).<br />
<span style="color: red;">0</span> <span style="color: #b45f06;">000000000000001</span> is <i>+NaN</i> (instead of <i>2<sup>-112</sup></i>).<br />
We could then treat the sign bit independently, like in floating point, giving us <i>±0</i>, <i><i>±inf</i></i>, and <i><i><i>±NaN</i></i>.</i> The neat properties related to thinking projectively would be lost, but the smallest and largest positive normal numbers would still be multiplicative inverses of each other. The hardware implementation may even be smaller, thanks to not having to deal with two's complement exponents.<br />
<br />
The inertia of floating point is massive, and I don't expect it to be unseated anytime soon. But it's awesome to see people rethinking such fundamental building blocks of computing and coming up with solid new ideas. Posits aren't going to happen quickly, if at all, but it's worth taking them seriously.Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-62249182829509802922017-09-09T13:30:00.000+02:002017-09-09T13:30:01.765+02:00radeonsi: out-of-order rasterization on VI+I've been polishing a patch of Marek to <a href="https://cgit.freedesktop.org/~nh/mesa/log/?h=out-of-order">enable out-of-order rasterization on VI+</a>. Assuming it goes through as planned, this will be the first time we're adding driver-specific drirc configuration options that are unfamiliar to the enthusiast community (there's <span style="font-family: "courier new" , "courier" , monospace;">radeonsi_enable_sisched</span> already, but Phoronix has reported on the sisched option often enough). So I thought it makes sense to explain what those options are about.<br />
<br />
<b>Background: Out-of-order rasterization</b><br />
<br />
Out-of-order rasterization is an optimization that can be enabled in some cases. Understanding it properly requires some background on how tasks are spread across <i>shader engines (SEs)</i> on Radeon GPUs.<br />
<br />
The frontends (vertex processing, including tessellation and geometry shaders) and backends (fragment processing, including rasterization and depth and color buffers) are spread across SEs roughly like this:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVVY6jkzxBV2vPChmSChmNC1moFFsJ0x5-35zeIZ4_nJHopuj9kQnzUVV_R8bZ3qMuoU1RVgjY1M5F3Q7UDfeFXQ1px0DcESVLIAAJkU7ajp7eFrWn3HBHn5NOJXKOuBcBLH6s3w/s1600/ia-se-vp-fp.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="347" data-original-width="597" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVVY6jkzxBV2vPChmSChmNC1moFFsJ0x5-35zeIZ4_nJHopuj9kQnzUVV_R8bZ3qMuoU1RVgjY1M5F3Q7UDfeFXQ1px0DcESVLIAAJkU7ajp7eFrWn3HBHn5NOJXKOuBcBLH6s3w/s1600/ia-se-vp-fp.png" /></a></div>
<br />
(Not shown are the <i>compute units (CUs)</i> in each SE, which is where all shaders are actually executed.)<br />
<br />
The input assembler distributes primitives (i.e., triangles) and their vertices across SEs in a mostly round-robin fashion for vertex processing. In the backend, work is distributed across SEs by on-screen location, because that improves cache locality.<br />
<br />
This means that once the data of a triangle (vertex position and attributes) is complete, most likely the corresponding rasterization work needs to be distributed to other SEs. This is done by what I'm simplifying as the "crossbar" in the diagram.<br />
<br />
OpenGL is very precise about the order in which the fixed-function parts of fragment processing should happen. If one triangle comes after another in a vertex buffer and they overlap, then the fragments of the second triangle better overwrite the corresponding fragments of the first triangle (if they weren't rejected by the depth test, of course). This means that the "crossbar" may have to delay forwarding primitives from a shader engine until all earlier primitives (which were processed in another shader engine) have been forwarded. This only happens rarely, but it's still sad when it does.<br />
<br />
There are some cases in which the order of fragments doesn't matter. Depth pre-passes are a typical example: the order in which triangles are written to the depth buffer doesn't matter as long as the "front-most" fragments win in the end. Another example are some operations involved in stencil shadows.<br />
<br />
Out-of-order rasterization simply means that the "crossbar" does not delay forwarding triangles. Triangles are instead forwarded
immediately, which means that they can be rasterized out-of-order. With the in-progress patches, the driver recognizes cases where this optimization can be enabled safely.<br />
<br />
By the way #1: From this explanation, you can immediately deduce that this feature only affects GPUs with multiple SEs. So integrated GPUs are not affected, for example.<br />
<br />
By the way #2: Out-of-order rasterization is entirely disabled by setting <span style="font-family: "courier new" , "courier" , monospace;">R600_DEBUG=nooutoforder</span>.<br />
<br />
<br />
<b>Why the configuration options?</b><br />
<br />
There are some cases where the order of fragments <i>almost</i> doesn't matter. It turns out that the most common and basic type of rendering is one of these cases. This is when you're drawing triangles without blending and with a standard depth function like LEQUAL with depth writes enabled. Basically, this is what you learn to do in every first 3D programming tutorial.<br />
<br />
In this case, the order of fragments is mostly irrelevant because of the depth test. However, it <i>might</i> happen that two triangles have the exact same depth value, and then the order matters. This is very unlikely in common scenes though. Setting the option <span style="font-family: "courier new" , "courier" , monospace;">radeonsi_assume_no_z_fights=true</span> makes the driver assume that it indeed <i>never</i> happens, which means out-of-order rasterization can be enabled in the most common rendering mode!<br />
<br />
Some other cases occur with blending. Some blending modes (though not the most common ones) are <i>commutative</i> in the sense that from a purely mathematical point of view, the end result of blending two triangles together is the same no matter which order they're blended in. Unfortunately, <i>additive</i> blending (which is one of those modes) involves floating point numbers in a way where changing the order of operations can lead to different rounding, which leads to subtly different results. Using out-of-order rasterization would break some of the guarantees the driver has to give for OpenGL conformance.<br />
<br />
The option <span style="font-family: "courier new" , "courier" , monospace;">radeonsi_commutative_blend_add=true</span> tells the driver that you don't care about these subtle errors and will lead to out-of-order rasterization being used in some additional cases (though again, those cases are rarer, and many games probably don't encounter them at all).<br />
<br />
<b>tl;dr</b><br />
<br />
Out-of-order rasterization can give a very minor boost on multi-shader engine VI+ GPUs (meaning dGPUs, basically) in many games by default. In most games, you should be able to set <span style="font-family: "courier new" , "courier" , monospace;">radeonsi_assume_no_z_fights=true</span> and <span style="font-family: "courier new" , "courier" , monospace;">radeonsi_commutative_blend_add=true</span> to get an additional very minor boost. Those options aren't enabled by default because they can lead to incorrect results. Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-43910725016150647892017-06-25T23:10:00.000+02:002017-06-25T23:10:11.814+02:00ARB_gl_spirv, NIR linking, and a NIR backend for radeonsi<a href="https://www.khronos.org/registry/spir-v/">SPIR-V</a> is the binary shader code representation used by Vulkan, and <a href="https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_gl_spirv.txt">GL_ARB_gl_spirv</a> is a recent extension that allows it to be used for OpenGL as well. Over the last weeks, I've been exploring how to add support for it in radeonsi.<br />
<br />
As a bit of background, here's an overview of the various relevant shader representations that Mesa knows about. There are some others for really old legacy OpenGL features, but we don't care about those. On the left, you see the SPIR-V to LLVM IR path used by radv for Vulkan. On the right is the path from GLSL to LLVM IR, plus a mention of the conversion from GLSL IR to NIR that some other drivers are using (i965, freedreno, and vc4).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpoKJVaqQxc7_bPYnRXzTVBkZz28NEhbcMqw09mNvN_2Kzktg-vwVp50uczHaacONmoP93BanfTC3I_lFroKVYlmxkY4Qvx4Jrs0WFV9ZDWeL3JypAB5khgVhcJ1-bLYn5F8Y8PA/s1600/slang-diagram.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="423" data-original-width="454" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpoKJVaqQxc7_bPYnRXzTVBkZz28NEhbcMqw09mNvN_2Kzktg-vwVp50uczHaacONmoP93BanfTC3I_lFroKVYlmxkY4Qvx4Jrs0WFV9ZDWeL3JypAB5khgVhcJ1-bLYn5F8Y8PA/s1600/slang-diagram.png" /></a></div>
For GL_ARB_gl_spirv, we ultimately need to translate SPIR-V to LLVM IR. A path for this exists, but it's in the context of radv, not radeonsi. Still, the idea is to reuse this path.<br />
<br />
Most of the differences between radv and radeonsi are in the ABI used by the shaders: the conventions by which the shaders on the GPU know where to load constants and image descriptors from, for example. The existing NIR-to-LLVM code needs to be adjusted to be compatible with radeonsi's ABI. I have mostly completed this work for simple VS-PS shader pipelines, which has the interesting side effect of allowing the GLSL-to-NIR conversion in radeonsi as well. We don't plan to use it soon, but it's nice to be able to compare.<br />
<br />
Then there's adding SPIR-V support to the driver-independent mesa/main code. This is non-trivial, because while GL_ARB_gl_spirv has been designed to remove a lot of the cruft of the old GLSL paths, we still need more supporting code than a Vulkan driver. This still needs to be explored a bit; the main issue is that GL_ARB_gl_spirv allows using default-block uniforms, so the whole machinery around glUniform*() calls has to work, which requires setting up all the same internal data structures that are setup for GLSL programs. Oh, and it looks like assigning locations is required, too.<br />
<br />
My current plan is to achieve all this by re-using the GLSL linker, giving a final picture that looks like this:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0WjbkHpCFu8pxzSSEgTpYAeKXqBdtg7jS5I-Im0hInv_LQv-fNjh1x-d_rA3lYuFt-wysmJm8H9adm4BQDO2GZPe1BMV8NIi8zfXytVK_DdfqwqGqmLJJno0EpdELUlONFxCYFA/s1600/slang-diagram-new.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="425" data-original-width="288" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0WjbkHpCFu8pxzSSEgTpYAeKXqBdtg7jS5I-Im0hInv_LQv-fNjh1x-d_rA3lYuFt-wysmJm8H9adm4BQDO2GZPe1BMV8NIi8zfXytVK_DdfqwqGqmLJJno0EpdELUlONFxCYFA/s1600/slang-diagram-new.png" /></a></div>
So the canonical path in radeonsi for GLSL remains GLSL -> AST -> IR -> TGSI -> LLVM (with an optional deviation along the IR -> NIR -> LLVM path for testing), while the path for GL_ARB_gl_spirv is SPIR-V -> NIR -> LLVM, with NIR-based linking in between. In radv, the path remains as it is today.<br />
<br />
Now, you may rightfully say that the GLSL linker is a huge chunk of subtle code, and quite thoroughly invested in GLSL IR. How could it possibly be used with NIR?<br />
<br />
The answer is that huge parts of the linker don't really that much about the <i>code</i> in the shaders that are being linked. They only really care about the <i>variables</i>: uniforms and shader inputs and outputs. True, there are a bunch of linking steps that touch code, but most of them aren't actually needed for SPIR-V. Most notably, GL_ARB_gl_spirv doesn't require intrastage linking, and it explicitly disallows the use of features that only exist in compatibility profiles.<br />
<br />
So most of the linker functionality can be preserved simply by converting the relevant variables (shader inputs/outputs, uniforms) from NIR to IR, then performing the linking on those, and finally extracting the linker results and writing them back into NIR. This isn't too much work. Luckily, NIR reuses the GLSL IR type system.<br />
<br />
There are still parts that might need to look at the actual shader code, but my hope is that they are few enough that they don't matter.<br />
<br />
And by the way, some people might want to move the IR -> NIR translation to before linking, so this work would set a foundation for that as well.<br />
<br />
Anyway, I got a ridiculously simple toy VS-PS pipeline working correctly this weekend. The real challenge now is to find actual test cases...Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com2tag:blogger.com,1999:blog-36137506.post-4675803809125601372017-01-22T12:52:00.001+01:002017-01-22T12:52:48.994+01:00Unser Wohlstand basiert auf geerbtem Wissen und aktiven Gehirnen<br />
Vor Kurzem schrieb Stefan Pietsch <a href="http://www.deliberationdaily.de/2017/01/die-maer-vom-recht-auf-arbeit/">einen Essay</a>, in dem er einigermaßen frei assoziierend eine Reihe von Themen im Umfeld "Arbeit" kommentiert. Der Text ist etwas zu unfokussiert für eine umfassende Antwort, aber ich möchte ein paar Punkte ansprechen, bei denen wir nicht ganz einig sind<br />
<br />
Ich will einmal am Ende anfangen, worauf unser Wohlstand eigentlich beruht. Wenn man sich das im historischen Vergleich überlegt, dann übersieht Pietsch einen ganz wichtigen Faktor: im Wesentlichen beruht unser Wohlstand auf den wissenschaftlichen und technologischen Errungenschaften, die wir von unseren Vorfahren geerbt haben und heute weiterentwickeln.<br />
<br />
Das hat interessante Konsequenzen. Da das Wissen und die Technologien von der ganzen Menschheit geerbt wurden, kann ein Einzelner daraus resultierende Gewinne nicht ohne Weiteres für sich beanspruchen. Zwar hat der Einzelne, indem er das geerbte Wissen praktisch anwendbar macht, durchaus seinen Teil beigetragen. Diese Argumentation eignet sich aber nur zur Rechtfertigung von relativem Wohlstand im Vergleich zu anderen, die weniger zur praktischen Anwendung des geerbten Wissens beigetragen haben. Sie eignet sich <i>nicht</i> zur Rechtfertigung von <i>absolutem</i> Wohlstand, da der absolute Wohlstand eben nicht aus dem persönlichen Beitrag kommt. Dieser Widerspruch lässt sich durch eine sehr "linke" Gesetzgebung ausgleichen, die dafür sorgt, dass jeder über hohen absolute Wohlstand verfügt, auch wenn es weiterhin relative Unterschiede gibt (die dann aber natürlich geringer ausfallen).<br />
<br />
Unser Wohlstand beruht aber natürlich nicht nur auf geerbtem Wissen. Pietsch nennt ein paar weiter Punkte, die aber nicht wirklich an die Wurzel gehen. Ganz wesentlich beruht unser Wohlstand darauf, dass ein möglichst großer Anteil der menschlichen Gehirne, die auf unserem Planeten wandeln, möglichst gut in gesellschaftliche Produktionsprozesse einbezogen und genutzt werden.<br />
<br />
Dazu gehört natürlich die von Pietsch genannte Freiheit: Gehirne, die von sich aus Konstruktives leisten wollen, muss man machen lassen.<br />
<br />
Dazu gehört auch, dass man Gehirne, die vielleicht nicht unbedingt von sich aus Konstruktives leisten, trotzdem dazu ermutigen. Die von Pietsch genannte Achtung des Eigentums ist ein entsprechendes Anreizsystem. Wenn man das Eigentum von dieser Perspektive aus betrachtet erkennt man aber auch sofort, dass es dabei schnell zu Tradeoffs kommt. Gerne wird ja zum Beispiel die Senkung von Spitzensteuersätzen als Anreiz begründet. Aber hier muss man doch nachhaken: wie stark ist der Anreizeffekt dieser Steuersenkungen denn nun wirklich? Und wäre es nicht besser, die Gehirne in der breiten Masse zu mobilisieren, anstatt sie womöglich noch durch die offensichtliche und steigende Ungleichheit zu demoralisieren?<br />
<br />
Der Blick auf die aktiven Gehirne wirft auf viele politische Themen ein anderes Licht. Pietsch nennt zum Beispiel eine Selbständigenquote von etwa 23% im Deutschland Anfang der 1960er und sieht darin lobenswerte Eigeninitiative. (Er unterschlägt dabei übrigens die in <a href="http://www2.soziologie.uni-halle.de/publikationen/pdf/0504.pdf">seiner eigenen Quelle</a> erkennbare Tatsache, dass der Wert bis ins Jahr 2000 in Deutschland höher lag als in den USA, obwohl diese landläufig als sehr viel unternehmerischer wahrgenommen werden. Glaube keiner Statistik, die du nicht selbst selektiv ausgewählt hast.)<br />
<br />
Ich sehe in dieser hohen Selbständigenquote aber auch Unternehmen mit durchschnittlich vier Personen. Da ist keine tiefe Spezialisierung möglich, und damit ist auch der optimalen Nutzung von Gehirnen eine gewisse Grenze gesetzt.<br />
<br />
Ich sehe auch das Potential für Scheinselbständigkeit oder Selbständigkeit aus Not. Da wird ein großer Teil der mentalen Energie verschwendet, weil sich die Menschen mit Sorgen auseinandersetzen müssen, die ihnen von guter linker Politik genommen werden könnten.<br />
<br />
Letztlich ist es die Einbindung möglichst vieler Menschen in produktiv-kreative Prozesse auf hohem Niveau, die für unseren Wohlstand essentiell ist. Diese Einbindung kann in Form von Unternehmertun geschehen, aber oft ist das eben auch der falsche Weg.
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-28897622540355191702016-12-18T22:27:00.000+01:002016-12-18T22:29:56.699+01:00Consciousness is not a quantum phenomenonConsciousness is weird. Quantum physics is weird. There's this temptation to conclude that therefore, they must be related, but <a href="http://www.smbc-comics.com/comic/the-talk-4">they're almost certainly not</a>. In fact, I want to explore an argument that consciousness <b>cannot</b> be a quantum phenomenon but must necessarily be classical.<br />
<br />
We don't really understand consciousness, but many people have tried. A compelling argument pushed by <a href="https://en.wikipedia.org/wiki/Douglas_Hofstadter">Douglas Hofstadter</a> is that consciousness is a product of certain sufficiently complex "strange loops", self-reflexive processes. Put simply, the mind observes the body, but it also observes itself observing the body, and it observes itself observing itself observing the body, and so on. On some level, this endlessly reflexive process of self-observation <b>is</b> consciousness.<br />
<br />
Part of this self-reflexive process is the simulation of simplified models of the world, including ourselves and other people. This allows us to anticipate the effects of possible actions, which helps us choose the actions that we ultimately take. (Much or even most of our daily decision making doesn't use this complicated process and simply uses subconscious pattern matching. But we're focusing on the conscious part of the mind here.) Running multiple simulations to choose from a set of possible actions requires starting the simulation with the same initial state multiple times. In other words, it requires <b>copying</b> the initial state.<br />
<br />
If consciousness were a quantum phenomenon, this would imply copying some quantum states. However, it is physically impossible to copy a quantum state due to the <a href="https://en.wikipedia.org/wiki/No-cloning_theorem">no-cloning theorem</a>, just like it is impossible to change the amount of energy in the universe.<br />
<br />
This immediately gives us a very nice explanation for why we are unable to perceive quantum states. Nothing can copy quantum states, and so a conscious mind that perceives quantum states cannot develop, since perception and memory of states of the world certainly requires making copies of those states (at least partial ones).<br />
<br />
There is some wiggle-room in this argument. Even a mind that only perceives, remembers, and computes on internal representations of classical states might still use quantum physical processes for this computation. And at a certain level, it is actually obviously true that our brains do this: our brains are full of chemistry, and chemistry is full of quantum physics. The point is that this is just an irrelevant implementation detail. Those quantum processes could just as well be simulated by something classical.<br />
<br />
And keep in mind that if we believe in the importance of "strange loops", then the mind itself is one of the things that the mind perceives. The mind cannot perceive quantum states, and so any kind of quantum-ness in the operation of the mind must be purely incidental. It cannot be a part of the "strange loop", and so it must ultimately be of limited importance.<br />
<br />
There is yet more wiggle-room. We know that <a href="https://en.wikipedia.org/wiki/Integer_factorization">some</a> <a href="https://en.wikipedia.org/wiki/Discrete_logarithm">computational tasks</a> can be implemented much more efficiently on a quantum computer than on a classical computer - at least in theory. Nobody has ever built a true quantum computer that is big enough to demonstrate a speed-up over classical computers and<strike> lived to tell the tale</strike> published it. Perhaps consciousness requires the solution of computational tasks that can only be solved efficiently using quantum computing?<br />
<br />
I doubt it. It looks increasingly like even in theory, quantum computers can only speed up a few very specialized mathematical problems, and that's not what our minds are particularly good at.<br />
<br />
It's always possible that this whole argument will turn out to be wrong, once we learn more about consciousness in the future (and wouldn't that be interesting!), but for now, the bottom line for me is: Strange information processing loops most likely play an important role in consciousness. Such loops cannot involve quantum states due to the no-cloning theorem. Hence, consciousness is not a quantum phenomenon.Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-80530610772512903532016-12-07T22:38:00.000+01:002016-12-07T22:38:12.749+01:00Bedingungsloses Grundeinkommen: Die Finanzierung ist nicht das ProblemGestern hat Joerg Wellbrock auf dem Spiegelfechter einen <a href="http://www.spiegelfechter.com/wordpress/134124/bedingungsloses-grundeinkommen-der-grosse-coup-des-neoliberalismus">Beitrag</a> geschrieben, in dem er einige Kritik am Konzept des Bedingungslosen Grundeinkommen (BGE) übt. Ich teile vieles der Kritik zumindest als Skepsis.<br />
<br />
Aber wenn er die Finanzierung des BGE hinterfragt, steige ich aus. Er sieht ein Finanzierungsproblem, wo in Wirklichkeit keins ist. Denn der Staat, zumindest der <a href="http://nhaehnle.blogspot.de/2011/10/def-souveran-monetarer.html">monetär souveräne Staat</a>, kann finanzieren was und wie viel er will. Zu glauben, ein monetär souveräner Staat könne irgendwo ein Finanzierungsproblem haben ist ungefähr so, als würde man glauben, dass beim Spiel Monopoly die Bank pleite gehen kann. Kann sie nicht - und wenn das gedruckte Geld ausgeht, dann behilft man sich eben irgendwie anders. So steht es in den Regeln.<br />
<br />
Moment! ruft der geneigte Leser jetzt innerlich, aber ist das nicht Gelddrucken und kommt dann nicht die Inflation? Nein, <a href="http://nhaehnle.blogspot.de/2011/10/geld-drucken-ist-nicht-inflationar.html">Geld<i>drucken</i> ist das nicht</a>. Auch beim Monopoly-Spielen würde man das "Problem" eher nicht durch Gelddrucken lösen. Aber ja, natürlich kann Geld<i>ausgeben</i> grundsätzlich immer zu Inflation führen, den Umständen entsprechend. Diese Möglichkeit führt ja auch Wellbrock schon als potentielles Problem des BGE an.<br />
<br />
Aber es ist eben wichtig, dass das Potential für Inflation das <i>einzige</i> Problem ist. Es gibt kein Finanzierungsproblem. <i>Vielleicht</i> gibt es ein Inflationsproblem. Der Unterschied ist wichtig, und zwar aus zwei Gründen.<br />
<br />
Erstens geben wir den sogenannten Kreditgebern und Finanzmärkten zu viel Macht, wenn wir an ein Finanzierungsproblem glauben. Dieser Glaube führt dazu, dass unsere Politiker in Angst leben vor den sogenannten Finanzmärkten. Und weil diese so diffus sind und nicht klar definiert, gibt diese Angst dann de facto Macht an die Hohepriester bei Banken und Think Tanks, die als Interpreten des Marktwillens gelten. Neutrale Beobachter sind das auf keinen Fall, und das Wohl der breiten Bevölkerung haben die meisten von ihnen auch nicht im Blick. Wir leben aber in einer Demokratie. Deren Politik muss natürlich die Beschränkungen der Realität anerkennen, aber sie darf nicht grundlos Macht an nicht legitimierte Gruppen abgeben.<br />
<br />
Zweitens schwingt beim Gedanken an Finanzierung und Kredite immer automatisch mit, dass das Geld zurück gezahlt werden soll, womöglich noch mit Gewinn. <i>Effizienz</i> ist natürlich auch für demokratische Politik ein sinnvolles Ziel, aber Profitwirtschaft darf kein Ziel sein. <a href="http://nhaehnle.blogspot.de/2012/04/warum-wir-ein-langfristiges.html">Außerdem gibt es gute Gründe dafür, dass der Staatsaushalt im langfristigen Mittel gar nicht ausgeglichen werden darf.</a> Es kann vereinzelt Perioden geben, in denen ein Überschuss im Staatshaushalt sinnvoll ist. Aber im langfristigen Mittel muss ein inflationsneutraler Staatshaushalt von normalen, monetär souveränen Staaten im Defizit sein.<br />
<br />
Besser also, wenn man erkennt, dass der Kaiser keine Kleider
hat. Inflationsprobleme kann es für monetär souveräne Staaten geben,
Finanzierungsprobleme nicht.<br />
<br />
Zurück zum BGE. Ich finde das Konzept sympathisch, aber ich teile die Bedenken, dass ein BGE vermutlich wegen der Inflationsproblematik nicht so realisierbar ist, wie sich die Befürworter das wünschen. Je näher die Roboterutopie kommt, um so realitischer wird ein BGE natürlich. Ich bezweifle, dass wir heute schon so weit sind, muss aber zugeben, dass es letztlich eine empirische Frage ist, die man nur experimentell beantworten kann.<br />
<br />
Und als kleines Nachwort noch: Die Überlegungen zum Staatshaushalt oben gehen von einem monetär souveränen Staat aus. Deutschland ist nicht mehr monetär souverän, sondern Mitglied der Eurozone. Diese hat keinen Souverän, und ich würde sagen, dass genau darin das Kernproblem der Eurozone liegt. Um die Eurozone langfristig politisch zu stabilisieren, bräuchte es einen Haushalt auf Euroebene, der wichtige Stabilisatorenaufgaben übernimmt. Dazu gehören auch Elemente des Sozialstaats, und die BGE-Befürworter täten gut daran, für ein BGE auf Euro-Ebene zu werben. Aber all das ist ein <a href="http://nhaehnle.blogspot.de/2011/09/die-schock-strategie-und-wie-es-anders.html">anderes Thema</a>.Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-29619838798926806012016-11-11T11:09:00.000+01:002016-11-11T11:09:00.199+01:00Was Trump und Hitler (vielleicht) gemeinsam habenDen zukünftigen US-Präsidenten mit Faschisten zu vergleichen ist nicht neu, aber eine Facette des Vergleichs habe ich bisher selten gesehen. In seiner Rede zum Wahlsieg hat Trump von groß angelegten Investitionsprogrammen gesprochen. Tatsächlich kann die USA das sehr gut gebrauchen. Schon seit Jahren beklagen Bauingenieure dort den schlechten Zustand der Infrastruktur, und natürlich wäre ein solches Investitionsprogramm gut für die Wirtschaft und könnte flächendeckend die Nachfrage nach Arbeit stärken.<br />
<br />
Das ist wichtig, denn man muss sich klar machen, <a href="http://www.bloomberg.com/news/articles/2016-06-30/americans-with-more-education-have-taken-almost-every-job-created-in-the-recovery">dass es ganze Schichten in den USA gibt, an denen die Erholung von der Wirtschaftskrise vollkommen vorbeigegangen ist</a>. Diese Menschen beklagen sich zurecht über den Status Quo.<br />
<br />
Das vermischt sich dann oft mit den für mich illegitimen Klagen von Weißen, die einfach nur Angst vor dem Verlust des gefühlten Privilegs der Mehrheit haben. Werdet endlich erwachsen!, möchte ich diesen Menschen zurufen. Aber wer sie dann deswegen als Rassisten abstempelt und den legitimen Teil ihrer Beschwerden schlicht ignoriert macht es sich auch zu einfach. Man riskiert damit nämlich genau den Aufstieg von Gefährdern wie Trump.<br />
<br />
Die Parallele zu Hitler, dessen Regierungsprogramm ja trotz allem Negativen tatsächlich auch durch massive Investitionen erst einmal die Leben vieler Menschen verbessert hat, ist mir deswegen so wichtig, weil sie so unglaublich traurig ist. Warum konnten sich die Demokraten der Weimarer Republik nicht zu großen Investitionsprogramm zusammenreißen? Warum gelang es in den USA in den letzten Jahren nicht? Und warum gelingt es auch heute in Deutschland wieder nicht und schafft so den Nährboden für die AfD? Warum lassen liberale, progressive Parteien und Politiker ihre Flanken so ungeschützt? Man könnte ja durchaus die positiven Seiten eines expansiven und inklusiven Wirtschaftsprogramms übernehmen ohne deswegen gleich dem globalen Klimawandel die Tür zu öffnen, alle Juden zu vernichten, oder einen Weltkrieg anzuzetteln.<br />
<br />
Wenn ich heute mit Menschen darüber rede, begegne ich leider immer wieder Fehlinformationen über unser Wirtschaftssystem. Bei Hitler heißt es sofort, das wäre ja alles in Wirklichkeit schlecht gewesen wegen der Schuldenaufnahme (dem zweiten beliebten Argument bei Hitler, nämlich der Kritik am Rüstungsfokus, stimme ich zu - aber es gibt so viele sinnvolle Dinge, in die man investieren kann, es muss wirklich nicht das Militär sein!). Auch im heutigen Deutschland ist die Ideologie der <a href="https://de.wikipedia.org/wiki/Wolfgang_Sch%C3%A4uble">schwarzen Null</a> das große Hindernis. Genauso haben in den USA in den letzten Jahren die Republikaner gerne mit Verweis auf das Haushaltsdefizit blockiert. Letzteres mag Zweifel aufbringen, ob es unter Trump große Investitionsprogramme geben wird, aber die Republikaner haben ja auch immer wieder bewiesen, dass ihnen Haushaltsdefizite nur so lange wichtig sind, wie sie damit demokratische Vorschläge blockieren können. Paul Ryans (republikanischer Speaker of the House) eigene Budgetentwürfe enthalten, wenn man sie nüchtern analysiert, immer gigantische Haushaltsdefizite. Vielleicht kommt Trump mit seinem Investitionsprogramm trotzdem nicht durch den Kongress, aber die Chancen stehen nicht schlecht.<br />
<br />
In Wirklichkeit sind Schulden und Defizite für <a href="http://nhaehnle.blogspot.de/2011/10/def-souveran-monetarer.html">monetär souveräne Staaten</a> schlicht kein Problem - zumindest <a href="http://nhaehnle.blogspot.de/2012/04/warum-wir-ein-langfristiges.html">nicht so, wie die meisten Menschen das zu wissen glauben</a>. <a href="http://nhaehnle.blogspot.de/2011/09/die-schock-strategie-und-wie-es-anders.html">In der Eurozone verkompliziert die fehlende Souveränität die Situation</a>, aber das ist ein Thema für ein andermal.<br />
<br />
Natürlich ist ein Investitionsprogramm alleine nicht genug. Auch an Bildung muss gearbeitet werden und an regionalen Strukturen, und sicherlich an noch mehr. Aber ein Investitionsprogramm ist ein guter, klar sichtbarer und medienwirksamer Anfang, mit dem es einfach ist, in die richtige Richtung zu gehen.<br />
<br />
Unsere Demokratie ist zu wichtig, um sie einer ökonomischen Ideologie zu opfern.<br />
Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-1672731847984928272016-10-24T21:36:00.000+02:002016-10-24T21:36:23.521+02:00Compiling shaders: dynamically uniform variables and "convergent" intrinsics<p>There are some program transformations that are obviously correct when compiling regular single-threaded or even multi-threaded code, but that cannot be used for shader code. For example:</p>
<pre style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;;padding:0px;color:#000000;text-align:left;line-height:20px;"><code style="color:#000000;word-wrap:normal;"> v = texture(u_sampler, texcoord);
if (cond) {
gl_FragColor = v;
} else {
gl_FragColor = vec4(0.);
}
... cannot be transformed to ...
if (cond) {
// The implicitly computed derivate of texcoord
// may be wrong here if neighbouring pixels don't
// take the same code path.
gl_FragColor = texture(u_sampler, texcoord);
} else {
gl_FragColor = vec4(0.);
}
... but the reverse transformation is allowed.
</code></pre>
<p>Another example is:</p>
<pre style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;;padding:0px;color:#000000;text-align:left;line-height:20px;"><code style="color:#000000;word-wrap:normal;"> if (cond) {
v = texelFetch(u_sampler[1], texcoord, 0);
} else {
v = texelFetch(u_sampler[2], texcoord, 0);
}
... cannot be transformed to ...
v = texelFetch(u_sampler[cond ? 1 : 2], texcoord, 0);
// Incorrect, unless cond happens to be dynamically uniform.
... but the reverse transformation is allowed.
</code></pre>
<p>Using <a href="https://www.opengl.org/registry/specs/ARB/shader_ballot.txt">GL_ARB_shader_ballot</a>, yet another example is:</p>
<pre style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;;padding:0px;color:#000000;text-align:left;line-height:20px;"><code style="color:#000000;word-wrap:normal;"> bool cond = ...;
uint64_t v = ballotARB(cond);
if (other_cond) {
use(v);
}
... cannot be transformed to ...
bool cond = ...;
if (other_cond) {
use(ballotARB(cond));
// Here, ballotARB returns 1-bits only for threads/work items
// that take the if-branch.
}
... and the reverse transformation is also forbidden.
</code></pre>
<p>These restrictions are all related to the GPU-specific SPMD/SIMT execution model, and they need to be taught to the compiler. Unfortunately, we partially fail at that today.</p>
<p>Here are some types of restrictions to think about (each of these restrictions should apply on top of any other restrictions that are expressible in the usual, non-SIMT-specific ways, of course):</p>
<ol>
<li><p><em>An instruction can be moved from location A to location B only if B <a href="https://en.wikipedia.org/wiki/Dominator_(graph_theory)">dominates</a> or post-dominates A.</em></p>
<p>This restriction applies e.g. to instructions that take derivatives (like in the first example) or that explicitly take values from neighbouring threads (like in the third example). It also applies to barrier instructions.</p>
<p>This is LLVM's <a href="http://llvm.org/docs/LangRef.html#id700">convergent</a> function attribute as I understand it.</p>
<li><p><em>An instruction can be moved from location A to location B only if A dominates or post-dominates B.</em></p>
<p>This restriction applies to the ballot instruction above, but it is not required for derivative computations or barrier instructions.</p>
<p>This is in a sense dual to LLVM's convergent attribute, so it's co-convergence? Divergence? Not sure what to call this.</p>
<li><p><em>Something vague about not introducing additional non-uniformity in the arguments of instructions / intrinsic calls.</em></p>
<p>This last one applies to the sampler parameter of texture intrinsics (for the second example), to the ballot instruction, and also to the texture coordinates on sampling instructions that implicitly compute derivatives.</p>
</ol>
<p>For the last type of restriction, consider the following example:</p>
<pre style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;;padding:0px;color:#000000;text-align:left;line-height:20px;"><code style="color:#000000;word-wrap:normal;"> uint idx = ...;
if (idx == 1u) {
v = texture(u_sampler[idx], texcoord);
} else if (idx == 2u) {
v = texture(u_sampler[idx], texcoord);
}
... cannot be transformed to ...
uint idx = ...;
if (idx == 1u || idx == 2u) {
v = texture(u_sampler[idx], texcoord);
}
</code></pre>
<p>In general, whenever an operation has this mysterious restriction on its arguments, then the second restriction above <em>must</em> apply: we can move it from A to B only if A dominates or post-dominates B, because only then can we be certain that the move introduces no non-uniformity. (At least, this rule applies to transformations that are not SIMT-aware. A SIMT-aware transformation might be able to prove that idx is dynamically uniform even without the predication on idx == 1u or idx == 2u.)</p>
<p>However, the control flow rule is not enough:</p>
<pre style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;;padding:0px;color:#000000;text-align:left;line-height:20px;"><code style="color:#000000;word-wrap:normal;"> v1 = texture(u_sampler[0], texcoord);
v2 = texture(u_sampler[1], texcoord);
v = cond ? v1 : v2;
... cannot be transformed to ...
v = texture(u_sampler[cond ? 0 : 1], texcoord);
</code></pre>
<p>The transformation does not break any of the CFG-related rules, and it would clearly be correct for a single-threaded program (given the knowledge that texture(...) is an operation without side effects). So the CFG-based restrictions really aren't sufficient to model the real set of restrictions that apply to the texture instruction. And it gets worse:</p>
<pre style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;;padding:0px;color:#000000;text-align:left;line-height:20px;"><code style="color:#000000;word-wrap:normal;"> v1 = texelFetch(u_sampler, texcoord[0], 0);
v2 = texelFetch(u_sampler, texcoord[1], 0);
v = cond ? v1 : v2;
... is equivalent to ...
v = texelFetch(u_sampler, texcoord[cond ? 0 : 1], 0);
</code></pre>
<p>After all, texelFetch computes no implicit derivatives.</p>
<p>Calling the three kinds of restrictions 'convergent', 'co-convergent', and 'uniform', we get:</p>
<pre style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;;padding:0px;color:#000000;text-align:left;line-height:20px;"><code style="color:#000000;word-wrap:normal;"> texture(uniform sampler, uniform texcoord) convergent (co-convergent)
texelFetch(uniform sampler, texcoord, lod) (co-convergent)
ballotARB(uniform cond) convergent co-convergent
barrier() convergent
</code></pre>
<p>For the texturing instructions, I put 'co-convergent' in parentheses because these instructions aren't <em>inherently</em> 'co-convergent'. The attribute is only there because of the 'uniform' function argument.</p>
<p>Actually, looking at the examples, it seems that co-convergent only appears when a function has a uniform argument. Then again, the texelFetch function <em>can</em> be moved freely in the CFG by a SIMT-aware pass that can prove that the move doesn't introduce non-uniformity to the sampler argument, so being able to distinguish functions that are inherently co-convergent (like ballotARB) from those that are only implicitly co-convergent (like texture and texelFetch) is still useful.</p>
<p>For added fun, things get muddier when you notice that in practice, AMDGPU doesn't even flag texturing intrinsics as 'convergent' today. Conceptually, the derivative-computing intrinsics need to be convergent to ensure that the texture coordinates for neighbouring pixels are preserved (as in the very first example). However, the AMDGPU backend does register allocation <em>after</em> the CFG has been transformed into the wave-level control-flow graph. So register allocation automatically preserves neighbouring pixels even when a texture instruction is sunk into a location with additional control-flow dependencies.</p>
<p>When we reach a point where vector register allocation happens with respect to the thread-level control-flow graph, then texture instructions really need to be marked as convergent for correctness. (This change would be beneficial overall, but is tricky because scalar register allocation must happen with respect to the wave-level control flow graph. LLVM currently wants to allocate all registers in one pass.)</p>Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-11397740006302908772016-09-03T20:18:00.000+02:002016-09-03T20:18:00.224+02:00Exec masks in LLVMLike is usual in GPUs, Radeon executes shaders in waves that execute the same program for many threads or work-items simultaneously in lock-step. Given a single program counter for up to 64 items (e.g. pixels being processed by a pixel shader), branch statements must be lowered to manipulation of the <i>exec mask</i> (unless the compiler can prove the branch condition to be uniform across all items). The exec mask is simply a bit-field that contains a 1 for every thread that is currently active, so code like this:
<pre>
if (i != 0) {
... some code ...
}
</pre>
gets lowered to something like this:
<pre>
v_cmp_ne_i32_e32 vcc, 0, v1
s_and_saveexec_b64 s[0:1], vcc
s_xor_b64 s[0:1], exec, s[0:1]
if_block:
... some code ...
join:
s_or_b64 exec, exec, s[0:1]
</pre>
(The <code>saveexec</code> assembly instructions apply a bit-wise operation to the exec register, storing the original value of exec in their destination register. Also, we can introduce branches to skip the if-block entirely if the condition happens to be uniformly false, )
<br><br>
This is quite different from CPUs, and so a generic compiler framework like LLVM tends to get confused. For example, the fast register allocator in LLVM is a very simple allocator that just spills all live registers at the end of a basic block before the so-called <em>terminators</em>. Usually, those are just branch instructions, so in the example above it would spill registers after the s_xor_b64.
<br><br>
This is bad because the exec mask has already been reduced by the if-condition at that point, and so vector registers end up being spilled only partially.
<br><br>
Until recently, these issues were hidden by the fact that we lowered the control flow instructions into their final form only at the very end of the compilation process. However, previous optimization passes including register allocation can benefit from seeing the precise shape of the GPU-style control flow earlier. But then, some of the subtleties of the exec masks need to be taken account by those earlier optimization passes as well.
<br><br>
A related problem arises with another GPU-specific specialty, the "whole quad mode". We want to be able to compute screen-space derivatives in pixel shaders - <a href="https://en.wikipedia.org/wiki/Mipmap">mip-mapping</a> would not be possible without it - and the way this is done in GPUs is to always run pixel shaders on 2x2 blocks of pixels at once and approximate the derivatives by taking differences between the values for neighboring pixels. This means that the exec mask needs to be turned <em>on</em> for pixels that are not really covered by whatever primitive is currently being rendered. Those are called <a href="https://www.opengl.org/sdk/docs/man/html/gl_HelperInvocation.xhtml">helper pixels</a>.
<br><br>
However, there are times when helper pixels absolutely must be disabled in the exec mask, for example when <a href="https://www.opengl.org/registry/specs/ARB/shader_image_load_store.txt">storing to an image</a>. A separate pass deals with the enabling and disabling of helper pixels. Ideally, this pass should run after instruction scheduling, since we want to be able to rearrange memory loads and stores freely, which can only be done <em>before</em> adding the corresponding exec-instructions. The instructions added by this pass look like this:
<pre>
s_mov_b64 s[2:3], exec
s_wqm_b64 exec, exec
... code with helper pixels enabled goes here ...
s_and_b64 exec, exec, s[2:3]
... code with helper pixels disabled goes here ...
</pre>
Naturally, adding the bit-wise AND of the exec mask must happen in a way that doesn't conflict with any of the exec manipulations for control flow. So some careful coordination needs to take place.
<br><br>
<a href="https://reviews.llvm.org/D24215">My suggestion</a> is to allow arbitrary instructions at the beginning and end of basic blocks to be marked as "initiators" and "terminators", as opposed to the current situation, where there is no notion of initiators, and whether an instruction is a terminator is a property of the opcode. An alternative, that Matt Arsenault is working on, adds aliases for certain exec-instructions which act as terminators. This may well be sufficient, I'm looking forward to seeing the result.Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-17116729239834511912016-06-18T12:10:00.000+02:002016-06-18T12:10:06.234+02:00Wie man eine Privatisierung von Autobahnen bewertetHeute erfahre ich, wohl mit etwas Verspätung, dass in Deutschland mal wieder die Privatisierung der Autobahnen diskutiert wird. Wozu soll diese Privatisierung gut sein?
<br><br>
Soweit ich sehe, gibt es in der aktuellen Diskussion grob skizziert zwei Argumente. Erstens, die Zinsen sind niedrig und private Investoren sehnen sich nach Möglichkeiten, Gewinne zu erzielen. Private Autobahnen wären eine solche Möglichkeit. Dieses Argument finde ich ziemlich unverschämt. Private Investoren - vor allem die individuellen Investoren, die bei Privatisierungen in der Regel am meisten profitieren - gehören definitionsgemäß zu den Leuten, die nun wirklich als Allerletztes Hilfe vom Staat brauchen. Sich von deren Interessen leiten zu lassen geht gar nicht.
<br><br>
Kommen wir also zum zweiten Argument. Die Deutschen sehen zwar, dass Investitionen in das Autobahnnetz nötig sind, <a href="https://de.wikipedia.org/wiki/Wolfgang_Sch%C3%A4uble">lieben aber auch die schwarze Null</a>. Aber auch das ist kein stimmiges Argument. Um die Privatisierung der Autobahnen sinnvoll bewerten zu können, ist es hilfreich, sich die Nettogeldflüsse anzusehen:
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDyhPr3X67VXrT8C0YwL6yv476C4beoCU6oHN9RkLJWYd1OB3m8eCOo0SjPBzNsNhvxtIY-SLa_LclLDp-EZFMJdqkWb5ThlGfjba60LArjQjrFzr_uC-lHjC-y0mDKiXm1_xHPQ/s1600/autobahn-finanzen.png" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDyhPr3X67VXrT8C0YwL6yv476C4beoCU6oHN9RkLJWYd1OB3m8eCOo0SjPBzNsNhvxtIY-SLa_LclLDp-EZFMJdqkWb5ThlGfjba60LArjQjrFzr_uC-lHjC-y0mDKiXm1_xHPQ/s1600/autobahn-finanzen.png" /></a>
<br>
Egal wie die Autobahn finanziell organisiert wird, letztendlich kommen die Einnahmen für das System zum einen von den Nutzern und zum anderen aus allgemeinen Steuern (manche Einnahmequellen können irgendwo dazwischen liegen, zum Beispiel Benzinsteuern). Und letztendlich landen die Ausgaben zum einen beim eigentlichen Bau und Betrieb der Autobahnen - das ist offensichtlich - und bei privaten Investoren. Letzteres ist nicht ganz so offensichtlich: auch eine Autobahn, die rein in Staatshand ist, hat Ausgaben an private Investoren wenn der Staat sich für Bau und Betrieb <a href="http://nhaehnle.blogspot.de/2011/09/modern-monetary-theory-eine.html">"verschuldet"</a>.
<br><br>
Aus Sicht des Gemeinwohls gibt es mit Blick auf die Geldflüsse vor allem zwei Fragen: Wie viel muss von der linken Seite gezahlt werden, und wie wird die Rechnung zwischen Autobahnnutzern und Bürgern allgemein aufgeteilt? Von links kommt zwangsläufig genau so viel Geld, wie rechts wieder herausfließt. Sinnvollerweise konzentriert sich die Politik also auf die rechte Seite. Hilft die Privatisierung dabei, die Ausgaben auf der rechten Seite zu reduzieren?
<br><br>
Meine kurze Einschätzung: Bei Bau und Betrieb lässt sich durch Privatisierung kaum etwas erreichen. Der größte Brocken ist der Bau, und der wird schon lange an private Unternehmen ausgeschrieben. Wenn es hier also Effizienz durch privates Wirtschaften geben sollte, so wird diese schon längst erreicht.
<br><br>
Spannender wird es bei den Ausgaben an private Investoren. Die erwartete Rendite bei privaten Unternehmungen liegt deutlich höher als die Zinsen auf Bundesanleihen. Man kennt das von anderen Privatisierungen: wenn die privaten Investoren nicht von einem höheren Gewinn als bei Bundesanleihen ausgehen können, dann kaufen sie eben lieber Bundesanleihen. Es ist also davon auszugehen, dass bei einer Privatisierung der Autobahnen die Ausgaben insgesamt eher steigen werden, wodurch das ganze System für die Bürger und Nutzer teurer wird. Privatisierung ist ein Verlustgeschäft.
<br><br>
Die zweite Frage, nämlich ob die Einnahmen mehr aus allgemeinen Steuern oder mehr direkt von den Nutzen kommen sollten, ist ganz offensichtlich unabhängig von der Frage der Privatisierung. Man muss nur bedenken (so wenig ich persönlich ein Freund des Autofahrens bin), dass eine verstärkte Finanzierung über die Nutzer in der Tendenz die Ausgabenseite erhöht und so das gesamte System teurer macht - Maut zu erheben ist ein Bürokratieaufwand. Ob die durch Maut verringerte Effizienz ein angemessener Preis ist für höhere Gerechtigkeit muss jeder für sich selbst entscheiden. Man beachte auch die Analogie zur <a href="http://www.taz.de/!5109537/">Forderung nach kostenlosem ÖPNV</a>!
<br><br>
Übrigens: Selbst wenn die Ausgaben mit der Privatisierung überraschend sinken sollten, gibt es noch zusätzliche Erwägungen, die in der Tendenz gegen eine Privatisierung sprechen. Zum einen stellt sich die Frage der Kontrolle: selbst wenn der Staat Mehrheitseigner des Autobahnnetzes bleibt erhöht sich doch der Einfluss demokratisch nicht legitimierter Kräfte auf den Betrieb wichtiger Infrastruktur des Landes. Wie hoch der finanzielle Zugewinn sein müsste, um diesen Kontrollverlust zu rechtfertigen, muss jeder selbst für sich entscheiden.
<br><br>
Zum anderen spielt auch die Zusammensetzung der privaten Investoren eine Rolle - wie weit werden die Ausgaben an private Investoren effektiv in der Bevölkerung gestreut? Bei der standardisierten, "langweiligen" Bundesanleihe sind die privaten Investoren oft langweilige Fonds an denen auch "normalere", weniger reiche Bürger z.B. zur Altersvorsorge beteiligt sind. Die Ausgaben für private Investoren gehen somit zwar nicht an alle Bürger gleichmässig, aber sie sind doch relativ weit gestreut. Bei einer Privatisierung wird aber nicht mit derart "langweiligen" Finanzinstrumenten gearbeitet, wodurch sie sich tendenziell eher in der Hand von Menschen sammeln, die ohnehin schon reich sind. Somit sorgen die Ausgaben an private Investoren also in der Tendenz für stärkere Ungleichheit im Land - auch das ein Argument gegen die Privatisierung.
<br><br>
Heißt das denn, das jede Privatisierung schlecht ist? Schließlich lassen sich die genannten Argumente auch auf andere Privatisierungen übertragen. Tatsächlich ist der Blick auf die Nettogeldflüsse immer hilfreich, und ja, das Fazit sieht bei allen Privatisierungsprojekten, die heutzutage diskutiert werden, ähnlich aus. Das liegt aber vor allem daran, dass sich der Staat bei uns schon weitgehend aus dem exekutiven Geschäft herausgezogen hat. Wäre die Stahlindustrie oder die Produktion von Industrierobotern heute in staatlicher Hand, dann sähe die Analyse einer Privatisierung der entsprechenden Unternehmen sicher anders aus, weil es dort glaubwürdige Argumente für einen Effizienzgewinn durch Marktöffnung gibt. Im Fall von Infrastruktur gibt es aber in der Regel keine sinnvolle Marktöffnung, und daher auch kaum Argumente für Privatisierung.
Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com0tag:blogger.com,1999:blog-36137506.post-15579298520900470062016-05-19T06:18:00.000+02:002016-05-19T06:18:53.861+02:00A little 5-to-8-bit mysteryWriting the accelerated glReadPixels path for reads to <a href="https://www.opengl.org/wiki/Pixel_Buffer_Object">PBOs</a> for Gallium, I wanted to make sure the various possible format conversions are working correctly. They do, but I noticed something strange: when reading from a GL_RGB565 framebuffer to GL_UNSIGNED_BYTE, I was getting tiny differences in the results depending on the code path that was taken. What was going on?
<br><br>
Color values are conceptually floating point values, but most of the time, so-called normalized formats are used to store the values in memory. In fact, many probably think of color values as 8-bit normalized values by default, because of the way many graphics programs present color values and because of the #cccccc color format of HTML.
<br><br>
Normalized formats generalize this well-known notion to an arbitrary number of bits. Given a normalized integer value <i>x</i> in N bits, the corresponding floating point value is <i>x / (2**N - 1)</i> - for example, <i>x / 255</i> for 8 bits and <i>x / 31</i> for 5 bits. When converting between normalized formats with different bit depths, the values cannot be mapped perfectly. For example, since 255 and 31 are <a href="https://en.wikipedia.org/wiki/Coprime_integers">coprime</a>, the only floating point values representable exactly in both 5- and 8-bit channels are 0.0 and 1.0.
<br><br>
So some imprecision is unavoidable, but why was I getting different values in different code paths?
<br><br>
It turns out that the non-PBO path first blits the requested framebuffer region to a staging texture, from where the result is then memcpy()d to the user's buffer. It is the GPU that takes care of the copy from VRAM, the <a href="https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/">de-tiling</a> of the framebuffer, and the format conversion. The blit uses the normal 3D pipeline with a simple fragment shader that reads from the "framebuffer" (which is really bound as a texture during the blit) and writes to the staging texture (which is bound as the framebuffer).
<br><br>
Normally, fragment shaders operate on 32-bit floating point numbers. However, Radeon hardware allows an optimization where color values are exported from the shader to the CB hardware unit as <a href="https://en.wikipedia.org/wiki/Half-precision_floating-point_format">16-bit half-precision floating point</a> numbers when the framebuffer does not require the full floating point precision. This is useful because it reduces the bandwidth required for shader exports and allows more shader waves to be in flight simultaneously, because less memory is reserved for the exports.
<br><br>
And it turns out that the value 20 in a 5-bit color channel, when first converted into half-float (fp16) format, becomes 164 in an 8-bit color channel, even though the 8-bit color value that is closest to the floating point number represented by 20 in 5-bit is actually 165. The temporary conversion to fp16 cuts off a bit that would make the difference.
<br><br>
Intrigued, I wrote a little script to see how often this happens. It turns out that 20 in a 5-bit channel and 32 in a 6-bit channel are the <em>only</em> cases where the temporary conversion to fp16 leads to the resulting 8-bit value to be off by one. Luckily, people don't usually use GL_RGB565 framebuffers... and as a general rule, taking a value from an N-bit channel, converting it to fp16, and then storing the value again in an N-bit value (of the same bit depth!) will always result in what we started out with, as long as N <= 11 (figuring out why is an exercise left to the reader ;-)) - so the use cases we really care about are fine.Nicolai Hähnlehttp://www.blogger.com/profile/18235566517992076346noreply@blogger.com1