Tagebuch eines Interplanetaren Botschafters: LLVM

Posts mit dem Label LLVM werden angezeigt. Alle Posts anzeigen

Montag, Mai 25, 2026

AI and Me

Lots is being written about AI these days, specifically LLM-based AI. I am not going to add anything novel here, but I thought it'd be interesting to document some of what I think about these tools and how I use them today so that I will be able to look back and compare in the future.

LLMs and the systems built around them are clearly impressive. They can also be really dumb in some ways; what some people refer to as jagged intelligence. In part because of that, and in part because I'm from the last millenium and hesitant about the privacy implications of everything "cloud", I don't use them very much in my personal life. I will very occasionally poke at ChatGPT about something or go into Google search's AI mode to drill down on a search. I have also dabbled in local models. But that's pretty much it. I regularly encounter people who seem to use these tools much more than me in their personal lives despite being generally less technical. To put this in context, I have always been hesitant and deliberate about the "smart" devices that I let into my life.

I do have free GitHub Copilot access on my personal GitHub account, and I do use that occasionally on personal projects, but not enough to have really spent a whole lot of time exploring what I can do with it. So, this is mostly going to be about using AI at work, specifically in software development and adjacent activities.

Outside of Software Development

My software development work is centered on LLVM and adjacent compiler-y projects. However, I also spend a large fraction of my working hours collaborating with our hardware architects. This second kind of work very much requires that I, as a human, deeply understand the systems that we are planning to build. I need to understand. By definition, an AI tool cannot directly help with this. There are sub-problems that AI tools can help with, but they are fairly limited in scope.

For example, I may want to make some simulation results interpretable in a way that requires ad hoc number crunching and visualization. I have written Python scripts using matplotlib by hand in the past, but an AI agent can do it faster. I sanity check the results as part of trying to understand them and the underlying processes and implications, just like I would sanity check a similar report produced by a colleague. If in doubt, I will dig into the script that the agent produced. But I don't feel the need to understand all the details of such scripts all the time. They are throwaway code.

Some people suggest that AI tools be used for writing text. I strongly disagree for the most part.

Writing is a form of thinking. In the process of writing and editing a text, such as a design document, a specification, this post, or even just an email or code review, I transfer my thoughts from their fuzzy state in my brain into the less forgiving shape of concrete words. In doing so, I implicitly review and refine them. Am I making a fully logical argument? Is there some corner case or a precondition that I missed? If I were to let an LLM write the text, I would skip this process and diminish my understanding. I would not be doing my job right. In fact, while I have gotten completely used to and have become a fan of LLM-based auto-complete for code, I specifically turn it off for text documents and also for comments in code. Getting those auto-completions in text always reminds me of the ST:TNG image macro where Q looks like he is about to whisper something into the ear of a visibly annoyed Picard.

That said, I only mostly disagree. For example, I have noticed colleagues use LLMs to help with a language barrier. I don't begrudge them that, though I sometimes wish they made more of an effort to find their own voice. If you do this, please be sure to make your own thorough editing pass on the resulting text! As another example, I have occasionally used LLMs to brainstorm synonyms and antonyms, metaphors, and names.

Editing source code

With that out of the way, let's move on to the meat that is all the rage these days: trying to use LLM-based tools for serious software development beyond the one-off, throwaway scripts I mentioned above.

Working on a large and subtle project like LLVM obviously biases me here, and so I'm going to start by rejecting the term "code generation". Most of serious software development in the kind of project that I'm interested in is, or should be, code editing instead. It is true that empirically, software always tends to net growth in terms of lines of code, but every line of code creates a liability. It is unfortunate that current AI tools are clearly more biased towards code generation.

A key aspect of all types of tooling, not just LLM-based tools, is interactivity. Does the tool give a response in seconds, minutes, hours? As a general rule, tools that are in the 5 seconds to minutes range are a giant drag on productivity and are bad for mental health. Sitting around idly waiting for the tool is boring. The temptation to context switch is high, but context switches are mentally draining. Current LLM-based tools are generally unable to give useful responses in under 5 seconds, and that puts a hard ceiling on where their use is desirable. Auto-complete is an obvious exception to this, in large part because it uses a model that isn't actually all that large by today's standards.

Given that the latency of LLMs is more than just a few seconds, the way to use them is to kick off tasks that can run in the background without creating frequent and/or costly context switches.

One example of such a use is that I occasionally have a question about a part of a large code base that I am not very familiar with. In that case, I might put the question to an AI agent integrated into my IDE while then also researching the question myself in parallel. Whether the agent response is still helpful by the time it shows up, or even at all, is hit and miss. There is also a danger here. The agent will almost always create a confident-sounding response, but it may just be incorrect. You must treat it not as an oracle but as a "rubber duck" that can talk back to you. Still, this has been helpful often enough that I feel almost, but not quite, entirely comfortable calling this a habit by now. I expect it will become one over time.

Of course, the big use case is agentic coding, but only if the agent does not have to ask for permission before common actions. We must give the agent a sandbox in which it can run without having to ask for permission.

Sandboxing

My current approach to sandboxing uses defense in depth. I use Claude Code with its integrated sandbox mechanism, into which I unfortunately had to poke some minor holes. Because of that, but more importantly because Claude Code seems rather vibe coded and not that robust, and I am the kind of person who has used Linux since before the kernel reached version 1.0, I created a separate "agents" user account on my workstation that the agent runs under. This user account cannot access my main account's home directory, and it has no credentials at all except for the key to the LLM gateway provided by my employer. There is a clone of LLVM in the agents home directory that has no remotes. I use git push and pull from my main account to locally exchange Git trees between the agent work space and my main work environment. I also have a Visual Studio Code workspace that opens the agent's clone so that I can easily make manual edits to the version of the code that the agent is working on. I use the SetGID bit on directories so that files inherit the agents group of which my main user account is a member.

I know of people who use Docker instead of user accounts to achieve similar defense in depth. That might work better in some "cloud" environments. I generally prefer to avoid unnecessary complexity in my software stacks, and the separate user account works well for me.

Workflow

I use coding agents almost exclusively in the background. I set them a task and do something else in parallel, checking in with the agent every once in a while. I do not use agents for coding on whatever my current main focus is. This requires some mental discipline and task switching that I am still getting used to. There are many days on which my main focus is so intense that I do not use agents at all. I simply do not have the spare mental bandwidth to set useful tasks for an agent and to adequately handle the output.

Coding agents do not currently have good taste for software architecture. They cannot be trusted to make good choices on tasks that have architectural freedom, but I still use them for coding tasks that are intended for production use.

I tend to use coding agents for tasks that are relatively narrow and well-defined steps of a larger project. Sometimes, that project is my main focus and I identify a future step that is largely independent of what I am currently working on. If I can get the agent to take care of that step already, I won't have to do it myself later. Sometimes, there is a larger project that is not my main focus but that is simple enough that I already have a good understanding of how to break it down into smaller steps that an agent can work through over time.

Sometimes, I prompt the agent for the task directly in its chat interface. More often, I prepare a description of the task in a markdown file. This file typically ends up being somewhere between 10 and 20 lines long. This line count includes whitespace and bullet lists. Initially, I made these files read-only for the agent. More recently, I ask the agent to append a progress log to this file as it works through the task. This works reasonably well, but it is certainly a part of my workflow that I can see changing further.

I have two reasons for putting what is essentially the prompt in a separate file. First, I like having these files as a record of the tasks I have given to agents. Second, as I noted above, writing is thinking. I end up implicitly checking myself. Being just a little more thorough and precise also helps keep the agent aligned with my intentions.

When the agent is done, I read its final report to see if anything suspicious stands out to me. For example, the agent may report that it ran into some issue and worked around it in a certain way. I may ask the agent to change something, though I have found this kind of re-prompting to not be successful often enough for my tastes. Unless I find anything obviously problematic quickly, the next step is to thoroughly review all changes with an editor and the diff open side by side. I make changes very liberally as I go. Occasionally, I end up breaking something accidentally in doing so, which often ends up being an important learning experience. I also completely rewrite any commit messages from scratch. Communicating human intent in the commit message is the right thing to do, and not just because that's LLVM policy.

Speaking of LLVM specifically, there are a few common issues that I end up editing almost always.

LLMs are ridiculously comment-happy. The LLVM code base is very sparsely commented. While I personally think that a few more comments often wouldn't hurt, the comments produced by LLMs are of incredibly low value on average. Most of the time, they restate the obvious.

As of right now, I am hesistant to try to guide the LLM away from generating so many comments for two reasons. First, anything done to push the model away from its trained preference seems likely to have a negative effect elsewhere. Second, while the comments would generally hurt when trying to read the code from scratch, they do often help to review the specific changes that the model was trying to achieve. Third, deleting the comments is quick and easy and prevents me from becoming lazy.
LLMs are trained for the wrong idea of defensive programming by LLVM's standards. In LLVM, we want bad internal state to fail hard and fast. If a pointer is unexpectedly null, we'd rather abort immediately. This biases us strongly towards using assertions. One could call this defensive programming, but it is very different from the more traditional form of defensive programming where you try to limp along when faced with bad internal state, which is what LLMs are biased towards.

I have tried to guide the LLMs away from this behavior with mixed results. I do wonder if the situation could be improved with additional deterministic tooling that would help purely human development as well. Roughly speaking, the idea would be to observe the coverage of conditions in the code while running the test suite. If a condition is never exercised, that's a good sign that there is either a missing test case or the condition should be converted into an assertion. However, there are also many exceptions to this rule.

Other common issues that are not as specific to LLVM:

LLMs produce overly verbose code. If a piece of domain logic is already implemented in one place, but they need it in a second, they tend to duplicate the logic instead of refactoring it out into a common location.

It is unsurprising that LLMs are bad at making judgments about this. Even humans have a hard time and come up with pithy heuristics like DRY and WET that fail to grasp the issue properly. What you really need is a proper understanding of the problem domain and its mapping to the code. This allows you to decide whether two pieces of logic are inherently identical because they map to the same aspect of the domain, or whether they only incidentally look the same but are in fact conceptually separate.
Confused and convoluted logic. This is more likely to happens when the agent fails to one-shot a solution and ends up iterating in a debug-retry loop. Similar to the previous bullet, the resulting logic does not map properly to the underlying problem domain. It is not uncommon for the agent to produce a chunk of code that I can fairly easily reduce down to half the length by properly mapping the logic to the problem domain.

So far, I have never used the code changes produced by an agent as-is. I have always done at least some minimal amount of editing, and not just for the sake of it.

Is agentic coding worth it?

That question has a lot of facets.

At its most myopic, one can ask whether the use of agentic coding saves enough time to be worth the money that is paid for the evaluation and generation of tokens. I currently believe that to be the case.

Getting to quality with agentic coding still requires a lot of human involvement. That said, making robust changes requires more than just editing code. In all but the most trivial changes, your first edit will have bugs. For LLVM specifically, rebuilding the project and running even basic offline tests takes more than a minute and falls into that awkward window of non-interactivity that I described earlier. When a test fails, it needs to debugged, understood, and a fix applied. And even for the kinds of relatively simple tasks that I give the agents, I may easily end up in multiple iterations of this loop if I do it myself. Having an LLM-based agent drive this tedious loop is genuinely useful. By the time the agent has finished and I review its output, it will usually have completed the loop. And if the agent addressed a bug incorrectly, at least it has already done some initial root-cause analysis. The analysis is not always reliable, but it does allow me to bring my own understanding to bear more quickly. And so that's the biggest part of how agentic coding saves time in my experience.

Over the course of a day, I might spend something like 20 minutes prompting the agent every once in a while, and another half hour to 45 minutes editing and wrangling the result of something that may well have taken me an afternoon without agentic coding. That's far from the 10x that is promised by some, and it only speeds up a fraction of my work. Amdahl's law applies. But that doesn't matter for the question of whether the cost of the tokens is justified on some accountant's spreadsheet. By the statistics I can see on my token usage, it does seem justified by a healthy margin. Probably not by quite as much as the loudest AI advocates would like, but I'm going to bet neither against further improvements in the models and agent harnesses nor against further cost optimizations in the underlying inference technology (making the latter happen is part of my job).

What about other facets of the question? Is agentic coding worth enough to justify the current investment bonanza? What about the circular financial shenanigans that are being played? Are the externalities of data centers properly priced in? Not to mention the hypocrisy around copyright. And what does the cocktail of social media and generative AI do to the fabric of our society? There are good reasons for concern on each of those points and more, but those are big topics, and this post is already quite long, so please forgive my leaving it at that.

Agentic coding for better code quality

One facet that I do want to touch on is that the momentum around agentic coding at large is very clearly moving towards a reduction in software quality. This is inherent to the idea of "vibe coding", and the term "slop" is thrown around for good reasons. The influx of crap in software projects is an urgent problem, especially where it comes to extractive contributions. It is no wonder that we're seeing many projects adopt new policies around AI tooling.

However, agentic coding can be used to improve code quality. I know that because I have already done so, and I believe that this experience could apply more widely. Every sufficiently senior software developer knows the struggle with tech debt. There is a general pattern where cleaning up tech debt requires a significant amount of refactoring work that is fairly tedious and wasn't traditionally viable to automate. Simple search-and-replace is usually not flexible enough. Writing a rule for a refactoring tool like clang-tidy or Coccinelle requires a lot of expertise. LLM-based agents aren't deterministic and are therefore not as reliable as those tools, but they do have a lot of flexibility. They can be used to make the refactors directly, and that's what I have already done. Or perhaps we can get them to write rules for those deterministic tools. That is something I still want to explore.

I also have the impression that by using both my brain and an agent, I get some of the same quality benefits that talking to another human would give. For example, the agent sometimes finds an approach that is better than what I had thought of.

At least in large companies, quality is fundamentally never going to be the path of least resistance because the cost of tech debt is so difficult to quantify. But if the cost of reducing tech debt goes down, it becomes easier to justify. That gives me some hope. I encourage others to explore this direction as well. Of course, for this to become a net benefit, we do also need to figure out how to cap the rate at which tech debt accumulates through the careless use of agentic coding.

Closing thoughts

I do not believe the maximalist AI hype.

There are those who publish pseudo-scientific mathematical models by which they predict massive upheaval within just a few years. Those models are so dubious and sensitive to chosen parameters that one may as well read tea leaves. Aside from this negative reason for disbelieving the hype, there are positive reasons to believe that change is going to happen more slowly. One of them is that perhaps the closest analogy we have today are self-driving cars. It's not that those aren't coming, but the timelines were massively exaggerated. Another reason is related to the observations around the lack of taste shown by coding agents today. Lack of taste manifests as compounding tech debt when coding agents are left to their own devices. This happens on time horizons that are much longer than even the longest contexts supported by LLMs today, so (1) it seems likely that some black swan innovations are required before the situation improves significantly and (2) even if some such innovations are made, there is no fundamental reason why they would end up scaling in the same way that LLMs have scaled so far.

I also do not believe the maximalist AI denial.

As laid out in this post, I believe that inference for agentic coding is economically viable today. I suspect that generative AI inference is economically viable for a whole bunch of other use cases as well, though I cannot really speak to those. We may well be in a bubble; but the models don't disappear when the bubble pops, and agentic coding is here to stay in some form.

In other words, there is a lot of uncertainty around timelines, but agentic coding is absolutely changing software development. And I do not feel entirely comfortable about that.

Montag, November 10, 2025

An alias analysis shower thought

Alias analysis allows a compiler to understand whether two memory accesses may conflict in the sense that they both touch the same memory location and at least one of them is a write. This in turn enables certain compiler optimizations. For example, memory accesses that do not conflict can be rearranged as part of instruction scheduling. Using this freedom where it exists is especially important in GPU programs, where memory accesses frequently have latencies of many hundreds of cycles, and there is no out-of-order scheduling in hardware to bail us out.

Instruction scheduling typically happens after instruction selection. Instruction selection makes a program harder to reason about because it replaces an internal representation that uses fairly generic instructions (such as pointer addition and pointer-sized multiplications) with target-specific instructions that can be more complex to reason about (such as shift and an add combined into a single instruction, or a 64-bit addition decomposed into two 32-bit additions with a carry between them). This in turn makes alias analysis a lot harder.

LLVM addresses this issue by keeping references to pointer values in the pre-isel instruction intermediate representation around. Alias analysis is then performed based on those pointer values.

To some extent this seems a bit arbitrary and merely a curious historical artifact. There are many "isel-related" transforms that happen on LLVM IR before the "actual" instruction selection pass that generates MachineIR. Especially with the unfortunate co-existence of SelectionDAG and GlobalISel, there is a strong incentive to extract certain common lowerings into earlier passes in LLVM IR, to avoid code duplication. However, pulling complex lowerings on address calculations earlier in the pass pipeline currently means likely losing important information about pointers and therefore weakening alias analysis. It would be great if we could explicitly preserve the pointers from an earlier point in compilation somehow.

There really doesn't seem to be a good a priori argument against it. A compiler written from scratch could easily use a unified IR substrate throughout and freely choose a point at which pointers for alias analysis become "frozen". It'd just be a massive undertaking to move LLVM to such a model.

Mittwoch, Februar 07, 2024

Building a HIP environment from scratch

HIP is a C++-based, single-source programming language for writing GPU code. "Single-source" means that a single source file can contain both the "host code" which runs on the CPU and the "device code" which runs on the GPU. In a sense, HIP is "CUDA for AMD", except that HIP can actually target both AMD and Nvidia GPUs.

If you merely want to use HIP, your best bet is to look at the documentation and download pre-built packages. (By the way, the documentation calls itself "ROCm" because that's what AMD calls its overall compute platform. It includes HIP, OpenCL, and more.)

I like to dig deep, though, so I decided I want to build at least the user space parts myself to the point where I can build a simple HelloWorld using a Clang from upstream LLVM. It's all open-source, after all!

It's a bit tricky, though, in part because of the kind of bootstrapping problems you usually get when building toolchains: Running the compiler requires runtime libraries, at least by default, but building the runtime libraries requires a compiler. Luckily, it's not quite that difficult, though, because compiling the host libraries doesn't require a HIP-enabled compiler - any C++ compiler will do. And while the device libraries do require a HIP- (and OpenCL-)enabled compiler, it is possible to build code in a "freestanding" environment where runtime libraries aren't available.

What follows is pretty much just a list of steps with running commentary on what the individual pieces do, since I didn't find an equivalent recipe in the official documentation. Of course, by the time you read this, it may well be outdated. Good luck!

Components need to be installed, but installing into some arbitrary prefix inside your $HOME works just fine. Let's call it $HOME/prefix. All packages use CMake and can be built using invocations along the lines of:

cmake -S . -B build -GNinja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_PREFIX=$HOME/prefix -DCMAKE_PREFIX_PATH=$HOME/prefix
ninja -C build install

In some cases, additional variables need to be set.

Step 1: clang and lld

We're going to need a compiler and linker, so let's get llvm/llvm-project and build it with Clang and LLD enabled: -DLLVM_ENABLE_PROJECTS='clang;lld' -DLLVM_TARGETS_TO_BUILD='X86;AMDGPU'

Building LLVM is an art of its own which is luckily reasonably well documented, so I'm going to leave it at that.

Step 2: Those pesky cmake files

Build and install ROCm/rocm-cmake to avoid cryptic error messages down the road when building other components that use those CMake files without documenting the dependency clearly. Not rocket science, but man am I glad for GitHub's search function.

Step 3: libhsa-runtime64.so

This is the lowest level user space host-side library in the ROCm stack. Its services, as far as I understand them, include setting up device queues and loading "code objects" (device ELF files). All communication with the kernel driver goes through here.

Notably though, this library does not know how to dispatch a kernel! In the ROCm world, the so-called Architected Queueing Language is used for that. An AQL queue is setup with the help of the kernel driver (and that does go through libhsa-runtime64.so), and then a small ring buffer and a "door bell" associated with the queue are mapped into the application's virtual memory space. When the application wants to dispatch a kernel, it (or rather, a higher-level library like libamdhip64.so that it links against) writes an AQL packet into the ring buffer and "rings the door bell", which basically just means writing a new ring buffer head pointer to the door bell's address. The door bell virtual memory page is mapped to the device, so ringing the door bell causes a PCIe transaction (for us peasants; MI300A has slightly different details under the hood) which wakes up the GPU.

Anyway, libhsa-runtime64.so comes in two parts for what I am being told are largely historical reasons:

ROCm/ROCT-Thunk-Interface
ROCm/ROCR-Runtime; this one has one of those bootstrap issues and needs a -DIMAGE_SUPPORT=OFF

The former is statically linked into the latter...

Step 4: It which must not be named

For Reasons(tm), there is a fork of LLVM in the ROCm ecosystem, ROCm/llvm-project. Using upstream LLVM for the compiler seems to be fine and is what I as a compiler developer obviously want to do. However, this fork has an amd directory with a bunch of pieces that we'll need. I believe there is a desire to upstream them, but also an unfortunate hesitation from the LLVM community to accept something so AMD-specific.

In any case, the required components can each be built individually against the upstream LLVM from step 1:

hipcc; this is a frontend for Clang which is supposed to be user-friendly, but at the cost of adding an abstraction layer. I want to look at the details under the hood, so I don't want to and don't have to use it; but some of the later components want it
device-libs; as the name says, these are libraries of device code. I'm actually not quite sure what the intended abstraction boundary is between this one and the HIP libraries from the next step. I think these ones are meant to be tied more closely to the compiler so that other libraries, like the HIP library below, don't have to use __builtin_amdgcn_* directly? Anyway, just keep on building...
comgr; the "code object manager". Provides a stable interface to LLVM, Clang, and LLD services, up to (as far as I understand it) invoking Clang to compile kernels at runtime. But it seems to have no direct connection to the code-related services in libhsa-runtime64.so.

That last one is annoying. It needs a -DBUILD_TESTING=OFF

Worse, it has a fairly large interface with the C++ code of LLVM, which is famously not stable. In fact, at least during my little adventure, comgr wouldn't build as-is against the LLVM (and Clang and LLD) build that I got from step 1. I had to hack out a little bit of code in its symbolizer. I'm sure it's fine.

Step 5: libamdhip64.so

Finally, here comes the library that implements the host-side HIP API. It also provides a bunch of HIP-specific device-side functionality, mostly by leaning on the device-libs from the previous step.

It lives in ROCm/clr, which stands for either Compute Language Runtimes or Common Language Runtime. Who knows. Either one works for me. It's obviously for compute, and it's common because it also contains OpenCL support.

You also need ROCm/HIP at this point. I'm not quite sure why stuff is split up into so many repositories. Maybe ROCm/HIP is also used when targeting Nvidia GPUs with HIP, but ROCm/CLR isn't? Not a great justification in my opinion, but at least this is documented in the README.

CLR also needs a bunch of additional CMake options: -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=${checkout of ROCm/HIP} -DHIPCC_BIN_DIR=$HOME/prefix/bin

Step 6: Compiling with Clang

We can now build simple HIP programs with our own Clang against our own HIP and ROCm libraries:

clang -x hip --offload-arch=gfx1100 --rocm-path=$HOME/prefix -rpath $HOME/prefix/lib -lstdc++ HelloWorld.cpp
LD_LIBRARY_PATH=$HOME/prefix/lib ./a.out

Neat, huh?

Freitag, Mai 12, 2023

An Update on Dialects in LLVM

EuroLLVM took place in Glasgow this week. I wasn't there, but it's a good opportunity to check in with what's been happening in dialects for LLVM in the ~half year since my keynote at the LLVM developer meeting.

Where we came from

To give an ultra-condensed recap: The excellent idea that MLIR brought to the world of compilers is to explicitly separate the substrate in which a compiler intermediate representation is implemented (the class hierarchy and basic structures that are used to represent and manipulate the program representation at compiler runtime) from the semantic definition of a dialect (the types and operations that are available in the IR and their meaning). Multiple dialects can co-exist on the same substrate, and in fact the phases of compilation can be identified with the set of dialects that are used within each phase.

Unfortunately for AMD's shader compiler, while MLIR is part of the LLVM project and shares some foundational support libraries with LLVM, its IR substrate is entirely disjoint from LLVM's IR substrate. If you have an existing compiler built on LLVM IR, you could bolt on an MLIR-based frontend, but what we really need is a way to gradually introduce some of the capabilities offered by MLIR throughout an existing LLVM-based compilation pipeline.

That's why I started llvm-dialects last year. We published its initial release a bit more than half a year ago, and have greatly expanded its capabilities since then.

Where we are now

We have been using llvm-dialects in production for a while now. Some of its highlights so far are:

Almost feature-complete for defining custom operations (aka intrinsics or instructions). The main thing that's missing is varargs support - we just haven't needed that yet.
Most of the way there for defining custom types: custom types can be defined, but they can't be used everywhere. I'm working on closing the gaps as we speak - some upstream changes in LLVM itself are required.
Expressive language for describing constraints on operation and type arguments and operation results - see examples here and here.
Thorough, automatically generated IR verifier routines.
A flexible and efficient visitor mechanism that is inspired by but beats LLVM's TypeSwitch in some important ways.

Transitioning to the use of llvm-dialects is a gradual process for us and far from complete. We have always had custom operations, but we used to do implement them in a rather ad-hoc manner. The old way of doing it consisted of hand-writing code like this:

SmallVector<Value *, 4> args;
std::string instName = lgcName::OutputExportXfb;
args.push_back(getInt32(xfbBuffer));
args.push_back(xfbOffset);
args.push_back(getInt32(streamId));
args.push_back(valueToWrite);
addTypeMangling(nullptr, args, instName);
return CreateNamedCall(instName, getVoidTy(), args, {});

With llvm-dialects, we can use a much cleaner builder pattern:

return create<InputImportGenericOp>(
resultTy, false, location, getInt32(0), elemIdx,
PoisonValue::get(getInt32Ty()));

Accessing the operands of a custom operation used to be a matter of code with magic numbers everywhere:

if (callInst.arg_size() > 2)
vertexIdx = isDontCareValue(callInst.getOperand(2))
? nullptr : callInst.getOperand(2);

With llvm-dialects, we get far more readable code:

Value *vertexIdx = nullptr;
if (!inputOp.getPerPrimitive())
vertexIdx = inputOp.getArrayIndex();

Following the example set by MLIR, these accessor methods as well as the machinery required to make the create<FooOp>(...) builder call work are automatically generated from a dialect definition written in a TableGen DSL.

An important lesson from the transition so far is that the biggest effort, but also one of the biggest benefits, has to do with getting to a properly defined IR in the first place.

I firmly believe that understanding a piece of software starts not with the code that is executed but with the interfaces and data structures that the code implements and interacts with. In a compiler, the most important data structure is the IR. You should think of the IR as the bulk of the interface for almost all compiler code.

When defining custom operations in the ad-hoc manner that we used to use, there isn't one place in which the operations themselves are defined. Instead, the definition is implicit in the scattered locations where the operations are created and consumed. More often than is comfortable, this leads to definitions that are fuzzy or confused, which leads to code that is fuzzy and confused, which leads to bugs and a high maintenance cost, which leads to the dark side (or something).

By having a designated location where the custom operations are explicitly defined - the TableGen file - there is a significant force pushing towards proper definitions. As the experience of MLIR shows, this isn't automatic (witness the rather thin documentation of many of the dialects in upstream MLIR), but without this designated location, it's bound to be worse. And so a large part of transitioning to a systematically defined dialect is cleaning up those instances of confusion and fuzziness. It pays off: I have found hidden bugs this way, and the code becomes noticeably more maintainable.

Where we want to go

llvm-dialects is already a valuable tool for us. I'm obviously biased, but if you're in a similar situation to us, or you're thinking of starting a new LLVM-based compiler, I recommend it.

There is more that can be done, though, and I'm optimistic we'll get around to further improvements over time as we gradually convert parts of our compiler that are being worked on anyway. My personal list of items on the radar:

As mentioned already, closing the remaining gaps in custom type support.
Our compiler uses quite complex metadata in a bunch of places. It's hard to read for humans, doesn't have a good compatibility story for lit tests, and accessing it at compile-time isn't particularly efficient. I have some ideas for how to address all these issues with an extension mechanism that could also benefit upstream LLVM.
Compile-time optimizations. At the moment, casting custom operations is still based on string comparison, which is clearly not ideal. There are a bunch of other things in this general area as well.
I really want to see some equivalent of MLIR regions in LLVM. But that's a non-trivial amount of work and will require patience.

There's also the question of if or when llvm-dialects will eventually be integrated into LLVM upstream. There are lots of good arguments in favor. Its DSL for defining operations is a lot friendlier than what is used for intrinsics at the moment. Getting nice, auto-generated accessor methods and thorough verification for intrinsics would clearly be a plus. But it's not a topic that I'm personally going to push in the near future. I imagine we'll eventually get there once we've collected even more experience.

Of course, if llvm-dialects is useful to you and you feel like contributing in these or other areas, I'd be more than happy about that!

Mittwoch, März 09, 2022

A New Type of Convergence Control Intrinsic?

Subgroup operations or wave intrinsics, such as reducing a value across the threads of a shader subgroup or wave, were introduced in GPU programming languages a while ago. They communicate with other threads of the same wave, for example to exchange the input values of a reduction, but not necessarily with all of them if there is divergent control flow.

In LLVM, we call such operations convergent. Unfortunately, LLVM does not define how the set of communicating threads in convergent operations -- the set of converged threads -- is affected by control flow.

If you're used to thinking in terms of structured control flow, this may seem trivial. Obviously, there is a tree of control flow constructs: loops, if-statements, and perhaps a few others depending on the language. Two threads are converged in the body of a child construct if and only if both execute that body and they are converged in the parent. Throw in some simple and intuitive rules about loop counters and early exits (nested return, break and continue, that sort of thing) and you're done.

In an unstructured control flow graph, the answer is not obvious at all. I gave a presentation at the 2020 LLVM Developers' Meeting that explains some of the challenges as well as a solution proposal that involves adding convergence control tokens to the IR.

Very briefly, convergent operations in the proposal use a token variable that is defined by a convergence control intrinsic. Two dynamic instances of the same static convergent operation from two different threads are converged if and only if the dynamic instances of the control intrinsic producing the used token values were converged.

(The published draft of the proposal talks of multiple threads executing the same dynamic instance. I have since been convinced that it's easier to teach this matter if we instead always give every thread its own dynamic instances and talk about a convergence equivalence relation between dynamic instances. This doesn't change the resulting semantics.)

The draft has three such control intrinsics: anchor, entry, and (loop) heart. Of particular interest here is the heart. For the most common and intuitive use cases, a heart intrinsic is placed in the header of natural loops. The token it defines is used by convergent operations in the loop. The heart intrinsic itself also uses a token that is defined outside the loop: either by another heart in the case of nested loops, or by an anchor or entry. The heart combines two intuitive behaviors:

It uses a token in much the same way that convergent operations do: two threads are converged for their first execution of the heart if and only if they were converged at the intrinsic that defined the used token.
Two threads are converged at subsequent executions of the heart if and only if they were converged for the first execution and they are currently at the same loop iteration, where iterations are counted by a virtual loop counter that is incremented at the heart.

Viewed from this angle, how about we define a weaker version of these rules that lies somewhere between an anchor and a loop heart? We could call it a "light heart", though I will stick with "iterating anchor". The iterating anchor defines a token but has no arguments. Like for the anchor, the set of converged threads is implementation-defined -- when the iterating anchor is first encountered. When threads encounter the iterating anchor again without leaving the dominance region of its containing basic block, they are converged if and only if they were converged during their previous encounter of the iterating anchor.

The notion of an iterating anchor came up when discussing the convergence behaviors that can be guaranteed for natural loops. Is it possible to guarantee that natural loops always behave in the natural way -- according to their loop counter -- when it comes to convergence?

Naively, this should be possible: just put hearts into loop headers! Unfortunately, that's not so straightforward when multiple natural loops are contained in an irreducible loop:

Hearts in A and C must refer to a token defined outside the loops; that is, a token defined in E. The resulting program is ill-formed because it has a closed path that goes through two hearts that use the same token, but the path does not go through the definition of that token. This well-formedness rule exists because the rules about heart semantics are unsatisfiable if the rule is broken.

The underlying intuitive issue is that if the branch at E is divergent in a typical implementation, the wave (or subgroup) must choose whether A or C is executed first. Neither choice works. The heart in A indicates that (among the threads that are converged in E) all threads that visit A (whether immediately or via C) must be converged during their first visit of A. But if the wave executes A first, then threads which branch directly from E to A cannot be converged with those that first branch to C. The opposite conflict exists if the wave executes C first.

If we replace the hearts in A and C by iterating anchors, this problem goes away because the convergence during the initial visit of each block is implementation-defined. In practice, it should fall out of which of the blocks the implementation decides to execute first.

So it seems that iterating anchors can fill a gap in the expressiveness of the convergence control design. But are they really a sound addition? There are two main questions:

Satisfiability: Can the constraints imposed by iterating anchors be satisfied, or can they cause the sort of logical contradiction discussed for the example above? And if so, is there a simple static rule that prevents such contradictions?
Spooky action at a distance: Are there generic code transforms which change semantics while changing a part of the code that is distant from the iterating anchor?

The second question is important because we want to add convergence control to LLVM without having to audit and change the existing generic transforms. We certainly don't want to hurt compile-time performance by increasing the amount of code that generic transforms have to examine for making their decisions.

Satisfiability

Consider the following simple CFG with an iterating anchor in A and a heart in B that refers back to a token defined in E:

Now consider two threads that are initially converged with execution traces:

E - A - A - B - X
E - A - B - A - X

The heart rule implies that the threads must be converged in B. The iterating anchor rule implies that if the threads are converged in their first dynamic instances of A, then they must also be converged in their second dynamic instances of A, which leads to a temporal paradox.

One could try to resolve the paradox by saying that the threads cannot be converged in A at all, but this would mean that the threads must diverge before a divergent branch occurs. That seems unreasonable, since typical implementations want to avoid divergence as long as control flow is uniform.

The example arguably breaks the spirit of the rule about convergence regions from the draft proposal linked above, and so a minor change to the definition of convergence region may be used to exclude it.

What if the CFG instead looks as follows, which does not break any rules about convergence regions:

For the same execution traces, the heart rule again implies that the threads must be converged in B. The convergence of the first dynamic instances of A are technically implementation-defined, but we'd expect most implementations to be converged there.

The second dynamic instances of A cannot be converged due to the convergence of the dynamic instances of B. That's okay: the second dynamic instance of A in thread 2 is a re-entry into the dominance region of A, and so its convergence is unrelated to any convergence of earlier dynamic instances of A.

Spooky action at a distance

Unfortunately, we still cannot allow this second example. A program transform may find that the conditional branch in E is constant and the edge from E to B is dead. Removing that edge brings us back to the previous example which is ill-formed. However, a transform which removes the dead edge would not normally inspect the blocks A and B or their dominance relation in detail. The program becomes ill-formed by spooky action at a distance.

The following static rule forbids both example CFGs: if there is a closed path through a heart and an iterating anchor, but not through the definition of the token that the heart uses, then the heart must dominate the iterating anchor.

There is at least one other issue of spooky action at a distance. If the iterating anchor is not the first (non-phi) instruction of its basic block, then it may be preceded by a function call in the same block. The callee may contain control flow that ends up being inlined. Back edges that previously pointed at the block containing the iterating anchor will then point to a different block, which changes the behavior quite drastically. Essentially, the iterating anchor is reduced to a plain anchor.

What can we do about that? It's tempting to decree that an iterating anchor must always be the first (non-phi) instruction of a basic block. Unfortunately, this is not easily done in LLVM in the face of general transforms that might sink instructions or merge basic blocks.

Preheaders to the rescue

We could chew through some other ideas for making iterating anchors work, but that turns out to be unnecessary. The desired behavior of iterating anchors can be obtained by inserting preheader blocks. The initial example of two natural loops contained in an irreducible loop becomes:

Place anchors in Ap and Cp and hearts in A and C that use the token defined by their respective dominating anchor. Convergence at the anchors is implementation-defined, but relative to this initial convergence at the anchor, convergence inside the natural loops headed by A and C behaves in the natural way, based on a virtual loop counter. The transform of inserting an anchor in the preheader is easily generalized.

To sum it up: We've concluded that defining an "iterating anchor" convergence control intrinsic is problematic, but luckily also unnecessary. The control intrinsics defined in the original proposal are sufficient. I hope that the discussion that led to those conclusions helps illustrate some aspects of the convergence control proposal for LLVM as well as the goals and principles that drove it.

Tagebuch eines Interplanetaren Botschafters