Posts mit dem Label FOSS werden angezeigt. Alle Posts anzeigen
Posts mit dem Label FOSS werden angezeigt. Alle Posts anzeigen

Samstag, Mai 04, 2024

A new kind of git history

Discussions about rebase vs. merge are familiar territory for anybody with an interest in version control in general and git in particular. I want to finally give a more permanent home to an idea that I have expressed in the past and that I've occasionally seen others hint at in those discussions as well.

There are multiple camps in these discussions that have slightly different ideas about how and for what purposes git should be used.

The first major axis of disagreement is whether history needs to be git bisect-able. Outside of my own little hobby projects, I've always worked on projects for which bisectability was important. This has generally been because their scope was such that CI simply had no chance to cover all uses of the software. Bug reports that can be traced to regressions from weeks or even months ago are not frequent per se, but they have always been frequent enough to matter. git bisect is an essential tool for finding those regression points when they happen. Not all projects are like that, but for projects which are, the notion of an "atomic" change to the project's main development branch (or branches) is important.

The second major axis of disagreement is whether the development history of those "atomic" changes is important enough to preserve. The original git development workflow does not consider this to be important: developers send around and review multiple iterations of a change, but only the final version of the change goes into the permanent record of the git repository. I tend to agree with that view. I have very occasionally found it useful to go back and read through the comments on a pull request that lead to a change months ago (or the email thread in projects that use an email workflow), but I have never found it useful to look at older versions of a change.

Some people seem to really care about this kind of history, though. They're the people who argue for a merge-based workflow for pull requests on GitHub (but against force-pushes to the same) and who have built hacks for bisectability and readability of history like --first-parent. I'm calling that a hack because it does not compose well. It works for projects whose atomic change history is essentially linear, but it breaks down once the history becomes more complex. What if the project occasionally has a genuine merge? Now you'd want to apply --first-parent for most merge commits but not all. Things get messy.

One final observation. Even "my" camp, which generally prefers to discard development history leading up to the atomic change in a main development branch, does want to preserve a kind of history that is currently not captured by git's graph. git revert inserts the hash of the commit that was reverted into the commit message. Similarly, git cherry-pick optionally inserts the hash of the commit that was cherry-picked into the commit message.

In other words, there is a kind of history for whose preservation at least in some cases there seems to be a broad consensus. This kind of history is distinct from the history that is captured by commit parent links. Looked at in this light, the idea is almost obvious: make this history an explicit part of git commit metadata.

The gist of it would be this. Every commit has a (often empty) list of historical commit references explaining the origins of the diff that is implicitly represented by the commit; let's call them diff-parents. The diff-parents are an ordered list of references to commits, each of them with a "reverted" bit that can optionally be set.

The history of a revert can be encoded by making the reverted commit a diff-parent with the "reverted" bit set. The history of a cherry-pick can be encoded similarly, with the "reverted" bit clear. When we perform a simple rebase, each new commit has an obvious diff-parent. When commits are squashed during a rebase, the sequence of squashed commits becomes the list of diff-parents of the newly formed commit. GitHub users who like to preserve all development history can use the "squash" option when landing pull requests and have the history be preserved via the list of diff-parents. git commit --amend can similarly record the original commit as diff-parent.

This is an idea and not a fully fleshed-out plan. There are obviously a whole bunch of tricky questions to answer. For example: How does this all fit into git's admittedly often byzantine CLI? Can merge commits be diff-parents, and how would that work? Can we visualize the difference between a commit and its diff-parents? (Hint: Here's an idea)

Diff-parents are a source of potential information leaks. This is not a problem specific to the idea of diff-parents; it is a general problem with the idea of preserving all history. Imagine some developer accidentally commits some credentials in their local clone of a repository and then uses git commit --amend to remove them again. Whoops, the commit that contains the credentials is still referenced as a diff-parent. Will it (and therefore the credentials) be published to the world for all to see when the developers pushes their branch to GitHub? This needs to be taken seriously.

So there are a whole bunch of issues that would have to be addressed for this idea to work well. I believe those issues to be quite surmountable in principle, but given the state of git development (where GitHub, which to many is almost synonymous with git, doesn't even seem to be able to understand how git was originally meant to be used) I am not particularly optimistic. Still, I think it's a good idea, and I'd love to see it or something like it in git.

Freitag, Mai 12, 2023

An Update on Dialects in LLVM

EuroLLVM took place in Glasgow this week. I wasn't there, but it's a good opportunity to check in with what's been happening in dialects for LLVM in the ~half year since my keynote at the LLVM developer meeting.

Where we came from

To give an ultra-condensed recap: The excellent idea that MLIR brought to the world of compilers is to explicitly separate the substrate in which a compiler intermediate representation is implemented (the class hierarchy and basic structures that are used to represent and manipulate the program representation at compiler runtime) from the semantic definition of a dialect (the types and operations that are available in the IR and their meaning). Multiple dialects can co-exist on the same substrate, and in fact the phases of compilation can be identified with the set of dialects that are used within each phase.

Unfortunately for AMD's shader compiler, while MLIR is part of the LLVM project and shares some foundational support libraries with LLVM, its IR substrate is entirely disjoint from LLVM's IR substrate. If you have an existing compiler built on LLVM IR, you could bolt on an MLIR-based frontend, but what we really need is a way to gradually introduce some of the capabilities offered by MLIR throughout an existing LLVM-based compilation pipeline.

That's why I started llvm-dialects last year. We published its initial release a bit more than half a year ago, and have greatly expanded its capabilities since then.

Where we are now

We have been using llvm-dialects in production for a while now. Some of its highlights so far are:

  • Almost feature-complete for defining custom operations (aka intrinsics or instructions). The main thing that's missing is varargs support - we just haven't needed that yet.
  • Most of the way there for defining custom types: custom types can be defined, but they can't be used everywhere. I'm working on closing the gaps as we speak  - some upstream changes in LLVM itself are required.
  • Expressive language for describing constraints on operation and type arguments and operation results - see examples here and here.
  • Thorough, automatically generated IR verifier routines.
  • A flexible and efficient visitor mechanism that is inspired by but beats LLVM's TypeSwitch in some important ways.

Transitioning to the use of llvm-dialects is a gradual process for us and far from complete. We have always had custom operations, but we used to do implement them in a rather ad-hoc manner. The old way of doing it consisted of hand-writing code like this:

SmallVector<Value *, 4> args;
std::string instName = lgcName::OutputExportXfb;
args.push_back(getInt32(xfbBuffer));
args.push_back(xfbOffset);
args.push_back(getInt32(streamId));
args.push_back(valueToWrite);
addTypeMangling(nullptr, args, instName);
return CreateNamedCall(instName, getVoidTy(), args, {});

With llvm-dialects, we can use a much cleaner builder pattern:

return create<InputImportGenericOp>(
    resultTy, false, location, getInt32(0), elemIdx,
    PoisonValue::get(getInt32Ty()));

Accessing the operands of a custom operation used to be a matter of code with magic numbers everywhere:

if (callInst.arg_size() > 2)
  vertexIdx = isDontCareValue(callInst.getOperand(2))
                  ? nullptr : callInst.getOperand(2);

With llvm-dialects, we get far more readable code:

Value *vertexIdx = nullptr;
if (!inputOp.getPerPrimitive())
  vertexIdx = inputOp.getArrayIndex();

Following the example set by MLIR, these accessor methods as well as the machinery required to make the create<FooOp>(...) builder call work are automatically generated from a dialect definition written in a TableGen DSL.

An important lesson from the transition so far is that the biggest effort, but also one of the biggest benefits, has to do with getting to a properly defined IR in the first place.

I firmly believe that understanding a piece of software starts not with the code that is executed but with the interfaces and data structures that the code implements and interacts with. In a compiler, the most important data structure is the IR. You should think of the IR as the bulk of the interface for almost all compiler code.

When defining custom operations in the ad-hoc manner that we used to use, there isn't one place in which the operations themselves are defined. Instead, the definition is implicit in the scattered locations where the operations are created and consumed. More often than is comfortable, this leads to definitions that are fuzzy or confused, which leads to code that is fuzzy and confused, which leads to bugs and a high maintenance cost, which leads to the dark side (or something).

By having a designated location where the custom operations are explicitly defined - the TableGen file - there is a significant force pushing towards proper definitions. As the experience of MLIR shows, this isn't automatic (witness the rather thin documentation of many of the dialects in upstream MLIR), but without this designated location, it's bound to be worse. And so a large part of transitioning to a systematically defined dialect is cleaning up those instances of confusion and fuzziness. It pays off: I have found hidden bugs this way, and the code becomes noticeably more maintainable.

Where we want to go

llvm-dialects is already a valuable tool for us. I'm obviously biased, but if you're in a similar situation to us, or you're thinking of starting a new LLVM-based compiler, I recommend it.

There is more that can be done, though, and I'm optimistic we'll get around to further improvements over time as we gradually convert parts of our compiler that are being worked on anyway. My personal list of items on the radar:

  • As mentioned already, closing the remaining gaps in custom type support.
  • Our compiler uses quite complex metadata in a bunch of places. It's hard to read for humans, doesn't have a good compatibility story for lit tests, and accessing it at compile-time isn't particularly efficient. I have some ideas for how to address all these issues with an extension mechanism that could also benefit upstream LLVM.
  • Compile-time optimizations. At the moment, casting custom operations is still based on string comparison, which is clearly not ideal. There are a bunch of other things in this general area as well.
  • I really want to see some equivalent of MLIR regions in LLVM. But that's a non-trivial amount of work and will require patience.

There's also the question of if or when llvm-dialects will eventually be integrated into LLVM upstream. There are lots of good arguments in favor. Its DSL for defining operations is a lot friendlier than what is used for intrinsics at the moment. Getting nice, auto-generated accessor methods and thorough verification for intrinsics would clearly be a plus. But it's not a topic that I'm personally going to push in the near future. I imagine we'll eventually get there once we've collected even more experience.

Of course, if llvm-dialects is useful to you and you feel like contributing in these or other areas, I'd be more than happy about that!

Freitag, Juni 11, 2021

Can memcpy be implemented in LLVM IR?

(Added a section on load/store pairs on June 14th)

This question probably seems absurd. An unoptimized memcpy is a simple loop that copies bytes. How hard can that be? Well...

There's a fascinating thread on llvm-dev started by George Mitenkov proposing a new family of "byte" types. I found the proposal and discussion difficult to follow. In my humble opinion, this is because the proposal touches some rather subtle and underspecified aspects of LLVM IR semantics, and rather than address those fundamentals systematically, it jumps right into the minutiae of the instruction set. I look forward to seeing how the proposal evolves. In the meantime, this article is a byproduct of me attempting to digest the problem space.

Here is a fairly natural way to (attempt to) implement memcpy in LLVM IR:

define void @memcpy(i8* %dst, i8* %src, i64 %n) {
entry:
  %dst.end = getelementptr i8, i8* %dst, i64 %n
  %isempty = icmp eq i64 %n, 0
  br i1 %isempty, label %out, label %loop

loop:
  %src.loop = phi i8* [ %src, %entry ], [ %src.next, %loop ]
  %dst.loop = phi i8* [ %dst, %entry ], [ %dst.next, %loop ]
  %ch = load i8, i8* %src.loop
  store i8 %ch, i8* %dst.loop
  %src.next = getelementptr i8, i8* %src.loop, i64 1
  %dst.next = getelementptr i8, i8* %dst.loop, i64 1
  %done = icmp eq i8* %dst.next, %dst.end
  br i1 %done, label %out, label %loop

out:
  ret void
}

Unfortunately, the copy that is written to the destination is not a perfect copy of the source.

Hold on, I hear you think, each byte of memory holds one of 256 possible bit patterns, and this bit pattern is perfectly copied by the `load`/`store` sequence! The catch is that in LLVM's model of execution, a byte of memory can in fact hold more than just one of those 256 values. For example, a byte of memory can be poison, which means that there are at least 257 possible values. Poison is forwarded perfectly by the code above, so that's fine. The trouble starts because of pointer provenance.


What and why is pointer provenance?

From a machine perspective, a pointer is just an integer that is interpreted as a memory address.

For the compiler, alias analysis -- that is, the ability to prove that different pointers point at different memory addresses -- is crucial for optimization. One basic tool in the alias analysis toolbox is to recognize that if pointers point into different "memory objects" -- different stack or heap allocations -- then they cannot alias.

Unfortunately, many pointers are obtained via getelementptr (GEP) using dynamic (non-constant) indices. These dynamic indices could be such that the resulting pointer points into a different memory object than the base pointer. This makes it nearly impossible to determine at compile time whether two pointers point into the same memory object or not.

Which is why there is a rule which says (among other things) that if a pointer P obtained via GEP ends up going out-of-bounds and pointing into a different memory object than the pointer on which the GEP was based, then dereferencing P is undefined behavior even though the pointer's memory address is valid from the machine perspective.

As a corollary, a situation is possible in which there are two pointers whose underlying memory address is identical but whose provenance is different. In that case, it's possible that one of them can be dereferenced while dereferencing the other is undefined behavior.

This only makes sense if, in the formal semantics of LLVM IR, pointer values carry more information than just an integer interpreted as a memory address. They also carry provenance information, which is essentially the set of memory objects that can be accessed via this pointer and any pointers derived from it.


Bytes in memory carry provenance information

What is the provenance of a pointer that results from a load instruction? In a clean operational semantics, the load must derive this provenance from the values stored in memory.

If bytes of memory can only hold one of 256 bit patterns (or poison), that doesn't give us much to work with. We could say that the provenance of the pointer is "empty", meaning the pointer cannot be used to access any memory objects -- but that's clearly useless. Or we could say that the provenance of the pointer is "all", meaning the pointer (or pointers derived from it) can be freely used to access all memory objects, assuming the underlying address is adjusted appropriately. That isn't much better.[0]

Instead, we must say that -- as far as LLVM IR semantics are concerned -- each byte of memory holds pointer provenance information in addition to its i8 content. The provenance information in memory is written by pointer store, and pointer load uses it to reconstruct the original provenance of the loaded pointer.

What happens to provenance information in non-pointer load/store? A load can simply ignore the additional information in memory. For store, I see 3 possible choices:

1. Leave the provenance information that already happens to be in memory unmodified.
2. Set the provenance to "empty".
3. Set the provenance to "all".

Looking back at our attempt to implement memcpy, there is no choice which results in a perfect copy. All of the choices lose provenance information.

Without major changes to LLVM IR, only the last choice is potentially viable because it is the only choice that allows dereferencing pointers that are loaded from the memcpy destination.

Should we care about losing provenance information?

Without major changes to LLVM IR, we can only implement a memcpy that loses provenance information during the copy.

So what? Alias analysis around memcpy and code like it ends up being conservative, but reasonable people can argue that this doesn't matter. The burden of evidence lies on whoever wants to make a large change here in order to improve alias analysis.

That said, we cannot just call it a day and go (or stay) home either, because there are related correctness issues in LLVM today, e.g. bug 37469 mentioned in the initial email of that llvm-dev thread.

Here's a simpler example of a correctness issue using our hand-coded memcpy:

define i32 @sample(i32** %pp) {
  %tmp = alloca i32*
  %pp.8 = bitcast i32** %pp to i8*
  %tmp.8 = bitcast i32** %tmp to i8*
  call void @memcpy(i8* %tmp.8, i8* %pp.8, i64 8)
  %p = load i32*, i32** %tmp
  %x = load i32, i32* %p
  ret i32 %x
}

A transform that should be possible is to eliminate the memcpy and temporary allocation:

define i32 @sample(i32** %pp) {
  %p = load i32*, i32** %pp
  %x = load i32, i32* %p
  ret i32 %x
}

This transform is incorrect because it introduces undefined behavior.

To see why, remember that this is the world where we agree that integer stores write an "all" provenance to memory, so %p in the original program has "all" provenance. In the transformed program, this may no longer be the case. If @sample is called with a pointer that was obtained through an out-of-bounds GEP whose resulting address just happens to fall into a different memory object, then the transformed program has undefined behavior where the original program didn't.

We could fix this correctness issue by introducing an unrestrict instruction which elevates a pointer's provenance to the "all" provenance:

define i32 @sample(i32** %pp) {
  %p = load i32*, i32** %pp
  %q = unrestrict i32* %p
  %x = load i32, i32* %q
  ret i32 %x
}

Here, %q has "all" provenance and therefore no undefined behavior is introduced.

I believe that (at least for address spaces that are well-behaved?) it would be correct to fold inttoptr(ptrtoint(x)) to unrestrict(x). The two are really the same.

For that reason, unrestrict could also be used to fix the above-mentioned bug 37469. Several folks in the bug's discussion stated the opinion that the bug is caused by incorrect store forwarding that should be weakened via inttoptr(ptrtoint(x)). unrestrict(x) is simply a clearer spelling of the same idea.


A dead end: integers cannot have provenance information

A natural thought at this point is that the situation could be improved by adding provenance information to integers. This is technically correct: our hand-coded memcpy would then produce a perfect copy of the memory contents.

However, we would get into serious trouble elsewhere because global value numbering (GVN) and similar transforms become incorrect: two integers could compare equal using the icmp instruction, but still be different because of different provenance. Replacing one by the other could result in miscompilation.

GVN is important enough that adding provenance information to integers is a no-go.

I suspect that the unrestrict instruction would allow us to apply GVN to pointers, at the cost of making later alias analysis more conservative and sprinkling unrestrict instructions that may inhibit other transforms. I have no idea what the trade-off is on that.


The "byte" types: accurate representation of memory contents

With all the above in mind, I can see the first-principles appeal of the proposed "byte" types. They allow us to represent the contents of memory accurately in SSA values, and so they fill a real gap in the expressiveness of LLVM IR.

That said, the software development cost of adding a whole new family of types to LLVM is very high, so it better be justified by more than just aesthetics.

Our hand-coded memcpy can be turned into a perfect copier with straightforward replacement of i8 by b8:

define void @memcpy(b8* %dst, b8* %src, i64 %n) {
entry:
  %dst.end = getelementptr b8, b8* %dst, i64 %n
  %isempty = icmp eq i64 %n, 0
  br i1 %isempty, label %out, label %loop

loop:
  %src.loop = phi b8* [ %src, %entry ], [ %src.next, %loop ]
  %dst.loop = phi b8* [ %dst, %entry ], [ %dst.next, %loop ]
  %ch = load b8, b8* %src.loop
  store b8 %ch, b8* %dst.loop
  %src.next = getelementptr b8, b8* %src.loop, i64 1
  %dst.next = getelementptr b8, b8* %dst.loop, i64 1
  %done = icmp eq b8* %dst.next, %dst.end
  br i1 %done, label %out, label %loop

out:
  ret void
}

Looking at the concrete choices made in the proposal, I disagree with some of them.

Memory should not be typed. In the proposal, storing an integer always results in different memory contents than storing a pointer (regardless of its provenance), and implicitly trying to mix pointers and integers is declared to be undefined behavior. In other words, a sequence such as:

store i64 %x, i64* %p
%q = bitcast i64* %p to i8**
%y = load i8*, i8** %q

... is undefined behavior under the proposal instead of being effectively inttoptr(%x). That seems fine for C/C++, but is it going to be fine for other frontends?

The corresponding distinction between bytes-as-integers and bytes-as-pointers complicates the proposal overall, e.g. it forces them to add a bytecast instruction.

Conversely, the benefits of the distinction are unclear to me. One benefit appears to be guaranteed non-aliasing between pointer and non-pointer memory accesses, but that is a form of type-based alias analysis which in LLVM should idiomatically be done via TBAA metadata. (Update: see the addendum for another potential argument in favor of typed memory.)

So let's keep memory untyped, please.

Bitwise poison in byte values makes me really nervous due to the arbitrary deviation from how poison works in other types. I don't see any justification for it in the proposal. I can kind of see how one could be motivated by implementing memcpy with vector intrinsics operating on, for example, <8 x b32>, but a simpler solution would be to just use <32 x b8> instead. And if poison is indeed bitwise, then certainly pointer provenance would also have to be bitwise!

Finally, no design discussion is complete without a little bit of bike-shedding. I believe the name "byte" is inspired by C++'s std::byte, but given that types such as b256 are possible, this name would forever be a source of confusion. Naming is hard, and I think we should at least try to look for a better one. Let me kick off the brainstorming by suggesting we think of them as "memory content" values, because that's what they are. The types could be spelled m8, m32, etc. in IR assembly.

A variation: adding a pointer provenance type

In the llvm-dev thread, Jeroen Dobbelaere points out work being done to introduce explicit `ptr_provenance` operands on certain instructions, in service of C99's restrict keyword. I haven't properly digested this work, but it inspired the thoughts of this section.

Values of the proposed byte types have both a bit pattern and a pointer provenance. Do we really need to have both pieces of information in the same SSA value? We could instead split them up into an integer bit pattern value and a pointer provenance value with an explicit provenance type. Loads of integers could read out the provenance information stored in memory and provide it as a secondary result. Similarly, stores of integers could accept the desired provenance to be stored in memory as a secondary data operand. This would allow us to write a perfect memcpy by replacing the core load/store sequence with something like:

%ch, %provenance = load_with_provenance i8, i8* %src
store_with_provenance i8 %ch, provenance %provenance, i8* %dst

The syntax and instruction names in the example are very much straw men. Don't take them too seriously, especially because LLVM IR doesn't currently allow multiple result values.

Interestingly, this split allows the derivation of pointer provenance to follow a different path than the calculation of the pointer's bit pattern. This in turn allows us in principle to perform GVN on pointers without being conservative for alias analysis.

One of the steps in bug 37469 is not quite GVN, but morally similar. Simplifying a lot, the original program sequence:

%ch1 = load i8, i8* %p1
%ch2 = load i8, i8* %p2
%eq = icmp eq i8 %ch1, %ch2
%ch = select i1 %eq, i8 %ch1, i8 %ch2
store i8 %ch, i8* %p3

... is transformed into:

%ch2 = load i8, i8* %p2
store i8 %ch2, i8* %p3

This is correct for the bit patterns being loaded and stored, but the program also indirectly relies on pointer provenance of the data. Of course, there is no pointer provenance information being copied here because i8 only holds a bit pattern. However, with the "byte" proposal, all the i8s would be replaced by b8s, and then the transform becomes incorrect because it changes the provenance information.

If we split the proposed use of b8 into a use of i8 and explicit provenance values, the original program becomes:

%ch1, %prov1 = load_with_provenance i8, i8* %p1
%ch2, %prov2 = load_with_provenance i8, i8* %p2
%eq = icmp eq i8 %ch1, %ch2
%ch = select i1 %eq, i8 %ch1, i8 %ch2
%prov = select i1 %eq, provenance %prov1, provenance %prov2
store_with_provenance i8 %ch, provenance %prov, i8* %p3

This could be transformed into something like:

%prov1 = load_only_provenance i8* %p1
%ch2, %prov2 = load_with_provenance i8, i8* %p2
%prov = merge provenance %prov1, %prov2
store_with_provenance i8 %ch2, provenance %prov, i8* %p3

... which is just as good for code generation but loses only very little provenance information.

Aside: loop idioms

Without major changes to LLVM IR, a perfect memcpy cannot be implemented because pointer provenance information is lost.

Nevertheless, one could still define the @llvm.memcpy intrinsic to be a perfect copy. This helps memcpys in the original source program be less conservative in terms of alias analysis. However, it also makes it incorrect to replace a memcpy loop idiom with a use of @llvm.memcpy: without adding unrestrict instructions, the replacement may introduce undefined behavior; and there is no way to bound the locations where such unrestricts may be needed.

We could augment @llvm.memcpy with an immediate argument that selects its provenance behavior.

In any case, one can argue that bug 37469 is really a bug in the loop idiom recognizer. It boils down to the details of how everything is defined, and unfortunately, these weird corner cases are currently underspecified in the LangRef.

Conclusion

We started with the question of whether memcpy can be implemented in LLVM IR. The answer is a qualified Yes. It is possible, but the resulting copy is imperfect because pointer provenance information is lost. This has surprising implications which in turn happen to cause real miscompilation bugs -- although those bugs could be fixed even without a perfect memcpy.

The "byte" proposal has a certain aesthetic appeal because it fixes a real gap in the expressiveness of LLVM IR, but its software engineering cost is large and I object to some of its details. There are also alternatives to consider.

The miscompilation bugs obviously need to be fixed, but they can be fixed much less intrusively, albeit at the cost of more conservative alias analysis in the affected places. It is not clear to me whether improving alias analysis justifies the more complex solutions.

I would like to understand better how all of this interacts with the C99 restrict work. That work introduces mechanisms for explicitly talking about pointer provenance in the IR, which may allow us to kill two birds with one stone.

In any case, this is a fascinating topic and discussion, and I feel like we're only at the beginning.


Addendum: storing back previously loaded integers

(Added this section on June 14th)

Harald van Dijk on Phrabricator and Ralf Jung on llvm-dev, referring to a Rust issue, explicitly and implicitly point out a curious issue with loading and storing integers.

Here is Harald's example:

define i8* @f(i8* %p) {
  %buf = alloca i8*
  %buf.i32 = bitcast i8** %buf to i32*
  store i8* %p, i8** %buf
  %i = load i32, i32* %buf.i32
  store i32 %i, i32* %buf.i32
  %q = load i8*, i8** %buf
  ret i8* %q
}

There is a pair of load/store of i32 which is fully redundant from a machine perspective and so we'd like to optimize that away, after which it becomes obvious that the function really just returns %p -- at least as far as bit patterns are concerned.

However, in a world where memory is untyped but has provenance information, this optimization is incorrect because it can introduce undefined behavior: the load/store of i32 resets the provenance information in memory to "all", so that the original function returns an unrestricted version of %p. This is no longer the case after the optimization.

There are at least two possible ways of resolving this conflict.

We could define memory to be typed, in the sense that each byte of memory remembers whether it was most recently stored as a pointer or a non-pointer. A load with the wrong type returns poison. In that case, the example above returns poison before the optimization (because %i is guaranteed to be poison). After the optimization it returns non-poison, which is an acceptable refinement, so the optimization is correct.

The alternative is to keep memory untyped and say that directly eliminating the i32 store in the example is incorrect.

We are facing a tradeoff that depends on how important that optimization is for performance.

Two observations to that end. First, the more common case of dead store elimination is one where there are multiple stores to the same address in a row, and we remove all but the last one of them. That more common optimization is unaffected by provenance issues either way.

Second, we can still perform store forwarding / peephole optimization across such load/store pairs, as long as we are careful to introduce unrestrict where needed. The example above can be optimized via store forwarding to:

define i8* @f(i8* %p) {
  %buf = alloca i8*
  %buf.i32 = bitcast i8** %buf to i32*
  store i8* %p, i8** %buf
  %i = load i32, i32* %buf.i32
  store i32 %i, i32* %buf.i32
  %q = unrestrict i8* %p
  ret i8* %q
}

We can then dead-code eliminate the bulk of the function and obtain:

define i8* @f(i8* %p) {
  %q = unrestrict i8* %p
  ret i8* %q
}

... which is as good as it can possibly get.

So, there is a good chance that preventing this particular optimization is relatively cheap in terms of code quality, and the gain in overall design simplicity may well be worth it.




[0] We could also say that the loaded pointer's provenance is magically the memory object that happens to be at the referenced memory address. Either way, provenance would become a useless no-op in most cases. For example, mem2reg would have to insert unrestrict instructions (defined later) everywhere because pointers become effectively "unrestricted" when loaded from alloca'd memory.

Donnerstag, Juni 25, 2020

They want to be small, they want to be big: thoughts on code reviews and the power of patch series

Code reviews are a central fact of life in software development. It's important to do them well, and developer quality of life depends on a good review workflow.

Unfortunately, code reviews also appear to be a difficult problem. Many projects are bottlenecked by code reviews, in that reviewers are hard to find and progress gets slowed down by having to wait a long time for reviews.

The "solution" that I've often seen applied in practice is to have lower quality code reviews. Reviewers don't attempt to gain a proper understanding of a change, so reviews become shallower and therefore easier. This is convenient on the surface, but more likely to allow bad code to go through: a subtle corner case that isn't covered by tests (yet?) may be missed, there may be a misunderstanding of a relevant underlying spec, a bad design decision slips through, and so on. This is bound to cause pains later on.

I've experienced a number of different code review workflows in practice, based on a number of tools: GitHub PRs and GitLab MRs, Phabricator, and the e-mail review of patch series that is the original workflow for which Git was designed. Of those, the e-mail review flow produced the highest quality. There may be confounding factors, such as the nature of the projects and the selection of developers working on them, but quality issues aside I certainly feel that the e-mail review flow was the most pleasant to work with. Over time I've been thinking and having discussions a lot about just why that is. I feel that I have distilled it to two key factors, which I've decided to write down here so I can just link to this blog post in the future.

First, the UI experience of e-mail is a lot nicer. All of the alternatives are ultimately web-based, and their UI latency is universally terrible. Perhaps I'm particularly sensitive, but I just cannot stand web UIs for serious work. Give me something that reacts to all input reliably in under 50ms and I will be much happier. E-mail achieves that, web UIs don't. Okay, to be fair, e-mail is probably more in the 100ms range given the general state of the desktop. The point is, web UIs are about an order of magnitude worse. It's incredibly painful. (All of this assumes that you're using a decent native e-mail client. Do yourself a favor and give that a try if you haven't. The CLI warriors all have their favorites, but frankly Thunderbird works just fine. Outlook doesn't.)

Second, I've come to realize that there are conflicting goals in review granularity that e-mail happens to address pretty well, but none of the alternatives do a good job of it. Most of the alternatives don't even seem to understand that there is a problem to begin with! Here's the thing:

Reviews want to be small. The smaller and the more self-contained a change is, the easier it is to wrap your head around and judge. If you do a big refactor that is supposed to have no functional impact, followed by a separate small functional change that is enabled by the refactor, then each change individually is much easier to review. Approving changes at a fine granularity also helps ensure that you've really thought through each individual change and that each change has a reasonable justification. Important details don't get lost in something larger.

Reviews want to be big. A small, self-contained change can be difficult to understand and judge in isolation. You're doing a refactor that moves a function somewhere else? Fine, it's easy to tell that the change is correct, but is it a good change? To judge that, you often need to understand how the refactored result ends up being used in later changes, so it's good to see all those changes at once. Keep in mind though that you don't necessarily have to approve them at the same time. It's entirely possible to say, yes, that refactor looks good, we can go ahead with that, but please fix the way it's being used in a subsequent change.

There is another reason why reviews want to be big. Code reviews have a mental context-switching overhead. As a reviewer, you need to think yourself into the affected code in order to judge it well. If you do many reviews, you typically need to context-switch between each review. This can be very taxing mentally and ultimately unpleasant. A similar, though generally smaller, context-switching overhead applies to the author of the change as well: let's say you send out some changes for review, then go off and do another thing, and come back a day or two later to some asynchronously written reviews. In order to respond to the review, you may now have to page the context of that change back in. The point of all this is that when reviews are big, the context-switching overhead gets amortized better, i.e. the cost per change drops.

Reviews want to be both small and big. Guess what, patch series solve that problem! You get to review an entire feature implementation in the form of a patch series at once, so your context-switching overhead is reduced and you can understand how the different parts of the change play together. At the same time, you can drill down into individual patches and review those. Two levels of detail are available simultaneously.

So why e-mail? Honestly, I don't know. Given that the original use case for Git is based on patch series review, it's mind-boggling in a bad way that web-based Git hosting and code review systems do such a poor job of it, if they handle it at all.

Gerrit is the only system I know of that really takes patch series as an idea seriously, but while I am using it occasionally, I haven't had the opportunity to really stress it. Unfortunately, most people don't even want to consider Gerrit as an option because it's ugly.

Phabricator's stacks are a pretty decent attempt and I've made good use of them in the context of LLVM. However, they're too hidden and clumsy to navigate. Both Phabricator and Gerrit lack a mechanism for discussing a patch series as a whole.

GitHub and Gitlab? They're basically unusable. Yes, you can look at individual commits, but then GitHub doesn't even display the commits in the correct order: they're sorted by commit or author date, not by the Git DAG order, which is an obviously and astonishingly bad idea. Comments tend to get lost when authors rebase, which is what authors should do in order to ensure a clean history, and actually reviewing an individual commit is impossible. Part of the power of patch series is the ability to say: "Patches 1-6, 9, and 11 are good to go, the rest needs work."

Oh, and by the way: Commit messages? They're important! Gerrit again is the only system I know of that allows review comments on commit messages. It's as if the authors of Gerrit are the only ones who really understood the problem space. Unfortunately, they seem to lack business and web design skills, and so we ended up in the mess we're in right now.

Mind you, even if the other players got their act together and supported the workflow properly, there'd still be the problem of UI latency. One can dream...

Dienstag, Februar 05, 2019

FOSDEM talk on TableGen

Video and slides for my talk in the LLVM devroom on TableGen are now available here.

Now I only need the time and energy to continue my blog series on the topic...

Freitag, März 09, 2018

TableGen #5: DAGs

This is the fifth part of a series; see the first part for a table of contents.

With bit sequences, we have already seen one unusual feature of TableGen that is geared towards its specific purpose. DAG nodes are another; they look a bit like S-expressions:
def op1;
def op2;
def i32:

def Example {
  dag x = (op1 $foo, (op2 i32:$bar, "Hi"));
}
In the example, there are two DAG nodes, represented by a DagInit object in the code. The first node has as its operation the record op1. The operation of a DAG node must be a record, but there are no other restrictions. This node has two children or arguments: the first argument is named foo but has no value. The second argument has no name, but it does have another DAG node as its value.

This second DAG node has the operation op2 and two arguments. The first argument is named bar and has value i32, the second has no name and value "Hi".

DAG nodes can have any number of arguments, and they can be nested arbitrarily. The values of arguments can have any type, at least as far as the TableGen frontend is concerned. So DAGs are an extremely free-form way of representing data, and they are really only given meaning by TableGen backends.

There are three main uses of DAGs:
  1. Describing the operands on machine instructions.
  2. Describing patterns for instruction selection.
  3. Describing register files with something called "set theory".
I have not yet had the opportunity to explore the last point in detail, so I will only give an overview of the first two uses here.

Describing the operands of machine instructions is fairly straightforward at its core, but the details can become quite elaborate.

I will illustrate some of this with the example of the V_ADD_F32 instruction from the AMDGPU backend. V_ADD_F32 is a standard 32-bit floating point addition, at least in its 32-bit-encoded variant, which the backend represents as V_ADD_F32_e32.

Let's take a look at some of the fully resolved records produced by the TableGen frontend:
def V_ADD_F32_e32 {    // Instruction AMDGPUInst ...
  dag OutOperandList = (outs anonymous_503:$vdst);
  dag InOperandList = (ins VSrc_f32:$src0, VGPR_32:$src1);
  string AsmOperands = "$vdst, $src0, $src1";
  ...
}


def anonymous_503 {    // DAGOperand RegisterOperand VOPDstOperand
  RegisterClass RegClass = VGPR_32;
  string PrintMethod = "printVOPDst";
  ...
}
As you'd expect, there is one out operand. It is named vdst and an anonymous record is used to describe more detailed information such as its register class (a 32-bit general purpose vector register) and the name of a special method for printing the operand in textual assembly output. (The string "printVOPDst" will be used by the backend that generates the bulk of the instruction printer code, and refers to the method AMDGPUInstPrinter::printVOPDst that is implemented manually.)

There are two in operands. src1 is a 32-bit general purpose vector register and requires no special handling, but src0 supports more complex operands as described in the record VSrc_f32 elsewhere.

Also note the string AsmOperands, which is used as a template for the automatically generated instruction printer code. The operand names in that string refer to the names of the operands as defined in the DAG nodes.

This was a nice warmup, but didn't really demonstrate the full power and flexibility of DAG nodes. Let's look at V_ADD_F32_e64, the 64-bit encoded version, which has some additional features: the sign bits of the inputs can be reset or inverted, and the result (output) can be clamped and/or scaled by some fixed constants (0.5, 2, and 4). This will seem familiar to anybody who has worked with the old OpenGL assembly program extensions or with DirectX shader assembly.

The fully resolved records produced by the TableGen frontend are quite a bit more involved:
def V_ADD_F32_e64 {    // Instruction AMDGPUInst ...
  dag OutOperandList = (outs anonymous_503:$vdst);
  dag InOperandList =
    (ins FP32InputMods:$src0_modifiers, VCSrc_f32:$src0,
         FP32InputMods:$src1_modifiers, VCSrc_f32:$src1,
         clampmod:$clamp, omod:$omod);
  string AsmOperands = "$vdst, $src0_modifiers, $src1_modifiers$clamp$omod";
  list<dag> Pattern =
    [(set f32:$vdst, (fadd
      (f32 (VOP3Mods0 f32:$src0, i32:$src0_modifiers,
                      i1:$clamp, i32:$omod)),
      (f32 (VOP3Mods f32:$src1, i32:$src1_modifiers))))];
  ...
}

def FP32InputMods {     // DAGOperand Operand InputMods FPInputMods
  ValueType Type = i32;
 
string PrintMethod = "printOperandAndFPInputMods";
 
AsmOperandClass ParserMatchClass = FP32InputModsMatchClass;
  ...
}


def FP32InputModsMatchClass {   // AsmOperandClass FPInputModsMatchClass
  string Name = "RegOrImmWithFP32InputMods";
  string PredicateMethod = "isRegOrImmWithFP32InputMods";
  string ParserMethod = "parseRegOrImmWithFPInputMods";
  ...
}
The out operand hasn't changed, but there are now many more special in operands that describe whether those additional features of the instruction are used.

You can again see how records such as FP32InputMods refer to manually implemented methods. Also note that the AsmOperands string no longer refers to src0 or src1. Instead, the printOperandAndFPInputMods method on src0_modifiers and src1_modifiers will print the source operand together with its sign modifiers. Similarly, the special ParserMethod parseRegOrImmWithFPInputMods will be used by the assembly parser.

This kind of extensibility by combining generic automatically generated code with manually implemented methods is used throughout the TableGen backends for code generation.

Something else is new here: the Pattern. This pattern, together will all the other patterns defined elsewhere, is compiled into a giant domain-specific bytecode that executes during instruction selection to turn the SelectionDAG into machine instructions. Let's take this particular pattern apart:
(set f32:$vdst, (fadd ...))
We will match an fadd selection DAG node that outputs a 32-bit floating point value, and this output will be linked to the out operand vdst. (set, fadd and many others are defined in the target-independent include/llvm/Target/TargetSelectionDAG.td.)
(fadd (f32 (VOP3Mods0 f32:$src0, i32:$src0_modifiers,
                      i1:$clamp, i32:$omod)),
      (f32 (VOP3Mods f32:$src1, i32:$src1_modifiers)))
Both input operands of the fadd node must be 32-bit floating point values, and they will be handled by complex patterns. Here's one of them:
def VOP3Mods { // ComplexPattern
  string SelectFunc = "SelectVOP3Mods";
  int NumOperands = 2;
  ...
}
As you'd expect, there's a manually implemented SelectVOP3Mods method. Its signature is
bool SelectVOP3Mods(SDValue In, SDValue &Src,
                    SDValue &SrcMods) const;
It can reject the match by returning false, otherwise it pattern matches a single input SelectionDAG node into nodes that will be placed into src1 and src1_modifiers in the particular pattern we were studying.

Patterns can be arbitrarily complex, and they can be defined outside of instructions as well. For example, here's a pattern for generating the S_BFM_B32 instruction, which generates a bitfield mask:
def anonymous_2373anonymous_2371 {    // Pattern Pat ...
  dag PatternToMatch =
    (i32 (shl (i32 (add (i32 (shl 1, i32:$a)), -1)), i32:$b));
  list<dag> ResultInstrs = [(S_BFM_B32 ?:$a, ?:$b)];
  ...
}
The name of this record doesn't matter. The instruction selection TableGen backend simply looks for all records that have Pattern as a superclass. In this case, we match an expression of the form ((1 << a) - 1) << b on 32-bit integers into a single machine instruction.

So far, we've mostly looked at how DAGs are interpreted by some of the key backends of TableGen. As it turns out, most backends generate their DAGs in a fairly static way, but there are some fancier techniques that can be used as well. This post is already quite long though, so we'll look at those in the next post.

Dienstag, März 06, 2018

TableGen #4: Resolving variables

This is the fourth part of a series; see the first part for a table of contents.

It's time to look at some of the guts of TableGen itself. TableGen is split into a frontend, which parses the TableGen input, instantiates all the records, resolves variable references, and so on, and many different backends that generate code based on the instantiated records. In this series I'll be mainly focusing on the frontend, which lives in lib/TableGen/ inside the LLVM repository, e.g. here on the GitHub mirror. The backends for LLVM itself live in utils/TableGen/, together with the command line tool's main() function. Clang also has its own backends.

Let's revisit what kind of variable references there are and what kind of resolving needs to be done with an example:
class Foo<int src> {
  int Src = src;
  int Offset = 1;
  int Dst = !add(Src, Offset);
}

multiclass Foos<int src> {
  def a : Foo<src>;
  let Offset = 2 in
  def b : Foo<src>;
}

foreach i = 0-3 in
defm F#i : Foos<i>;
This is actually broken in older LLVM by one of the many bugs, but clearly it should work based on what kind of features are generally available, and with my patch series it certainly does work in the natural way. We see four kinds of variable references:
  • internally within a record, such as the initializer of Dst referencing Src and Offset
  • to a class template variable, such as Src being initialized by src
  • to a multiclass template variable, such as src being passed as a template argument for Foo
  • to a foreach iteration variable
As an aside, keep in mind that let in TableGen does not mean the same thing as in the many functional programming languages that have a similar construct. In those languages let introduces a new variable, but TableGen's let instead overrides the value of a variable that has already been defined elsewhere. In the example above, the let-statement causes the value of Offset to be changed in the record that was instantiated from the Foo class to create the b prototype inside multiclass Foos.

TableGen internally represents variable references as instances of the VarInit class, and the variables themselves are simply referenced by name. This causes some embarrassing issues around template arguments which are papered over by qualifying the variable name with the template name. If you pass the above example through a sufficiently fixed version of llvm-tblgen, one of the outputs will be the description of the Foo class:
class Foo<int Foo:src = ?> {
  int Src = Foo:src;
  int Offset = 1;
  int Dst = !add(Src, Offset);
  string NAME = ?;
}
As you can see, Foo:src is used to refer to the template argument. In fact, the template arguments of both classes and multiclasses are temporarily added as variables to their respective prototype records. When the class or prototype in a multiclass is instantiated, all references to the template argument variables are resolved fully, and the variables are removed (or rather, some of them are removed, and making that consistent is one of the many things I set out to clean up).

Similarly, references to foreach iteration variables are resolved when records are instantiated, although those variables aren't similarly qualified. If you want to learn more about how variable names are looked up, TGParser::ParseIDValue is a good place to start.

The order in which variables are resolved is important. In order to achieve the flexibility of overriding defaults with let-statements, internal references among record variables must be resolved after template arguments.

Actually resolving variable references used to be done by the implementations of the following virtual method of the Init class hierarchy (which represents initializers, i.e. values and expressions):
virtual Init *resolveReferences(Record &R, const RecordVal *RV) const;
This method recursively resolves references in the constituent parts of the expression and then performs constant folding, and returns the resulting value (or the original value if nothing could be resolved). Its interface is somewhat magical: R represents the "current" record which is used as a frame of reference for magical lookups in the implementation of !cast; this is a topic for another time, though. At the same time, variables referencing R are supposed to be resolved, but only if RV is null. If RV is non-null, then only references to that specific variable are supposed to be resolved. Additionally, some behaviors around unset depend on this.

This is replaced in my changes with
virtual Init *resolveReferences(Resolver &R) const;
where Resolver is an abstract base class / interface which can lookup values based on their variable names:
class Resolver {
  Record *CurRec;

public:
  explicit Resolver(Record *CurRec) : CurRec(CurRec) {}
  virtual ~Resolver() {}

  Record *getCurrentRecord() const { return CurRec; }
  virtual Init *resolve(Init *VarName) = 0;
  virtual bool keepUnsetBits() const { return false; }
};
The "current record" is used as a reference for the aforementioned magical !casts, and keepUnsetBits instructs the implementation of bit sequences in BitsInit not to resolve to ? (as was explained in the third part of the series). resolve itself is implemented by one of the subclasses, most notably:
  1. MapResolver: Resolve based on a dictionary of name-value pairs.
  2. RecordResolver: Resolve variable names that appear in the current record.
  3. ShadowResolver: Delegate requests to an underlying resolver, but filter out some names.
 This last type of resolver is used by the implementations of !foreach and !foldl to avoid mistakes with nesting. Consider, for example:
class Exclamation<list<string> messages> {
  list Messages = !foreach(s, messages, s # "!");
}

class Greetings<list<string> names>
    : Exclamation&lt!foreach(s, names, "Hello, " # s)>;

def : Greetings<["Alice", "Bob"]>;
This effectively becomes a nested !foreach. The iteration variable is named s in both, so when substituting s for the outer !foreach, we must ensure that we don't also accidentally substitute s in the inner !foreach. We achieve this by having !foreach wrap the given resolver with a ShadowResolver. The same principle applies to !foldl as well, of course.

Freitag, Februar 23, 2018

TableGen #3: Bits

This is the third part of a series; see the first part for a table of contents.

One of the main backend uses of TableGen is describing target machine instructions, and that includes describing the binary encoding of instructions and their constituents parts. This requires a certain level of bit twiddling, and TableGen supports this with explicit bit (single bit) and bits (fixed-length sequence of bits) types:
class Enc<bits<7> op> {
  bits<10> Encoding;

  let Encoding{9-7} = 5;
  let Encoding{6-0} = op;
}

def InstA : Enc<0x35>;
def InstB : Enc<0x08>;
... will produce records:
def InstA {     // Enc
  bits<10> Encoding = { 1, 0, 1, 0, 1, 1, 0, 1, 0, 1 };
  string NAME = ?;
}
def InstB {     // Enc
  bits<10> Encoding = { 1, 0, 1, 0, 0, 0, 1, 0, 0, 0 };
  string NAME = ?;
}
So you can quite easily slice and dice bit sequences with curly braces, as long as the indices themselves are constants.

But the real killer feature is that so-called unset initializers, represented by a question mark, aren't fully resolved in bit sequences:
class Enc<bits<3> opcode> {
  bits<8> Encoding;
  bits<3> Operand;

  let Encoding{0} = opcode{2};
  let Encoding{3-1} = Operand;
  let Encoding{5-4} = opcode{1-0};
  let Encoding{7-6} = { 1, 0 };
}

def InstA : Enc<5>;
... produces a  record:
def InstA {     // Enc
  bits<8> Encoding = { 1, 0, 0, 1, Operand{2}, Operand{1}, Operand{0}, 1 };
  bits<3> Operand = { ?, ?, ? };
  string NAME = ?;
}
So instead of going ahead and saying, hey, Operand{2} is ?, let's resolve that and plug it into Encoding, TableGen instead keeps the fact that bit 3 of Encoding refers to Operand{2} as part of its data structures.

Together with some additional data, this allows a backend of TableGen to automatically generate code for instruction encoding and decoding (i.e., disassembling). For example, it will create the source for a giant C++ method with signature
uint64_t getBinaryCodeForInstr(const MCInst &MI, /* ... */) const;
which contains a giant constant array with all the fixed bits of each instruction followed by a giant switch statement with cases of the form:
case AMDGPU::S_CMP_EQ_I32:
case AMDGPU::S_CMP_EQ_U32:
case AMDGPU::S_CMP_EQ_U64:
// more cases...
case AMDGPU::S_SET_GPR_IDX_ON: {
  // op: src0
  op = getMachineOpValue(MI, MI.getOperand(0), Fixups, STI);
  Value |= op & UINT64_C(255);
  // op: src1
  op = getMachineOpValue(MI, MI.getOperand(1), Fixups, STI);
  Value |= (op & UINT64_C(255)) << 8;
  break;
}
The bitmasks and shift values are all derived from the structure of unset bits as in the example above, and some additional data (the operand DAGs) are used to identify the operand index corresponding to TableGen variables like Operand based on their name. For example, here are the relevant parts of the S_CMP_EQ_I32 record generated by the AMDGPU backend's TableGen files:
 def S_CMP_EQ_I32 {      // Instruction (+ other superclasses)
  field bits<32> Inst = { 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, src1{7}, src1{6}, src1{5}, src1{4}, src1{3}, src1{2}, src1{1}, src1{0}, src0{7}, src0{6}, src0{5}, src0{4}, src0{3}, src0{2}, src0{1}, src0{0} };
  dag OutOperandList = (outs);
  dag InOperandList = (ins SSrc_b32:$src0, SSrc_b32:$src1);
  bits
<8> src0 = { ?, ?, ?, ?, ?, ?, ?, ? };
  bits
<8> src1 = { ?, ?, ?, ?, ?, ?, ?, ? };
  // many more variables...
}
Note how Inst, which describes the 32-bit encoding as a whole, refers to the TableGen variables src0 and src1. The operand indices used in the calls to MI.getOperand() above are derived from the InOperandList, which contains nodes with the corresponding names. (SSrc_b32 is the name of a record that subclasses RegisterOperand and describes the acceptable operands, such as registers and inline constants.)

Hopefully this helped you appreciate just how convenient TableGen can be. Not resolving the ? in bit sequences is an odd little exception to an otherwise fairly regular language, but the resulting expressive power is clearly worth it. It's something to keep in mind when we discuss how variable references are resolved.

Mittwoch, Februar 21, 2018

TableGen #2: Functional Programming

This is the second part of a series; see the first part for a table of contents.

When the basic pattern of having classes with variables that are filled in via template arguments or let-statements reaches the limits of its expressiveness, it can become useful to calculate values on the fly. TableGen provides string concatenation out of the box with the paste operator ('#'), and there are built-in functions which can be easily recognized since they start with an exclamation mark, such as !add, !srl, !eq, and !listconcat. There is even an !if-builtin and a somewhat broken and limited !foreach.

There is no way of defining new functions, but there is a pattern that can be used to make up for it: classes with ret-values:
class extractBit<int val, int bitnum> {
  bit ret = !and(!srl(val, bitnum), 1);
}

class Foo<int val> {
  bit bitFour = extractBit<val, 4>.ret;
}

def Foo1 : Foo<5>;
def Foo2 : Foo<17>;
This doesn't actually work in LLVM trunk right now because of the deficiencies around anonymous record instantiations that I mentioned in the first part of the series, but after a lot of refactoring and cleanups, I got it to work reliably. It turns out to be an extremely useful tool.

In case you're wondering, this does not support recursion and it's probably better that way. It's possible that TableGen is already accidentally Turing complete, but giving it that power on purpose seems unnecessary and might lead to abuse.

Without recursion, a number of builtin functions are required. There has been a !foreach for a long time, and it is a very odd duck:
def Defs {
  int num;
}

class Example<list<int> nums> {
  list<int> doubled = !foreach(Defs.num, nums, !add(Defs.num, Defs.num));
}

def MyNums : Example<[4, 1, 9, -3]>;
In many ways it does what you'd expect, except that having to define a dummy record with a dummy variable in this way is clearly odd and fragile. Until very recently it did not actually support everything you'd think even then, and even with the recent fixes there are plenty of bugs. Clearly, this is how !foreach should look instead:
class Example<list<int> nums> {
  list<int> doubled =
      !foreach(x, nums, !add(x, x));
}

def MyNums : Example<[4, 1, 9, -3]>;
... and that's what I've implemented.

This ends up being a breaking change (the only one in the whole series, hopefully), but !foreach isn't actually used in upstream LLVM proper anyway, and external projects can easily adapt.

A new feature that I have found very helpful is a fold-left operation:
class Enumeration<list<string> items> {
  list<string> ret = !foldl([], items, lhs, item,
      !listconcat(lhs, [!size(lhs) # ": " # item]));
}

def MyList : Enumeration<["foo", "bar", "baz"]>;
This produces the following record:
def MyList {    // Enumeration
  list<string> ret = ["0: foo", "1: bar", "2: baz"];
  string NAME = ?;
}
Needless to say, it was necessary to refactor the TableGen tool very deeply to enable this kind of feature, but I am quite happy with how it ended up.

The title of this entry is "Functional Programming", and in a sense I lied. Functions are not first-class values in TableGen even with my changes, so one of the core features of functional programming is missing. But that's okay: most of what you'd expect to have and actually need is now available in a consistent manner, even if it's still clunkier than in a "real" programming language. And again: making functions first-class would immediately make TableGen Turing complete. Do we really want that?

Montag, Februar 19, 2018

TableGen #1: What has TableGen ever done for us?

This is the first entry in an on-going series. Here's a list of all entries:
  1. What has TableGen ever done for us?
  2. Functional Programming
  3. Bits
  4. Resolving variables
  5. DAGs
  6. to be continued 
Also: here is a talk (slides + video) I gave in the FOSDEM 2019 LLVM devroom on TableGen.

Anybody who has ever done serious backend work in LLVM has probably developed a love-hate relationship with TableGen. At its best it can be an extremely useful tool that saves a lot of manual work. At its worst, it will drive you mad with bizarre crashes, indecipherable error messages, and generally inscrutable failures to understand what you want from it.

TableGen is an internal tool of the LLVM compiler framework. It implements a domain-specific language that is used to describe many different kinds of structures. These descriptions are translated to read-only data tables that are used by LLVM during compilation.

For example, all of LLVM's intrinsics are described in TableGen files. Additionally, each backend describes its target machine's instructions, register file(s), and more in TableGen files.

The unit of description is the record. At its core, a record is a dictionary of key-value pairs. Additionally, records are typed by their superclass(es), and each record can have a name. So for example, the target machine descriptions typically contain one record for each supported instruction. The name of this record is the name of the enum value which is used to refer to the instruction. A specialized backend in the TableGen tool collects all records that subclass the Instruction class and generates instruction information tables that is used by the C++ code in the backend and the shared codegen infrastructure.

The main point of the TableGen DSL is to provide an ostensibly convenient way to generate a large set of records in a structured fashion that exploits regularities in the target machine architecture. To get an idea of the scope, the X86 backend description contains ~47k records generated by ~62k lines of TableGen. The AMDGPU backend description contains ~39k records generated by ~24k lines of TableGen.

To get an idea of what TableGen looks like, consider this simple example:
def Plain {
  int x = 5;
}

class Room<string name> {
  string Name = name;
  string WallColor = "white";
}

def lobby : Room<"Lobby">;

multiclass Floor<int num, string color> {
  let WallColor = color in {
    def _left : Room<num # "_left">;
    def _right : Room<num # "_right">;
  }
}

defm first_floor : Floor<1, "yellow">;
defm second_floor : Floor
<2, "gray">;
This example defines 6 records in total. If you have an LLVM build around, just run the above through llvm-tblgen to see them for yourself. The first one has name Plain and contains a single value named x of value 5. The other 5 records have Room as a superclass and contain different values for Name and WallColor.

The first of those is the record of name lobby, whose Name value is "Lobby" (note the difference in capitalization) and whose WallColor is "white".

Then there are four records with the names first_floor_left, first_floor_right, second_floor_left, and second_floor_right. Each of those has Room as a superclass, but not Floor. Floor is a multiclass, and multiclasses are not classes (go figure!). Instead, they are simply collections of record prototypes. In this case, Floor has two record prototypes, _left and _right. They are instantiated by each of the defm directives. Note how even though def and defm look quite similar, they are conceptually different: one instantiates the prototypes in a multiclass (or several multiclasses), the other creates a record that may or may not have one or more superclasses.

The Name value of first_floor_left is "1_left" and its WallColor is "yellow", overriding the default. This demonstrates the late-binding nature of TableGen, which is quite useful for modeling exceptions to an otherwise regular structure:
class Foo {
  string salutation = "Hi";
  string message = salutation#", world!";
}

def : Foo {
  let
salutation = "Hello";
}
The message of the anonymous record defined by the def-statement is "Hello, world!".

There is much more to TableGen. For example, a particularly surprising but extremely useful feature are the bit sets that are used to describe instruction encodings. But that's for another time.

For now, let me leave you with just one of the many ridiculous inconsistencies in TableGen:
class Tag<int num> {
  int Number = num;
}

class Test<int num> {
  int Number1 = Tag<5>.Number;
  int Number2 = Tag<num>.Number;
  Tag Tag1 = Tag<5>;
  Tag Tag2 = Tag<num>;
}

def : Test<5>;
What are the values in the anonymous record? It turns out that Number1 and Number2 are both 5, but Tag1 and Tag2 refer to different records. Tag1 refers to an anonymous record with superclass Tag and Number equal to 5, while Tag2 also refers to an anonymous record, but with the Number equal to an unresolved variable reference.

This clearly doesn't make sense at all and is the kind of thing that sometimes makes you want to just throw it all out of the window and build your own DSL with blackjack and Python hooks. The problem with that kind of approach is that even if the new thing looks nicer initially, it'd probably end up in a similarly messy state after another five years.

So when I ran into several problems like the above recently, I decided to take a deep dive into the internals of TableGen with the hope of just fixing a lot of the mess without reinventing the wheel. Over the next weeks, I plan to write a couple of focused entries on what I've learned and changed, starting with how a simple form of functional programming should be possible in TableGen.