Subgroup operations or wave intrinsics, such as reducing a
value across the threads of a shader subgroup or wave, were introduced in GPU programming languages a while ago. They communicate
with other threads of the same wave, for example to exchange the input values of
a reduction, but not necessarily with all of them if there is divergent control
flow.
In LLVM, we call such operations convergent. Unfortunately, LLVM does not define how the set of communicating threads in convergent
operations -- the set of converged threads -- is affected by control
flow.
If you're used to thinking in terms of structured control flow, this may
seem trivial. Obviously, there is a tree of control flow constructs: loops,
if-statements, and perhaps a few others depending on the language. Two threads
are converged in the body of a child construct if and only if both execute that
body and they are converged in the parent. Throw in some simple and
intuitive rules about loop counters and early exits (nested return, break and
continue, that sort of thing) and
you're done.
In an unstructured control flow graph, the answer is not obvious at all. I gave
a presentation at the 2020 LLVM
Developers' Meeting that explains some of the challenges as well as a solution
proposal that involves adding convergence control tokens to the IR.
Very briefly, convergent operations in the proposal use a token variable that is defined by a
convergence control intrinsic. Two dynamic instances of the same
static convergent operation from two different threads are converged if and only
if the dynamic instances of the control intrinsic producing the used token
values were converged.
(The published draft of the proposal
talks of multiple threads executing the same dynamic instance. I have since been
convinced that it's easier to teach this matter if we instead always give every
thread its own dynamic instances and talk about a convergence equivalence
relation between dynamic instances. This doesn't change the resulting
semantics.)
The draft has three such control intrinsics: anchor, entry, and (loop) heart.
Of particular interest here is the heart. For the most common and intuitive use cases, a
heart intrinsic is placed in the header of natural loops. The token it defines
is used by convergent operations in the loop. The heart intrinsic itself also uses
a token that is defined outside the loop: either by another heart in the case of
nested loops, or by an anchor or entry. The heart combines two intuitive behaviors:
- It uses a token in much the same way that convergent operations do: two
threads are converged for their first execution of the heart if and only if they
were converged at the intrinsic that defined the used token.
- Two threads are converged at subsequent executions of the heart if and only if they were converged for the first execution and they are currently at the same loop iteration, where iterations are counted by a virtual loop counter that is incremented at the heart.
Viewed from this angle, how about we define a weaker version of these rules
that lies somewhere between an anchor and a loop heart? We could call it a
"light heart", though I will stick with "iterating anchor".
The iterating anchor defines a token but has no arguments. Like for the anchor,
the set of converged threads is implementation-defined -- when the iterating
anchor is first encountered. When threads encounter the iterating anchor again
without leaving the dominance region of its containing basic block,
they are converged if and only if they were converged during their previous
encounter of the iterating anchor.
The notion of an iterating anchor came up when discussing the
convergence behaviors that can be guaranteed for natural loops. Is it possible to
guarantee that natural loops always behave in the natural way -- according to
their loop counter -- when it comes to convergence?
Naively, this should be possible: just put hearts into loop headers!
Unfortunately, that's not so straightforward when multiple natural loops are
contained in an irreducible loop:
Hearts in A and C must refer to a token defined outside the loops; that is, a
token defined in E. The resulting program is ill-formed because it has a closed
path that goes through two hearts that use the same token, but the path does
not go through the definition of that token. This well-formedness rule exists
because the rules about heart semantics are unsatisfiable if the rule is broken.
The underlying intuitive issue is that if the branch at E is divergent in a typical implementation, the
wave (or subgroup) must choose whether A or C is executed first. Neither choice works.
The heart in A indicates that (among the threads that are converged in E) all
threads that visit A (whether immediately or via C) must be converged during
their first visit of A. But if the wave executes A first, then threads which
branch directly from E to A cannot be converged with those that first branch to
C. The opposite conflict exists if the wave executes C first.
If we replace the hearts in A and C by iterating anchors, this problem goes
away because the convergence during the initial visit of each block is
implementation-defined. In practice, it should fall out of which of the blocks
the implementation decides to execute first.
So it seems that iterating anchors can fill a gap in the expressiveness of the
convergence control design. But are they really a sound addition? There are two
main questions:
- Satisfiability: Can the constraints imposed by iterating anchors be
satisfied, or can they cause the sort of logical contradiction discussed for the example above? And if so, is there a simple static rule that prevents such
contradictions?
- Spooky action at a distance: Are there generic code transforms which change
semantics while changing a part of the code that is distant from the iterating
anchor?
The second question is important because we want to add convergence control to
LLVM without having to audit and change the existing generic transforms. We certainly don't want to hurt compile-time performance by increasing the amount of code that generic transforms have to examine for making their
decisions.
Satisfiability
Consider the following simple CFG with an iterating anchor in A and a heart in B
that refers back to a token defined in E:
- E - A - A - B - X
- E - A - B - A - X
The heart rule implies that the threads must be converged in B. The iterating
anchor rule implies that if the threads are converged in their first dynamic
instances of A, then they must also be converged in their second dynamic
instances of A, which leads to a temporal paradox.
The second dynamic instances of A cannot be converged due to the convergence of
the dynamic instances of B. That's okay: the second dynamic instance of A in
thread 2 is a re-entry into the dominance region of A, and so its convergence
is unrelated to any convergence of earlier dynamic instances of A.
Spooky action at a distance
Unfortunately, we still cannot allow this second example. A program transform
may find that the conditional branch in E is constant and the edge
from E to B is dead. Removing that edge brings us back to the previous example
which is ill-formed. However, a transform which removes the dead edge would not
normally inspect the blocks A and B or their dominance relation in detail. The program
becomes ill-formed by spooky action at a distance.
The following static rule forbids both example CFGs: if there is a closed path
through a heart and an iterating anchor, but not through the definition of the
token that the heart uses, then the heart must dominate the iterating anchor.
There is at least one other issue of spooky action at a distance. If the
iterating anchor is not the first (non-phi) instruction of its basic block,
then it may be preceded by a function call in the same block. The callee may
contain control flow that ends up being inlined. Back edges that previously pointed at the
block containing the iterating anchor will then point to a different block,
which changes the behavior quite drastically. Essentially, the iterating anchor
is reduced to a plain anchor.
What can we do about that? It's tempting to decree that an iterating anchor must
always be the first (non-phi) instruction of a basic block. Unfortunately, this
is not easily done in LLVM in the face of general transforms that might sink
instructions or merge basic blocks.
Preheaders to the rescue
We could chew through some other ideas for making iterating anchors work, but
that turns out to be unnecessary. The desired behavior of iterating anchors can
be obtained by inserting preheader blocks. The initial example of two natural loops contained in an irreducible loop becomes:
Place anchors in Ap and Cp and hearts in A and C that use the token defined by
their respective dominating anchor. Convergence at the anchors is
implementation-defined, but relative to this initial convergence at the anchor,
convergence inside the natural loops headed by A and C behaves in the natural
way, based on a virtual loop counter. The transform of inserting an anchor in the preheader is easily generalized.
To sum it up: We've concluded that defining an "iterating anchor"
convergence control intrinsic is problematic, but luckily also unnecessary. The
control intrinsics defined in the original proposal are sufficient. I hope that the
discussion that led to those conclusions helps illustrate some aspects of the convergence control proposal for LLVM as well as the
goals and principles that drove it.