Draft: This patch speeds up and corrects liveness computation for AGX.
What does this MR do and why?
This patch speeds up and corrects liveness computation for AGX.
The current liveness calculation does not handle phi nodes properly in all cases, and recomputes local properties on every iteration of a block. It also takes at least 2x the iterations it should because it computes live_in before live_out (since it's a backwards problem, live out should be computed first and then propagated backwards to livein)
This patch splits the computation into the traditional local/global sets, so that we only compute the local sets once, and then iterate the global ones, and corrects the computation along the way.
Since we are in SSA form, a variable is live from it's last use, upwards, until the definition. As such, the local properties are defs and upwards exposed uses. These are then propagated backwards through the CFG to generate liveness results.
This is fast enough for what we are trying to do/how big shaders are. It could be made faster by improving the iteration order or moving to a two-pass method or a sparse one. However, this was all done as an intro to the code, not to try to optimize the heck out of it for no reason. I think this version is simpler to understand than the previous one, with the advantage of being faster.
pressure_schedule relies on the existing per-instruction function so i moved the computation there for now.
In terms of correctness - the existing code is "mostly" correct, but overdoes liveness in some cases. This is not the fault of the author - some internet sources seem to just have it wrong and copied from each other. This was noticed a few decades ago - https://lists.llvm.org/pipermail/llvm-dev/2003-November/000542.html See the bottom: "(PHI uses occur in the predecessor blocks, not in the PHI node blocks): B3 should have an empty IN set."
As an example:
block0 {
stack_adjust #18
44h = get_sr #50
45h = get_sr #51
3 = convert #6, 44h, #1
4 = convert #6, 45h, #1
6 = fadd 3, #0.500000
7 = fadd 4, #0.500000
8 = convert #9, 6, #0
9 = convert #9, 7, #0
10 = iadd u4, 8, #0
11 = iadd u5, 9, #0
12 = imad 11, u6, 10, #0
15 = icmpsel 12, u7, 12, u7, #0
begin_cf
if_icmp u8l, #0, #aaaaaf8dfd10, n=1, inv
} -> block1 block2
block1 {
18 = iadd 15, 15, #0
19 = device_load u10:u11, 18.abs, xyz, #0, slot 0
46, 47, 48 = split 19
22 = mov_imm #3f800000
} -> block3 from block0
block2 {
else_fcmp #0.000000, #0.000000, #aaaaaf8dfd10, n=1
24h = mov_imm #0
26 = mov_imm #0
28 = mov_imm #3ff
29 = and 15, 28
31 = bfeil #0, 15, #10, #0
32 = collect 29, 31
49 = texture_load 32, 24h, u0:u1, 26, #1, _, xyzw, #6, 2d, slot 0
50, 51, 52, 53 = split 49
pop_exec #0, n=1
} -> block3 from block0
block3 {
38 = phi 46, 50
39 = phi 47, 51
40 = phi 48, 52
41 = phi 22, 53
wait_pix #c
54 = collect 38, 39, 40, 41
st_tile 54, #255, xyzw, #0
stop
} from block1 block2
Old liveness:
live_out[0]: { 15 }
live_in[1]: { 15 }
live_out[1]: { 22 46 47 48 }
live_in[2]: { 15 }
live_out[2]: { 50 51 52 53 }
live_in[3]: { 38 39 40 41 }
live_out[3]: { }
New Liveness:
live_in[0]: { }
live_out[0]: { 15 }
live_in[1]: { 15 }
live_out[1]: { 22 46 47 48 }
live_in[2]: { 15 }
live_out[2]: { 50 51 52 53 }
live_in[3]: { }
live_out[3]: { }
New is correct - 38/39/40/41 are not live into the block, as they are defined at the entry of the block. The phi uses are live out of the predecessor block, as the operation occurs on an edge. However, the phi defs occur at entry to the block, and are not exposed upwards. They can't be. If they were, they would violate the dominance property of SSA (defs dominate all uses).
There are some other degenerate cases this results in where liveness is overshot by the current algorithm.