Fragment shader interlock test possibly violates issue 1 resolution and fails on AMD

During a test run for the fragment shader interlock implementation in RADV (the comments related to this issue start with "spec@arb_fragment_shader_interlock@arb_fragment_shader_interlock-image-load-store,Fail on Navi10!"), there was a failure of the spec@arb_fragment_shader_interlock@arb_fragment_shader_interlock-image-load-store test when running it on Zink on top of RADV with the new changes.

After some investigation, I started to suspect that the test assumed a synchronization guarantee that was not provided by the specification of GL_ARB_fragment_shader_interlock.

The test is done for a somewhat extreme case — layout(pixel_interlock_ordered) with sample shading (gl_SampleID). This, while of course valid, is a pretty scary combination, by the way, considering that it's not exposed in Direct3D, Metal and GL_INTEL_fragment_shader_ordering (which force sample interlock for sample shading), and not very useful in real life especially taking into account that gl_SampleMaskIn in OpenGL and Vulkan has only one bit set with sample shading (unlike in Direct3D) and thus can't be used for tie breaking.

From each per-MSAA-sample invocation, the test performs read–modify–writes of two memory regions at the granularity of the entire pixel. One of them is a per-pixel vec4 that each MSAA sample attempts to write to at the same address:

ivec3 result_coord = ivec3(gl_FragCoord.x, gl_FragCoord.y, sample_rate);
// …
imageStore(img_output, result_coord, result);

The other is an array of values for each sample. Each per-sample invocation first read–modify–writes the value for its own element:

ivec3 current_sample_coord = ivec3(gl_FragCoord.x, gl_FragCoord.y, gl_SampleID);
vec4 current_sample_color = imageLoad(img_output, current_sample_coord);
// …
imageStore(img_output, current_sample_coord, result);

But then, it reads values written by invocations for other samples in the same pixel:

for (i = 0; i < sample_rate; i++) {
  if (i != gl_SampleID) {
    ivec3 sample_coord = ivec3(gl_FragCoord.x, gl_FragCoord.y, i);
    vec4 sample_color = imageLoad(img_output, sample_coord);
    // …
  }
}

The test is executed for 6 large triangles, each covering a half of a 50x100 or a 100x100 area, with no sample mask. Therefore, most of the time, each triangle covers more than one sample per pixel — thus, with sample shading, one primitive generates multiple fragment shader invocations for the same pixel (for the same fragment) of itself.

Those try to perform read–modify–writes to the same memory locations, and for that, they require interlocked execution. However, the specification of GL_ARB_fragment_shader_interlock contains an issue describing the expected behavior in this situation:

    (1) When using multisampling, the OpenGL specification permits
        multiple fragment shader invocations to be generated for a single
        fragment.  For example, per-sample shading using the "sample"
        auxiliary storage qualifier or the MinSampleShading() OpenGL API command
        can be used to force per-sample shading.  What execution ordering
        guarantees are provided between fragment shader invocations generated
        from the same fragment?

      RESOLVED:  We don't provide any ordering guarantees in this extension.
      This implies that when using multisampling, there is no guarantee that
      two fragment shader invocations for the same fragment won't be executing
      their critical sections concurrently.  This could cause problems for
      algorithms sharing data structures between all the samples of a pixel
      unless accesses to these data structures are performed atomically.

      When using per-sample shading, the interlock we provide *does* guarantee
      that no two invocations corresponding to the same sample execute the
      critical section concurrently.  If a separate set of data structures is
      provided for each sample, no conflicts should occur within the critical
      section.

As the invocations trying to access data for the same pixel don't correspond to the same sample, the "no two invocations corresponding to the same sample execute the critical section concurrently" rule does not apply to this case, and it's covered by the general expectation for multiple fragment shader invocations for the same fragment: "there is no guarantee that two fragment shader invocations for the same fragment won't be executing their critical sections concurrently".

In addition, Vulkan fragment shader interlock, which provides API support for GL_ARB_fragment_shader_interlock as specified in the VK_EXT_fragment_shader_interlock description, defines synchronization in the fragment interlock scope as:

If the PixelInterlockOrderedEXT execution mode is specified, any interlocked operations in a fragment shader must happen before interlocked operations in fragment shader invocations that execute later in rasterization order and cover at least one sample in the same pixel, and must happen after interlocked operations in a fragment shader that executes earlier in rasterization order and cover at least one sample in the same pixel.

However, fragment shader invocations generated from the same primitive are neither earlier nor later in rasterization order than each other, so no ordering is specified for them. Furthermore, if this situation was possible, the specification would need to define the order in which sample-shading invocations (for one sample with 1.0 sample shading, or for multiple samples with fractional sample shading) would enter the critical section — but that isn't the case either.

It appears that Intel actually does provide the synchronization guarantee in this situation. But in general, resolving overlap within a primitive may require hardware logic that's significantly different from handling overlap between waves (and also not needed for any API other than OpenGL and Vulkan) — and AMD doesn't implement it. This also includes effectively running all sample-rate shader invocations for a single pixel in order, even when different primitives don't overlap.

(As a side note, by the way, I think it would be nice if Vulkan and OpenGL had a mode with interlock granularity corresponding to the actual fragment shader execution granularity, although that may be complicated to define with fractional sample shading, but this case may be resolved conservatively to having SampleInterlock if sample shading is enabled at all regardless of its amount. Without this, it's impossible to implement PixelInterlock in translation drivers such as the D3D12-based Dozen and the Metal-based MoltenVK, as well as potentially on some hardware initially developed with only Direct3D or Metal in mind with regard to fragment shader interlock, as those implementation would not be able to provide PixelInterlock with sample shading. Specifically, this also makes it impossible to do VKD3D–Dozen–VKD3D round trips, as VKD3D requires PixelInterlock with MSAA in general to implement Direct3D rasterizer-ordered views, but doesn't require PixelInterlock with sample shading, yet because Dozen can't provide the latter, PixelInterlock has to be disabled entirely. But this is of course a topic for a KhronosGroup/Vulkan-Docs issue rather than Piglit.)

According to my experiments, however, AMD actually does support the general case of pixel_interlock with sample shading as required, with hackily-modified usage examples of pixel interlock getting stability at common edges of adjacent primitives and taking a large speed hit with pixel_interlock_ordered (POPS_OVERLAP_NUM_SAMPLES = log2(1) in the DB_SHADER_CONTROL hardware register) regardless of whether sample shading is used compared to sample_interlock_ordered (POPS_OVERLAP_NUM_SAMPLES = MSAA_EXPOSED_SAMPLES in hardware).

There are possible alternative implementations a test specifically for pixel_interlock with sample shading. Although, considering that gl_SampleMaskIn contains only samples corresponding to the current shader invocation (thus shaders can't use gl_SampleID == findLSB(gl_SampleMaskIn[0]) to ensure that the memory accesses in the critical section are done only for one sample), the possibilities here are greatly limited.

The most straightforward way to ensure that it's the case would be performing multiple draws, each with only one sample enabled via glSampleMaski, switching between different sample masks. GL_SAMPLE_MASK is a part of early per-fragment tests, and it overwrites the coverage value of the fragment, so it should have effect both on which fragments require ordering and on which fragment shader executions need to be performed. With pixel_interlock_ordered (unlike with sample_interlock), fragment shader invocations for a draw with glSampleMaski(0, 0b01) and a draw with glSampleMaski(0, 0b10) should be ordered relative to each other.

Edited Jun 25, 2023 by Triang3l (Vitaliy Kuzmin)

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Fragment shader interlock test possibly violates issue 1 resolution and fails on AMD