terakan: Handle 1D-thin-tiled mip pitch alignment difference between depth and stencil
For the 1D thin tiling mode, the alignment of the row pitch depends on the number of bytes per element. On GCN this issue isn't present as the pitch is always aligned to 8 elements, but on R8xx/R9xx the alignment formula differs, see EgBasedLib::HwlGetPitchAlignmentMicroTiled
in AddrLib, and with the default pipe interleave of 256 bytes, stencil needs the pitch to be a multiple of 32 pixels, but for depth the alignment is only 16 pixels for D16, and 8 pixels for D24X8/D32.
However, for the depth/stencil attachment, there's only one register specifying the pitch for both depth and stencil — DB_DEPTH_SIZE::PITCH_TILE_MAX
.
For the base level, it's possible to manually overalign the pitch of the depth aspect of combined depth/stencil images, as in all places — DB, TC, CB — the base level pitch is specified explicitly. But for mips of sampled images, TC hardware computes the pitch implicitly from the width, so it won't match the pitch used for the depth attachment pointing to a mip in this case. It's also not possible to force 2D tiling which doesn't suffer from this issue, as TC also automatically degrades the tiling to 1D for small mips.
It should be possible to work around this issue by rendering to a separate depth buffer with the overaligned pitch, and then copying the result to the actual texture read by the TC.
The overaligned depth buffer needs to be allocated in the image's memory if the image is a combined depth/stencil one with any 1D-thin-tiled mips for which the depth row pitch doesn't match that of the stencil. The needed amount of memory should be calculated for the largest mip meeting this condition. Because the width, height and depth of mips (unlike of the base level) are padded to powers of two in surface calculations in the TC, in reality the overhead should be small as that should happen only for mips not wider than 16 pixels. Note that the intermediate image should be allocated for all combined depth/stencil images with mips — there's no need to check the usage flags granularly as if the application is using mips with depth/stencil, it's likely conceptually going to access them via a sampled image or at least as a transfer source anyway, and both are TC resources.
The workaround should be activated if vkCmdBeginRendering
is done with both the depth and the stencil aspects used if they're pointing to a mip that requires different pitch alignments. In a dynamic rendering instance with this workaround:
- The intermediate depth buffer should be bound to the DB rather than the actual mip.
-
VK_ATTACHMENT_LOAD_OP_LOAD
for the depth attachment should run a meta pass copying from the actual mip to the overaligned intermediate depth buffer. - A meta pass copying from the intermediate depth buffer to the actual mip should be done in two cases:
- Subpass-local barrier for depth write source access.
-
vkCmdEndRendering
done withVK_ATTACHMENT_STORE_OP_STORE
for the depth attachment.
It's not clear what exactly to do with input attachments, however, but the possible options are:
- Input attachment descriptor always pointing to the actual mip.
- Can always be loaded from the descriptor set directly.
- Requiring copying on a subpass-local barrier from depth write for both self-dependencies and feedback loops.
- Input attachment possibly being the intermediate resource (likely as the base address with explicit power-of-two padding if possible, since its dimensions apparently are not observable from the shader).
- Difficulties with storing in descriptor sets:
- If the intermediate surface is used regardless of whether the stencil attachment is needed, the descriptor can be loaded directly from the descriptor set.
- If the intermediate surface is only used when both depth and stencil aspects are bound as attachments, but not when only the depth, the input attachment binding would need to be overridden dynamically.
- Requiring copying on a subpass-local barrier from depth write only for feedback loops.
- Difficulties with storing in descriptor sets:
However, a combination of all things involved here should be considered an extremely rare edge case, for which the handling should preferably be as simpler as possible, so I think input attachment descriptors should be created just as usual without any consideration of this workaround, and barriers for subpass self-dependencies and feedback loops should be handled the same way.