Rework geometry shaders
Currently we implement vertex + geometry shaders as:
(VS + GS merged)/compute ---------RAM------> (copy)/vertex
This is bad for several reasons:
- merging VS/GS makes shader object very challenging to implement
- heap memory usage is unpredictable
- excessive memory usage/bandwidth for amplifying GS
- poor parallelism in the massive merged shader -- easily can become latency-bound
The new approach looks like:
VS/compute -----RAM-------> (GS index buffer + XFB only shader)/compute
|
|-------------------------------------------------------> (GS output)/vertex
The GS gets split up into a compute prepass, that generates an index buffer but does not shade outputs, and a vertex shader, that shades a single output vertex. This is better because
- no more merging, easier shader objects
- heap usage proprtional to VS outputs, not GS outputs: much more predictable and (for an amplifying GS) much smaller
- reduced memory bandwidth/usage for amplifying GS
- better parallelism with the smaller shaders
The output programs can do some redundant work, but in practice this ends up massively faster in Citra (my most GS-heavy apitrace). More importantly, it gets us much closer to shader objects. Along the way, we pick up disk caching and shader-db support for GS, and whittle down our GS key to almost nothing.
Frametime in a Citra trace decreased 27% across the series (fps increased 37%)