Duplicated Shader Compilation and Shader Program Linking

@gerddie @gurchetansingh @lepton: I've scattered hints about this issue in a few different issues/MRs lately, but now is a good time to formalize and organize the discussion into one place. I welcome discussion about the feasibility of the solution presented herein. Hopefully we can come to consensus about the appropriate way to resolve this issue and potentially improve the maintainability of vrend_shader.c for the long term.

Overview

Lately we've noticed a lot more time spent on shader compilation than previously, and eventually bisected the issue to 0f5cb454. Concretely, I found that a user application that was expected to contain only one fragment and one vertex shader compilation, was actually triggering two calls to vrend_shader_create(), vrend_convert_shader(), and vrend_compile_shader() for each shader type. This lead to a doubling in the number of calls to add_shader_program() as a consequence.

Small Scale Investigation

Digging in, I observed the following sequence of events: In summary, the steps in 0f5cb454 leading to the performance regressions observed in automatic game trace replay are:

fragment shader (fs) is emitted by guest, then a variant is immediately created/converted/compiled on the host, which provides fs_info for the following vertex shader's key
vertex shader (vs) is emitted by guest, then a variant is immediately created/converted/patched/compiled on the host.
- vs key inherits the fs_info and compiled_fs_uid params from the fs in step 1
A new fs variant is converted/compiled during the next call to vrend_draw_vbo(), unprompted by the guest (no extra call to vrend_decode_create_shader())
- ... since vrend_shader_select() fails to match the original fs variant due to the new fs key (specifically the members: num_prev_generic_and_patch_outputs, and prev_stage_num_clip_out both changed due to the presence of vs in step 2).
shader program is linked from vs in step 2 and re-compiled fs in step 3.
A new vs variant is converted/patched/compiled during the next vrend_draw_vbo()
- ... since the new vertex shader key is different from the one in step 2 (compiled_fs_uid has changed due to the fragment shader recompile in step 3.
A new shader program is linked from vs in step 5 and fs in step 3
- ... since the bound vs variant uid was changed again by step 5

* Extra steps are shown in bold

Game Trace Replay Reproduction

I confirmed the occurence of this issue at larger scale using apitrace to repeatably render a game trace running under QEMU with virgl enabled. I found the following number of function calls for 0f5cb454 (issue first observed) and the previous commit (c488abed). I also ran with a patch (tracked in !425 (closed)) that removed some of the problematic members from struct vrend_shader_key, which was enough to get vs and fs shaders compiling reliably, and without duplication, for the limited scope of the test.

num. function calls (QEMU:L4D2)	pre (c488abed)	post (0f5cb454)	0f5cb454-patched
vrend_create_shader	402	402	402
vrend_convert_shader	602	805	402
vrend_patch_vertex_shader_interpolants	203	406	201
vrend_compile_shader	805	805	402
add_shader_program	203	406	201
Average framerate (fps; N=5 replays)	29.7	27.6	30.2

* Some noise in function call counts is attributed to the window manager

Some observations:

The guest is not complicit in this issue, since the number of calls to vrend_create_shader is constant
Even before 0f5cb454, there was repeated fs variant conversion/compilation (confirmed in testing, and also apparent by 1.5x increase in calls to vrend_convert_shader (observed ~600 vs expected ~400) but no change in calls to vrend_patch_vertex_shader_interpolants() (~200), which is only used for vs)

The State of vrend_shader.c

After studying the internals of vrend_renderer.c and vrend_shader.c, I believe the current state of the shader compilation and subsequent variant caching is quite brittle. Its brittleness is primarily due to the use of both struct vrend_shader_info and struct vrend_shader_key to encode the dependencies between various stages of the rendering pipeline (for example the fs/vs cyclical dependency documented above).

A Possible "Cheap" Solution:

It should be possible, with enough effort (and frustration), to add more conditionals to the tgsi->glsl conversion process and shader variant key-setting (vrend_fill_shader_key()) process. If this is done effectively, it could solve the current issue of multiple shader conversion/compilation. However, I consider this a crude bandage over the larger issue, because it doesn't reduce the brittleness of the current shader dependency resolution approach, and in fact, it adds even more complexity that will likely come back to haunt us in the future. Complexity here acts as a direct enemy to further performance optimization.

A Long Term Solution

This issue seems to be an ideal impetus for improving the shader conversion and caching lifecycle. I have been thinking about a long-term solution and I believe the answer lies in deferral of shader conversion until a call to vrend_draw_vbo() triggers shader program linking, at which time we are guaranteed to have all of the information necessary to resolve shader dependencies in a unified way.

My idea follows this outline:

Create a struct vrend_shader_selector whenever the guest emits a new shader (vrend_decode_create_shader()). However, we no longer kickoff vrend_convert_shader() at this time or even create a new variant yet.
When vrend_draw_vbo() is called to render a frame, follow one of two code-paths:
- a) If we can successfully lookup a bound shader program based on the set of currently bound shader variant uids, ensure the program is active on the host (glUseProgram()).
- b) Whenever necessary (shader program cache miss), lookup shader variants by key or create a new variant and kickoff (semi-)unified tgsi parsing, inter-stage dependency resolution, tgsi to glsl conversion, and shader compilation. Complete by linking into a new shader program and caching the new shader variants and program for later.

The fundamental design decisions I think should be enforced are:

Shader key should be treated as immutable and opaque. As such, it should not be referenced during tgsi parsing and conversion. This eliminates confusion over how shader inter-dependencies are defined, since it reduces the number of possibilities in contrast the the current code.
If possible, TGSI parsing (i.e. shader interpretation) should be separated from glsl string emission. The ideal scenario is explained by the well-defined series of events:
1. vrend_draw_vbo() is called and a new shader program needs to be linked before rendering
2. Each token-set (stored in the bound struct vrend_shader_selector set) is parsed for interpretation, however no glsl is emitted yet.
3. The interpreted shaders are analyzed for inter-dependencies, which are encoded into a new struct vrend_shader_depends object that contains data such as the IO layouts for each stage, and a description of the features to be emulated
4. struct vrend_shader_depends is used to guide glsl string generation for each shader stage. Since dependencies are resolved before any strings are emitted, this should allow us to eliminate the ugly interpolant patching step that currently happens after the main conversion process. I believe this patching step is the primary cause of this duplicate compilation issue.
5. shaders variants are compiled and linked into a new program. struct vrend_shader_depends is finally used to define the contents of each shader variant's key so it can be cached.
In support of the previous point, TGSI parsing and GLSL string emission code in vrend_shader.c should be refactored to clarify the access permissions of struct dump_ctx that are truly necessary at each step of the conversion process. I've begun this process already, and its progress will be tracked in #179 (closed).

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message