gl-renderer: Optimize vertex clipping
This Merge Request proposes a series of commits trying to reduce the vertex clipping CPU workload.
It starts by changing the coordinate space in which vertices are clipped and dispatched. Clipping and dispatching in surface space removes a roundtrip to global space for each vertex. The projection matrix now combines the surface-to-global and global-to-output transformations so that the graphics hardware can handle the full transform.
calculate_edges()
is then split to transform dirty rects and compute their AABB once for each surface rect.
The next 3 commits offload texture coordinates generation onto the graphics hardware. The compositor exposes a new normalized surface-to-buffer matrix used by the OpenGL renderer to let an additional shader variant generate texcoords from vertices.
Since positions are sent down the graphics hardware as single-precision floating-point values, the clipping functions are then moved back to single-precision in order to prevent type conversions and to improve code generation by letting compilers fit twice as much data into vector registers. This also lets the functions store generated vertices directly into the vertex buffer.
Afterwards, for the last commits, check the node transform type to let the clipper take the simple path when it stores a translation and/or scaling and, finally, replace the bounding box check in the simple clipping path by a more efficient one post-clipping.
On the CPU side, on a Core i5-7200U, with a debug optimized build and gcc-10.2.1, texture_region()
is from ~2x to ~5x faster. On the GPU side, on an integrated HD Graphics 620, using profiling patch !1113 (merged), the performance hit is around 5%, on the μs order of magnitude. Instrumented layouts were very basic: from 1 to 4 weston-simple-egl
and weston-subsurfaces
instances randomly placed over a weston-terminal
. More complex layouts would likely benefit from bigger performance gains.