ts: implementations evaluation (!967) · Merge requests · GStreamer / gst-plugins-rs

François Laignel requested to merge fengalin/gst-plugins-rs:ts-udpsink-task-removal into main Nov 06, 2022

Testing threadshare variants

`ts-standalone`

ts-standalone is a tool designed to test the performance of the threadshare runtime and the implementation variants.

The tool intantiates a pipeline with S sub pipelines consisting in a source and a sink. Each source pushes buffers at a fixed interval rate and the matching sink receives the buffers and logs statistics such as the actual interval between buffers and the duration between the buffer's generation and its reception. This last measure will be refered to as the latency in the rest of this description.

When compiled with the tuning feature, the statistics also include the total duration the threadshare scheduler has spent parked. Considered as a percent of the processing duration, the parked duration gives a good approximation of 100% - CPU usage.

Statistics start after a ramp up period and stop early enough so as to avoid transient effects.

ts-standalone already implements a sink for which the PadSinkHandler passes buffers to a ts runtime Task via an asynchronous channel. This design is convenient from a developer's perspective as they can take advantage of the TaskImpl functions accepting &mut self, which means no synchronization primitives are required in the hot path.

However, one drawback of using a Task is that fields which are necessary to the Task loop implementation are not visible to the element (of course unless they are shared as an Arc<Mutex<_>>). Another drawback is that buffers need to go through a channel, which may impact performance. The Task loop implementation itself relies on trait functions. As of rustc 1.65.0, async trait functions are not available, forcing the functions to return boxed Futures, which comes with heap allocation overhead and increased cache misses.

`ts-udpsink`

ts-udpsink used to be based on the Task design. In this element, the sockets' clients can be listed or modified via the element's properties. A command oriented channel allowed to update the clients list along the usual buffers handling. While it works in practice, this design adds a bit of complexity.

Extending `ts-standalone`

In this MR, I decided to evaluate the impact of using a Task in a ts-sink element. To this end, I implemented 2 additional sink elements which handle the buffers and log statistics in the PadSinkHandler:

One sink uses the sync Mutex from std. Note that this Mutex can not be kept locked accross await points, which is a no-go for some implementations.
The other uses the async Mutex from the futures crate.

I also tested the Mutex from the async-lock crate, but my tests didn't show significant improvements over the one from the futures crate.

The diagrams below compare the 2 new Mutex-based sinks to the Task sink.

According to these tests, the difference in parked duration between the sync and async Mutex is marginal.

As the streams number increases, the parked duration for the Task variant drops significantly compared to the Mutex variants. Note however that the sink elements perform no async i/o nor timers operations and the statistics are computed and logged by only one element whatever the streams numbers. See below for tests on a closer to reality use case. I believe this diagram is still valuable as an evaluation of the intrinsic impact of the implementation.

Whatever the streams numbers used, the latency goes from around 1.5µs for the both Mutex variants to about 5ms for the Task variant.

There might be better mechanisms for passing buffers than the async channel. We only need to pass one item at a time. The ts-standalone should make it easy to experiment alternative solutions to improve CPU usage and hopefully reduce latency. Meanwhile, using a Mutex variant when applicable seems like a better choice.

Migrating `ts-udpsink` back to a `Mutex` variant

Based on the above results, I decided to migrate ts-udpsink back to a Mutex variant. For this element, we have no other choice but using an async Mutex since we need to hold the lock while awaiting for the sockets to send the buffers.

I decided to create a model similar to ts-standalone using the existing udpsrc-benchmark-sender and the benchmark receiver under the ts examples. I also wanted to make sure the operating point was properly chosen for real life data. For this reason, I created the ts-audiotestsrc element, so that we can listen to the buffers sent by ts-udpsink. I tuned up a buffer duration which plays well with this model (10ms of mono audio raw signed 16 bits samples @44100 samples/s).

Here, the overhead of the Task model still stands out, though it is not as prevailing as for the ts-standalone case. I believe this is due to the fact that the udp-sender has significant processing to perform, which flattens the overhead of the runtime itself.

Additional details

For each test case, the tool is invoked 10 times so as to account for variations on the memory layouts and to flatten other external influences. Each iteration runs about 2mn worth of buffers.

Specs

CPU: Core i5-7200U (2 physical cores)
Performance governor
Mem: 12 GB
Linux kernel: 5.19.16-301.fc37.x86_64
Airplane mode activated
ulimit -Sn 30000
rustc 1.65.0 (897e37553 2022-11-02)
gst-plugin-rs base @ 7ac29827
gstreamer-rs @ efb85f416e0edb44e6543e0123d5706de43c529b

Commands

`ts-standalone`

export GST_DEBUG=ts-standalone*:4
export GST_DEBUG_NO_COLOR=1
export BUFFERS=5000
export STREAMS=7000

export FILE=~/temp/ts-standalone.${STREAMS}x${BUFFERS}.log; mv ${FILE} ${FILE}.old; for SINK in sync-mutex async-mutex task; do echo "-----------"; echo $SINK; echo "-----------"; for i in {1..10}; do echo $i; target/release/examples/ts-standalone -n $BUFFERS -s $STREAMS --sink $SINK >>${FILE} 2>&1; done; done;

Then a logs parser extracts the statistics and generates a csv file.

udp benchamrk

export GST_DEBUG=ts-audiotestsrc:4
export GST_DEBUG_NO_COLOR=1
export BUFFERS=10000

export FILE=~/temp/ts-udpsender.${BUFFERS}.log; mv ${FILE} ${FILE}.old; for STREAMS in 100 200 300 400 500; do echo "-----------"; echo $STREAMS; echo "-----------"; echo "-----------" >> ${FILE}; echo $STREAMS >> ${FILE}; echo "-----------" >> ${FILE}; for i in {1..10}; do echo $i; echo $i >> ${FILE}; target/release/examples/udpsrc-benchmark-sender ${STREAMS} test ${BUFFERS} >>${FILE} 2>&1; done; done;

Buffers are consumed in another terminal:

target/release/examples/benchmark 700 ts-udpsrc 1 20

Edited Nov 07, 2022 by François Laignel

Admin message

ts: implementations evaluation

Testing threadshare variants

ts-standalone

ts-udpsink

Extending ts-standalone

Migrating ts-udpsink back to a Mutex variant

Additional details

Specs

Commands

ts-standalone

udp benchamrk

Merge request reports

`ts-standalone`

`ts-udpsink`

Extending `ts-standalone`

Migrating `ts-udpsink` back to a `Mutex` variant

`ts-standalone`