Performance of siprtmp: multitask vs gevent

Poor performance has been an issue in my RTMP server and SIP-RTMP gateway. Traditionally, I blamed the multitask framework for the poor performance. In this article I present my measurement results as well as introduce an alternative gevent-based implementation to improve the performance.

There are several performance aspects of this software, e.g., CPU utilization per call or session, memory usage, bandwidth requirement, etc. This article only focuses on the CPU performance. Moreover, I only consider the steady state CPU usage to measure the number of active simultaneous calls through the gateway. The CPU usage during call setup and termination is not considered.

The conclusion of my measurement is as follows. The SIP-RTMP gateway software using gevent takes about 2/3 the CPU cycles than using multitask, and the RTMP server software using gevent takes about 1/2 the CPU cycles than using multitask. After the improvements, on a dual-core 2.13 GHz CPU machine, a single audio call going though gevent-based siprtmp using Speex audio codec at 8Hz sampling takes about 3.1% CPU, and hence in theory can support about 60 active calls in steady state. Another way to look at it is that the software requires CPU cycles of about 66 MHz per audio call.

The gevent-based software is also available under the same license for you to try out. The next step to further improve the performance is to move part of the media processing of siprtmp to an external C/C++ extension module.

Background

Traditionally, I have used the multitask framework for co-operative multitasking in my Python software including p2p-sip, rtmplite and siprtmp. In the past, people have complained about high CPU utilization in siprtmp for a single call or even with no call. Part of the discussion is documented in issue 31. It turned out that the no-call CPU usage was a bug, and that we could optimize the multitask framework to improve the performance by approximately 2x. The optimization alters the way in which the multitask framework looks for io-events and more tasks. In particular, it gives more preference to tasks than to io-events, hence if a single io-event generates multiple tasks, all of them run before waiting for next io-events. These optimizations and fixes are in SVN r60 and r68. Unfortunately, these optimizations are not enough.

To further improve the performance, I looked at the built-in asyncore module of Python and re-implemented rtmp.py to use asyncore. There was significant improvement of approximately another 1.5x to 2x. Unfortunately, getting timers to work with asyncore is not trivial. Hence I couldn't implement siprtmp easily as the SIP/RTP library relies heavily on timers.

Then I looked at the gevent project, thanks to a co-worker for recommending it. It supports co-routine based co-operative multitasking by modifying the existing blocking modules such as socket. Compared to the multitask framework, the source code using gevent is more readable and easy to maintain because it works behind the scene. Unlike this, the multitask framework requires yield statements scattered everywhere and non-trivial StopIteration exception to return from a task. I re-implemented siprtmp.py, and related SIP/RTP modules, using gevent. Since siprtmp module includes all of rtmp module, this can also be used as an RTMP server in addition to being a SIP-RTMP gateway.

Test Setup

All my tests were done on my MacBook laptop, 2.13 GHz Intel core 2 duo, 2GB memory, and running Mac OS X 10.5.6. I used Python 2.7 for server side components and flash debug player version MAC 10,0,45,2 (how to find?). I used X-lite version 3 as a standard SIP client. The debug trace on the server was disabled, by not supplying any -d option. All my clients and server ran locally on my local host hence bandwidth was not an issue. I used Mac's Activity Monitor to measure the CPU usage.

Measurement Result
The main metric is the CPU usage in percentage as reported by the Activity Monitor. There are several parameters that were altered and the effects were measured.

The siprtmp performance was measured for an audio call between a web-based VideoPhone sample application available as part of the siprtmp software, and the third-party X-Lite application. The sampling rate of the Speex audio codec can be 8kHz or 16kHz. The larger the sampling rate, the larger the encoded packet is. The CPU usage increases with higher sampling rate. Note that there is no transcoding in siprtmp. The following table shows the percentage CPU usage for siprtmp using multitask and gevent, and for the two sampling rates.

Ratemultitaskgevent
8 kHz4.8-5.1%3.1-3.2%
16 kHz6.2-6.5%4.0-4.1%

Base on these, we can conclude that the gevent-based SIP-RTMP gateway takes about 2/3 the CPU compared to multitask-based gateway. Roughly, the gevent-based gateway takes about 66 MHz/audio-call of the CPU cycles in steady state.

The rtmp performance was measured using one publisher and zero or more players. The CPU usage increases with the number of players. Typically, audio only session gives less variance in the CPU usage, whereas if video is included then depending on the amount of movement or image details the packet size changes, and so does the CPU usage. I used the Flash VideoIO project's test page to perform the tests. If video is present, then Flash Player's camera capture uses these properties: cameraQuality=80, cameraWidth=320, cameraHeight=240, cameraFPS=12. Audio is always Speex 16 kHz with encodeQuality=6. The following tables shows the percentage CPU usage using multitask and gevent, with one publisher and different number of players, and with or without video. If the variance is small, only the average is reported, whereas if the variance is large the range is listed.

Media#playersmultitaskgevent
Audio02.2%1.3%
Audio13.5%1.8%
Audio24.5%2.1%
Audio35.5%2.5%
Audio+Video03.0-3.9%1.4-1.7%
Audio+Video14.2-4.7%2.1%
Audio+Video25.5-6.3%2.7%
Audio+Video37.0-7.6%3.1%

Based on these, we can conclude that gevent-based software takes less than 1/2 the CPU than the multitask-based software for RTMP streaming. Roughly, the gevent-based server takes 34 MHz/publisher and 12 MHz/player of the CPU cycles in steady state.