Understanding RTMFP Handshake

(Disclaimer: the protocol description here is not an official specification of RTMFP but just the protocol understanding based on the OpenRTMFP's Cumulus project as well as the IETF presentation slides.)

Introduction

What is RTMFP? RTMFP (Real-time media flow protocol) allows UDP-based low-latency end-to-end media path between two Flash Player instances. Compared to earlier RTMP-based media path which runs over TCP, this new protocol enables actual real-time communication on the web. Although the end-to-end media path is not always possible when certain types of NATs and firewalls are present, it is possible to do end-to-end media across most residential-type NATs. The end-to-end media path between two Flash Player reduces latency as well as scalability of the service (or server infrastructure) since most heavy media traffic can be sent without going through the hosted server. The UDP transport reduces latency compared to TCP transport even if the media-path is client-server.

Why is understanding RTMFP important? Unlike the earlier RTMP, the new protocol RTMFP is still closed with no open specification available. There have been some attempts at reverse engineering the protocol for interoperability and some official slides explaining the core logic. Understanding the wire-protocol is not important if you are building Flash-based applications that work among each other. However for applications such as Flash-to-SIP gateway or Flash-to-RTSP translator, where you may need to interoperate between RTMFP and SIP/RTP, it is important to understand the wire-protocol in detail. For a Flash-to-SIP gateway incorporating RTMFP from the Flash side in addition to the existing RTMP will enable low-latency UDP media path between the web user and the translator service on the Internet.

The following description is reproduced from a contribution (see rtmfp.py) to my RTMP server project.

Session

An RTMFP session is an end-to-end bi-directional pipe between two UDP transport addresses. A transport address contains an IP address and port number, e.g., "192.1.2.3:1935". A session can have one or more flows where a flow is a logical path from one entity to another via zero or more intermediate entities. UDP packets containing encrypted RTMFP data are exchanged in a session. A packet contains one or more messages. A packet is always encrypted using AES with 128-bit keys.

In the protocol description below, all numbers are in network byte order (big-endian). The | operator indicates concatenation of data. The numbers are assumed to be unsigned unless mentioned explicitly.

Scrambled Session ID

The packet format is as follows. Each packet has the first 32 bits of scrambled session-id followed by encrypted part. The scrambled (instead of raw) session-id makes it difficult if not impossible to mangle packets by middle boxes such as NATs and layer-4 packet inspectors. The bit-wise XOR operator is used to scramble the first 32-bit number with subsequent two 32-bit numbers. The XOR operator makes it possible to easily unscramble.
packet := scrambled-session-id | encrypted-part

To scramble a session-id,
scrambled-session-id = a^b^c

where ^ is the bit-wise XOR operator, a is session-id, and b and c are two 32-bit numbers from the first 8 bytes of the encrypted-part.

To unscramble,
session-id = x^y^z

where z is the scrambled-session-id, and b and c are two 32-bit numbers from the first 8 bytes of the encrypted-part.

The session-id determines which session keys are used for encryption and decryption of the encrypted part. There is one exception for the fourth message in the handshake which contains the non-zero session-id but the handshake (symmetric) session keys are used for encryption/decryption. For the handshake messages, a symmetric AES (advanced encryption standard) with 128-bit (16 bytes) key of "Adobe Systems 02" (without quotes) is used. For subsequent in-session messages the established asymmetric session keys are used as described later.

Encryption

Assuming that the AES keys are known, the encryption and decryption of the encrypted-part is done as follows. For decryption, an initialization vector of all zeros (0's) is used for every decryption operation. For encryption, the raw-part is assumed to be padded as described later, and an initialization vector of all zeros (0's) is used for every encryption operation. The decryption operation does not add additional padding, and the byte-size of the encrypted-part and the raw-part must be same.

The decrypted raw-part format is as follows. It starts with a 16-bit checksum, followed by variable bytes of network-layer data, followed by padding. The network-layer data ignores the padding for convenience.
raw-part := checksum | network-layer-data | padding


The padding is a sequence of zero or more bytes where each byte is \xff. Since it uses 128-bit (16 bytes) key, padding ensures that the size in bytes of the decrypted part is a multiple of 16. Thus, the size of padding is always less than 16 bytes and is calculated as follows:
len(padding) = 16*N - len(network-layer-data) - 1

where N is any positive number to make 0 <= padding-size < 16

For example, if network-layer-data is 84 bytes, then padding is 16*6-84-1=11 bytes. Adding a padding of 11 bytes makes the decrypted raw-part of size 96 which is a multiple of 16 (bytes) hence works with AES with 128-bit key.

Checksum

The checksum is calculated over the concatenation of network-layer-data and padding. Thus for the encoding direction you should apply the padding followed by checksum calculation and then AES encrypt, and for the decoding direction you should AES decrypt, verify checksum and then remove the (optional) padding if needed. Usually padding removal is not needed because network-layer data decoders will ignore the remaining data anyway.

The 16-bit checksum number is calculated as follows. The concatenation of network-layer-data and padding is treated as a sequence of 16-bit numbers. If the size in bytes is not an even number, i.e., not divisible by 2, then the last 16-bit number used in the checksum calculation has that last byte in the least-significant position (weird!). All the 16-bit numbers are added in to a 32-bit number. The first 16-bit and last 16-bit numbers are again added, and the resulting number's first 16 bits are added to itself. Only the least-significant 16 bit part of the resulting sum is used as the checksum.

Network Layer Data

The network-layer data contains flags, optional timestamp, optional timestamp echo and one or more chunks.
network-layer-data = flags | timestamp | timestamp-echo | chunks ...

The flags value is a single byte containing these information: time-critical forward notification, time-critical reverse notification, whether timestamp is present? whether timestamp echo is present and initiator/responder marker. The initiator/responder marker is useful if the symmetric (handshake) session keys are used for AES, so that it protects against packet loopback to sender.

The bit format of the flags is not clear, but the following applies. For the handshake messages, the flags is \x0b. When the flags' least-significant 4-bits are 1101b then the timestamp-echo is present. The timestamp seems to be always present. For in-session messages, the last 4-bits are either 1101b or 1001b.
--------------------------------------------------------------------
flags meaning
--------------------------------------------------------------------
0000 1011 setup/handshake
0100 1010 in-session no timestamp-echo (server to Flash Player)
0100 1110 in-session with timestamp-echo (server to Flash Player)
xxxx 1001 in-session no timestamp-echo (Flash Player to server)
xxxx 1101 in-session with timestamp-echo (Flash Player to server)
--------------------------------------------------------------------

TODO: looks like bit \x04 indicates whether timestamp-echo is present. Probably \x80 indicates whether timestamp is present. last two bits of 11b indicates handshake, 10b indicates server to client and 01b indicates client to server.

The timestamp is a 16-bit number that represents the time with 4 millisecond clock. The wall clock time can be used for generation of this timestamp value. For example if the current time in seconds is tm = 1319571285.9947701 then timestamp is calculated as follows:
int(time * 1000/4) & 0xffff = 46586
, i.e., assuming 4-millisecond clock, calculate the clock units and use the least significant 16-bits.

The timestamp-echo is just the timestamp value that was received in the incoming request and is being echo'ed back. The timestamp and its echo allows the system to calculate the round-trip-time (RTT) and keep it up-to-date.

Each chunk starts with an 8-bit type, followed by the 16-bit size of payload, followed by the payload of size bytes. Note that \xff is reserved and not used for chunk-type. This is useful in detecting when the network-layer-data has finished and padding has started because padding uses \xff. Alternatively, \x00 can also be used for padding as that is reserved type too!
chunk = type | size | payload


Message Flow

There are three types of session messages: session setup, control and flows. The session setup is part of the four-way handshake whereas control and flows are in-session messages. The session setup contains initiator hello, responder hello, initiator initial keying, responder initial keying, responder hello cookie change and responder redirect. The control messages are ping, ping reply, re-keying initiate, re-keying response, close, close acknowledge, forwarded initiator hello. The flow messages are user data, next user data, buffer probe, user data ack (bitmap), user data ack (ranges) and flow exception report.

A new session starts with an handshake of the session setup. Under normal client-server case, the message flow is as follows:
 initiator (client)                target (server)
|-------initiator hello---------->|
|<------responder hello-----------|


Under peer-to-peer session setup case for NAT traversal, the server acts as a forwarder and forwards the hello to another connected client as follows:
 initiator (client)                forwarder (server)                     target (client)
|-------initiator hello---------->| |
| |---------- forwarded initiator hello-->|
| |<--------- ack ----------------------->|
|<------------responder hello---------------------------------------------|


Alternatively, the server could redirect to another target by supplying an alternative list of target addresses as follows:
 initiator (client)                redirector (server)                     target (client)
|-------initiator hello---------->|
|<------responder redirect--------|
|-------------initiator hello-------------------------------------------->|
|<------------responder hello---------------------------------------------|


Note that the initiator, target, forwarder and redirector are just roles for session setup whereas client and server are specific implementations such as Flash Player and Flash Media Server, respectively. Even a server may initiate an initiator hello to a client in which case the server becomes the initiator and client becomes the target for that session. This mechanism is used for the man-in-middle mode in the Cumulus project.

The initiator hello may be forwarded to another target but the responder hello is sent directly. After that the initiator initial keying and the responder initial keying are exchanged (between the initiator and the responded target directly) to establish the session keys for the session between the initiator and the target. The four-way handshake prevents denial-of-service (DoS) via SYN-flooding and port scanning.

As mentioned before the handshake messages for session-setup use the symmetric AES key "Adobe Systems 02" (without the quotes), whereas in-session messages use the established asymmetric AES keys. Intuitively, the session setup is sent over pre-established AES cryptosystem, and it creates new asymmetric AES cryptosystem for the new session. Note that a session-id is established for the new session during the initial keying process, hence the first three messages (initiator-hello, responder-hello and initiator-initial-keying) use session-id of 0, and the last responder-initial-keying uses the session-id sent by the initiator in the previous message. This is further explained later.

Message Types

The 8-bit type values and their meaning are shown below.
---------------------------------
type meaning
---------------------------------
\x30 initiator hello
\x70 responder hello
\x38 initiator initial keying
\x78 responder initial keying
\x0f forwarded initiator hello
\x71 forwarded hello response

\x10 normal user data
\x11 next user data
\x0c session failed on client side
\x4c session died
\x01 causes response with \x41, reset keep alive
\x41 reset times keep alive
\x5e negative ack
\x51 some ack
---------------------------------
TODO: most likely the bit \x01 indicates whether the transport-address is present or not.

The contents of the various message payloads are described below.

Variable Length Data

The protocol uses variable length data and variable length number. Any variable length data is usually prefixed by its size-in-bytes encoded as a variable length number. A variable length number is an unsigned 28-bit number that is encoded in 1 to 4 bytes depending on its value. To get the bit-representation, first assume the number to be composed of four 7-bit numbers as follows
number = 0000dddd dddccccc ccbbbbbb baaaaaaa (in binary)
where A=aaaaaaa, B=bbbbbbb, C=ccccccc, D=ddddddd are the four 7-bit numbers

The variable length number representation is as follows:
0aaaaaaa (1 byte)  if B = C = D = 0
0bbbbbbb 0aaaaaaa (2 bytes) if C = D = 0 and B != 0
0ccccccc 0bbbbbbb 0aaaaaaa (3 bytes) if D = 0 and C != 0
0ddddddd 0ccccccc 0bbbbbbb 0aaaaaaa (4 bytes) if D != 0


Thus a 28-bit number is represented as 1 to 4 bytes of variable length number. This mechanism saves bandwidth since most numbers are small and can fit in 1 or 2 bytes, but still allows values up to 2^28-1 in some cases.

Handshake

The initiator-hello payload contains an endpoint discriminator (EPD) and a tag. The payload format is as follows:
initiator-hello payload = first | epd | tag

The first (8-bit) is unknown. The next epd is a variable length data that contains an epd-type (8-bit) and epd-value (remaining). Note that any variable length data is prefixed by its length as a variable length number. The epd is typically less than 127 bytes, so only 8-bit length is enough. The tag is a fixed 128-bit (16 bytes) randomly generated data. The fixed sized tag does not encode its length.
epd = epd-type | epd-value

The epd-type is \x0a for client-server and \x0f for peer-to-peer session. If epd-type is peer-to-peer, then the epd-value is peer-id whereas if epd-type is client-server the epd-value is the RTMFP URL that the client uses to connect to. The initiator sets the epd-value such that the responder can tell whether the initiator-hello is for them but an eavesdropper cannot deduce the identity from that epd. This is done, for example, using an one-way hash function of the identity.

The tag is chosen randomly by the initiator, so that it can match the response against the pending session setup. Once the setup is complete the tag can be forgotten.

When the target receives the initiator-hello, it checks whether the epd is for this endpoint. If it is for "another" endpoint, the initiator-hello is silently discarded to avoid port scanning. If the target is an introducer (server) then it can respond with an responder, or redirect/proxy the message with forwarded-initiator-hello to the actual target. In the general case, the target responds with responder-hello.

The responder-hello payload contains the tag echo, a new cookie and the responder certificate. The payload format is as follows:
responder-hello payload = tag-echo | cookie | responder-certificate

The tag echo is same as the original tag from the initiator-hello but encoded as variable length data with variable length size. Since the tag is 16 bytes, size can fit in 8-bits.

The cookie is a randomly and statelessly generated variable length data that can be used by the responder to only accept the next message if this message was actually received by the initiator. This eliminates the "SYN flood" attacks, e.g., if a server had to store the initial state then an attacker can overload the state memory slots by flooding with bogus initiator-hello and prevent further legitimate initiator-hello messages. The SYN flooding attack is common in TCP servers. The length of the cookie is 64 bytes, but stored as a variable length data.

The responder certificate is also a variable length data containing some opaque data that is understood by the higher level crypto system of the application. In this application, it uses the diffie-hellman (DH) secure key exchange as the crypto system.

Note that multiple EPD might map to the single endpoint, and the endpoint has single certificate. A server that does not care about the man-in-middle attack or does not create secure EPD can generate random certificate to be returned as the responder certificate.
certificate = \x01\x0A\x41\x0E | dh-public-num | \x02\x15\x02\x02\x15\x05\x02\x15\x0E

Here the dh-public-num is a 64-byte random number used for DH secure key exchange.

The initiator does not open another session to the same target identified by the responder certificate. If it detects that it already has an open session with the target it moves the new flow requests to the existing open session and stops opening the new session. The responder has not stored any state so does not need to care. (In our implementation we do store the initial state for simplicity, which may change later). This is one of the reason why the API is flow-based rather than session-based, and session is implicitly handled at the lower layer.

If the initiator wants to continue opening the session, it sends the initiator-initial-keying message. The payload is as follows:
initiator-initial-keying payload = initiator-session-id | cookie-echo
| initiator-certificate | initiator-component | 'X'

Note that the payload is terminated by \x58 (or 'X' character).

The initiator picks a new session-id (32-bit number) to identify this new session, and uses it to demultiplex subsequent received packet. The responder uses this initiator-session-id as the session-id to format the scrambled session-id in the packet sent in this session.

The cookie-echo is the same variable length data that was received in the responder-hello message. This allows the responder to relate this message with the previous responder-hello message. The responder will process this message only if it thinks that the cookie-echo is valid. If the responder thinks that the cookie-echo is valid except that the source address has changed since the cookie was generated it sends a cookie change message to the initiator.

In this DH crypto system, p and g are publicly known. In particular, g is 2, and p is a 1024-bit number. The initiator picks a new random 1024-bit DH private number (x1) and generates 1024-bit DH public number (y1) as follows.
y1 = g ^ x1 % p


The initiator-certificate is understood by the crypto system and contains the initiator's DH public number (y1) in the last 128 bytes.

The initiator-component is understood by the crypto system and contains an initiator-nonce to be used in DH algorithm as described later.

When the target receives this message, it generates a new random 1024-bit DH private number (x2) and generates 1024-bit DH public number (y2) as follows.
y2 = g ^ x2 % p


Now that the target knows the initiator's DH public number (y1) and it generates the 1024-bit DH shared secret as follows.
shared-secret = y1 ^ x2 % p


The target generates a responder-nonce to be sent back to the initiator. The responder-nonce is as follows.
responder-nonce = \x03\x1A\x00\x00\x02\x1E\x00\x81\x02\x0D\x02 | responder's DH public number


The peer-id is the 256-bit SHA256 (hash) of the certificate. At this time the responder knows the peer-id of the initiator from the initiator-certificate.

The target picks a new 32-bit responder's session-id number to demultiplex subsequent packet for this session. At this time the server creates a new session context to identify the new session. It also generates asymmetric AES keys to be used for this session using the shared-secret and the initiator and responder nonces as follows.
decode key = HMAC-SHA256(shared-secret, HMAC-SHA256(responder nonce, initiator nonce))[:16]
encode key = HMAC-SHA256(shared-secret, HMAC-SHA256(initiator nonce, responder nonce))[:16]

The decode key is used by the target to AES decode incoming packet containing this responder's session-id. The encode key is used by the target to AES encode outgoing packet to the initiator's session-id. Only the first 16 bytes (128-bits) are used as the actual AES encode and decode keys.

The target sends the responder-initial-keying message back to the initiator. The payload is as follows.
responder-initial-keying payload = responder session-id | responder's nonce | 'X'

Note that the payload is terminated by \x58 (or 'X' character). Note also that this handshake response is encrypted using the symmetric (handshake) AES key instead of the newly generated asymmetric keys.

When the initiator receives this message it also calculates the AES keys for this session.
encode key = HMAC-SHA256(shared-secret, HMAC-SHA256(responder nonce, initiator nonce))[:16]
decode key = HMAC-SHA256(shared-secret, HMAC-SHA256(initiator nonce, responder nonce))[:16]

As before, only the first 16 bytes (128-bits) are used as the AES keys. The encode key of initiator is same as the decode key of the responder and the decode key of the initiator is same as the encode key of the responder.

When a server acts as a forwarder, it receives an incoming initiator-hello and sends a forwarded-initiator-hello in an existing session to the target. The payload is follows.
forwarded initiator hello payload := first | epd | transport-address | tag


The first 8-bit value is \x22. The epd value is same as that in the initiator-hello -- a variable length data containing epd-type and epd-value. The epd-type is \x0f for a peer-to-peer session. The epd-value is the target peer-id that was received as epd-value in the initiator-hello.

The tag is echoed from the incoming initiator-hello and is a fixed 16 bytes value.

The transport address contains a flag for indicating whether the address is private or public, the binary bits of IP address and optional port number. The transport address is that of the initiator as known to the forwarder.
transport-address := flag | ip-address | port-number

The flag is an 8-bit number with the first most significant bit as 1 if the port-number is present, otherwise 0. The least significant two bits are 10b for public IP address and 01b for private IP address.

The ip-address is either 4-bytes (IPv4) or 16-bytes (IPv6) binary representation of the IP address.

The optional port-number is 16-bit number and is present when the flag indicates so.

The server then sends a forwarded-hello-response message back to the initiator with the transport-address of the target.
forwarded-hello-response = transport-address | transport-address | ...

The payload is basically one or more transport addresses of the intended target, with the public address first.

After this the initiator client directly sends subsequent messages to the responder, and vice-versa.

A normal-user-data message type is used to deal with any user data in the flows. The payload is shown below.
normal-user-data payload := flags | flow-id | seq | forward-seq-offset | options | data

The flags, an 8-bits number, indicate fragmentation, options-present, abandon and/or final. Following table indicates the meaning of the bits from most significant to least significant.
bit   meaning
0x80 options are present if set, otherwise absent
0x40
0x20 with beforepart
0x10 with afterpart
0x08
0x04
0x02 abandon
0x01 final

The flow-id, seq and forward-seq-offset are all variable length numbers. The flow-id is the flow identifier. The seq is the sequence number. The forward-seq-offset is used for partially reliable in-order delivery.

The options are present only when the flags indicate so using the most significant bit as 1. The options are as follows.

TODO: define options

The subsequent data in the fragment may be sent using next-user-data message with the payload as follows:
next-user-data := flags | data

This is just a compact form of the user data when multiple user data messages are sent in the same packet. The flow-id, seq and forward-seq-offset are implicit, i.e., flow-id is same and subsequent next-user-data have incrementing seq and forward-seq-offset. Options are not present. A single packet never contains data from more than one flow to avoid head-of-line blocking and to enable priority inversion in case of problems.

TODO

Will update this article in future:
- Fill in the description of the remaining message flows beyond handshake.
- Describe the man-in-middle mode that enables audio/video flowing through the server.

Three Problems in Interoperating with H.264 of Flash Player

H.264 decoding has been part of Flash Player since version 9, but H264 encoding was recently added in version 11. Once Flash Player 11 beta was out I started looking in to integrating video translation in the SIP-RTMP gateway project. For a Flash-to-Flash video conference you do not need to understand the problems related to H.264 in Flash Player because everything is taken care of behind the scenes by Flash Player. Adding H.264 support in the flash-videoio project was relatively straight forward. However if you are building your own translator to interoperate video between Flash Player and some other application you will need to understand these problems.

1) The first problem is that Flash Player doesn't enable H.264 even for decoding if the RTMP connection does not use the new-style "secure" handshake. In the older version handshaking with bytes containing zeros worked, but not when using H.264. Eventually I found about this on reading some open-source-flash (osflash) forum post and incorporated it in my gateway.

2) The H.264 encoder generates some sequence headers (called SPS and PPS) which are essential in decoding the rest of the video data packets. The same is true with AAC audio codec. In particular in live H.264 publish mode, Flash Player generates periodic SPS/PPS packets so the other Flash Player (or SIP phone) can join the call later and still be able to start decoding the stream. However, some existing SIP video phones generate the sequence packets only once at the beginning. The SIP-RTMP gateway needed to be modified to cache the sequence packets received from non-Flash Player client and re-send them with correct timestamp to the Flash Player client that joined the stream late.

3) Looks like Flash Player 11.0 changed something related to buffering of live stream, which causes problems if the SIP side generates multiple slice NALU (primitive data units in H.264) per frame. The Flash Player itself generates one NALU per frame, however some existing SIP video phones (e.g., Bria 3) generate old-style multiple slice per frame and one NALU per slice and cannot be decoded and displayed in Flash Player 11 in live mode. You can read more about the problem. This is not a problem for buffered playback though. (update on 12/12/2011 -- I can verify that this bug has been fixed in Flash Player 11.2.202.96 and video call works fine now between Bria 3 and Flash Player via my SIP-RTMP gateway)

Ekiga SIP phone uses the new-style RTP mechanism for fragmenting a full H.264 frame instead of using multiple slices in H.264 encoding. This can be easily translated to Flash Player and works with my SIP-RTMP gateway. However, Ekiga has another problem in incorrectly interpreting RTP timestamp of received stream which makes it play the stream much slower.

The Philosophy of Open Source

I recently read a book by Henrik Ingo on "Open Life: The Philosophy of Open Source". I strongly recommend software developers as well as technical managers to read it. Here I present some excerpts that I find very interesting in the book.

"If a buyer is willing to pay a lot for it, then a cheap product can be sold at a high price... It is not stupid to ask a high price, but it is to pay it."

"The law of supply and demand can lead to situations that seem strange when common sense is applied to them. ... The oil is no different to the oil that was on sale at a considerably cheaper price just the day before... When supply goes down, the price goes up - even if all else remains equal."

"Often, the kind of stuff branded a trade secret can also be absurdly insignificant, but the important thing is that they don't tell others about it. Today's companies are at least as interested in the things they don't do as the things they pretend to be doing and producing... There's an ominous sense that much of what we do is done with a logic of mean-spiritedness, whether it is in business or in our everyday lives!"

"In a word, Europe's farming policy is based on mean-spiritedness. The subsidies policy is based on farmers agreeing not to produce more food than their agreed quota (to keep the supply low and prices high)."

"The logic of mean-spiritedness that follows from the law of supply and demand, can also be found in all fields of commerce where there is any co called 'immaterial property', including IT, music, film, and other kinds of entertainment, but the most glaring examples of it occur within the world of computers."

"These three demands -- features, quality, and deadline -- would build certain tension into any project. If, for instance, the schedule is too tight, there may not be enough time to include all the features you want. But if a project manager leans on his team, demanding that all the features are included and the deadline be met, then they are compelled to do a rushed job and, inevitably, quality suffers. ... The Open Source community's no-deadlines principle makes excellent sense, and is probably one of the reasons Open Source programs are so successful... One of the most frequently asked questions at the time was, 'When will the next version of Linux be released?' Linus had a stock answer, which was always, 'When it's done.'"

"Why do people do things? The first reason is survival. The second reason is to have social life. And the third reason is that people do things for fun. In that order... Since we work to have fun, to enjoy it, then why do we drive ourselves into the ground trying to meet artificial deadlines?"

"Usually, the vision and business strategies which guide a company are created in the upper echelons of management, after which it's up to the employees to do whatever the boss requires of them...But the principle of 'do whatever you like' would suggest that the management should quit producing the whole vision and business strategies, and focus instead on making it possible for employees to realize their own vision as best they can. (Unfortunately) For many managers such a concept would seem totally alien."

"The lazier the programmer, the more code he writes. .. Typing is too arduous for him, so he writes the code for a word processing program... Because it's too much effort to print out a letter and take it to the postbox, he writes the code for e-mail... So, laziness is a programmer's prime virtue."

"It's not healthy for one's central motivation to be hatred and fear. And what if one day Linux did manage to bring down Microsoft? Would life then lose its meaning? In order to energize themselves, would the programmers then have to find some new and fearful threat to compete against?"

"Since the beginning of hacking, Open Source hackers have always made programs to suit their own needs. ... As a client, the Federal Republic of Germany accepted this logic, and they aren't likely to have any reason to complain. Not only did they get what they wanted, they got a high-quality solution, they got it cheap, and they got it fast. What could be unfair about that?"

"An interesting situation -- IBM had to keep developing Eclipse; yet, financially, investing in it was a bad idea. The solution, of course, was Open Source."

"A company that has calculated its tender openly is much easier to trust. If I were to receive an honest tender of 1,000,000 from a company that operated with open principles, and the tender from a closed company came in at 999,500, I am likely to laugh at the latter and accept the former."

"I once read somewhere about a study which showed that about 20 percent of ants in an anthill do totally stupid things, such as tear down walls the other ants have just built, or take recently gathered food and stow it where none of them will ever find it, or do other things to sabotage everything the other ants so diligently try to achieve. The theory is that these ants don't do these things out of malice but simply because they're stupid... Critics of Open Source projects claim that their non-existent hierarchy and lack of organization leads to inefficiency... If a number of people do some stupid things, we make a rule to say it mustn't be done. Then we need a rule that says everybody has to read the rules. Before long, we need managers and inspectors to make sure people read and follow the rules and that nobody does anything stupid, even by mistake. Finally, the organization has a large group of people who spend time thinking up and writing rules, and enforcing them. And those not involved in doing, are primarily concerned with not breaking the rules...However, Linux and Wikipedia prove the opposite is true... This is particularly true when you factor in that not all planners (managers) are all that smart. Which means organizations risk having their entire staff doing something really inane, because that's what somebody planned. So, it seems better to have a little overlapping and lack of planning, because at least you have better odds for some of the overlapping activities actually making sense..."

Internet Video Communication: Past, Present and Future


I gave a presentation last month titled Hello 1 2 3, can you hear see me now? highlighting my point of view on the origins of Internet video communication technologies we see today. The full text of the presentation can be found at Internet video communication: past, present and future.

Modern video communication systems have roots in several technologies: transporting video over phone lines, using multicast on Internet2's Mbone, adding video to voice-over-IP (VoIP), and adding interactivity in existing streaming applications. Although the Internet telephony and multimedia communication protocols have matured over the last fifteen years, they are largely being used for interconnectivity among closed networks of telecom services. Recently, the world wide web has evolved as a popular platform for everything we do on the Internet including email, text chat, voice calls, discussions, enterprise applications and multi-party collaboration. Unfortunately, there is a disconnect between the web and traditional Internet telephony protocols as they have ignored the constraints and requirements of each other. Consequently, Adobe's Flash Player is being used as a web browser plugin by many developers for voice and video calls over the web.

Learning from the mistakes of the past and knowing where we stand at present will help us build the Internet video communication systems of the future. I present my point of view on the evolution, challenges and mistakes of the past, and, moving forward, describe the challenges in bridging the gap between web and VoIP. I highlight my contributions at various stages in the journey of Internet audio/video communication protocols.

Flash-based audio and video communications in the cloud


Internet telephony and multimedia communication protocols have matured over the last fifteen years. Recently, the web is evolving as a popular platform for everything we do on the Internet including email, text chat, voice calls, discussions, enterprise apps and multi-party collaboration. Unfortunately, there is a disconnect between web and traditional Internet telephony protocols as they have ignored the constraints and requirements of each other. Consequently, the Flash Player is being used as a web browser plugin by many developers for web-based voice and video calls. We describe the challenges of video communication using a web browser, present a simple API using a Flash Player application, show how it supports wide range of web communication scenarios in the cloud, and describe how it can interoperate with Session Initiation Protocol (SIP)-based systems. We describe both the advantages and challenges of Flash Player based communication applications. The presented API could guide future work on communication-related web protocol extensions.

More details are available in our white-paper. The associated software and example use cases are available as flash-videoio and siprtmp projects. The white-paper also serves as the architecture and design document of these projects.

Voice and Video Communications on Web


I co-authored and presented a paper on "SIP APIs for voice and video communications on the web" at IPTcomm 2011. The paper compares various alternative architectures, and presents the components of our ongoing project at IIT, Chicago. We are open to sponsorship of the project to further continue its R&D work. Please feel free to get in touch with me or Prof. Davids if you are interested in sponsoring student projects in her lab related to this technology.

The paper and the presentation slides are available. The project page, open source code, and free demonstration page are also available.

Abstract: Existing standard protocols for the web and Internet telephony fail to deliver real-time interactive communication from within a web browser. In particular, the client-server web protocol over reliable TCP is not always suitable for end-to-end low latency media path needed for interactive voice and video communication. To solve this, we compare the available platform options using the existing technologies such as modifying the web programming language and protocol, using an existing web browser plugin, and a separate host resident application that the web browser can talk to. We argue that using a separate application as an adaptor is a promising short term as well as long-term strategy for voice and video communications on the web. Our project aims at developing the open technology and sample implementations for web-based real-time voice and video communication applications. We describe the architecture of our project including (1) a RESTful web communication API over HTTP inspired by SIP message flows, (2) a web-friendly set of metadata for session description, and (3) an UDP-based end-to-end media path. All other telephony functions reside in the web application itself and/or in web feature servers. The adaptor approach allows us to easily add new voice and video codecs and NAT traversal technologies such as Host Identity Protocol. We want to make web-based communication accessible to millions of web developers, maximize the end user experience and security, and preserve the huge global investment in and experience from SIP systems while adhering to web standards and development tools as much as possible. We have created an open source prototype that allows you to freely use the conference application by directing a browser to the conference URL.

A Proposal for Reference Implementation Repository of SIP-related RFCs

One of the root causes of non-interoperable implementations is the misinterpretation of the specification. A number of people have claimed that SIP has become complicated and has failed to deliver its promise of mix-and-match interoperability. There are two main reasons: (a) the number of SIP related RFCs and drafts is growing faster than what a developer or product life-cycle can catch up with, and (b) many of the RFCs and drafts are not supported by an open implementation which results in misinterpretation of some aspects of the specification by the programmers. The job of a SIP programmer is to (1) read the RFC and draft for SIP or its extensions, (2) understand the logic and figure out how it fits in the big picture or how it relates to the existing SIP source code, (3) come up with some object-oriented class diagram, classes' properties and pseudo-code for their methods, and finally (4) implement the classes and methods.

Clearly the text in RFCs and drafts cannot be as unambiguous as real source code of a computer program. So many programmers may read and implement some features differently, resulting in non-interoperable implementations. Having a readily available pseudo-code for SIP and many of its extensions relieves the programmer of error-prone step (2) above, and resolves any misinterpretation at an early stage. There is a huge cost paid by the vendor or provider for this programmer's misinterpretation of the specification.

This project proposal is to keep an open and public repository of reference implementation of RFC 3261 and other SIP-related extensions. If this repository is maintained by public bodies such as SIPForum and open source community, it will enable easy access to developers and enable better interoperability of new extensions.

The goal of this effort will be to encourage submission of reference implementations by RFC and Internet Draft authors . In case of any ambiguity, the clarification will not only be applied to specification but also to the reference implementation.

If we use a very high level language such as Python then the reference implementation essentially also serves as a pseudo code, which can be ported to other programming languages. The goal is not to get involved in the syntax of a particular programming language, but just express the ideas more formally to prevent misinterpretation of the specification. Perhaps if Python is not suitable, then a similar high level language syntax can be defined.

This will greatly simplify the job of a programmer, and in the long term, will result in more interoperable and robust products seamlessly supporting new SIP extensions and features. The programmers will have fewer things to worry about; hence can write more accurate code in the short time. From an specification author's point of view, it will encourage him/her to write more solid and implementable specification without ambiguity, and encourage him/her to provide the pseudo-code in the draft. From a reviewer's point of view, one can easily judge the complexity of various alternatives or features, e.g., one can say that adding the extension 'foo' is just 10 lines of pseudo-code to the base SIP reference implementation.

It will help RFC and draft authors in seeing the complexity and implementation aspects of their proposal. Sometimes an internet-draft proposes multiple solutions without any details on them. This is partially due to the lack of implementation and complexity evaluation of the various approaches. With reference implementation and pseudo-code repository, the author can provide a patch to the existing code to evaluate the complexity of the proposal.

A few years ago I wrote a tool to annotate software source code with RFC/draft, so that when you are reading a class or method in a source code file, you can quickly know which part of the RFC/draft it implements. Please see an example here and here. Such annotations in reference implementation will help in co-relating the RFC/draft with the actual implementation.

If there is wide support for this proposal, we can raise it to SIPForum or other bodies, we can help get started and bootstrap the repository of reference implementations of a few SIP-related RFCs. Then we can invite contributions from the community and RFC/draft authors towards completing the implementations. Please post your comment to let us know what you think.

RESTful communication over WebSocket

This article shows how to implement generic resource oriented communication on top of synchronous channel such as WebSocket. This is a continuation of my previous article on REST and SIP [1] and provides more concrete thoughts because I now have an actual implementation of part of this in my web conferencing application. Other people have commented on the idea of REST on WebSocket [2]. (Using the term RESTful, which inherently is stateless, is confusing with a stateful WebSocket transport. Changing the title of this article to "Resource oriented and event-based communication over WebSocket" is more appropriate.)

Following are the basic principles of such a mechanism.
  1. Enable resource-oriented (instead of RPC-style) communication.
  2. Enable asynchronous notification when a resource (or its child resource) changes.
  3. Enable asynchronous event subscribe, publish, and notification.
  4. Enable Unix file system style access control on resources.
  5. Enable the concept of transient vs persistent resources.
Consider my favorite example of web conferencing application. The logged in users list is represented as resource /login relative to the hosting domain, and the calls list as /call. If the provider supports concept of registered subscribers, those can be at /user. For example, /user/kundan10@gmail.com can be my user profile.

Now let us look at how the four motivational points apply in this example.

1) Enable resource-oriented (instead of RPC-style) communication.

Every resource has its own representation, e.g., in JSON or XML. For example, /login/bob@home.com can be {"name": "Bob Smith", "email": "bob@home.com", "has-video": true, "has-audio": true}. The client-server communication can be over HTTP using standard RESTful or over WebSocket to access these resources.

Over WebSocket, the client sends a request of the form '{"method":"PUT","resource":"/login/bob@home.com", "type":"application/json","entity":{"name":"Bob Smith", ...}}' to login. The server treats this as same as that received on just HTTP using RESTful PUT request.

A resource-oriented (instead of RPC-style) communication allows us to keep all the business logic in the client, which uses the server only as a data store. The standard HTTP methods allow access to such data, e.g., POST to create, PUT to update, GET to read and DELETE to delete. POST is a special method that must return the resource identifier of the newly created resource. For example, when a client does POST /call to create a new call, the server returns {"id": "conf123"} to indicate that the new resource identifier is "conf123" relative to /call and call be accessed at "/call/conf123".

2) Enable asynchronous notification when a resource (or its child resource) changes.

Many web applications including web conferencing need the notion of asynchronous notifications, e.g., when a user is online, or a user joins/leaves a call. Traditionally, Internet communication has used protocols such as SIP and XMPP for asynchronous notifications. With the advent of WebSocket (and the popular socket.io project) it is possible to implement persistent client-server connection for asynchronous notifications and messages within the web browser.

In this mechanism, a generic notification architecture is applied to resources. A new method named "SUBSCRIBE" is used to subscribe to any resource. A subscriber receives notification whenever the resource or its immediate children are changed (created, updated or deleted). For example, a conference participant sends the following over WebSocket: '{"method":"SUBSCRIBE","resource":"/call/conf123"}'. Whenever the server detects that a new PUT is done for "/call/conf123/participant12" or a new POST is done for "/call/conf123" it sends a notification message to the subscriber over WebSocket: '{"notify":"UPDATE","resource":"/call/conf123","type":"application/json","entity":{ ... child resource}, "create":"participant12"}'. On the other hand, if the moderator does a PUT on "/call/conf123", then the server sends a notification as '{"notify":"PUT","resource":"/call/conf123","type":"application/json", "entity":{... parent resource}}'. In summary, the server generates the notification to both the modified resource "/call/conf123/participant12" as well as the parent resource, "/call/conf123".

The notification message contains a new "notify" attribute instead of re-using the "method" attribute to indicate the type of notification. For example, "PUT", "POST", "DELETE" means that the resource identified in "resource" attribute has been modified using that method by another client. In this case the "type" and "entity" attribute represent the "resource". Similarly, "UPDATE" means that a child resource has been modified and the details of the child resource identifier is in "create", "update" or "delete" attribute. In this case the "type" and "entity" attribute represent the child resource identified in "create", "update" or "delete".

The concept of notifications when a resource change is available in ActionScript programming language. For example, a markup text can use width="{size}" to bind the "width" property of a user interface component to the "size" variable. Whenever the "size" changes the "width" is updated. A property change event is dispatched to enable the notification. Similarly in our mechanism, a resource can be subscribed for to detect change in its value or the value of its children resources by the client application.

3) Enable asynchronous event subscribe, publish, and notification

The previous point enables a client to receive notification when a resource changes and these notifications are server generated notifications. Additionally, we need a generic end-to-end publish-subscribe mechanism to allow a client to send notification to all the subscribers without dealing with a resource modification. This allows end-to-end notifications from one client to others, via the server.

When a client subscribes to a resource, it also receives generic notifications sent by another client on that resource. A new NOTIFY method is defined. For example, if a client sends '{"method":"NOTIFY","resource":"/login/bob@home.com","type":"application/json","data":{"invite-to":"/call/conf123","invite-id":"6253"}}', and another client is subscribed to /login/bob@home.com, then it receives a notification message as '{"notify":"NOTIFY", "resource":"/login/bob@home.com","type":"application/json","data":{...}}'. In summary, the server just passes the "data" attribute to all the subscribers. The "notify" value of "NOTIFY" means an end-to-end notification generated because another client sent a NOTIFY method.

In a web conferencing application, most of the notifications are associated with a resource, e.g., conference membership change, presence status change, etc. Some notifications such as call invitation or cancel can be independent of a resource, and the NOTIFY mechanism can be used. For example, sending a NOTIFY to /login/bob@home.com is received by all the subscribers of this resource.

4) Enable Unix file system style access control on resources.

Without an authentication and access control mechanism, the resource oriented communication protocol described earlier becomes useless. Fortunately, it is possible to design a generic access control mechanism similar to Unix file system. Essentially, each resource is treated as a file and a directory. In analogy, all the child resources of this resource belong to the directory, whereas the resource entity belongs to the file. The service provider can configure top-level directories with user permissions, e.g., anyone can add child to "/user", and once added will be owned by that user. Thus if user Bob creates /user/bob, then Bob owns the subtree of this resource. It is up to Bob to configure the permissions of its child resources. For example, it can configure /user/bob/inbox to be writable by anyone but readable only by self, e.g., permissions "rwx-w--w-". This allows a web based voice and video mail application.

Unlike traditional file system data model with create, update, read and write, we also need permissions bit for subscription. For example, only Bob should be allowed to subscribe to /user/bob so that other users cannot get notifications sent to Bob. The concept of group is little vague but can be incorporated in this mechanism as well. Finally, a notion of anonymous user needs to be added so that any client which does not have account with the service provider can also get permissions if needed.

In summary, the permissions bits become five bits for each of the four categories: self, group, others-authenticated, others-anonymous. The four bits define permissions to allow create, read, update, write and subscribe. Existing authentication such as HTTP basic/digest, cookies or oAuth based sessions can be used to actually authenticate the client.

5) Enable the concept of transient vs persistent resources.

In software programming, application developers typically use local and global variables to represent transient and persistent data respectively. A similar concept is needed in the generic resource oriented communication application. So far we have seen how to read, write, update and create resources. Each resource can be transient, so that it is deleted when the client which created the resource is disconnected, or persistent which remains even after the client disconnects. For example, when a client POSTs to /call/conf123, it wants that resource to be transient which gets deleted when the client is disconnected. This causes the resource to be used as a conference membership resource, and the server notifies other participants when an existing participant is disconnected. On the other hand, when a client POSTs to /user/bob@home.com, it wants it to be the persistent user profile which is available even when this user has disconnected.

The concept of transient and persistent in the resource-oriented system allows the client to easily create a variety of applications without having to write custom kludges. In general a new resource should be created as transient by default, unless the client requests a persistent resource. Whenever the client disconnects the WebSocket all the transient resources (or local variables) of that client are deleted, and appropriate notifications are sent to the subscribers as needed.

Implementation

I have implemented part of this concept in my web conferencing project. The server side (called as service provider) is a generic PHP "restserver.php" application that accepts WebSocket connections and uses a back-end MySQL database to store and manage resources and subscriptions. Each connected client is assigned a unique client-id. There are two database tables: the resource table has fields resource-id (e.g., "/login/bob@home.com"), parent-resource-id (e.g., "/login"), type (e.g., "application/json"), entity (i.e., actual JSON representation string), and client-id, whereas the subscribe table has fields resource-id of the target resource and client-id of the subscriber. The subscriptions are always transient, i.e., when the client disconnects the all subcribe rows are removed for that client-id. The resources can be transient or persistent. By default any new resource is created as transient and the client-id is stored in that row. When the client disconnects all the resources with the matching client-id are removed and appropriate notifications generated. A client can create persistent resource by supplying "persistent": true attribute in the PUT or POST request, and the server puts empty client-id for that new resource.

The generic "restserver.php" server application can be used in a variety of web communication applications, and we have shown it to work with web conferencing and slides presentation application.


WebRTC vs Flash Player

This year has been great for the world of IP communications so far -- with the Skype deal, Flash Player adding echo cancellation, and now Google open sourcing WebRTC (with source code) that includes the audio/video codecs and quality engines.

RTC-Web is an effort started in the IETF (and Web-RTC in W3C) to standardize the way media streams are transported end-to-end between two browser instances for a real-time communication experience within the browser. It consists of a protocol for establishing end-to-end media path, abstractions for audio/video codecs and devices, and the language elements to use this feature from with Javascript/HTML. Traditionally browser communication has been done using plugins such as Flash Player. I have written a few open source software projects that use Flash based audio and video communication (flash-videoio, siprtmp, vvowproject). The WebRTC effort brings a completely new dimension, in a good way, because now we do not depend on external plugins for web based real time communications. The real-time communication becomes a first class construct to web developers.

This article summarizes some differences between WebRTC and Flash Player approaches for real-time audio/video communication. It also mentions a separate application approach as described in the VVoW project.

WebRTC is inline with the evolution of web protocols whereas using Flash Player is like patching an incomplete system. With WebRTC there is no external dependency beyond the basic web browser. However, given the ubiquitous availability of Flash Player compared to basic inter-operating HTML5 features, Flash Player approach is still promising, at least in the short term.

The number of web developers who understand Javascript/HTML is clearly much more than Actionscript/MXML, which benefits WebRTC approach as there can be many more new applications and use cases implemented in practice. However, the complexity of building Javascript based application combining various individual pieces of the communication elements may be overwhelming. On the other hand existing IDE tools for Flash development take away a lot of complexity from the developers.

Many users are reluctant to change their browser, and hence getting ubiquitous user adoption may take a long time unless this gets added to Internet Explorer. Moreover, dealing with device interfaces in a portable manner is a challenge. It is also not clear how the devices should be accessed across multiple instances of the same browser or different browser.

In the past, incompatibility in HTML among browsers has been a nightmare for web developers, and extending HTML for yet another feature is bound to cause more interoperability problems. Two interoperability scenarios are significant: between browsers from different vendors running the same web page, and between two different web sites. The latter is tricky from security point of view if open standards are used because the web site owners would want to restrict communication of its user to another web site user, whereas the protocol will be capable of such communication.

On the other hand, Flash Player has shown more ubiquitous availability on user's desktops and laptops than any specific web browser. Flash Player allows implementing platform agnostic software because all the incompatibilities between browsers and platforms are taken care by the plugin vendor.

Flash Player has the ability to do group communication by building scalable application level multicast tree among Flash Player instances. This is useful for one-to-many broadcast type communication scenarios. WebRTC is still in the initial phases of two party communication. Obviously, multiparty communication can be built on top of the two-party communication elements, but requires more effort to achieve efficiency.

In terms of video codecs, WebRTC provides open source high quality video codec, whereas Flash Player's camera captured video is still in outdated Sorenson codec, which is difficult to interoperate with non-Flash products. Availability of source code enables a WebRTC-based project to add new codecs as needed without depending on the vendor to provide new audio and video codec features.

The main problems with Flash Player approach is that the protocol for end-to-end media path is proprietary so interoperating with existing VoIP gears is inefficient without buying server pieces from the plugin vendor. Although, interoperability is possible using open RTMP and SIP-RTMP translators, it is not efficient because the browser to translator media path over TCP incurs unnecessary latency for some users. Secondly, for any new feature, we depend on the vendor, for example, echo cancellation, new codec, portability to new device. Luckily, Adobe has been releasing new updates with new features periodically. For example, echo cancellation feature released in Flash Player 10.3 solved a lot of problems for real-time communication. (Please see the public-chat demo in my flash-videoio project page to try out the video conference with echo cancellation.)

Some problems common to both the approaches are: (1) lack of a listening TCP socket or a general purpose UDP socket which could be used to implement a peer-to-peer application protocol within the browser without relying on servers, (2) the scope of an application is within a web page as defined by the Javascript or Flash elements, so if the user navigates to another web page the communication is lost. This is not a problem for web communication use case, but people are generally not used to this model in traditional communication.

On the other hand, the separate application model as used in the VVoW project allows you to have host resident software for communication, which can be used by any application including a web application running in your browser by connecting to the resident software locally. The resident application can reuse the existing research, e.g., Host Identity Protocol and P2P-SIP. This can save initial setup time for every connection of WebRTC. The main problem is that it involves yet another download and installation by the end user which hampers wide adoption.

I will continue to explore the WebRTC software developed by Google and try to include it in my open source projects. Some example projects could be: (1) add interoperability between WebRTC and Flash Player for communication in my siprtmp project, (2) add option to detect WebRTC support and use that in my flash-videoio project if available, and fallback to Flash Player, and (3) use the WebRTC source code to implement a separate application with high quality end-to-end media path in the VVoW project, and (4) create a Python wrapper to use WebRTC from within any Python application.

Performance of siprtmp: multitask vs gevent

Poor performance has been an issue in my RTMP server and SIP-RTMP gateway. Traditionally, I blamed the multitask framework for the poor performance. In this article I present my measurement results as well as introduce an alternative gevent-based implementation to improve the performance.

There are several performance aspects of this software, e.g., CPU utilization per call or session, memory usage, bandwidth requirement, etc. This article only focuses on the CPU performance. Moreover, I only consider the steady state CPU usage to measure the number of active simultaneous calls through the gateway. The CPU usage during call setup and termination is not considered.

The conclusion of my measurement is as follows. The SIP-RTMP gateway software using gevent takes about 2/3 the CPU cycles than using multitask, and the RTMP server software using gevent takes about 1/2 the CPU cycles than using multitask. After the improvements, on a dual-core 2.13 GHz CPU machine, a single audio call going though gevent-based siprtmp using Speex audio codec at 8Hz sampling takes about 3.1% CPU, and hence in theory can support about 60 active calls in steady state. Another way to look at it is that the software requires CPU cycles of about 66 MHz per audio call.

The gevent-based software is also available under the same license for you to try out. The next step to further improve the performance is to move part of the media processing of siprtmp to an external C/C++ extension module.

Background

Traditionally, I have used the multitask framework for co-operative multitasking in my Python software including p2p-sip, rtmplite and siprtmp. In the past, people have complained about high CPU utilization in siprtmp for a single call or even with no call. Part of the discussion is documented in issue 31. It turned out that the no-call CPU usage was a bug, and that we could optimize the multitask framework to improve the performance by approximately 2x. The optimization alters the way in which the multitask framework looks for io-events and more tasks. In particular, it gives more preference to tasks than to io-events, hence if a single io-event generates multiple tasks, all of them run before waiting for next io-events. These optimizations and fixes are in SVN r60 and r68. Unfortunately, these optimizations are not enough.

To further improve the performance, I looked at the built-in asyncore module of Python and re-implemented rtmp.py to use asyncore. There was significant improvement of approximately another 1.5x to 2x. Unfortunately, getting timers to work with asyncore is not trivial. Hence I couldn't implement siprtmp easily as the SIP/RTP library relies heavily on timers.

Then I looked at the gevent project, thanks to a co-worker for recommending it. It supports co-routine based co-operative multitasking by modifying the existing blocking modules such as socket. Compared to the multitask framework, the source code using gevent is more readable and easy to maintain because it works behind the scene. Unlike this, the multitask framework requires yield statements scattered everywhere and non-trivial StopIteration exception to return from a task. I re-implemented siprtmp.py, and related SIP/RTP modules, using gevent. Since siprtmp module includes all of rtmp module, this can also be used as an RTMP server in addition to being a SIP-RTMP gateway.

Test Setup

All my tests were done on my MacBook laptop, 2.13 GHz Intel core 2 duo, 2GB memory, and running Mac OS X 10.5.6. I used Python 2.7 for server side components and flash debug player version MAC 10,0,45,2 (how to find?). I used X-lite version 3 as a standard SIP client. The debug trace on the server was disabled, by not supplying any -d option. All my clients and server ran locally on my local host hence bandwidth was not an issue. I used Mac's Activity Monitor to measure the CPU usage.

Measurement Result
The main metric is the CPU usage in percentage as reported by the Activity Monitor. There are several parameters that were altered and the effects were measured.

The siprtmp performance was measured for an audio call between a web-based VideoPhone sample application available as part of the siprtmp software, and the third-party X-Lite application. The sampling rate of the Speex audio codec can be 8kHz or 16kHz. The larger the sampling rate, the larger the encoded packet is. The CPU usage increases with higher sampling rate. Note that there is no transcoding in siprtmp. The following table shows the percentage CPU usage for siprtmp using multitask and gevent, and for the two sampling rates.

Ratemultitaskgevent
8 kHz4.8-5.1%3.1-3.2%
16 kHz6.2-6.5%4.0-4.1%

Base on these, we can conclude that the gevent-based SIP-RTMP gateway takes about 2/3 the CPU compared to multitask-based gateway. Roughly, the gevent-based gateway takes about 66 MHz/audio-call of the CPU cycles in steady state.

The rtmp performance was measured using one publisher and zero or more players. The CPU usage increases with the number of players. Typically, audio only session gives less variance in the CPU usage, whereas if video is included then depending on the amount of movement or image details the packet size changes, and so does the CPU usage. I used the Flash VideoIO project's test page to perform the tests. If video is present, then Flash Player's camera capture uses these properties: cameraQuality=80, cameraWidth=320, cameraHeight=240, cameraFPS=12. Audio is always Speex 16 kHz with encodeQuality=6. The following tables shows the percentage CPU usage using multitask and gevent, with one publisher and different number of players, and with or without video. If the variance is small, only the average is reported, whereas if the variance is large the range is listed.

Media#playersmultitaskgevent
Audio02.2%1.3%
Audio13.5%1.8%
Audio24.5%2.1%
Audio35.5%2.5%
Audio+Video03.0-3.9%1.4-1.7%
Audio+Video14.2-4.7%2.1%
Audio+Video25.5-6.3%2.7%
Audio+Video37.0-7.6%3.1%

Based on these, we can conclude that gevent-based software takes less than 1/2 the CPU than the multitask-based software for RTMP streaming. Roughly, the gevent-based server takes 34 MHz/publisher and 12 MHz/player of the CPU cycles in steady state.