Understanding RTMFP Handshake

(Disclaimer: the protocol description here is not an official specification of RTMFP but just the protocol understanding based on the OpenRTMFP's Cumulus project as well as the IETF presentation slides.)

Introduction

What is RTMFP? RTMFP (Real-time media flow protocol) allows UDP-based low-latency end-to-end media path between two Flash Player instances. Compared to earlier RTMP-based media path which runs over TCP, this new protocol enables actual real-time communication on the web. Although the end-to-end media path is not always possible when certain types of NATs and firewalls are present, it is possible to do end-to-end media across most residential-type NATs. The end-to-end media path between two Flash Player reduces latency as well as scalability of the service (or server infrastructure) since most heavy media traffic can be sent without going through the hosted server. The UDP transport reduces latency compared to TCP transport even if the media-path is client-server.

Why is understanding RTMFP important? Unlike the earlier RTMP, the new protocol RTMFP is still closed with no open specification available. There have been some attempts at reverse engineering the protocol for interoperability and some official slides explaining the core logic. Understanding the wire-protocol is not important if you are building Flash-based applications that work among each other. However for applications such as Flash-to-SIP gateway or Flash-to-RTSP translator, where you may need to interoperate between RTMFP and SIP/RTP, it is important to understand the wire-protocol in detail. For a Flash-to-SIP gateway incorporating RTMFP from the Flash side in addition to the existing RTMP will enable low-latency UDP media path between the web user and the translator service on the Internet.

The following description is reproduced from a contribution (see rtmfp.py) to my RTMP server project.

Session

An RTMFP session is an end-to-end bi-directional pipe between two UDP transport addresses. A transport address contains an IP address and port number, e.g., "192.1.2.3:1935". A session can have one or more flows where a flow is a logical path from one entity to another via zero or more intermediate entities. UDP packets containing encrypted RTMFP data are exchanged in a session. A packet contains one or more messages. A packet is always encrypted using AES with 128-bit keys.

In the protocol description below, all numbers are in network byte order (big-endian). The | operator indicates concatenation of data. The numbers are assumed to be unsigned unless mentioned explicitly.

Scrambled Session ID

The packet format is as follows. Each packet has the first 32 bits of scrambled session-id followed by encrypted part. The scrambled (instead of raw) session-id makes it difficult if not impossible to mangle packets by middle boxes such as NATs and layer-4 packet inspectors. The bit-wise XOR operator is used to scramble the first 32-bit number with subsequent two 32-bit numbers. The XOR operator makes it possible to easily unscramble.
packet := scrambled-session-id | encrypted-part

To scramble a session-id,
scrambled-session-id = a^b^c

where ^ is the bit-wise XOR operator, a is session-id, and b and c are two 32-bit numbers from the first 8 bytes of the encrypted-part.

To unscramble,
session-id = x^y^z

where z is the scrambled-session-id, and b and c are two 32-bit numbers from the first 8 bytes of the encrypted-part.

The session-id determines which session keys are used for encryption and decryption of the encrypted part. There is one exception for the fourth message in the handshake which contains the non-zero session-id but the handshake (symmetric) session keys are used for encryption/decryption. For the handshake messages, a symmetric AES (advanced encryption standard) with 128-bit (16 bytes) key of "Adobe Systems 02" (without quotes) is used. For subsequent in-session messages the established asymmetric session keys are used as described later.

Encryption

Assuming that the AES keys are known, the encryption and decryption of the encrypted-part is done as follows. For decryption, an initialization vector of all zeros (0's) is used for every decryption operation. For encryption, the raw-part is assumed to be padded as described later, and an initialization vector of all zeros (0's) is used for every encryption operation. The decryption operation does not add additional padding, and the byte-size of the encrypted-part and the raw-part must be same.

The decrypted raw-part format is as follows. It starts with a 16-bit checksum, followed by variable bytes of network-layer data, followed by padding. The network-layer data ignores the padding for convenience.
raw-part := checksum | network-layer-data | padding


The padding is a sequence of zero or more bytes where each byte is \xff. Since it uses 128-bit (16 bytes) key, padding ensures that the size in bytes of the decrypted part is a multiple of 16. Thus, the size of padding is always less than 16 bytes and is calculated as follows:
len(padding) = 16*N - len(network-layer-data) - 1

where N is any positive number to make 0 <= padding-size < 16

For example, if network-layer-data is 84 bytes, then padding is 16*6-84-1=11 bytes. Adding a padding of 11 bytes makes the decrypted raw-part of size 96 which is a multiple of 16 (bytes) hence works with AES with 128-bit key.

Checksum

The checksum is calculated over the concatenation of network-layer-data and padding. Thus for the encoding direction you should apply the padding followed by checksum calculation and then AES encrypt, and for the decoding direction you should AES decrypt, verify checksum and then remove the (optional) padding if needed. Usually padding removal is not needed because network-layer data decoders will ignore the remaining data anyway.

The 16-bit checksum number is calculated as follows. The concatenation of network-layer-data and padding is treated as a sequence of 16-bit numbers. If the size in bytes is not an even number, i.e., not divisible by 2, then the last 16-bit number used in the checksum calculation has that last byte in the least-significant position (weird!). All the 16-bit numbers are added in to a 32-bit number. The first 16-bit and last 16-bit numbers are again added, and the resulting number's first 16 bits are added to itself. Only the least-significant 16 bit part of the resulting sum is used as the checksum.

Network Layer Data

The network-layer data contains flags, optional timestamp, optional timestamp echo and one or more chunks.
network-layer-data = flags | timestamp | timestamp-echo | chunks ...

The flags value is a single byte containing these information: time-critical forward notification, time-critical reverse notification, whether timestamp is present? whether timestamp echo is present and initiator/responder marker. The initiator/responder marker is useful if the symmetric (handshake) session keys are used for AES, so that it protects against packet loopback to sender.

The bit format of the flags is not clear, but the following applies. For the handshake messages, the flags is \x0b. When the flags' least-significant 4-bits are 1101b then the timestamp-echo is present. The timestamp seems to be always present. For in-session messages, the last 4-bits are either 1101b or 1001b.
--------------------------------------------------------------------
flags meaning
--------------------------------------------------------------------
0000 1011 setup/handshake
0100 1010 in-session no timestamp-echo (server to Flash Player)
0100 1110 in-session with timestamp-echo (server to Flash Player)
xxxx 1001 in-session no timestamp-echo (Flash Player to server)
xxxx 1101 in-session with timestamp-echo (Flash Player to server)
--------------------------------------------------------------------

TODO: looks like bit \x04 indicates whether timestamp-echo is present. Probably \x80 indicates whether timestamp is present. last two bits of 11b indicates handshake, 10b indicates server to client and 01b indicates client to server.

The timestamp is a 16-bit number that represents the time with 4 millisecond clock. The wall clock time can be used for generation of this timestamp value. For example if the current time in seconds is tm = 1319571285.9947701 then timestamp is calculated as follows:
int(time * 1000/4) & 0xffff = 46586
, i.e., assuming 4-millisecond clock, calculate the clock units and use the least significant 16-bits.

The timestamp-echo is just the timestamp value that was received in the incoming request and is being echo'ed back. The timestamp and its echo allows the system to calculate the round-trip-time (RTT) and keep it up-to-date.

Each chunk starts with an 8-bit type, followed by the 16-bit size of payload, followed by the payload of size bytes. Note that \xff is reserved and not used for chunk-type. This is useful in detecting when the network-layer-data has finished and padding has started because padding uses \xff. Alternatively, \x00 can also be used for padding as that is reserved type too!
chunk = type | size | payload


Message Flow

There are three types of session messages: session setup, control and flows. The session setup is part of the four-way handshake whereas control and flows are in-session messages. The session setup contains initiator hello, responder hello, initiator initial keying, responder initial keying, responder hello cookie change and responder redirect. The control messages are ping, ping reply, re-keying initiate, re-keying response, close, close acknowledge, forwarded initiator hello. The flow messages are user data, next user data, buffer probe, user data ack (bitmap), user data ack (ranges) and flow exception report.

A new session starts with an handshake of the session setup. Under normal client-server case, the message flow is as follows:
 initiator (client)                target (server)
|-------initiator hello---------->|
|<------responder hello-----------|


Under peer-to-peer session setup case for NAT traversal, the server acts as a forwarder and forwards the hello to another connected client as follows:
 initiator (client)                forwarder (server)                     target (client)
|-------initiator hello---------->| |
| |---------- forwarded initiator hello-->|
| |<--------- ack ----------------------->|
|<------------responder hello---------------------------------------------|


Alternatively, the server could redirect to another target by supplying an alternative list of target addresses as follows:
 initiator (client)                redirector (server)                     target (client)
|-------initiator hello---------->|
|<------responder redirect--------|
|-------------initiator hello-------------------------------------------->|
|<------------responder hello---------------------------------------------|


Note that the initiator, target, forwarder and redirector are just roles for session setup whereas client and server are specific implementations such as Flash Player and Flash Media Server, respectively. Even a server may initiate an initiator hello to a client in which case the server becomes the initiator and client becomes the target for that session. This mechanism is used for the man-in-middle mode in the Cumulus project.

The initiator hello may be forwarded to another target but the responder hello is sent directly. After that the initiator initial keying and the responder initial keying are exchanged (between the initiator and the responded target directly) to establish the session keys for the session between the initiator and the target. The four-way handshake prevents denial-of-service (DoS) via SYN-flooding and port scanning.

As mentioned before the handshake messages for session-setup use the symmetric AES key "Adobe Systems 02" (without the quotes), whereas in-session messages use the established asymmetric AES keys. Intuitively, the session setup is sent over pre-established AES cryptosystem, and it creates new asymmetric AES cryptosystem for the new session. Note that a session-id is established for the new session during the initial keying process, hence the first three messages (initiator-hello, responder-hello and initiator-initial-keying) use session-id of 0, and the last responder-initial-keying uses the session-id sent by the initiator in the previous message. This is further explained later.

Message Types

The 8-bit type values and their meaning are shown below.
---------------------------------
type meaning
---------------------------------
\x30 initiator hello
\x70 responder hello
\x38 initiator initial keying
\x78 responder initial keying
\x0f forwarded initiator hello
\x71 forwarded hello response

\x10 normal user data
\x11 next user data
\x0c session failed on client side
\x4c session died
\x01 causes response with \x41, reset keep alive
\x41 reset times keep alive
\x5e negative ack
\x51 some ack
---------------------------------
TODO: most likely the bit \x01 indicates whether the transport-address is present or not.

The contents of the various message payloads are described below.

Variable Length Data

The protocol uses variable length data and variable length number. Any variable length data is usually prefixed by its size-in-bytes encoded as a variable length number. A variable length number is an unsigned 28-bit number that is encoded in 1 to 4 bytes depending on its value. To get the bit-representation, first assume the number to be composed of four 7-bit numbers as follows
number = 0000dddd dddccccc ccbbbbbb baaaaaaa (in binary)
where A=aaaaaaa, B=bbbbbbb, C=ccccccc, D=ddddddd are the four 7-bit numbers

The variable length number representation is as follows:
0aaaaaaa (1 byte)  if B = C = D = 0
0bbbbbbb 0aaaaaaa (2 bytes) if C = D = 0 and B != 0
0ccccccc 0bbbbbbb 0aaaaaaa (3 bytes) if D = 0 and C != 0
0ddddddd 0ccccccc 0bbbbbbb 0aaaaaaa (4 bytes) if D != 0


Thus a 28-bit number is represented as 1 to 4 bytes of variable length number. This mechanism saves bandwidth since most numbers are small and can fit in 1 or 2 bytes, but still allows values up to 2^28-1 in some cases.

Handshake

The initiator-hello payload contains an endpoint discriminator (EPD) and a tag. The payload format is as follows:
initiator-hello payload = first | epd | tag

The first (8-bit) is unknown. The next epd is a variable length data that contains an epd-type (8-bit) and epd-value (remaining). Note that any variable length data is prefixed by its length as a variable length number. The epd is typically less than 127 bytes, so only 8-bit length is enough. The tag is a fixed 128-bit (16 bytes) randomly generated data. The fixed sized tag does not encode its length.
epd = epd-type | epd-value

The epd-type is \x0a for client-server and \x0f for peer-to-peer session. If epd-type is peer-to-peer, then the epd-value is peer-id whereas if epd-type is client-server the epd-value is the RTMFP URL that the client uses to connect to. The initiator sets the epd-value such that the responder can tell whether the initiator-hello is for them but an eavesdropper cannot deduce the identity from that epd. This is done, for example, using an one-way hash function of the identity.

The tag is chosen randomly by the initiator, so that it can match the response against the pending session setup. Once the setup is complete the tag can be forgotten.

When the target receives the initiator-hello, it checks whether the epd is for this endpoint. If it is for "another" endpoint, the initiator-hello is silently discarded to avoid port scanning. If the target is an introducer (server) then it can respond with an responder, or redirect/proxy the message with forwarded-initiator-hello to the actual target. In the general case, the target responds with responder-hello.

The responder-hello payload contains the tag echo, a new cookie and the responder certificate. The payload format is as follows:
responder-hello payload = tag-echo | cookie | responder-certificate

The tag echo is same as the original tag from the initiator-hello but encoded as variable length data with variable length size. Since the tag is 16 bytes, size can fit in 8-bits.

The cookie is a randomly and statelessly generated variable length data that can be used by the responder to only accept the next message if this message was actually received by the initiator. This eliminates the "SYN flood" attacks, e.g., if a server had to store the initial state then an attacker can overload the state memory slots by flooding with bogus initiator-hello and prevent further legitimate initiator-hello messages. The SYN flooding attack is common in TCP servers. The length of the cookie is 64 bytes, but stored as a variable length data.

The responder certificate is also a variable length data containing some opaque data that is understood by the higher level crypto system of the application. In this application, it uses the diffie-hellman (DH) secure key exchange as the crypto system.

Note that multiple EPD might map to the single endpoint, and the endpoint has single certificate. A server that does not care about the man-in-middle attack or does not create secure EPD can generate random certificate to be returned as the responder certificate.
certificate = \x01\x0A\x41\x0E | dh-public-num | \x02\x15\x02\x02\x15\x05\x02\x15\x0E

Here the dh-public-num is a 64-byte random number used for DH secure key exchange.

The initiator does not open another session to the same target identified by the responder certificate. If it detects that it already has an open session with the target it moves the new flow requests to the existing open session and stops opening the new session. The responder has not stored any state so does not need to care. (In our implementation we do store the initial state for simplicity, which may change later). This is one of the reason why the API is flow-based rather than session-based, and session is implicitly handled at the lower layer.

If the initiator wants to continue opening the session, it sends the initiator-initial-keying message. The payload is as follows:
initiator-initial-keying payload = initiator-session-id | cookie-echo
| initiator-certificate | initiator-component | 'X'

Note that the payload is terminated by \x58 (or 'X' character).

The initiator picks a new session-id (32-bit number) to identify this new session, and uses it to demultiplex subsequent received packet. The responder uses this initiator-session-id as the session-id to format the scrambled session-id in the packet sent in this session.

The cookie-echo is the same variable length data that was received in the responder-hello message. This allows the responder to relate this message with the previous responder-hello message. The responder will process this message only if it thinks that the cookie-echo is valid. If the responder thinks that the cookie-echo is valid except that the source address has changed since the cookie was generated it sends a cookie change message to the initiator.

In this DH crypto system, p and g are publicly known. In particular, g is 2, and p is a 1024-bit number. The initiator picks a new random 1024-bit DH private number (x1) and generates 1024-bit DH public number (y1) as follows.
y1 = g ^ x1 % p


The initiator-certificate is understood by the crypto system and contains the initiator's DH public number (y1) in the last 128 bytes.

The initiator-component is understood by the crypto system and contains an initiator-nonce to be used in DH algorithm as described later.

When the target receives this message, it generates a new random 1024-bit DH private number (x2) and generates 1024-bit DH public number (y2) as follows.
y2 = g ^ x2 % p


Now that the target knows the initiator's DH public number (y1) and it generates the 1024-bit DH shared secret as follows.
shared-secret = y1 ^ x2 % p


The target generates a responder-nonce to be sent back to the initiator. The responder-nonce is as follows.
responder-nonce = \x03\x1A\x00\x00\x02\x1E\x00\x81\x02\x0D\x02 | responder's DH public number


The peer-id is the 256-bit SHA256 (hash) of the certificate. At this time the responder knows the peer-id of the initiator from the initiator-certificate.

The target picks a new 32-bit responder's session-id number to demultiplex subsequent packet for this session. At this time the server creates a new session context to identify the new session. It also generates asymmetric AES keys to be used for this session using the shared-secret and the initiator and responder nonces as follows.
decode key = HMAC-SHA256(shared-secret, HMAC-SHA256(responder nonce, initiator nonce))[:16]
encode key = HMAC-SHA256(shared-secret, HMAC-SHA256(initiator nonce, responder nonce))[:16]

The decode key is used by the target to AES decode incoming packet containing this responder's session-id. The encode key is used by the target to AES encode outgoing packet to the initiator's session-id. Only the first 16 bytes (128-bits) are used as the actual AES encode and decode keys.

The target sends the responder-initial-keying message back to the initiator. The payload is as follows.
responder-initial-keying payload = responder session-id | responder's nonce | 'X'

Note that the payload is terminated by \x58 (or 'X' character). Note also that this handshake response is encrypted using the symmetric (handshake) AES key instead of the newly generated asymmetric keys.

When the initiator receives this message it also calculates the AES keys for this session.
encode key = HMAC-SHA256(shared-secret, HMAC-SHA256(responder nonce, initiator nonce))[:16]
decode key = HMAC-SHA256(shared-secret, HMAC-SHA256(initiator nonce, responder nonce))[:16]

As before, only the first 16 bytes (128-bits) are used as the AES keys. The encode key of initiator is same as the decode key of the responder and the decode key of the initiator is same as the encode key of the responder.

When a server acts as a forwarder, it receives an incoming initiator-hello and sends a forwarded-initiator-hello in an existing session to the target. The payload is follows.
forwarded initiator hello payload := first | epd | transport-address | tag


The first 8-bit value is \x22. The epd value is same as that in the initiator-hello -- a variable length data containing epd-type and epd-value. The epd-type is \x0f for a peer-to-peer session. The epd-value is the target peer-id that was received as epd-value in the initiator-hello.

The tag is echoed from the incoming initiator-hello and is a fixed 16 bytes value.

The transport address contains a flag for indicating whether the address is private or public, the binary bits of IP address and optional port number. The transport address is that of the initiator as known to the forwarder.
transport-address := flag | ip-address | port-number

The flag is an 8-bit number with the first most significant bit as 1 if the port-number is present, otherwise 0. The least significant two bits are 10b for public IP address and 01b for private IP address.

The ip-address is either 4-bytes (IPv4) or 16-bytes (IPv6) binary representation of the IP address.

The optional port-number is 16-bit number and is present when the flag indicates so.

The server then sends a forwarded-hello-response message back to the initiator with the transport-address of the target.
forwarded-hello-response = transport-address | transport-address | ...

The payload is basically one or more transport addresses of the intended target, with the public address first.

After this the initiator client directly sends subsequent messages to the responder, and vice-versa.

A normal-user-data message type is used to deal with any user data in the flows. The payload is shown below.
normal-user-data payload := flags | flow-id | seq | forward-seq-offset | options | data

The flags, an 8-bits number, indicate fragmentation, options-present, abandon and/or final. Following table indicates the meaning of the bits from most significant to least significant.
bit   meaning
0x80 options are present if set, otherwise absent
0x40
0x20 with beforepart
0x10 with afterpart
0x08
0x04
0x02 abandon
0x01 final

The flow-id, seq and forward-seq-offset are all variable length numbers. The flow-id is the flow identifier. The seq is the sequence number. The forward-seq-offset is used for partially reliable in-order delivery.

The options are present only when the flags indicate so using the most significant bit as 1. The options are as follows.

TODO: define options

The subsequent data in the fragment may be sent using next-user-data message with the payload as follows:
next-user-data := flags | data

This is just a compact form of the user data when multiple user data messages are sent in the same packet. The flow-id, seq and forward-seq-offset are implicit, i.e., flow-id is same and subsequent next-user-data have incrementing seq and forward-seq-offset. Options are not present. A single packet never contains data from more than one flow to avoid head-of-line blocking and to enable priority inversion in case of problems.

TODO

Will update this article in future:
- Fill in the description of the remaining message flows beyond handshake.
- Describe the man-in-middle mode that enables audio/video flowing through the server.

Three Problems in Interoperating with H.264 of Flash Player

H.264 decoding has been part of Flash Player since version 9, but H264 encoding was recently added in version 11. Once Flash Player 11 beta was out I started looking in to integrating video translation in the SIP-RTMP gateway project. For a Flash-to-Flash video conference you do not need to understand the problems related to H.264 in Flash Player because everything is taken care of behind the scenes by Flash Player. Adding H.264 support in the flash-videoio project was relatively straight forward. However if you are building your own translator to interoperate video between Flash Player and some other application you will need to understand these problems.

1) The first problem is that Flash Player doesn't enable H.264 even for decoding if the RTMP connection does not use the new-style "secure" handshake. In the older version handshaking with bytes containing zeros worked, but not when using H.264. Eventually I found about this on reading some open-source-flash (osflash) forum post and incorporated it in my gateway.

2) The H.264 encoder generates some sequence headers (called SPS and PPS) which are essential in decoding the rest of the video data packets. The same is true with AAC audio codec. In particular in live H.264 publish mode, Flash Player generates periodic SPS/PPS packets so the other Flash Player (or SIP phone) can join the call later and still be able to start decoding the stream. However, some existing SIP video phones generate the sequence packets only once at the beginning. The SIP-RTMP gateway needed to be modified to cache the sequence packets received from non-Flash Player client and re-send them with correct timestamp to the Flash Player client that joined the stream late.

3) Looks like Flash Player 11.0 changed something related to buffering of live stream, which causes problems if the SIP side generates multiple slice NALU (primitive data units in H.264) per frame. The Flash Player itself generates one NALU per frame, however some existing SIP video phones (e.g., Bria 3) generate old-style multiple slice per frame and one NALU per slice and cannot be decoded and displayed in Flash Player 11 in live mode. You can read more about the problem. This is not a problem for buffered playback though. (update on 12/12/2011 -- I can verify that this bug has been fixed in Flash Player 11.2.202.96 and video call works fine now between Bria 3 and Flash Player via my SIP-RTMP gateway)

Ekiga SIP phone uses the new-style RTP mechanism for fragmenting a full H.264 frame instead of using multiple slices in H.264 encoding. This can be easily translated to Flash Player and works with my SIP-RTMP gateway. However, Ekiga has another problem in incorrectly interpreting RTP timestamp of received stream which makes it play the stream much slower.

The Philosophy of Open Source

I recently read a book by Henrik Ingo on "Open Life: The Philosophy of Open Source". I strongly recommend software developers as well as technical managers to read it. Here I present some excerpts that I find very interesting in the book.

"If a buyer is willing to pay a lot for it, then a cheap product can be sold at a high price... It is not stupid to ask a high price, but it is to pay it."

"The law of supply and demand can lead to situations that seem strange when common sense is applied to them. ... The oil is no different to the oil that was on sale at a considerably cheaper price just the day before... When supply goes down, the price goes up - even if all else remains equal."

"Often, the kind of stuff branded a trade secret can also be absurdly insignificant, but the important thing is that they don't tell others about it. Today's companies are at least as interested in the things they don't do as the things they pretend to be doing and producing... There's an ominous sense that much of what we do is done with a logic of mean-spiritedness, whether it is in business or in our everyday lives!"

"In a word, Europe's farming policy is based on mean-spiritedness. The subsidies policy is based on farmers agreeing not to produce more food than their agreed quota (to keep the supply low and prices high)."

"The logic of mean-spiritedness that follows from the law of supply and demand, can also be found in all fields of commerce where there is any co called 'immaterial property', including IT, music, film, and other kinds of entertainment, but the most glaring examples of it occur within the world of computers."

"These three demands -- features, quality, and deadline -- would build certain tension into any project. If, for instance, the schedule is too tight, there may not be enough time to include all the features you want. But if a project manager leans on his team, demanding that all the features are included and the deadline be met, then they are compelled to do a rushed job and, inevitably, quality suffers. ... The Open Source community's no-deadlines principle makes excellent sense, and is probably one of the reasons Open Source programs are so successful... One of the most frequently asked questions at the time was, 'When will the next version of Linux be released?' Linus had a stock answer, which was always, 'When it's done.'"

"Why do people do things? The first reason is survival. The second reason is to have social life. And the third reason is that people do things for fun. In that order... Since we work to have fun, to enjoy it, then why do we drive ourselves into the ground trying to meet artificial deadlines?"

"Usually, the vision and business strategies which guide a company are created in the upper echelons of management, after which it's up to the employees to do whatever the boss requires of them...But the principle of 'do whatever you like' would suggest that the management should quit producing the whole vision and business strategies, and focus instead on making it possible for employees to realize their own vision as best they can. (Unfortunately) For many managers such a concept would seem totally alien."

"The lazier the programmer, the more code he writes. .. Typing is too arduous for him, so he writes the code for a word processing program... Because it's too much effort to print out a letter and take it to the postbox, he writes the code for e-mail... So, laziness is a programmer's prime virtue."

"It's not healthy for one's central motivation to be hatred and fear. And what if one day Linux did manage to bring down Microsoft? Would life then lose its meaning? In order to energize themselves, would the programmers then have to find some new and fearful threat to compete against?"

"Since the beginning of hacking, Open Source hackers have always made programs to suit their own needs. ... As a client, the Federal Republic of Germany accepted this logic, and they aren't likely to have any reason to complain. Not only did they get what they wanted, they got a high-quality solution, they got it cheap, and they got it fast. What could be unfair about that?"

"An interesting situation -- IBM had to keep developing Eclipse; yet, financially, investing in it was a bad idea. The solution, of course, was Open Source."

"A company that has calculated its tender openly is much easier to trust. If I were to receive an honest tender of 1,000,000 from a company that operated with open principles, and the tender from a closed company came in at 999,500, I am likely to laugh at the latter and accept the former."

"I once read somewhere about a study which showed that about 20 percent of ants in an anthill do totally stupid things, such as tear down walls the other ants have just built, or take recently gathered food and stow it where none of them will ever find it, or do other things to sabotage everything the other ants so diligently try to achieve. The theory is that these ants don't do these things out of malice but simply because they're stupid... Critics of Open Source projects claim that their non-existent hierarchy and lack of organization leads to inefficiency... If a number of people do some stupid things, we make a rule to say it mustn't be done. Then we need a rule that says everybody has to read the rules. Before long, we need managers and inspectors to make sure people read and follow the rules and that nobody does anything stupid, even by mistake. Finally, the organization has a large group of people who spend time thinking up and writing rules, and enforcing them. And those not involved in doing, are primarily concerned with not breaking the rules...However, Linux and Wikipedia prove the opposite is true... This is particularly true when you factor in that not all planners (managers) are all that smart. Which means organizations risk having their entire staff doing something really inane, because that's what somebody planned. So, it seems better to have a little overlapping and lack of planning, because at least you have better odds for some of the overlapping activities actually making sense..."

Internet Video Communication: Past, Present and Future


I gave a presentation last month titled Hello 1 2 3, can you hear see me now? highlighting my point of view on the origins of Internet video communication technologies we see today. The full text of the presentation can be found at Internet video communication: past, present and future.

Modern video communication systems have roots in several technologies: transporting video over phone lines, using multicast on Internet2's Mbone, adding video to voice-over-IP (VoIP), and adding interactivity in existing streaming applications. Although the Internet telephony and multimedia communication protocols have matured over the last fifteen years, they are largely being used for interconnectivity among closed networks of telecom services. Recently, the world wide web has evolved as a popular platform for everything we do on the Internet including email, text chat, voice calls, discussions, enterprise applications and multi-party collaboration. Unfortunately, there is a disconnect between the web and traditional Internet telephony protocols as they have ignored the constraints and requirements of each other. Consequently, Adobe's Flash Player is being used as a web browser plugin by many developers for voice and video calls over the web.

Learning from the mistakes of the past and knowing where we stand at present will help us build the Internet video communication systems of the future. I present my point of view on the evolution, challenges and mistakes of the past, and, moving forward, describe the challenges in bridging the gap between web and VoIP. I highlight my contributions at various stages in the journey of Internet audio/video communication protocols.

Flash-based audio and video communications in the cloud


Internet telephony and multimedia communication protocols have matured over the last fifteen years. Recently, the web is evolving as a popular platform for everything we do on the Internet including email, text chat, voice calls, discussions, enterprise apps and multi-party collaboration. Unfortunately, there is a disconnect between web and traditional Internet telephony protocols as they have ignored the constraints and requirements of each other. Consequently, the Flash Player is being used as a web browser plugin by many developers for web-based voice and video calls. We describe the challenges of video communication using a web browser, present a simple API using a Flash Player application, show how it supports wide range of web communication scenarios in the cloud, and describe how it can interoperate with Session Initiation Protocol (SIP)-based systems. We describe both the advantages and challenges of Flash Player based communication applications. The presented API could guide future work on communication-related web protocol extensions.

More details are available in our white-paper. The associated software and example use cases are available as flash-videoio and siprtmp projects. The white-paper also serves as the architecture and design document of these projects.

Voice and Video Communications on Web


I co-authored and presented a paper on "SIP APIs for voice and video communications on the web" at IPTcomm 2011. The paper compares various alternative architectures, and presents the components of our ongoing project at IIT, Chicago. We are open to sponsorship of the project to further continue its R&D work. Please feel free to get in touch with me or Prof. Davids if you are interested in sponsoring student projects in her lab related to this technology.

The paper and the presentation slides are available. The project page, open source code, and free demonstration page are also available.

Abstract: Existing standard protocols for the web and Internet telephony fail to deliver real-time interactive communication from within a web browser. In particular, the client-server web protocol over reliable TCP is not always suitable for end-to-end low latency media path needed for interactive voice and video communication. To solve this, we compare the available platform options using the existing technologies such as modifying the web programming language and protocol, using an existing web browser plugin, and a separate host resident application that the web browser can talk to. We argue that using a separate application as an adaptor is a promising short term as well as long-term strategy for voice and video communications on the web. Our project aims at developing the open technology and sample implementations for web-based real-time voice and video communication applications. We describe the architecture of our project including (1) a RESTful web communication API over HTTP inspired by SIP message flows, (2) a web-friendly set of metadata for session description, and (3) an UDP-based end-to-end media path. All other telephony functions reside in the web application itself and/or in web feature servers. The adaptor approach allows us to easily add new voice and video codecs and NAT traversal technologies such as Host Identity Protocol. We want to make web-based communication accessible to millions of web developers, maximize the end user experience and security, and preserve the huge global investment in and experience from SIP systems while adhering to web standards and development tools as much as possible. We have created an open source prototype that allows you to freely use the conference application by directing a browser to the conference URL.

A Proposal for Reference Implementation Repository of SIP-related RFCs

One of the root causes of non-interoperable implementations is the misinterpretation of the specification. A number of people have claimed that SIP has become complicated and has failed to deliver its promise of mix-and-match interoperability. There are two main reasons: (a) the number of SIP related RFCs and drafts is growing faster than what a developer or product life-cycle can catch up with, and (b) many of the RFCs and drafts are not supported by an open implementation which results in misinterpretation of some aspects of the specification by the programmers. The job of a SIP programmer is to (1) read the RFC and draft for SIP or its extensions, (2) understand the logic and figure out how it fits in the big picture or how it relates to the existing SIP source code, (3) come up with some object-oriented class diagram, classes' properties and pseudo-code for their methods, and finally (4) implement the classes and methods.

Clearly the text in RFCs and drafts cannot be as unambiguous as real source code of a computer program. So many programmers may read and implement some features differently, resulting in non-interoperable implementations. Having a readily available pseudo-code for SIP and many of its extensions relieves the programmer of error-prone step (2) above, and resolves any misinterpretation at an early stage. There is a huge cost paid by the vendor or provider for this programmer's misinterpretation of the specification.

This project proposal is to keep an open and public repository of reference implementation of RFC 3261 and other SIP-related extensions. If this repository is maintained by public bodies such as SIPForum and open source community, it will enable easy access to developers and enable better interoperability of new extensions.

The goal of this effort will be to encourage submission of reference implementations by RFC and Internet Draft authors . In case of any ambiguity, the clarification will not only be applied to specification but also to the reference implementation.

If we use a very high level language such as Python then the reference implementation essentially also serves as a pseudo code, which can be ported to other programming languages. The goal is not to get involved in the syntax of a particular programming language, but just express the ideas more formally to prevent misinterpretation of the specification. Perhaps if Python is not suitable, then a similar high level language syntax can be defined.

This will greatly simplify the job of a programmer, and in the long term, will result in more interoperable and robust products seamlessly supporting new SIP extensions and features. The programmers will have fewer things to worry about; hence can write more accurate code in the short time. From an specification author's point of view, it will encourage him/her to write more solid and implementable specification without ambiguity, and encourage him/her to provide the pseudo-code in the draft. From a reviewer's point of view, one can easily judge the complexity of various alternatives or features, e.g., one can say that adding the extension 'foo' is just 10 lines of pseudo-code to the base SIP reference implementation.

It will help RFC and draft authors in seeing the complexity and implementation aspects of their proposal. Sometimes an internet-draft proposes multiple solutions without any details on them. This is partially due to the lack of implementation and complexity evaluation of the various approaches. With reference implementation and pseudo-code repository, the author can provide a patch to the existing code to evaluate the complexity of the proposal.

A few years ago I wrote a tool to annotate software source code with RFC/draft, so that when you are reading a class or method in a source code file, you can quickly know which part of the RFC/draft it implements. Please see an example here and here. Such annotations in reference implementation will help in co-relating the RFC/draft with the actual implementation.

If there is wide support for this proposal, we can raise it to SIPForum or other bodies, we can help get started and bootstrap the repository of reference implementations of a few SIP-related RFCs. Then we can invite contributions from the community and RFC/draft authors towards completing the implementations. Please post your comment to let us know what you think.

RESTful communication over WebSocket

This article shows how to implement generic resource oriented communication on top of synchronous channel such as WebSocket. This is a continuation of my previous article on REST and SIP [1] and provides more concrete thoughts because I now have an actual implementation of part of this in my web conferencing application. Other people have commented on the idea of REST on WebSocket [2]. (Using the term RESTful, which inherently is stateless, is confusing with a stateful WebSocket transport. Changing the title of this article to "Resource oriented and event-based communication over WebSocket" is more appropriate.)

Following are the basic principles of such a mechanism.
  1. Enable resource-oriented (instead of RPC-style) communication.
  2. Enable asynchronous notification when a resource (or its child resource) changes.
  3. Enable asynchronous event subscribe, publish, and notification.
  4. Enable Unix file system style access control on resources.
  5. Enable the concept of transient vs persistent resources.
Consider my favorite example of web conferencing application. The logged in users list is represented as resource /login relative to the hosting domain, and the calls list as /call. If the provider supports concept of registered subscribers, those can be at /user. For example, /user/kundan10@gmail.com can be my user profile.

Now let us look at how the four motivational points apply in this example.

1) Enable resource-oriented (instead of RPC-style) communication.

Every resource has its own representation, e.g., in JSON or XML. For example, /login/bob@home.com can be {"name": "Bob Smith", "email": "bob@home.com", "has-video": true, "has-audio": true}. The client-server communication can be over HTTP using standard RESTful or over WebSocket to access these resources.

Over WebSocket, the client sends a request of the form '{"method":"PUT","resource":"/login/bob@home.com", "type":"application/json","entity":{"name":"Bob Smith", ...}}' to login. The server treats this as same as that received on just HTTP using RESTful PUT request.

A resource-oriented (instead of RPC-style) communication allows us to keep all the business logic in the client, which uses the server only as a data store. The standard HTTP methods allow access to such data, e.g., POST to create, PUT to update, GET to read and DELETE to delete. POST is a special method that must return the resource identifier of the newly created resource. For example, when a client does POST /call to create a new call, the server returns {"id": "conf123"} to indicate that the new resource identifier is "conf123" relative to /call and call be accessed at "/call/conf123".

2) Enable asynchronous notification when a resource (or its child resource) changes.

Many web applications including web conferencing need the notion of asynchronous notifications, e.g., when a user is online, or a user joins/leaves a call. Traditionally, Internet communication has used protocols such as SIP and XMPP for asynchronous notifications. With the advent of WebSocket (and the popular socket.io project) it is possible to implement persistent client-server connection for asynchronous notifications and messages within the web browser.

In this mechanism, a generic notification architecture is applied to resources. A new method named "SUBSCRIBE" is used to subscribe to any resource. A subscriber receives notification whenever the resource or its immediate children are changed (created, updated or deleted). For example, a conference participant sends the following over WebSocket: '{"method":"SUBSCRIBE","resource":"/call/conf123"}'. Whenever the server detects that a new PUT is done for "/call/conf123/participant12" or a new POST is done for "/call/conf123" it sends a notification message to the subscriber over WebSocket: '{"notify":"UPDATE","resource":"/call/conf123","type":"application/json","entity":{ ... child resource}, "create":"participant12"}'. On the other hand, if the moderator does a PUT on "/call/conf123", then the server sends a notification as '{"notify":"PUT","resource":"/call/conf123","type":"application/json", "entity":{... parent resource}}'. In summary, the server generates the notification to both the modified resource "/call/conf123/participant12" as well as the parent resource, "/call/conf123".

The notification message contains a new "notify" attribute instead of re-using the "method" attribute to indicate the type of notification. For example, "PUT", "POST", "DELETE" means that the resource identified in "resource" attribute has been modified using that method by another client. In this case the "type" and "entity" attribute represent the "resource". Similarly, "UPDATE" means that a child resource has been modified and the details of the child resource identifier is in "create", "update" or "delete" attribute. In this case the "type" and "entity" attribute represent the child resource identified in "create", "update" or "delete".

The concept of notifications when a resource change is available in ActionScript programming language. For example, a markup text can use width="{size}" to bind the "width" property of a user interface component to the "size" variable. Whenever the "size" changes the "width" is updated. A property change event is dispatched to enable the notification. Similarly in our mechanism, a resource can be subscribed for to detect change in its value or the value of its children resources by the client application.

3) Enable asynchronous event subscribe, publish, and notification

The previous point enables a client to receive notification when a resource changes and these notifications are server generated notifications. Additionally, we need a generic end-to-end publish-subscribe mechanism to allow a client to send notification to all the subscribers without dealing with a resource modification. This allows end-to-end notifications from one client to others, via the server.

When a client subscribes to a resource, it also receives generic notifications sent by another client on that resource. A new NOTIFY method is defined. For example, if a client sends '{"method":"NOTIFY","resource":"/login/bob@home.com","type":"application/json","data":{"invite-to":"/call/conf123","invite-id":"6253"}}', and another client is subscribed to /login/bob@home.com, then it receives a notification message as '{"notify":"NOTIFY", "resource":"/login/bob@home.com","type":"application/json","data":{...}}'. In summary, the server just passes the "data" attribute to all the subscribers. The "notify" value of "NOTIFY" means an end-to-end notification generated because another client sent a NOTIFY method.

In a web conferencing application, most of the notifications are associated with a resource, e.g., conference membership change, presence status change, etc. Some notifications such as call invitation or cancel can be independent of a resource, and the NOTIFY mechanism can be used. For example, sending a NOTIFY to /login/bob@home.com is received by all the subscribers of this resource.

4) Enable Unix file system style access control on resources.

Without an authentication and access control mechanism, the resource oriented communication protocol described earlier becomes useless. Fortunately, it is possible to design a generic access control mechanism similar to Unix file system. Essentially, each resource is treated as a file and a directory. In analogy, all the child resources of this resource belong to the directory, whereas the resource entity belongs to the file. The service provider can configure top-level directories with user permissions, e.g., anyone can add child to "/user", and once added will be owned by that user. Thus if user Bob creates /user/bob, then Bob owns the subtree of this resource. It is up to Bob to configure the permissions of its child resources. For example, it can configure /user/bob/inbox to be writable by anyone but readable only by self, e.g., permissions "rwx-w--w-". This allows a web based voice and video mail application.

Unlike traditional file system data model with create, update, read and write, we also need permissions bit for subscription. For example, only Bob should be allowed to subscribe to /user/bob so that other users cannot get notifications sent to Bob. The concept of group is little vague but can be incorporated in this mechanism as well. Finally, a notion of anonymous user needs to be added so that any client which does not have account with the service provider can also get permissions if needed.

In summary, the permissions bits become five bits for each of the four categories: self, group, others-authenticated, others-anonymous. The four bits define permissions to allow create, read, update, write and subscribe. Existing authentication such as HTTP basic/digest, cookies or oAuth based sessions can be used to actually authenticate the client.

5) Enable the concept of transient vs persistent resources.

In software programming, application developers typically use local and global variables to represent transient and persistent data respectively. A similar concept is needed in the generic resource oriented communication application. So far we have seen how to read, write, update and create resources. Each resource can be transient, so that it is deleted when the client which created the resource is disconnected, or persistent which remains even after the client disconnects. For example, when a client POSTs to /call/conf123, it wants that resource to be transient which gets deleted when the client is disconnected. This causes the resource to be used as a conference membership resource, and the server notifies other participants when an existing participant is disconnected. On the other hand, when a client POSTs to /user/bob@home.com, it wants it to be the persistent user profile which is available even when this user has disconnected.

The concept of transient and persistent in the resource-oriented system allows the client to easily create a variety of applications without having to write custom kludges. In general a new resource should be created as transient by default, unless the client requests a persistent resource. Whenever the client disconnects the WebSocket all the transient resources (or local variables) of that client are deleted, and appropriate notifications are sent to the subscribers as needed.

Implementation

I have implemented part of this concept in my web conferencing project. The server side (called as service provider) is a generic PHP "restserver.php" application that accepts WebSocket connections and uses a back-end MySQL database to store and manage resources and subscriptions. Each connected client is assigned a unique client-id. There are two database tables: the resource table has fields resource-id (e.g., "/login/bob@home.com"), parent-resource-id (e.g., "/login"), type (e.g., "application/json"), entity (i.e., actual JSON representation string), and client-id, whereas the subscribe table has fields resource-id of the target resource and client-id of the subscriber. The subscriptions are always transient, i.e., when the client disconnects the all subcribe rows are removed for that client-id. The resources can be transient or persistent. By default any new resource is created as transient and the client-id is stored in that row. When the client disconnects all the resources with the matching client-id are removed and appropriate notifications generated. A client can create persistent resource by supplying "persistent": true attribute in the PUT or POST request, and the server puts empty client-id for that new resource.

The generic "restserver.php" server application can be used in a variety of web communication applications, and we have shown it to work with web conferencing and slides presentation application.


WebRTC vs Flash Player

This year has been great for the world of IP communications so far -- with the Skype deal, Flash Player adding echo cancellation, and now Google open sourcing WebRTC (with source code) that includes the audio/video codecs and quality engines.

RTC-Web is an effort started in the IETF (and Web-RTC in W3C) to standardize the way media streams are transported end-to-end between two browser instances for a real-time communication experience within the browser. It consists of a protocol for establishing end-to-end media path, abstractions for audio/video codecs and devices, and the language elements to use this feature from with Javascript/HTML. Traditionally browser communication has been done using plugins such as Flash Player. I have written a few open source software projects that use Flash based audio and video communication (flash-videoio, siprtmp, vvowproject). The WebRTC effort brings a completely new dimension, in a good way, because now we do not depend on external plugins for web based real time communications. The real-time communication becomes a first class construct to web developers.

This article summarizes some differences between WebRTC and Flash Player approaches for real-time audio/video communication. It also mentions a separate application approach as described in the VVoW project.

WebRTC is inline with the evolution of web protocols whereas using Flash Player is like patching an incomplete system. With WebRTC there is no external dependency beyond the basic web browser. However, given the ubiquitous availability of Flash Player compared to basic inter-operating HTML5 features, Flash Player approach is still promising, at least in the short term.

The number of web developers who understand Javascript/HTML is clearly much more than Actionscript/MXML, which benefits WebRTC approach as there can be many more new applications and use cases implemented in practice. However, the complexity of building Javascript based application combining various individual pieces of the communication elements may be overwhelming. On the other hand existing IDE tools for Flash development take away a lot of complexity from the developers.

Many users are reluctant to change their browser, and hence getting ubiquitous user adoption may take a long time unless this gets added to Internet Explorer. Moreover, dealing with device interfaces in a portable manner is a challenge. It is also not clear how the devices should be accessed across multiple instances of the same browser or different browser.

In the past, incompatibility in HTML among browsers has been a nightmare for web developers, and extending HTML for yet another feature is bound to cause more interoperability problems. Two interoperability scenarios are significant: between browsers from different vendors running the same web page, and between two different web sites. The latter is tricky from security point of view if open standards are used because the web site owners would want to restrict communication of its user to another web site user, whereas the protocol will be capable of such communication.

On the other hand, Flash Player has shown more ubiquitous availability on user's desktops and laptops than any specific web browser. Flash Player allows implementing platform agnostic software because all the incompatibilities between browsers and platforms are taken care by the plugin vendor.

Flash Player has the ability to do group communication by building scalable application level multicast tree among Flash Player instances. This is useful for one-to-many broadcast type communication scenarios. WebRTC is still in the initial phases of two party communication. Obviously, multiparty communication can be built on top of the two-party communication elements, but requires more effort to achieve efficiency.

In terms of video codecs, WebRTC provides open source high quality video codec, whereas Flash Player's camera captured video is still in outdated Sorenson codec, which is difficult to interoperate with non-Flash products. Availability of source code enables a WebRTC-based project to add new codecs as needed without depending on the vendor to provide new audio and video codec features.

The main problems with Flash Player approach is that the protocol for end-to-end media path is proprietary so interoperating with existing VoIP gears is inefficient without buying server pieces from the plugin vendor. Although, interoperability is possible using open RTMP and SIP-RTMP translators, it is not efficient because the browser to translator media path over TCP incurs unnecessary latency for some users. Secondly, for any new feature, we depend on the vendor, for example, echo cancellation, new codec, portability to new device. Luckily, Adobe has been releasing new updates with new features periodically. For example, echo cancellation feature released in Flash Player 10.3 solved a lot of problems for real-time communication. (Please see the public-chat demo in my flash-videoio project page to try out the video conference with echo cancellation.)

Some problems common to both the approaches are: (1) lack of a listening TCP socket or a general purpose UDP socket which could be used to implement a peer-to-peer application protocol within the browser without relying on servers, (2) the scope of an application is within a web page as defined by the Javascript or Flash elements, so if the user navigates to another web page the communication is lost. This is not a problem for web communication use case, but people are generally not used to this model in traditional communication.

On the other hand, the separate application model as used in the VVoW project allows you to have host resident software for communication, which can be used by any application including a web application running in your browser by connecting to the resident software locally. The resident application can reuse the existing research, e.g., Host Identity Protocol and P2P-SIP. This can save initial setup time for every connection of WebRTC. The main problem is that it involves yet another download and installation by the end user which hampers wide adoption.

I will continue to explore the WebRTC software developed by Google and try to include it in my open source projects. Some example projects could be: (1) add interoperability between WebRTC and Flash Player for communication in my siprtmp project, (2) add option to detect WebRTC support and use that in my flash-videoio project if available, and fallback to Flash Player, and (3) use the WebRTC source code to implement a separate application with high quality end-to-end media path in the VVoW project, and (4) create a Python wrapper to use WebRTC from within any Python application.

Performance of siprtmp: multitask vs gevent

Poor performance has been an issue in my RTMP server and SIP-RTMP gateway. Traditionally, I blamed the multitask framework for the poor performance. In this article I present my measurement results as well as introduce an alternative gevent-based implementation to improve the performance.

There are several performance aspects of this software, e.g., CPU utilization per call or session, memory usage, bandwidth requirement, etc. This article only focuses on the CPU performance. Moreover, I only consider the steady state CPU usage to measure the number of active simultaneous calls through the gateway. The CPU usage during call setup and termination is not considered.

The conclusion of my measurement is as follows. The SIP-RTMP gateway software using gevent takes about 2/3 the CPU cycles than using multitask, and the RTMP server software using gevent takes about 1/2 the CPU cycles than using multitask. After the improvements, on a dual-core 2.13 GHz CPU machine, a single audio call going though gevent-based siprtmp using Speex audio codec at 8Hz sampling takes about 3.1% CPU, and hence in theory can support about 60 active calls in steady state. Another way to look at it is that the software requires CPU cycles of about 66 MHz per audio call.

The gevent-based software is also available under the same license for you to try out. The next step to further improve the performance is to move part of the media processing of siprtmp to an external C/C++ extension module.

Background

Traditionally, I have used the multitask framework for co-operative multitasking in my Python software including p2p-sip, rtmplite and siprtmp. In the past, people have complained about high CPU utilization in siprtmp for a single call or even with no call. Part of the discussion is documented in issue 31. It turned out that the no-call CPU usage was a bug, and that we could optimize the multitask framework to improve the performance by approximately 2x. The optimization alters the way in which the multitask framework looks for io-events and more tasks. In particular, it gives more preference to tasks than to io-events, hence if a single io-event generates multiple tasks, all of them run before waiting for next io-events. These optimizations and fixes are in SVN r60 and r68. Unfortunately, these optimizations are not enough.

To further improve the performance, I looked at the built-in asyncore module of Python and re-implemented rtmp.py to use asyncore. There was significant improvement of approximately another 1.5x to 2x. Unfortunately, getting timers to work with asyncore is not trivial. Hence I couldn't implement siprtmp easily as the SIP/RTP library relies heavily on timers.

Then I looked at the gevent project, thanks to a co-worker for recommending it. It supports co-routine based co-operative multitasking by modifying the existing blocking modules such as socket. Compared to the multitask framework, the source code using gevent is more readable and easy to maintain because it works behind the scene. Unlike this, the multitask framework requires yield statements scattered everywhere and non-trivial StopIteration exception to return from a task. I re-implemented siprtmp.py, and related SIP/RTP modules, using gevent. Since siprtmp module includes all of rtmp module, this can also be used as an RTMP server in addition to being a SIP-RTMP gateway.

Test Setup

All my tests were done on my MacBook laptop, 2.13 GHz Intel core 2 duo, 2GB memory, and running Mac OS X 10.5.6. I used Python 2.7 for server side components and flash debug player version MAC 10,0,45,2 (how to find?). I used X-lite version 3 as a standard SIP client. The debug trace on the server was disabled, by not supplying any -d option. All my clients and server ran locally on my local host hence bandwidth was not an issue. I used Mac's Activity Monitor to measure the CPU usage.

Measurement Result
The main metric is the CPU usage in percentage as reported by the Activity Monitor. There are several parameters that were altered and the effects were measured.

The siprtmp performance was measured for an audio call between a web-based VideoPhone sample application available as part of the siprtmp software, and the third-party X-Lite application. The sampling rate of the Speex audio codec can be 8kHz or 16kHz. The larger the sampling rate, the larger the encoded packet is. The CPU usage increases with higher sampling rate. Note that there is no transcoding in siprtmp. The following table shows the percentage CPU usage for siprtmp using multitask and gevent, and for the two sampling rates.

Ratemultitaskgevent
8 kHz4.8-5.1%3.1-3.2%
16 kHz6.2-6.5%4.0-4.1%

Base on these, we can conclude that the gevent-based SIP-RTMP gateway takes about 2/3 the CPU compared to multitask-based gateway. Roughly, the gevent-based gateway takes about 66 MHz/audio-call of the CPU cycles in steady state.

The rtmp performance was measured using one publisher and zero or more players. The CPU usage increases with the number of players. Typically, audio only session gives less variance in the CPU usage, whereas if video is included then depending on the amount of movement or image details the packet size changes, and so does the CPU usage. I used the Flash VideoIO project's test page to perform the tests. If video is present, then Flash Player's camera capture uses these properties: cameraQuality=80, cameraWidth=320, cameraHeight=240, cameraFPS=12. Audio is always Speex 16 kHz with encodeQuality=6. The following tables shows the percentage CPU usage using multitask and gevent, with one publisher and different number of players, and with or without video. If the variance is small, only the average is reported, whereas if the variance is large the range is listed.

Media#playersmultitaskgevent
Audio02.2%1.3%
Audio13.5%1.8%
Audio24.5%2.1%
Audio35.5%2.5%
Audio+Video03.0-3.9%1.4-1.7%
Audio+Video14.2-4.7%2.1%
Audio+Video25.5-6.3%2.7%
Audio+Video37.0-7.6%3.1%

Based on these, we can conclude that gevent-based software takes less than 1/2 the CPU than the multitask-based software for RTMP streaming. Roughly, the gevent-based server takes 34 MHz/publisher and 12 MHz/player of the CPU cycles in steady state.

Implementing video conferencing and text chat using Channel API

Last week, Google finally released the Channel API [1, 2] for Google App Engine. It has been available to developers for six months [3], but not on actual app engine for production. I had built a few video conferencing and text chat applications [4, 5] using Flash VideoIO project [6] on Google App Engine. Earlier, I had to use Ajax/polling technique to get events related to chat and user list. In the last couple of days, I modified those applications to use the asynchronous event notifications using the Channel API. More text from [6] follows:

"Random-Face [4]: This is a chatroulette-type application built using the Flash VideoIO component on Adobe Stratus service and Python-based Google App Engine. ... You can view the source code of two files, index.html that renders the front end user interface and main.py that forms the back-end service."

"Public-Chat [5]: This is a multi-party audio, video and text chat application built on top of Python-based Google App Engine and using Channel API for asynchronous instant messaging and presence. ... Developers can see the source code files: index.html is the front-end user interface, webtalk.js is the client side Javascript to do signaling, and main.py is the back-end service code."

The Channel API essentially implements an XMPP-style asynchronous communication from your server to the Javascript client. I use this to implement notifications for new messages, change in user list, and update of user video session to other participants in the system.

What is Flash Media Gateway?

I recently saw description of Adobe's Flash Media Gateway [1] and a related information on how Adobe Connect 8 can use it to make and receive SIP calls [2]. This article lists my view on advantages and problems of such a gateway architecture. (Disclaimer: I have not used any of these products though, so my views may be completely wrong).

In summary the new Flash Media Gateway is similar to the bunch of other SIP-RTMP gateway products that already existed for few years, e.g., siprtmp, gtalk2voip, flaphone and red5phone. I feel the industry demand of interoperating between Flash Player and SIP devices eventually forced the company to do something about it. Unfortunately, it did something which is sub-optimal as I describe here.

I have been involved with development of open source siprtmp project [3] hence I can speak from my experience about advantages and problems with such an architecture. I have also blogged earlier about FAQ on using Flash Player to make phone calls [4].

Advantages of Flash Media Gateway
  1. It allows you to build Flash applications that can talk to SIP devices using Adobe's servers at the back end. While it is not useful for those who already have resorted to other solutions such as Red5 and Wowza, it is useful to those who use Adobe's Flash Media Server (FMS) and do not want to switch to other alternatives for any reason. Problem: It is not clear whether the Flash Media Gateway can work with other media servers such as Red5 or Wowza.
  2. It supports audio transcoding among Speex, Nellymoser and G.711, as well as mixing for a simple conference bridge. This allows working with older Flash players that do not have Speex and with SIP devices that do not have Speex. A third-party product such as siprtmp is typically reluctant to implement transcoding with Nellymoser because of licensing restrictions. Problem: In general transcoding is not the best option because it takes significant CPU cycles on your (expensive) hosted servers. It can drastically reduce the capacity of your server by a factor, e.g., support 100 calls with Speex or support 10 calls between Speex and G.711.
  3. It supports video using H.264. Problem: It is not clear whether it allows only one-direction H.264 from SIP device to Flash Player, or whether it supports bi-directional H.264. A bi-directional H.264 will a huge advantage, but will mean that Flash Player is capable of capturing and sending H.264 video, which does not look like the case.
  4. It can potentially support UDP between Flash Player and server. Note that one of the biggest issue with real-time voice calls with Flash Player was that RTMP (over TCP) caused high latency not suitable for interactive communication. Adobe added another protocol, RTMFP (over UDP), that could allow end-to-end media path among the participants thus drastically reducing the end-to-end audio latency. While a gateway architecture does not allow end-to-end media path, it can still allow UDP between Flash Player and media server using RTMFP. This could reduce the end-to-end latency to some extent. Problem: It is not clear whether RTMFP can be used in conjunction with Flash Media Gateway.
Problems with Flash Media Gateway
  1. It does not allow you to build a SIP client in the browser. The communication between Flash application and the media server/gateway is still over RTMP (or RTMFP). This means unlike true end-to-end media path for SIP calls, the media must go through the server/gateway. I don't think the connect plugin is implementing a SIP/RTP stack because it says that it uses the gateway in the back end.
  2. If RTMFP is not allowed for such SIP calls, then the RTMP (over TCP) connection will significantly contribute to latency which is not suitable for interactive voice calls unless you have deployed the gateway close to your user.
  3. Most SIP-PSTN gateways that translate SIP calls to phone network support traditional voice codecs of G.711, G.729, G.723.1 but not Speex or Nellymoser, whereas the Flash Player supports only Speex and Nellymoser for captured voice. Thus you always need a transcoding. Unfortunately, G.711 at 64 kb/s is expensive on bandwidth compared to say G.729 at 8 kb/s. Since the gateway does not support common voice codecs of PSTN providers, in most cases you will need to run some form of transcoding, twice! or live with higher bandwidth usage.
  4. It does not add any more significant value to what already exists with red5phone or siprtmp. You still need to use a third-party SIP provider who can terminate your PSTN calls. It does not optimize the media path latency because of the gateway architecture. And finally it does not really improve the call experience for Flash to SIP calls to the end-user.
Ideally, the SIP/RTP and related protocols should become part of Flash Player, so that it allows one to create a SIP user agent in the browser and enable low latency end-to-end media path with third-party SIP user agents.


How to extend HTML5 for real-time video communication?

A few months ago, I was discussing HTML5 with a friend of mine. We tried to figure out what would it take to extend it to support web-based video communication. The proposed HTML5 already includes audio and video tags, but are useful only for streaming video applications. This article presents more refined thoughts on how to extend the browser to support video communication.

First approach: extend the video tag
W3C has added new video element in HTML5 to facilitate playback of interoperable video formats across browsers. Existing web sites use "object" element to run an external plugin such as Flash Player for video playback, which is intended to be replaced by the HTML5's video element. This allows browser manufacturers especially for phones and other devices to easily playback web videos, without having to implement the full Flash Player plugin. The "src" property allows specifying the URL of the video to play, and additional properties such as poster, preload, autoplay, loop and controls allow controlling the behavior of the video player.

One way to support video communication is to extend the video element with additional properties that allow it to capture and publish local video, and control camera and microphone behavior. For example, in a two-party call between Alice and Bob, Alice can have two video elements, one to publish local video to URL stream "alice" and other to play remote video from URL stream "bob". Similarly, Bob can have two video elements, one to publish local video to URL stream "bob" and other to play remote video from URL stream "alice". The "src" property can specify the central media server or rendezvous server location as well as the publish or play stream names, e.g., "rtmp://server/conf123?publish=alice".

This is the idea behind my Flash-based audio and video communication project. In addition to existing properties such as src, preload, autoplay, loop and controls, it defines new properties for microphone, camera, playing, recording, etc., as you can see on How to use the VideoIO API?. It also overloads the "src" property to allow "rtmp" and "rtmfp" URLs for media server or rendezvous server location, respectively. This application with its new properties can be used as a drop-in replacement for a video element that supports video communication in the browser.

This approach of extending the existing video element with new properties works well for two-party as well as multi-party conferences, and centralized as well as end-to-end media path. The nice thing about this approach is that it keeps the actual call signaling out-of-scope of the video element, e.g., your web application implements call signaling using existing Javascript/Ajax/websocket/server-event technologies. It keeps the specific rendezvous protocol mechanism such as "rtmp", "rtmfp", and in future "sip" or "rtsp", outside the video element.

To avoid interoperability problems, a minimum subset of supported rendezvous is recommended. The requirements of such a protocol is to support real-time media transport, preferably over UDP, in centralized or end-to-end path in presence of network middle boxes such as NATs and firewalls.

Second approach: define new connection object
The previous approach integrates capture, playback and connection functions in to a single video element, with additional properties. Alternatively, these functions can be split in to different elements and Javascript objects, e.g., the video element does display/playback, but new camera and microphone objects allow capture, and new connection object allows end-to-end real-time media path among participants. The Javascript application actually connects these different elements and objects to build a complete video communication system.

There are several proposals on how the new connection or transport API will look like. Example attributes are: protocol (udp or tcp), list of reflectors and relay servers , mode (initiating or receiving), secure (boolean). Additionally, it has methods such as connect and send, and events to indicate connection status and incoming data. Existing protocols such as ICE, STUN, RTP/RTCP and SIP may be implemented in the browser or external gateways to support such as transport object. Finally, these transport objects can be piped with display and capture components, audio and video codecs and filters, etc., to implement a complete video communication application.

In summary, this approach defines new Javascript objects such as Transport, Camera, Microphone, Codec, etc., and allows the application to connect these objects to build a real application. This is more complex than the first approach, but allows fine-grained application logic.

Third approach: use external application
This approach understands the limitations of HTML and does not try to "add" video communication to it. We are considering this approach of a separate application in our web communications project at Illinois Institute of Technology.

While the idea of extending HTML to support video communication is useful and interesting, there are many limitations. In the past, incompatibility in HTML among browsers has been a nightmare for web developers, and extending HTML for yet another feature is bound to cause more interoperability problems. Browser manufacturers are sometimes not too keen to add a new feature, e.g., for business reasons if it competes with the manufacturer's existing product or service. Third important reason is that the video element of HTML5 lacks some digital rights management related features, which causes media owners to publish their media using restricted Flash application. Fourth, adoption of new HTML5 is slow, so web site developers still need to fall back to Flash-based application for video playback at least in the short term. Finally, adding capture and end-to-end transport components in HTML5 gives rise to a plethora of issues related to privacy, security and denial of service attacks, in case of faulty browser implementation. Due to these reasons many people believe that extending HTML and browsers to support video communication is not the right approach.

Hundreds of applications exist that implement consumer video communication. Some popular ones are Skype, Gmail, tinychat and Facetime. The technology behind these are drastically different, especially for signaling and control. However, at the bottom, every video communication application tends to establish some form of end-to-end UDP-based real-time media path, and fall-back to relays if that fails. As mentioned before, IETF standards exist to establish such media path and relays.

Imagine a standard-compliant resident application, rtc-app, that runs on user's machine independent of the browser, but allows any application including browser to establish real-time media-path. The browser can use existing API such as websocket or HTTP to interact with rtc-app. The rtc-app application is not owned by a specific vendor, and is installed by the end-user. The avoids re-implementing the feature by every vendor who wants to do real-time video communication. To address the privacy and security concerns, rtc-app must directly ask permission from the end-user before initiating or accepting a connection instead of automatically (and randomly) on API calls. This is similar to how Flash Player asks the end-user for permission to capture from microphone or camera, but can remember the application for future use if told so by the end-user.

The main advantage of this approach is that it does not require changing the browser or HTML, but still is a generic implementation-focussed way to enable real-time video communication for many other applications. If an existing vendor such as Skype or Google opens up its API, it will be a big step forward. While rtc-app can provide transport functions, the audio and video capture still needs to be done somehow. Various codec licensing issues may prevent us from including it in rtc-app, but Flash player based application similar to the first approach can perform capture on its behalf. The main problem with this approach is that it requires an additional download and install by the end-user.



How to conduct a technical interview for software engineer?

(This article presents my thoughts on how to effectively conduct a technical interview for a software engineering position. It presents the "interviewer's" point of view based on more than 30 technical interviews I have conducted, and quality of candidates I have recommended. If you are an "interviewee" I suggest you look elsewhere, e.g., interview questions.)
  1. Know the position you are hiring for. If you have been part of a software engineering team or have read the book, "The mythical man-month", you would know that you need several different "types" of members in a successful team. You need a "magician", who knows or can figure out solution to every technical problem you may have. You need a couple of "plumbers" who are willing to fix any broken software piece. You need a "general" who is very motivated about what you are doing, knows how and when to delegate, and keeps everyone together. You need a few "soldiers" who can follow orders, do the job, and be happy to contribute. And so on. As an interviewer, you need to know what position you are hiring for? You need to tailer your interview as per the requirement. One interview pattern does not fit all types.
  2. Do your homework. Before the interview, thoroughly read the candidate's resume/CV. If she has extensive work experience, identify only one or two of her past projects to focus on. If you have even a slight doubt about her programming ability, prepare a written programming test. If possible, scheduler a separate or additional time slot before your face-to-face interview for the programming test. Do not use any existing online programming test material, otherwise you won't be able to distinguish between someone who knows how to program vs someone who has gone through many web sites containing interview questions. Do not give take home tests. Do not share your programming interview questions with other interviewers in your organization.
  3. Start with knowledge questions. During the interview, after initial introductions, start with a question on her past experience. Your interview should balance between knowledge and application types of questions. Do not ignore his experience or knowledge, and do not focus only on his experience. Getting started with what the candidate already knows is also a good way to make her comfortable. You can ask something from her past project, e.g., "Describe in one minute what you did in XYZ?", or ask about a past technology that he used extensively, e.g., "Did you use STL in C++? What are the common STL classes available?"
  4. Focus on real application problems. Most software engineering positions require applying your existing knowledge to a new problem. The one quality which distinguishes a good programmer from a mediocre programmer is that a good programmer can easily translate your problem in to pseudo-code. If you are interviewing for "soldiers" and not "magician" or "general", avoid discussing high-level design type of problems, but instead focus on more low level real technical problems. For example, instead of asking "How would you design a scalable web server for blah blah?" ask more specific questions. In my experience, people who can answer high level design questions can create "vaporware" but those who can translate a small real problem to pseudo-code can actually write "software". If you need software engineers, avoid wasting time on high level design questions. Also, such application problems should be independent of specific domains but just be able to test whether the candidate has the required mathematical and computer skills to translate your problem to pseudo-code. I have given some examples later.
  5. Follow thought processes and provide hints. If you believe that the candidate is getting diverted in to incorrect answer, there is no harm to give hints or counter-questions to course correct her thoughts. Do not be too adamant on your answer. Sometimes, a 75% correct answer is good enough.
  6. Provide itemized feedback. When you submit your recommendation to the HR or your manager about a candidate, specifically itemize individual qualities and performance, and emphasize specific skills and lack of it. For example, "I had a nice 45 min conversation with XYZ, and I found her to be a very good programmer but needs training on Flex. After initial introductions, I asked one algorithm and three programming questions. She did good in two programming ones and average on others. Programming ability: very good; Needs hand-holding: yes; Algorithms: average; Strength: programming; Weakness: Flex; Recommendation: weak accept." My final recommendation is one of strong-accept, weak-accept, weak-reject, or strong-reject, with implied meaning of "a very strong candidate, and must hire her", "a good enough candidate, but won't argue to hire him if others disliked her", "an average candidate, but won't argue to reject him if others strongly liked her", "a poor candidate, and must not hire her", respectively.
As an interviewer you would be wondering about examples of real questions that would distinguish a good programmer from an average one. These are some examples. As mentioned before, you should create your own question, instead of using these, otherwise you cannot distinguish a candidate who genuinely solved the problem from the one who has read this blog.
  1. Video conferencing layout: suppose you know the window dimension, WxH, and want to fit participant videos in MxM tile. Each video has fixed aspect ratio of 4:3. All video objects are of same size in the layout. Your MxM tile should be laid out in the middle-center, with potential empty spaces near window edges. The layout should maximize the size of the MxM tile, so that the empty spaces near edges are minimized. You are given an array of video objects V[] and a function layout(v:video, x, y, w, h) which lays out a single video object with size (w, h) at position (x, y) inside the window. Write pseudo-code to layout participant videos. (Hint: start with 1 video, then 2x2, then 3x3, then generalize. Additional questions: how would you modify it to NxM tile instead of MxM? What should happen if number of videos is more then 9 but less than 16 -- which boxes are empty? How would you modify it so that empty spaces including empty boxes are minimized in NxM layout?)
  2. Path optimization: suppose you have a map of a city with Manhattan-style layout. Suppose north-south streets are named, a1, a2, etc., and east-west streets are named b1, b2, etc. Some streets have traffic signals, with 5-second walk sign, 15 seconds count-down to continue walking if started, and 20 seconds don't walk sign, periodically repeating in that order. Other streets do not have traffic signal, in which case traffic must yield to pedestrians. Suppose you need to walk from corner of a5/b5 to corner of a7/b10, and only street with traffic lights on you way is a6. You walk at the same speed. Crossing a6 takes 15 seconds whereas crossing any other street takes only 5 seconds. You do not want to cross a6 if you know you can't finish before it turns to don't walk sign. You want to minimize the time taken from source to destination, hence minimize the time waiting on traffic lights. You have function named walk(), turn(left or right), stop(). Write pseudo-code for your decision process from your source to destination point. (Hint: draw out the map first, then it becomes easy to visualize and solve. Additional question: can you generalize between any two points as long as you know the complete map and which streets have signals?)
If you have more ideas, feel free to comment.