The dangers to the promise of WebRTC

WebRTC is an emerging HTML5 component to enable real-time audio and video communication from within the web browser of a desktop or mobile platform. In this article I present my critical views on what can potentially go wrong in achieving the promise of WebRTC.

The promise

The technology promises the web developers a plugin-free cross-browser standard that will allow them to create exciting new audio/video communication applications and bring them to the millions (or even billions!) of web users on the desktop and mobile platform.

The rest of the article assumes these premises:
  1. A technology is easy to invent but hard to change.
  2. Business motivation drives technology deployment.

The dangers

Following are just some of the questions I will explore in this article.
  1. What if Apple does not commit to WebRTC?
  2. What if Microsoft's implementation is different than Google's?
  3. What if there are different flavors of the API implementation?
  4. What if a web browser cannot talk to legacy SIP infrastructure without gateways?
  5. What if "they" cannot agree on a common video codec?
Let us begin by listing the existing technologies - real-time communication protocols (RTP, SIP, SDP), HTTP, web servers, Flash Player and related media servers on desktop - these are hard to change. On the other hand the emerging HTML5 technologies can be changed in the right direction if needed.

Now what are the business motivations for the browser vendors? Google is the most open, rich and web-centric browser vendor with a lot of money and a lot of motivation to drive the technology for a true browser-to-browser real-time communication. It makes sense to design a platform that enables web clients to connect to each other via a rendezvous or relay service that runs in the cloud. Improving and simplifying the tools available to web developers while hiding the complexity of protocol negotiation will improve the overall web experience, which in turn will drive business for web-centric companies such as Google. Moreover, the end-to-end encryption and open codecs are more important to Google than having to deal with legacy VoIP infrastructure -- which gets reflected in the choice of the several new RTP profiles for media path and VP8 for video.

On the other hand, Apple prefers closed systems with integrated offering that works seamlessly across Apple platforms. Opening up the Facetime (or related technology) to all the web developers is interesting but not motivational enough, especially when the resulting web application and the WebRTC session is controlled by the web developers and users instead of the iOS developer platform or the iTunes portal. This makes Apple, as a browser vendor, uncommitted to WebRTC until it proves its worth to the business.

Microsoft already has a huge communication server business that is largely SIP based. Any technology that makes it difficult to interoperate (at the media path/RTP level) with the existing SIP application servers is likely to cause trouble. They could build a gateway, but it will be hard to compete with others who can provide such services on the low cost cloud infrastructure. Hence, having the building blocks in place that will enable a web browser to talk to existing SIP devices in the media path is probably a core requirement for Microsoft. Thus, low level access to SDP and more control over media path is proposed in their draft.

For a web browser on desktop, WebRTC is interesting in bringing a variety of new use cases and web application. However, these use cases are already available to web users on the desktop via alternatives such as Adobe's Flash Player. It it not clear what the WebRTC can contribute in terms of additional use cases for web applications on desktop. However, for mobile platforms and smart phones it is a completely different story. Flash Player either does not work or is too difficult to work for video conferencing on mobile. HTML5 presents a great opportunity to bridge the differences between end user platforms when it comes to web applications. Thus, adding WebRTC to web browsers on mobile phones such as iPhone and Android adds a lot of value.

Let us consider a few "what if"s that pose danger to the promise of WebRTC.

What if Apple does not commit to WebRTC?

If Apple with its iOS platform and variety of cool gadgets and frantic fans-base decides to keep the real-time communication out of reach of web developers -- people will still build such application albeit as standalone native applications and sell over iTunes. Eventually the web developer may end up writing custom  platform dependent kludges, e.g., invoke the native API for iOS but use WebRTC for others. Once Apple realizes that the web-based real-time communication is slipping its grips it may be too late.   Moreover, other browsers besides Safari could run on iOS instead.

What if Microsoft's implementation is different than Google's?

Based on the historic browser interoperability issues and the recent announcement by Microsoft, this is quite likely. Fortunately, web developers are used to such custom browser-dependent kludges. However, WebRTC poses greater risk of differences in operation and interoperability beyond just some HTML/Javascript hacks. Having a server side gateway for media path to facilitate interoperability among browsers from different vendors is not in the spirit of end-to-end principle and is not the correct motivation for WebRTC in my opinion. Moreover, the Chrome Frame extension for Internet Explorer could relieve web developers to some extent.

What if there are different flavors of the API in different browsers?

This is similar to the previous threat -- if the implementations among the browsers do not interwork, especially among the leading mobile platform vendors such as Apple, Google and Microsoft, then we are back to the square one. As a web developer what will be the motivation to build three different applications using WebRTC when one can build once using Adobe Flash Player? Perhaps, Adobe will revive the mobile version of Flash Player to solve the differences :)

What if a web browser cannot talk to legacy SIP infrastructure without gateways?

This issue has troubled both telecom and web-centric vendors alike. Web browsers do not want to blindly implement legacy SIP family of standards (actually RTP) because the problems on the web are different, and need to be addressed differently. The telecom centric vendors do not want to re-do everything to interoperate with web browser. If they can get the web browser to talk to their existing SIP/RTP components by keeping the media path intact, it will be a big deal. Having to implement a variety of new RTP profiles in their equipment for interoperability with web is unmotivational.

This could go in one of the two ways - either browser vendors agree on not interoperating with legacy RTP devices in which case we will need a gateway to do interoperability, or the browser vendors implement proprietary extensions (back doors!) that allow such hooks. If it is the latter, it will get immediately picked up by the telecom-sponsored developers to build applications that use the legacy RTP mode only -- not a good thing!

What if "they" cannot agree on a common video codec?

The debate on what codec to pick is as old as the debate on WebRTC. Currently, "they" are forming two camps -- one that supports royalty free VP8 family of standards and other that backs ubiquitous H.264 family of standards (or keep no mandatory codec). The advantage of H.264 is that many hardware already comes with fast H.264 processing hence will be available natively on mobile platform, whereas with VP8 you may end up using your battery life on the codec -- until VP8 becomes available in the hardware platform. It is not clear what will happen. If VP8 gets picked as the mandatory-to-implement codec in the browser, I can see some browser vendors digressing from the standard and implementing only H.264 (hardware supported codec) or implementing both VP8 and H.264.

What are the other potential threats to this technology? What if the standardization works drags for too long, until it is too late, and web users have moved on to something better! Web is driven by web applications. What a browser vendor does or what a standardization committee adopts are just a shadow of what the web applications do (or want to do). A web developer prefers to have a consistent, simple and unified API for real-time communication.

If it were in my hands, I would ban any IETF or W3C proposal without a running and open implementation. But the reality is far from this!

Getting started with WebRTC

The emerging WebRTC standards and the corresponding Javascript APIs allow creating web-based real-time communication systems that use audio and video without depending on external plugins such as Flash Player. I got involved in a couple of interesting projects that use the pre-standard WebRTC API available in the Google Chrome Canary release. This article describes how to get started with these projects.

The first step is to download a browser that supports WebRTC, e.g.,  Google Chrome Canary. In your browser address bar, go to chrome://flags, search for Enable Media Stream and click to enable it. This enables the WebRTC APIs for your web applications running in the browser. Use this browser for getting started with the following two projects.

1. The first project is voice and video on web that enables web-based audio and video conferencing among multiple participants. It uses resource-oriented data access to move all the application logic running to the client application in HTML/Javascript, whereas the server merely acts as a data store interface. The architecture enables scalable and robust systems because of the thin server. The source code of WebRTC integration is available in the SVN branch, webrtc. Follow the instructions on the project web page to install and configure the system, but use the link to the webrtc branch instead of trunk in SVN. After visiting the main page of the installed project, see the help tab to create a new conference. On the preferences page, make this conference persistent, check to enable WebRTC and uncheck to disable Flash Player for this conference. Visit the conference from multiple tabs in your browser, using different user names. Let the moderator participant enable the video checkboxes for various participants to see the audio and video flowing via WebRTC among the participants.

2. The second project is SIP in Javascript that creates a complete SIP endpoint running in your browser. It contains a SIP/SDP stack implemented in Javascript along with a SIP soft-phone demonstration application. It has two modes: Flash and WebRTC. In the WebRTC mode, it uses a variation of SIP over WebSocket and secure media over WebRTC. The default mode is to use Flash, which you can change to WebRTC as follows. Checkout the source code from SVN and put it on a web server. Install and run a SIP proxy server that supports SIP over WebSocket as mentioned on the project page. Now visit the page on your web server phone.html?network_type=WebRTC on two tabs of your browser. Follow the getting started instructions on that page to register on the first tab's phone with one name, and on the second with another, and try a call from the first to the second. This is an example program trace at the proxy server from a past experiment on how the WebRTC capabilities are negotiated between the two phones using SIP over WebSocket. In addition to the network_type, you can also supply additional parameters on the page URL to pre-configure the phone, e.g., username, displayname, outbound_proxy, target_value. Example links are available to configure as alice and bob to connect to the web and SIP server on localhost, assuming your webserver is running locally on port 8080 and SIP server on websocket port 5080. Just open the two URLs in the two tabs, click on Register on each, make a call from one to another by clicking on the Call button, accept the call from another by clicking on the Call button, and see the audio and video flowing via WebRTC between the two phones.



Translating H.264 between Flash Player and SIP/RTP

Our SIP-RTMP gateway as part of the rtmplite project includes the translation of packetization between Flash Player's RTMP and SIP/RTP for H.264. There are some hurdles, but it is doable! In this article I present what it takes to do such an interoperability. If you are interested in looking at the implementation, please see the _rtmp2rtpH264 and _rtp2rtmpH264 functions in the siprtmp.py module.

Before jumping in to the details, let us take a brief background of H.264 packetization. The encoder encodes sequence of frames or pictures to generate the encoded stream, which is consumed by the decoder to re-create the video. The encoder generates what is called as NALU or network abstraction layer unit. The decoder works on a single NALU and needs sequence of NALUs to decode. Each frame can have one or more slices. Each slice can be encoded in one or more NALUs. There are certain pieces of information that remain same for all or many frames. For example, the sequence parameter set (SPS) and picture parameter set (PPS) are like configuration elements that need to be sent once or only occasionally instead of with every frame or NALU. The configuration parameters apply to the encoder, whereas the decoder should be able to decode any configuration.

RTMP Payload

Flash Player 11+ is capable of capturing from camera and encoding in H.264 to send to an RTMP stream. Each RTMP message contains header and data (or payload), where the header contains crucial information such as timestamp, stream identifier, and the payload contains the encoded video NALUs or actual configuration data. The format of the payload is same as that of the F4V/FLV tag for H.264 video in an FLV file. Each RTMP message contains one frame but may contain more than one NALUs. The first byte contains the encoding type, and for H.264 is either 0x17 (for intra-frame) or 0x27 (for non-intra frame). The second byte contains packet type and is either 0x00 (configuration data) or 0x01 (picture data). The configuration data contains both SPS and PPS as described here.

  rtmp-payload := enc-type[1B] | type[1B] | remaining
enc-type := is-intra[4b] | codec-type[4b]
is-intra := 1 if intra and 2 if non-intra
codec-type := 7 for H.264/AVC
If the type is configuration data then the next four bytes are configuration version (0x01), the profile index, the profile compatibility and the level index. This is followed by one byte containing least-significant two-bits that determine the number of bytes to use for the length of the NALU in subsequent picture data messages. For example, if the bits are 11b then it indicates 3+1=4 bytes of NALU length, and if the bits are 01b then it indicates 1+1=2 bytes of NALU length. Lets call this the length-size and possible values are 1, 2 or 4. This is followed by a byte containing least-significant 5 bit for the number of subsequent SPS blocks. Each SPS block is prefixed by 16-bits length followed by the bit-wise encoding of SPS as per H.264 specification. This is followed by a byte containing the number of subsequent PPS blocks. Each PPS block is prefixed by 16-bits length followed by the bit-wise encoding of PPS as per H.264 specification. Typically only one SPS and one PPS blocks are present.

remaining for config := version[1B] | profile-idc[1B]
| profile-compat[1B]
| level-idc[1B]
| length-flag[1B]
| sps-count[1B] | sps0 ...
| pps-count[1B] | pps0 ...
length-flag := 0[6b] | value[2b] where value + 1 is length-size
sps-count := 0[3b] | count[5b] where count is number of sps
pps-count := number of pps elements
sps(n) := length[2B] | sps
pps(n) := length[2B] | pps
If the type is picture data then the next three bytes contain a 24-bit number for the decoder delay value for the frame and is applicable only for B-frames. The default baseline profile does not include the B-frames. Thus the first five bytes of the picture data payload are like header data. This is followed by one or more NALU blocks. Each NALU block is prefixed by the length of the next NALU encoded-bits. The number of bytes used to encode this length is determined by length-size mentioned earlier. Then the NALU is encoded as per H.264 specification.

remaining-picture := delay[3B] | nalu0 | nalu1 ...
nalu(n) := length | nalu
length := number in length-size bytes
nalu := NAL unit as per H.264
Each NALU has first byte of flags. The flags contains 1 most-significant bit of forbidden, next 2-bits of nri (NAL reference index) and final 5 least-significant bits of nal-type. There are several nal-types such as 0x01 for non-intra regular pictures, 0x05 for intra-pictures, etc. Please see the H.264 specification for the complete list.

The camera captured and encoded data in Flash Player contains three NALUs in each RTMP message -- the access unit delimiter (nal-type 0x06), the timing-information (nal-type 0x09) and the picture slice (nal-type 0x01 or 0x05). The Flash Player is capable of decoding other nal-types as well, and does not require access unit delimiter or timing-information NALUs for decoding. I haven't seen any support for aggregated or fragmented NALUs in the Flash Player.

RTP Payload

The RTP payload format for H.264 is specified in RFC 6184 and is typically supported in SIP-based video phones. The RTP header contains the crucial information such as the payload type, the timing data, and the sequence number, whereas the actual configuration and picture NALUs are sent in the payload as specified by this RFC. The first byte is the type containing one bit forbidden, two bits of nri and 5 bits of nal-type.

nalu := nal-flags[1B] | encoded-data
nal-flags := forbidden[1b] | nri[2b] | nal-type [5b]
In addition to the base nal-types of H.264, the RFC defines new nal-types for fragmentation and aggregation. Traditionally, the Internet plagued by middle-boxes, NATs and firewalls has imposed a limit on the size of the UDP packet that can be pragmatically used on the Internet, and the typically MTU is around 1400-1500 bytes. The H.264 encoder is capable of generating much larger encoded frame sizes hence cannot be successfully sent as one frame per RTP packet over UDP in many cases. On the other hand, some low-sized encoded frames may be much smaller than MTU thus incurring additional overhead for RTP headers. These low-sized frames can be aggregated for efficiency.

Many SIP video phones configure their H.264 encoders to use multiple slice NALUs in a single frame, unlike Flash Player which generates one picture NALU per frame. Thus the traditional SIP video phones are capable of using low sized encoded payload without RFC 6184 which can be sent in a single RTP/UDP packet.

When a large encoded frame is fragmented to smaller fragments, the nal-type=28 is used in the first byte of each fragment, followed by the second byte containing the actual nal-type of the frame as well as the start and end markers. This is followed by the actual encoded data. The RTP header of all these fragments contain the same timestamp value. The last fragment of the frame contains the marker set to true, whereas all the previous ones set it to false. When multiple smaller frames are aggregated, the nal-type of 24 is used in the first byte of the aggregate payload, followed by one or more NAL data. Each NAL data is prefixed by 16-bit length of the encoded NALU. There are non-trivial rules on how the nri is obtained and we refer you to the RFC for the details.

To fragment:
let encoded-data = fragment0 | fragment1 | fragment2...
encoded-data of fragment(n) := orig-nal-flags[1B] | fragment(n)
orig-nal-flags := start[1b] | end[1b] | ignore[1b]
| orig-nal-type[5b]
start := 1 if first fragment else 0
end := 1 if last fragment else 0

To aggregate:
encoded-data of aggregate := nalu0 | nalu1 | nalu2 ...
nalu(n) := length[2B] | orig-nalu(n)
In additional to sending the SPS and PPS packets in RTP, the video phones also negotiate the configuration data via external protocol such as SIP/SDP. Since Flash Player does not do that, we will not discuss it further.

Translating

Now that we understand the packetization of H.264 for Flash Player as well as SIP/RTP, let us go over the details of the translation process.

The configuration data is sent periodically by Flash Player before every intra-picture frame. However, SIP phones may not send the configuration data periodically. It is desirable to cache the configuration data received from both sides, and re-use it when the other side connects. The first packet sent must contain the configuration data. It is also desirable to periodically send the configuration data to both Flash Player and SIP sides from the translator, irrespective of whether the configuration data is received periodically. In our translator we send the configuration data before every infra frame.

In Flash Player to SIP/RTP direction, when the configuration data is received on RTMP, it is sent in two RTP packets, one for SPS and one for PPS. Both use the same timestamp and set the marker to true. When picture data is received on RTMP and need to be sent to the RTP side, it is dropped until a previous configuration data has been sent to the RTP side. If the picture data is not dropped, all the NALUs are extracted. The last out-of-three NALUs per RTMP message is the actual picture NALU which is sent to the RTP side as follows. Only the nal-type of 1 and 5 are used, whereas others are ignored. If the NAL size is less than 1500 bytes, it is used as is in the RTP payload with marker set to true. If the NAL size is more, it is fragmented in to smaller fragments with each of size at most 1500 bytes. Multiple fragmented RTP packets are generated as per the RFC. All but the last fragment has marker set to false. The RTP marker of true indicates end of frame. All the fragments use the same timestamp value.

In the SIP/RTP to Flash Player direction, the configuration data is received in multiple RTP packets and are cached by the translator. When both SPS and PPS payloads have been received from the RTP side, we are ready to start streaming to the Flash Player side. Any incoming RTP packet is put in a queue. When the last packet in the queue (that was most recently received) has marker set to true, the queue is examined and RTMP messages are created to be sent to the Flash Player side. Since Flash Player handles complete frames in each RTMP message, we need to wait until the marker is set to true so that we only send complete frames to Flash Player. If the RTMP stream is ready but we have not received the configuration data from RTP or we have not or are not sending the first intra frame to RTMP, then received packets are dropped. If no infra frames are received for 5 seconds, then we send a fast-intra-update (FIR) request to the SIP/RTP side, so that it triggers the SIP phone to send an intra frame.

Once we decide that we can send packets to RTMP from the received RTP queue, we divide the queue in to groups of packets of same timestamp and same nal-type values while preserving the order of the packets. If the nal-type is 5 indicating that an intra-frame is being sent to RTMP, then we send a configuration data too before the actual picture data. The configuration payload format is explained earlier and contains both PPS and SPS along with other elements. Each group of packets of the same timestamp and same nal-type is sent as a single RTMP message in the same order containing one or more NALUs. If the nal-type is 1 or 5, the NALU from the RTP payload is used as is in the RTMP payload with five bytes of header as explained earlier. If the nal-type is 28 indicating fragmented packets, then all the fragmented payloads are combined in to a single NALU. If the nal-type is 24 indicating aggregated packet, then it is split in to individual NALU data. Then the sequence of NALUs generated from this group of packets of same timestamp and nal-type are combined in to a single RTMP payload to be sent to the Flash Player.

Gotchas

As mentioned in my previous article, there are a few gotchas. You must use the new-style RTMP handshake, otherwise the Flash Player will not decode/display the received H.264 stream. You must use Flash Player 11.2 (beta) or later when using "live" mode, otherwise the Flash Player does not accept multiple slice NALUs of a single frame. If audio and video are enabled, then the timestamp of video must be synchronized with the timestamp of audio sent to RTMP. Note that RTP picks random initial timestamp for each media stream so the audio and video RTP timestamp values are not easily co-related unless using RTCP or external mechanism. You need to co-related the RTP timestamps of audio and video to a single timestamp clock of RTMP.

Conclusion

It is possible to do re-packetization of H.264 between Flash Player's RTMP and standard SIP/RTP without having to do actual video transcoding. This article explains the tricks and gotchas of doing so!

The implementation works between Flash Player 11.2 and a few SIP video phones such as Ekiga and Bria 3.

References
[6] ITU-T recommendation H.264, "advanced video coding for generic audiovisual services", March 2010.
[7] ISO/IEC International Standard 14496-10:2008.