How to extend HTML5 for real-time video communication?

A few months ago, I was discussing HTML5 with a friend of mine. We tried to figure out what would it take to extend it to support web-based video communication. The proposed HTML5 already includes audio and video tags, but are useful only for streaming video applications. This article presents more refined thoughts on how to extend the browser to support video communication.

First approach: extend the video tag
W3C has added new video element in HTML5 to facilitate playback of interoperable video formats across browsers. Existing web sites use "object" element to run an external plugin such as Flash Player for video playback, which is intended to be replaced by the HTML5's video element. This allows browser manufacturers especially for phones and other devices to easily playback web videos, without having to implement the full Flash Player plugin. The "src" property allows specifying the URL of the video to play, and additional properties such as poster, preload, autoplay, loop and controls allow controlling the behavior of the video player.

One way to support video communication is to extend the video element with additional properties that allow it to capture and publish local video, and control camera and microphone behavior. For example, in a two-party call between Alice and Bob, Alice can have two video elements, one to publish local video to URL stream "alice" and other to play remote video from URL stream "bob". Similarly, Bob can have two video elements, one to publish local video to URL stream "bob" and other to play remote video from URL stream "alice". The "src" property can specify the central media server or rendezvous server location as well as the publish or play stream names, e.g., "rtmp://server/conf123?publish=alice".

This is the idea behind my Flash-based audio and video communication project. In addition to existing properties such as src, preload, autoplay, loop and controls, it defines new properties for microphone, camera, playing, recording, etc., as you can see on How to use the VideoIO API?. It also overloads the "src" property to allow "rtmp" and "rtmfp" URLs for media server or rendezvous server location, respectively. This application with its new properties can be used as a drop-in replacement for a video element that supports video communication in the browser.

This approach of extending the existing video element with new properties works well for two-party as well as multi-party conferences, and centralized as well as end-to-end media path. The nice thing about this approach is that it keeps the actual call signaling out-of-scope of the video element, e.g., your web application implements call signaling using existing Javascript/Ajax/websocket/server-event technologies. It keeps the specific rendezvous protocol mechanism such as "rtmp", "rtmfp", and in future "sip" or "rtsp", outside the video element.

To avoid interoperability problems, a minimum subset of supported rendezvous is recommended. The requirements of such a protocol is to support real-time media transport, preferably over UDP, in centralized or end-to-end path in presence of network middle boxes such as NATs and firewalls.

Second approach: define new connection object
The previous approach integrates capture, playback and connection functions in to a single video element, with additional properties. Alternatively, these functions can be split in to different elements and Javascript objects, e.g., the video element does display/playback, but new camera and microphone objects allow capture, and new connection object allows end-to-end real-time media path among participants. The Javascript application actually connects these different elements and objects to build a complete video communication system.

There are several proposals on how the new connection or transport API will look like. Example attributes are: protocol (udp or tcp), list of reflectors and relay servers , mode (initiating or receiving), secure (boolean). Additionally, it has methods such as connect and send, and events to indicate connection status and incoming data. Existing protocols such as ICE, STUN, RTP/RTCP and SIP may be implemented in the browser or external gateways to support such as transport object. Finally, these transport objects can be piped with display and capture components, audio and video codecs and filters, etc., to implement a complete video communication application.

In summary, this approach defines new Javascript objects such as Transport, Camera, Microphone, Codec, etc., and allows the application to connect these objects to build a real application. This is more complex than the first approach, but allows fine-grained application logic.

Third approach: use external application
This approach understands the limitations of HTML and does not try to "add" video communication to it. We are considering this approach of a separate application in our web communications project at Illinois Institute of Technology.

While the idea of extending HTML to support video communication is useful and interesting, there are many limitations. In the past, incompatibility in HTML among browsers has been a nightmare for web developers, and extending HTML for yet another feature is bound to cause more interoperability problems. Browser manufacturers are sometimes not too keen to add a new feature, e.g., for business reasons if it competes with the manufacturer's existing product or service. Third important reason is that the video element of HTML5 lacks some digital rights management related features, which causes media owners to publish their media using restricted Flash application. Fourth, adoption of new HTML5 is slow, so web site developers still need to fall back to Flash-based application for video playback at least in the short term. Finally, adding capture and end-to-end transport components in HTML5 gives rise to a plethora of issues related to privacy, security and denial of service attacks, in case of faulty browser implementation. Due to these reasons many people believe that extending HTML and browsers to support video communication is not the right approach.

Hundreds of applications exist that implement consumer video communication. Some popular ones are Skype, Gmail, tinychat and Facetime. The technology behind these are drastically different, especially for signaling and control. However, at the bottom, every video communication application tends to establish some form of end-to-end UDP-based real-time media path, and fall-back to relays if that fails. As mentioned before, IETF standards exist to establish such media path and relays.

Imagine a standard-compliant resident application, rtc-app, that runs on user's machine independent of the browser, but allows any application including browser to establish real-time media-path. The browser can use existing API such as websocket or HTTP to interact with rtc-app. The rtc-app application is not owned by a specific vendor, and is installed by the end-user. The avoids re-implementing the feature by every vendor who wants to do real-time video communication. To address the privacy and security concerns, rtc-app must directly ask permission from the end-user before initiating or accepting a connection instead of automatically (and randomly) on API calls. This is similar to how Flash Player asks the end-user for permission to capture from microphone or camera, but can remember the application for future use if told so by the end-user.

The main advantage of this approach is that it does not require changing the browser or HTML, but still is a generic implementation-focussed way to enable real-time video communication for many other applications. If an existing vendor such as Skype or Google opens up its API, it will be a big step forward. While rtc-app can provide transport functions, the audio and video capture still needs to be done somehow. Various codec licensing issues may prevent us from including it in rtc-app, but Flash player based application similar to the first approach can perform capture on its behalf. The main problem with this approach is that it requires an additional download and install by the end-user.



No comments: