Friday, January 01, 2016

My WebRTC related papers from 2015

This post continues from the previous one and gives an overview of my ongoing research on web-based multimedia communication or WebRTC. I wrote five published co-authored research papers last year answering these questions - How do we solve some of the enterprise challenges of WebRTC using browser extensions? How do we address cloud challenges of a multi-services and apps vendor? How do we create pure web-based enterprise communication and collaboration system without depending on legacy protocols? How do we do user reachability in a multi-apps environment created by WebRTC? And, how do we do write-once-run-anywhere for WebRTC based team apps using a cross platform tool? Read on if one or more of these questions interest you...

Enterprise WebRTC powered by browser extensions

Traversing WebRTC flows created by external third-party websites across restricted enterprise firewalls is challenging. There are other challenges in adopting WebRTC in enterprises, e.g., how to integrate it seamlessly with existing communication equipments, or how to enforce enterprise policies such as call recording of WebRTC flows on third-party websites? We show how to use browser extensions to solve these problems. This systems paper is based on my implementations of two interesting projects.

“We use browser extensions to solve two important issues in adopting WebRTC (Web Real-Time Communications) in enterprises: how to integrate WebRTC-centric communication with existing systems such as corporate directories, communication infrastructure and intranet websites, and how to traverse media paths across enterprise firewalls. Vclick is a simple and easy to use web-based video collaboration application that enables click-to-call from other webpages. SecureEdge is a network border traversal system for policy and security enforcement, and consists of a secure media relay that sits at the network border or in the cloud. A browser extension in the enterprise user’s device transparently injects this media relay in every WebRTC media path needing to traverse the enterprise network edge to enable authenticated border traversal without help from the websites hosting the WebRTC pages. We attempt to generically support WebRTC in enterprises on a variety of application scenarios instead of creating another fragmented communication island. The challenges faced and techniques used in our proof-of-concepts are likely extensible to other enterprise WebRTC scenarios using the emerging HTML5 technologies.”

Keywords: WebRTC, enterprise communication, secure edge, browser extension, VoIP, video call, firewall traversal, media relay.


ALICE: Avaya Labs Innovations Cloud Engagement

Although we know how to create cloud-hosted services, platforms and infrastructures, little is known about cloud hosted communication and collaboration services, especially to enable multi-tenancy and self-service. This research focuses on the challenges of hosting cloud services for customer trials, where resources are limited to make existing services cloud ready or to fit a specific platform. This is based on my work on creating a cloud portal to host research-oriented early or pre-product services on the cloud, and identifying common themes and techniques.

“We present the architecture and implementation of our enterprise cloud portal named ALICE, Avaya Labs Innovations Cloud Engagement, which provides self-service access to service developers, tenants, and users to various communication and collaboration applications. Currently ALICE is used for field testing of advanced research prototype services based on technologies such as WebRTC and HTML5. This paper describes the current portal and extensions to support multi-tenancy.

We describe challenges in creating a self-service multi-tenant SaaS (software-as-a-service) portal to host communications and collaboration applications for small to medium scale businesses. The challenges faced and the techniques used in our architecture relate to security, provisioning, management, complexity, cost savings and multi-tenancy, and are applicable and useful to other cloud deployments of diverse enterprise applications.”

Keywords: Cloud, system architecture, portal, multi-tenancy, Internet telephony, enterprise communications, web collaboration.


Vclick: endpoint driven enterprise WebRTC

One of my earliest project at Avaya Labs was on creating a light-weight service for wide range of web communication and collaboration scenarios. Vclick is a collection of many loosely coupled apps that run the app-logic in the browser or endpoint, and mash up at the data level. It contains applications for video call, conferencing, video presence, text chat, click-to-call, screen sharing, shared notepad and whiteboard, and so on. It goes against the conventional web wisdom of thin-client, single-page-apps, or rigid GUI, and presents a new software architecture to create robust endpoint driven apps. The paper is really about how to keep the endpoints smart and network (or service) dumb in the context of collaboration applications.

“We present a robust, scalable and secure system architecture for web-based multimedia collaboration that keeps the application logic in the endpoint browser. Vclick is a simple and easy-to-use application for video interaction, collaboration and presence using HTML5 technologies including WebRTC (Web Real Time Communication), and is independent of legacy Voice-over-IP systems. Since its conception in early 2013, it has received many positive feedbacks, undergone improvements, and has been used in many enterprise communications research projects both in the cloud and on premise, on desktop as well as mobile. The techniques used and the challenges faced are useful to other emerging WebRTC applications.”

Keywords: WebRTC, enterprise communication, web video conferencing, resource-based architecture, web applications.


User reachability in multi-apps environment

With numerous "walled-garden" services and apps emerging because of WebRTC, there is a need to identify the best way to reach your contacts, irrespective of which app or service she is on. This systems paper describes my work on implementing a mobile (and desktop) app called Strata Top9, to quickly reach your important contacts. It really is a front-end to launch other applications. Unlike existing presence based systems, we propose to iterate during call initiation. The paper presents the software architecture and design decisions along with several motivational use cases of our project. It also details the concept of dynamic contacts, and endpoint driven caller and receiver policies.

“Recent progress in web real-time communication (WebRTC) promotes multi-apps environment by creating islands of communication apps where users of one website or service cannot easily communicate with those of another. We describe the architecture and implementation of a multi-platform system to do user reachability in multiple communication services where users decide how they want to be reached on multiple apps, e.g., in an organization that has voice-over-IP, web conferencing and messaging from different vendors. Our architecture separates the user contacts from reachability apps, supports user and endpoint driven reachability policies, and has several independent and non-interoperable WebRTC-based apps for two-way and multiparty multimedia communication. Our flexible implementation can be used for enterprise or personal communications, or as a white-labeled app for consumers of a business.”

Keywords: system design; mobile app; user reachability; multiservices; VoIP; WebRTC; caller policy.


Developing WebRTC-based team apps with a cross-platform mobile framework

Ability to write-once-run-anywhere still eludes many app developers. Luckily several cross-platform development tools exist. However, creating cross-platform communication and collaboration related apps is still challenging. This paper presents my implementation work on creating cross platform apps. In particular, four types of platforms - web app on PC and mobile, and installed app on PC and mobile - are considered, and seven different apps are covered for a wide range of enterprise use cases. Techniques and steps for creating such cross platform apps are presented along with lessons learned based on practical experience. Additionally, considerations for iOS and wearable Glass devices are presented.

“We present lessons learned in developing cross platform multi-party team applications. Our apps include a range of communication and collaboration scenarios: document and content sharing in a team space, an agent-based meeting helper, phone number dialer via a voice-over-IP (VoIP) gateway, and multi-party call in peer-to-peer or client-server mode. We use web real-time communication (WebRTC) to enable the audio and video media paths in the apps. We use frameworks such as Chrome Apps and Apache Cordova to create apps that can be accessed from a browser, or installed on a desktop, mobile device, or wearable. The challenges and techniques described in our paper related to audio, video, network, power conservation and security are important to other developers building cross-platform apps involving WebRTC, VoIP and cloud services.”

Keywords: HTML5, Apache Cordova, Chrome Apps, WebRTC, Mobile, Cloud, Wearable.


Wednesday, December 31, 2014


Every time a new protocol appears in the protocol jungle of multimedia communications [2], people attempt to compare and contrast it with existing established protocols, such as SIP. With WebRTC as the new kid on the map, you can find several attempts to compare SIP and WebRTC [3][4][5][6][7]. Depending on which camp the comparison originates from, you may find flavors of favoritism, or unintentional downplay of the importance of the other camp.

This article presents my point of view, hopefully unbiased, on this topic. Let me start by saying that SIP and WebRTC are different, and it is not fair to compare them without an established context.

Complex protocol vs. simple API

SIP is a protocol, not an API; whereas WebRTC is an API, with an associated set of protocols.
Consider that TCP is a protocol but socket is an API. TCP has complex state machinery to enable reliable bi-directional end-to-end packet flow assuming that intermediate routers and networks can have problems but end to end reliability is assured. On the other hand, the socket API hides all the complexity of TCP and provides an easy to use abstraction. Luckily in the case of TCP, developers only work on top of the standardized socket API. Whereas in the case of SIP, usually they end up dealing with the protocol directly, or work with (partly broken) implementations of intermediate network element such as a SIP proxy, or use (poorly defined) APIs from third-party SIP libraries. Depending on what level of abstraction is exposed in the SIP library, the development effort can range from very simple to very complex. In the case of WebRTC, the API is being defined and standardized, and is intended as a high level abstraction suitable for  JavaScript/Web developers.

To list the differences between SIP and WebRTC - one could say that SIP is programming language agnostic but WebRTC is not, and must use JavaScript, or that SIP application could run on any device or platform, but a WebRTC application requires a browser to host the app. These are superficial differences - due to the core difference in the nature of SIP vs. WebRTC, i.e., protocol vs. API.

SIP would have been perceived as much simpler if there were well defined APIs from the start. But that would have limited the flexibility (and freedom) of what SIP could be used for. More on the flexibility and freedom is discussed later in this article.

One thing to note, however, is that WebRTC does use a set of complex protocols behind the scenes, many of which are shared by a SIP system, as we describe next. Although a web application using WebRTC is simple due to the high level API, the implementation of WebRTC in a browser itself is not that simple!

Protocol stack

SIP is an established protocol for Internet communication. WebRTC is an emerging API intended to be provided by the browsers and consumed by the applications in JavaScript. It has an associated set of protocols, some of which overlap with those used in a SIP system. The title of this article indicates a comparison of WebRTC with SIP/SDP, and not just with SIP. This is because SIP is often only a part of the overall puzzle in a real world communication system, as shown in the following diagram taken from my research paper [1].  In fact, the role that SIP plays in a real system is outside the scope of the WebRTC specification currently. Which means that it does not really make sense to compare SIP alone with WebRTC protocols, but it is okay to compare a SIP system with that based on WebRTC.

Comparison of typical SIP vs WebRTC application stack. The dotted line separates what is programmed by the application developer and what is provided by the platform. 
Although most SIP systems use SDP, it is not a requirement, and some SIP systems may use something else if needed. Same thing holds true for other protocol elements such as RTP for media transport. However, within the context of a voice and video call application, you will likely see the above mentioned protocols in effect. On the other hand, the WebRTC effort specifies the mandatory set of protocols including SDP, SRTP and ICE, and also, the minimum list of mandatory audio and video codecs, which must be implemented by a WebRTC capable browser. In the diagram above, while SIP is only one block in the system, WebRTC specifies multiple blocks including the API, codecs, SRTP/RTP/RTCP, ICE and to some extent SDP. While SDP is used in the WebRTC API, an application is free to change it to anything else when exchanging the session information with the peer, if needed. Defining and mandating more things in WebRTC makes the interoperability a little easier, unlike SIP where many such things are left to the implementation. We will talk about interoperability later in this article.

Network elements

The SIP specification defines the behavior of an endpoint (or user agent) as well as various network elements such as proxies and redirect servers. Other SIP extensions define more network elements such as a conference server, a presence server, and so on. With WebRTC, the focus is only on browser-to-browser communication, and only the browser's behavior is defined in the specification. Other elements such as a gateway or media server are free to implement the behavior. However, that part is likely not going to be standardized. This hopefully limits the number of specifications related to WebRTC, and hence limits the resulting complexity.

This does not mean that WebRTC systems will not need such network elements. Ability to call a phone number via a gateway, or to host a multi-party voice mixing or recording on a media server, or to locate the target user via a lookup service are all examples of crucial services required in many real world applications. Lack of standardization means that every vendor will likely define its own version of these elements, creating fragmentation. Such scenarios are actually inline with WebRTC's goal which aims at creating media path among browsers visiting the same website, and does not care for inter-website communication. More on interoperability vs. fragmentation is discussed later in this article.

Fortunately, it is entirely possible to create fully functional WebRTC systems without using complex or heavy weight network elements. In fact, many initial demonstrations of WebRTC applications have followed this path with peer-to-peer media flows, along with a very simple signaling/rendezvous service. SIP, too, started out with this model of peer-to-peer media flows with a simple signaling/rendezvous service in the form of a SIP proxy or redirect server. However, practical limitations and real-world deployments have mostly thrown this idealistic model out of the window, and have largely adopted a "managed" network centric model with heavy weight network elements in the form of back-to-back user agents, session border controllers, and call stateful proxies. It is likely that the similar real world forces may cause a similar network centric deployment model in the case of WebRTC. Only time will tell!

Message flow

To establish a multimedia call, three pieces of information are exchanged between the two parties - (1) an "invite" from caller to callee as an intention to establish a call, which the callee may answer or decline, or ignore, (2) session description containing capabilities of each party, (3) transport candidates for establishing connectivity on the media path, hopefully peer-to-peer. Let us call these steps as (1) "invite", (2) session description exchange and (3) transport candidates exchange.

In a WebRTC-based application, these may be separated out into different phases or may be combined together, as determined the web application. In fact, (1) above is not part of the specification, but is often implemented in a WebRTC call, and (2) and (3) are exchanged between the two parties using an out-of-band channel outside the scope of the WebRTC specification. The term "trickle ICE" refers to separating (3) from (2) in a WebRTC session negotiation, and is the default mode - although it is possible to not use it, as determined by the web application. In a SIP system, all three pieces are usually bundled in a single request-response transaction and are carried in a SIP INVITE and it's 2xx response between the two parties, often via one or more intermediate proxies.

The separation of these pieces of information in WebRTC makes it more flexible, e.g., an application can invent any kind of user lookup service over any protocol (HTTP POST, WebSocket, XMPP, Google Channel, LDAP, hard-coded, or whatever, or even SIP [1]). Some applications may not even need such a lookup service, e.g., a virtual presence application where everyone who visits the website is able to see and hear others on that site. Unlike this, in SIP, a user lookup service must be accessible via SIP.

For those who remember H.323, and its multiple steps of call setup in version 1, may argue that we have come back a full circle - started with multiple steps call setup, invited H.323 fast connect and SIP for single step call setup, and now back to multiple steps to establish a WebRTC call. However the situations are different in H.323 vs. WebRTC - in H.323 all the steps were done by the same entity, and after the initial "invite" step, the subsequent ones were follow through, with no perceived benefit to keep them separate. In the case of WebRTC, the "invite" step is outside the scope of WebRTC specification, hence it makes sense to keep it separate. The last transport candidates exchange step is incremental with real benefit to keep it separate from session description exchange, so that better peer-to-peer flows can be detected in future, which first peer-to-peer flow that works gets in use as soon as possible. In the case of H.323, there is no equivalent of this step in my opinion, because the second and third steps are about exchanging media capabilities, and then establishing fixed set of logical channels based on those media capabilities.

The SIP specification and many of its extensions, such as call transfer, interoperability with phone, or capabilities of user agents, actually deal with only the first piece of information mentioned above, i.e., how to reach the target user or device, and are often related to the notion of a "call". In WebRTC specification, there is no notion of a "call", or signaling to lookup the target user. It is assumed that such constructs and concepts belong to the web application or the website that is hosting the application. This gives flexibility to the application developer on how to implement advanced features such as conferencing or call transfer. With more flexibility, comes more power, and more non-interoperability, described next.

Interoperability vs. fragmentation

In the past fifteen years, non-interoperability has been a major problem in the SIP community. Although it is often easy to achieve interoperable voice call signaling, it is rather difficult to make sure that those hundreds or thousands of SIP systems out there will implement every little detail consistently as per the specification. Moreover, there is no mandatory codec defined by SIP, which means that two perfectly compliant SIP endpoints may not talk to each other.

With WebRTC, you only need to achieve interoperability among the major browsers. WebRTC easily enables communication between users visiting the same website, where the application fills the void of signaling, user lookup, and the likes. Since signaling is outside the scope of the specification, interoperability between users visiting separate websites is not readily available and requires one-to-one federation between the two websites or web application. This fragmented communication behavior is inline with how existing web applications behave, i.e., they tend to create islands of users, where users within an island can talk to each other, but not beyond. Have you ever thought of being able to reach your Linked-In contact from your Facebook page? I have, and have tried to do something about it [8][9].

The fundamental difference of triangular vs trapezoidal call model in WebRTC vs SIP, respectively -- which means the two parties are connected to the same server in WebRTC, whereas two parties can potentially be on separate servers in SIP -- results in major differences in interoperability requirements.

The triangular way of WebRTC is not how VoIP community perceived open communication to be. From that perspective, even though WebRTC is an emerging open standard, it will created closed walled gardens of communication islands - leading to communication fragmentation. Unfortunately, even in SIP-based systems, due to non-standard behaviors in implementations or different implementation choices (e.g., codecs) by different vendors, it tends to create fragmented islands, where an equipment from one vendor rarely talks freely with that of another. This tends to create an ecosystem, where a customer must purchase phone endpoints and servers from the same vendor, because it is often hard to achieve true mix-and-match interoperability out-of-the-box.

The problem is more relaxed in the case of WebRTC, because there are only a few major browsers, and the protocol specification does not involve any signaling or media server, except for generic STUN and TURN servers, when appropriate. A fair comparison would have been possible if there were hundreds or thousands of browsers implementing WebRTC, similar to the plethora of SIP implementations that exist today. Or if there were only five major vendors creating SIP user agents. With SIP, the protocol was invented first, followed by an attempt to deploy the user agents and servers. With WebRTC, we already had user agents (browsers) and servers (web servers, STUN/TURN servers), and we put together a set of protocols to let them talk in real-time.

By leaving signaling outside the scope, a WebRTC specification escapes many problems related to non-interoperability. In practice, fragmentation seems to be unavoidable, largely due to business reasons -- or lack of incentive for one website to let its users talk to that on another website. The benefit of better interoperability, maintainability and usability far out weighs the cost of fragmentation for many websites and web applications. For some applications that require cross-site communication, a gateway can do the job, without being part of the specification.

The biggest non-interoperability and fragmentation threat to WebRTC is that some major browser vendors may not implement it altogether, or may implement a variant of the specification, making it non-interoperable with other browsers. Hopefully, market pressure will stabilize the interoperability among major browser. Or developers will find alternatives (plugins? gateways?) to fill the gap. Again, only time will tell!

Flexibility and freedom

We have mentioned about flexibility at the protocol and application level. In summary, SIP provides greater flexibility at the protocol level and WebRTC at the application level. With greater flexibility comes more freedom to innovate, and unfortunately, more problems in interoperability. 

First, SIP provides flexibility at the protocol level, e.g., ability to use non-SDP based session description, or ability to use RTP vs SRTP vs ZRTP based media transport, or ability to add newer extensions for media path or call signaling. WebRTC mandates many such elements in the protocol, e.g., use of SDP, SRTP, ICE and so on. In future, if a newer or better media transport, NAT traversal or session description is invented, the specification will need to be changed to incorporate that. However, upgrading the specification when there are only a few browser implementations should not be a nightmare, especially if the upgrade is backward compatible. Once WebRTC specification gets ratified, it leaves little room to change or improve, in an implementation. With SIP's flexibility and resulting freedom, people have created many hundreds of different implementations and use cases using the same base protocol. WebRTC lacks the ability, for example, to implement non-trivial media path, such as application level multicast or local LAN broadcast, or to let the web application configure differential service for different media flows, or to let the web application inject text-to-speech in a media path.

The protocol flexibility of SIP comes at a price - no guarantee of common features such as security or common codecs in different implementations. For example, the voice path in one SIP call could be unencrypted and in another could be encrypted - determined by the implementation choice. Whereas in WebRTC, many of such features are well defined and mandated, as described in the next section.

Keeping the signaling part out of scope and defining the API give good application level flexibility to WebRTC systems. As mentioned before, features such as call transfer, conferencing or user lookup are application's responsibility. This allows a social website to implement these features different from a click-to-call provider's website, for example. Given the numerous websites, and their potential for incorporating real-time in-browser communication, such flexibility is very useful and highly desired.

Protocol differences in SIP vs WebRTC system

Both SIP and WebRTC systems use certain common set of protocols, e.g., SDP for session description, ICE for NAT traversal, and (S)RTP for media transport. The following chart from my paper [1] summarizes the interoperability differences.

Interoperability differences in SIP vs WebRTC; (o) means optional.
In practice, the session description (SDP) and media path defined by WebRTC requires many extensions, that are not found in existing SIP systems, and hence it is very hard to transparently interoperate between the two without using a media gateway. Eventually, one would hope that existing SIP systems would adapt to incorporate these new extensions. But history has told us that this is next to impossible - given the large number of SIP implementations, and given the availability of media gateways. Thus, a browser-to-SIP call will almost always involve a media gateway, unless you sell your own SIP endpoint.

Frequently asked questions (FAQ)

I have heard these. You will likely hear them too.
  1. Are SIP and WebRTC competing technologies? Yes and no. It depends on the context. As an access protocol to connect a client to an established voice/video service or infrastructure, yes. For anything else, most likely no.
  2. Is WebRTC a subset or refinement of SIP? No. WebRTC has certain elements and features that are not present in SIP/SDP. For example peer-to-peer data channel, or trickle ICE, or programming APIs. 
  3. Is WebRTC a superset or generic case of SIP? No. A SIP-based system includes elements and features that are not present in a WebRTC browser, e.g., well defined network elements, call event notifications, or user lookup service.
  4. Is WebRTC similar to P2P-SIP? or Jingle? or RTMFP? or name-your-favorite-here? No, for all. In theory, WebRTC has some overlap with session description and media path of a standard SIP system. In practice, WebRTC is very different from any of the existing systems we had.
  5. Does SIP make WebRTC better? What about the other way? Depends on who you talk to. A SIP proponent would argue that cross-site communication and PSTN interoperability requires using SIP at the core, or that SIP in JavaScript is good enough for signaling of WebRTC. A WebRTC proponent would argue that replacing SIP with WebRTC will improve on security, interoperability and call quality. In my opinion, these arguments are either too narrowly focussed or a compelling counter argument can easily be made. So my answer is "no" on both counts.
  6. Can SIP and WebRTC work together, instead of competing? Absolutely. Firstly, they do not compete except in a very narrow context. And secondly, they do work together - the SDP generated by WebRTC can be carried in SIP, directly from the browser or via a gateway. See [1] for alternatives.
  7. Will WebRTC remain at the end, and must SIP be at the core? No. It is entirely possible to create WebRTC capable network elements such as media servers and gateways without depending on SIP, that allow creating voice/video infrastructure with WebRTC at the core. It also possible to create many existing use cases with peer-to-peer media path of WebRTC, without using these network elements at all. 
  8. Does WebRTC have better voice quality than SIP? No. Because, SIP does not specific fixed set of voice codecs or voice quality. At present, when many SIP systems use G.711, G.729 or Speex, the voice quality of Opus used in WebRTC is superior to those SIP systems. However, nobody has stopped those SIP systems from using Opus (and related media engine) in their voice stack.
  9. Will WebRTC improve voice quality of my PC-to-phone application? No. Unless you are talking about smart-phone with WebRTC running on phone also. When voice path gets translated to a phone call, the quality will likely be reduced to G.711 or G.729 or whatever the gateway provider has deemed suitable on the phone network.
  10. Will WebRTC give better voice quality for my conferencing application? Depends. See previous question. For centralized media path (with audio mixer), it requires transcoding, so voice quality will likely suffer. Optimization such as using peer-to-peer voice path, or reflecting loudest-N streams to client without mixing may preserve the end-to-end voice quality in conferencing. 
  11. Why does browser not implement SIP directly? For browser-to-browser use case, SIP is not needed. For anything else where one end is a browser, a gateway can do the job of translating to SIP if needed. And WebRTC is focussed on browser-to-browser use case only. Having said that, it is possible to do SIP in JavaScript [1] running in your browser, or to implement your own plugin or browser with built-in SIP stack and related APIs. 
  12. What does it mean that a SIP conference server supports WebRTC? It could mean that it supports a browser end-point to participate in a multi-party conference over WebRTC, or it could mean that it uses Google's WebRTC media engine or the OPUS/VP8 codec for the voice/video stack but requires SIP/H.323 endpoints to connect. Be careful!
  13. Is WebRTC better than SIP/SDP? Hard to say. WebRTC attempts to fill the gap for a very specific problem, but requires robust, secure and high quality implementation, and defines simple high level API. On the other hand, SIP leaves many such things open to implementers. So for the specific problem, WebRTC is better, some might say.
  14. Between WebRTC and SIP, who will win the racing? There is no racing. Or if there is one, then WebRTC and SIP are racing in different directions, at 90 degrees to each other. Depending on where you look from, or which axis has the finish line, either of them could win. But the truth is, WebRTC is for browser-to-browser communication, where SIP/SDP/RTP is not feasible currently -- so there is no racing. One could argue that with more number of browsers supporting WebRTC than number of SIP endpoints, or with more number of JavaScript developers than number of SIP implementors, WebRTC is likely to win the racing. No comments on that. A more correct analogy of racing, in my opinion, is in specific contexts, e.g., between WebRTC of browser and RTMP+ RTMFP of Flash Player for browser-to-browser communication, or between SIP phone vs.  WebRTC browser for client access to backend voice/video service or infrastructure. 
  15. What is the real difference between WebRTC and SIP? Ah, glad you asked! If you have already read the rest of this article, you will be surprised to find that unlike all the differences mentioned before, the real difference is who the most important customers of WebRTC vs. SIP are - for WebRTC it is the web developers community, and for SIP it has been telecom providers and equipment vendors.


This article has attempted to present several factors comparing and contrasting SIP/SDP vs. WebRTC systems, without really diving in to the specific protocol differences of how SDP/RTP are used in SIP vs. WebRTC. This is because, while WebRTC specifies the list of profiles and extensions to SDP/RTP, the SIP system is free to choose whichever it likes. So any such comparison will end up being one between WebRTC and SIP of vendor-X.

There are similarities and differences between the two types of systems. In the end it is about the applications and user experiences, rather than protocols and APIs. In practice, it is easy for an established SIP-based IVR system to add WebRTC access, compared to that for a newbie WebRTC system to implement full fledged IVR functionality. The same holds true for many existing telecom, web communication and Internet conferencing applications. Thus, many existing SIP-based applications will likely also adopt WebRTC for access, if there is a demand. And similarly, new pure WebRTC-based web applications will likely incorporate SIP gateway-ing to reach out to  non-browser users, if there is demand. SIP or WebRTC itself is just a small piece of the puzzle in the overall system or application. Nevertheless, WebRTC attempts to fill a void that has existed for a really long time in the SIP-based communication world, for which past attempts such as using plugins have largely been unsuccessful, especially on mobile platform.

Monday, December 29, 2014

About Bellheads, Netheads and Webheads

Many people are familiar with the Netheads vs Bellheads battle [1][2]. A Bellhead has a tightly controlled and centrally managed view of services that work all the time and on which the provider can deploy applications and consumers will pay for them, usually a heft price every month. A Nethead is happy to use loosely connected hardware that is up 99% of the time, but allows creating a massive loosely connected network of software applications, where the hardware does not care which software is running or sending traffic.

I believe, most people of my generation belong to the Netheads family, even though their employers may be Bellheads. From [1] "This philosophical divide should sound familiar. It's the difference between 19th-century scientists who believed in a clockwork universe, mechanistic and predictable, and those contemporary scientists who see everything as an ecosystem, riddled with complexity. It's the difference between central and distributed control, between mainframes and workstations."

More recently a new generation is evolving, called Webheads. (Please do not google for this term, or you will be misled). For a Bellhead, the control must be strict and centralized, with managed services. For a Nethead, the control, if any, is distributed in the end-point, with layered applications on top of the basic best effort infrastructure. For a Webhead, everything must run or be accessible from a browser.

"Weird!" - some might say.

"How is it even in the same league?" - others might ask.

Here is my attempt to present the distinguishing attributes.

Evolution Initially there were Bellheads. They were happy living in their islands, working hard towards what the island demanded, and created applications and services for the island. Then Netheads evolved, and separated applications from the grasp of the services. And interconnected the islands. They created Internet. They created the world-wide-web. They created SIP. And the list goes on. Then Webheads emerged, and they started moving applications to the web. Slowly and steadily. First, it was email to webmail. Then documents on web editor. Then VoIP to web telephony. Now it looks like they have spread every where and practically moved all kinds of applications to the web.
Control Believe in centralized control, like one super-power to rule them all. The central element has enough gravity to pull many applications and services, resulting is heavy weight servers but thin clients. Believe in distributed control or no-control, resulting in chaos at times. The fluidity of the services and applications keeps them decoupled, or loosely connected with each other, spreading their wings to the peripherals. Believe in centralized control in the form of the website on which their application is hosted, but often accept the fact that there can be numerous such control elements. Work hard to avoid any connectivity or information leak to other controls, outside their own.
Robustness Everything hardware must be 99.999% reliable. Overall system must be 99.9% reliable. If system goes down, even if nobody is using it right now, they will lose business. Things have strict guarantees.The system works most of the time, but fails sometimes, but no one person or entity may be blamed, because it is collective responsibility. No guarantees, just best effort service, whatever... whatever works wellThe system works if it does, and does not if it does not. They work hard to keep the system running. But many systems are poorly designed, and often fail on sudden popularity of the website. Often blame for the failure is put on the network, and many times on the database.
Example To add video communication on top of voice calls is a massive renovation project involving upgrading existing services, making the pipes bigger to carry video, incorporating fixed bit rate video codecs to guarantee quality, and upgrading terminal devices to support video. It must work all the time from day one.To add video on top of a voice call client-server application, is just changing the application to use the desktop webcam and display, and pushing new video packet along side voice, without changing any other infrastructure element in most cases.To add video on top of a voice call web page, there is a redesign of the website to accommodate the visual elements, so that everyone viewing the same web page can see and hear each other. Regarding the codecs and bandwidth, delegate that to Flash or WebRTC.

In some ways, like Bellheads, Webheads also think of centralized control and management. The hosting website is the controller of an application and keeper of all the user data that could be generated on that website. In a way, they both create islands of innovations which do not communicate with each other very easily, except occasionally via rowboats, i.e., the likes of oauth and web APIs. Like Netheads, Webheads do not care about lower layers of communication, as long as the browsers can reach their web servers. Many things that are well studied in the Netheads community, e.g., socket programming, voice and video communication, or binary data processing, are only recently being introduced to Webheads, e.g., in the form of WebSocket, WebRTC or ArrayBuffer. If Bellheads are purists, then Netheads are hackers. And compared to them, Webheads are hackers++.

[Heard elsewhere...]

Bellhead: "We invented video communication two decades ago - from H.320, H.323/Q.931/H.225/H.245, H.324, H.325.. (15 more). It works on ISDN, LAN, ATM, MPLS, POTS, 802.1x, ... (10 more). There is wide variety of codecs including G.711, G.722, G.723, G.729, H.261, H.263, H.264, ... (20 more). We just finalized our first $100k sale of equipment to whats-it-called company for their intra-office video telepresence."

Nethead: "We re-invented everything you did, and we can quickly build endpoint terminals that can do two party video call to large scale video broadcast. Everyone is now using what we created. We even sent our engineers to your companies to add our protocols to your equipments. Did you know that your equipment also supports SIP for video call? You may not like it, but our protocols are everywhere, and better than yours."

Webhead: "Not sure what you guys are talking about, but just got a tweet from my IT guy that our click-to-video-call app reached 1 million unique users. We are going to hire a new CEO to monetize our app now."

For Webheads, focus is not about services or protocols, but only about apps and users.

Friday, December 26, 2014

Spaghetti, ravioli and lasagna in software


As I was walking down the street, down the street, down the street. A very good friend I chanced to meet... And we went to this italiano restauranto around the corner. Here is what we talked about. [blink blink, screen turns black and white, time goes back...]

"Have you heard of spaghetti code?"

"Of course, I write it all the time. It is easy. I think of a problem, and start writing code to solve it. And do not stop until either the code is complete or my laptop battery runs out. And yes, I don't delete any line of code that I write. Eventually I end up with like two thousand lines of code in one giant function, and like two hundred control statements. It is fast, furious and fun. But tastes awful the next day."


"Yes. So, what kind of code do you write?"

"Well, I have tried my hands on ravioli, lasagna, and more recently farfalle and penne."

"I know ravioli and lasagna - just too modular or too many layers for my taste - I get lost in which piece or layer I am working on, and end up making a mess, embarrassing myself. What about farfalle and penne?"

"Oh, farfalle is what I create when I refactor my correctly working software to smaller pieces, to such an extent that each piece looks broken and half complete."

"What is penne?"

"It is when I refactor it right, and pieces look good. Not too big, not too small. And no information hiding and stuff like that. Frankly, I don't really like information hiding - it is more hype than hoopla. I like data oriented programming - software that is independent of the data - or pasta that is independent of the meat."

[waitress comes by, and asks...]

"I heard pasta, so assume you are ready. What would you two gentlemen have?"

"I will have the three-pasta house combo, with lasagna at the core, the first layer filled with penne, and the second layer with cheese ravioli. Spare the meat. Mix tomato, pesto and alfredo sauce and spread it all over."

[waitress makes a face, and turns to the other...]

"I like spaghetti, but I will also have what my esteemed friend asked for. Additionally, can you open all the layers and separate them, placing side by side, and break all the raviolis to remove the cheese and put it on the side too. I like to keep my pasta separate from other ingredients."

Irony in Internet Communications

We live in a world where

  1. first, we build Internet Protocol over phone network (modem), and then we build a phone system over Internet Protocol (VoIP),
  2. first, we build a phone system to run on Internet (VoIP), and then drive the internet from our phones (smart phones),
  3. first, we build open communication on top of monopolies (Internet over PSTN), and then build closed communication on top of that to promote monopolies (Facetime, Skype, or you-name-it-here).
Wouldn't it be easier if we start afresh?

Thursday, December 25, 2014

NetConnection or PeerConnection?

Flash Player and the corresponding ActionScript programming language have NetConnection/NetStream programming APIs for developing communicating web applications. Emerging WebRTC has PeerConnection/MediaStream APIs to enable real-time multimedia communication in the browser. This article compares and contrasts these two approaches to designing application programming interfaces (API). I have attempted to compare the API-styles rather than the specific implementations in Flash Player or browsers.


In Flash Player + ActionScript, four categories of objects (or classes) are needed to build a communicating web application such as a video chat. First, you need a connection to the server which may be a media server processing the media path in a client server approach or just a rendezvous server for establishing a peer-to-peer media path in the peer-to-peer approach. Second, you create named streams, for publishing and for playing. Third, you attach the publish side stream with your capture devices for audio and/or video. And fourth, you attach the play side stream with display or playback of remote media. The corresponding classes for these are: NetConnection, NetStream, Camera and/or Microphone, and Video or VideoDisplay [1] (pp.16-18). The diagram below shows a two party video call between our regular actors, Alice and Bob, each creating a named stream to publish, and each playing the other party's named stream to display. On each terminal, a NetConnection object is used to create two named NetStream objects, one is attached to Camera and Microphone and another to a Video or VideoDisplay element. Additionally, for displaying local preview, the Camera object is also attached directly to a Video or VideoDisplay.

In WebRTC capable browser + Javascript, again, four categories of objects (or classes) are needed. First, a LocalMediaStream created via the getUserMedia function to represent the capture stream from the user's camera and microphone devices. Second, an PeerConnection (actually called RTCPeerConnection) object is created to establish a peer-to-peer path between the two browser instances. Third, the LocalMediaStream object is attached (or added) to the PeerConnection on one end, and appears as remote MediaStream object at the other end. Fourth, this remote MediaStream is attached to an audio or video element to playback or display the received media. The LocalMediaStream is also attached to another video element to display local preview. Unlike publish vs. play mode of NetStream, I put the LocalMediaStream vs. remote MediaStream into separate categories, due to the way they are created and what they represent, even though LocalMediaStream is also a MediaStream.

For comparison purpose, let us call these Flash vs WebRTC set of APIs as NetConnection vs PeerConnection.

Level of abstraction: named stream vs. offer-answer

Clearly NetConnection presents a higher level of abstraction than the PeerConnection. In PeerConnection, each media path is explicitly created via a new RTCPeerConnection object on each side. It uses offer-answer model - where one side sends a session offer with its capabilities and the other side responds back with an answer with its side's capabilities. The offer-answer messages are exchanged between the two browsers outside the WebRTC APIs. In case of NetConnection, when a participant publishes a named stream (MediaStream), zero or more other participants may play that named stream. The implementation of these connection and stream classes hide the details of the peer-to-peer or client-server media path establishment. Internally, some variation of offer-answer and transport candidate messages are exchanged between the Flash Player instances, but are not exposed to the programming interface - thus making it a higher level of abstraction.

Topology creation: client-server vs. peer-to-peer

A NetConnection is a client-server abstraction where as a PeerConnection is a peer-to-peer path abstraction. A NetConnection object can be used to create client-server as well as peer-to-peer media paths, whereas PeerConnection is targeted only towards a peer-to-peer media path. To further clarify - a NetConnection is between a client and a server, which may be using RTMP in which case the server is a media server for the client-server media path, or may be RTMFP in which case the server may just be a rendezvous server to establish peer-to-peer MediaStreams between the two browsers. Note that with RTMFP it is also possible to do client-server media path. Additionally, it also allows application level multicast for use cases such as large event broadcast. With PeerConnection, the media path is intended to be peer-to-peer, but it is also possible to implement PeerConnection at a server (outside the browser), in which case it gets transparently used to create a client-server media path because the server behaves as another browser endpoint of the WebRTC media path. With WebRTC, once MediaStream proxying is implemented in the browser, it may be possible, though not trivial, to create application level multicast for large group event broadcast. Proxying a media stream means that a received remote MediaStream from one peer may be attached (or attached) to another PeerConnection to another peer, thus proxying the media path via this peer.

Signaling channel: built-in vs. external

As explained in the previous points, NetConnection is the signaling channel to establish the client-server or peer-to-peer MediaStream objects. Due to the built-in nature of the client-server signaling channel, it is non-trivial for third-party vendors to build such a signaling server without implementing the full RTMP and/or RTMFP protocol. On the other hand, a PeerConnection generates the offer-answer and other signaling messages, which must be exchanged via another channel between the two browser instances, e.g., via a notification service over WebSocket, by using Ajax, or over other third-party channels. Such notification service does not need to understand the details of the WebRTC protocol. This gives more flexibility and freedom to the developers on how the signaling channel for a particular communicating web application is implemented.

Server dependent vs. independent

As explained in the previous points, a NetConnection is between a client and a server, thus dependent on the server. Additionally, any server side media processing such as mixing or gateway'ing to VoIP can only be done by extending or modifying this server. If the server is vendor dependent, which seems to be the case for RTMFP, until recently, then it makes these media processing and gateway'ing features also vendor dependent. On the other hand, a PeerConnection does not care about the server side. But an implementation of a web application that uses this object does. Hence, in practice, these types of servers may be needed - notification server for exchanging offer-answer and other messages, media server for doing things like server side recording or mixing, media gateway for translating to VoIP or other networks, and media relay for gathering reflexive or relayed transport candidates and for actually relaying media path across network middle boxes.

Summary and further reading

In summary, the higher level NetConnection-style interface with built-in signaling channel makes it easier for programmers to create applications, ranging from two-party video calls to multi-party conferences to large scale event broadcasts. On the other hand, greater flexibility (and freedom) of the PeerConnection-style interface allows building wide range of system architectures using third-party components such as notification servers, media servers, media gateways and media relays.

In my past research on flash-videoio [2] and aRtisy [3], I have proposed another higher level abstraction for the programming APIs for communicating web applications. In fact [3] describes how to implement the NetConnection-style interface on top of WebRTC's PeerConnection-style interface and using an external resource server to provide the notification service. It extends the named stream model to named resource model to implement several other applications such as text chat or shared whiteboard or notepad. Then, [2] and [3] describe how to build a widget-oriented interface which consists of a single video-io widget used in either publish or play mode to create a wide range of communicating web applications such as two-party calls, multi-party conferences and video presence for Flash and WebRTC, respectively.

Sunday, March 16, 2014

My WebRTC related papers from 2013

This post gives an overview of my past and ongoing research on web-based multimedia communication. I wrote four published co-authored research papers last year answering these questions - How does WebRTC interoperate with existing SIP/VoIP systems and how does SIP in JavaScript compare with other approaches? What challenges lie ahead of enterprises in adopting WebRTC and what can they do? How can developers take advantage of WebRTC and HTML5 in creating rich internet applications in the cloud? How does emerging HTML5 technology allow better enterprise collaboration using rich application mash-ups? Read on if one or more of these questions interest you...

A Case for SIP in JavaScript

WebRTC (Web Real Time Communications) promises easy interaction with others using just your web browser on a website. SIP (Session Initiation Protocol) is the backbone of Internet communications interconnecting many devices, servers and systems you use daily for your call. Connecting WebRTC with SIP/RTP raises more questions than what we already know, e.g., does interoperability really work? what changes do you need in the existing devices and systems? does the conversion happen at the server or in the browser? what are the trade-offs of doing the translation? what are the risks if you go full throttle with WebRTC in the browser? and many more. There is a need for a fair comparison to evaluate not only what is theoretically possible but also practically achievable with existing systems.
"This paper presents the challenges and compares the alternatives to interoperate between the Session Initiation Protocol (SIP)-based systems and the emerging standards for the Web Real-Time Communication (WebRTC). We argue for an end-point and web-focused architecture, and present both sides of the SIP in JavaScript approach. Until WebRTC has ubiquitous cross-browser availability, we suggest a fall back strategy for web developers — detect and use HTML5 if available, otherwise fall back to a browser plugin."

Taking on WebRTC in an enterprise

On one hand WebRTC enables the end-users to communicate with others directly from their browsers, on the other it opens a can of worms a door to many novel communicating applications that scares your enterprise IT guy. Existing enterprise policies and tools are designed to work with existing communication systems such as emails and VoIP, and employ bunch of edge devices such as firewalls, reverse proxies, email filters and SBCs (session border controllers) to protect the enterprise data and interaction from the outside world. Unfortunately, session-less WebRTC media flows and, more importantly, end-to-end encrypted data channels, pose the same threat to enterprise security that peer-to-peer applications did. The first reaction from your IT guy will likely be to block all such peer-to-peer flows in the firewall. There is a need to systematically understand the risk and propose potential solutions so that the pearls of WebRTC can co-exist with the chains-and-locks of enterprise security.
 "WebRTC will have a major impact on enterprise communications, as well as consumer communications and their interactions with enterprises. This article illustrates and discusses a number of issues that are specific to WebRTC enterprise usage. Some of these relate to security: firewall traversal, access control, and peer-to-peer data flows. Others relate to compliance: recording, logging, and enforcing enterprise policies. Additional enterprise considerations relate to integration and interoperation with existing communication infrastructure and session-centric telephony systems."

Building communicating web applications leveraging endpoints and cloud resource service

One problem with the growth of SIP was that there were not many applications that demonstrated its capabilities beyond replacing a telephone call. This resulted in telephony mindset dominating the evolution in both standards and deployments. At the core of it, WebRTC essentially does most of what SIP aims for - peer-to-peer media flows and control in the end-points. Unfortunately, telephony influence in existing SIP devices, servers and providers have hidden that benefit from the end-user. There is a need to avoid such dominating telephony influence on WebRTC by demonstrating how easy it is to build communicating web applications, without relying on existing (and expensive) VoIP infrastructure. 
Secondly, the open web is threatened by closed social websites that tend to lock the user data in their ecosystem, and force the user to only use what they build, go where they exist and talk to who they allow. This developer focused research is an attempt to go against them, by showing a rich internet application architecture focused on application logic in the end-point and application mash-ups at the data-level. It shows my 15+ crucial web widgets written in few hundred lines of code each, covering wide rand of use cases from video click-to-call to automatic-call-distribution, call-queue and shared white-boards, and my web applications such as multiparty video collaboration, instant messaging/communicator, video presence and personal social wall, all written using HTML5 without depending on "the others" (imagine tune from lost played here) - the others are the existing SIP, XMPP or Flash Player based systems.
"We describe a resource-based architecture to quickly and easily build communicating web applications. Resources are structured and hierarchical data stored in the server but accessed by the endpoint via the application logic running in the browser. The architecture enables deployments that are fully cloud based, fully on-premise or hybrid of the two. Unlike a single web application controlling the user's social data, this model allows any application to access the authenticated user's resources promoting application mash-ups. For example, user contacts are created by one application but used by another based on the permission from the user instead of the first application."

Private overlay of enterprise social data and interactions in the public web context

Continuing the previous research theme, this research focuses on enterprise collaboration and enterprise social interactions. The goal is to define a few architectural principles to facilitate rich user interactions, both real-time and offline in the form of digital trail, in an enterprise environment. In particular, my proof-of-concept project described in the paper shows how to do web annotations, virtual presence, co-browsing, click-to-call from corporate directory, and internal context sensitive personal wall. The basic idea is to separate the application logic from the user data, keep the user data protected within the enterprise network, use public web as a context for storing the user interactions and build a generic web application framework running the browser.
"We describe our project, living-content, that creates a private overlay of enterprise social data and interactions on public websites via a browser extension and HTML5. It enables starting collaboration from and storing private interactions in the context of web pages. It addresses issues such as lack of adoption, privacy concerns and fragmented collaboration seen in enterprise social networks. In a data-centric approach, we show application scenarios for virtual presence, web annotations, interactions and mash-ups such as showing a user's presence on linked-in pages or embedding a social wall in corporate directory without help from those websites. The project enables new group interactions not found in existing social networks."

Monday, August 06, 2012

The dangers to the promise of WebRTC

WebRTC is an emerging HTML5 component to enable real-time audio and video communication from within the web browser of a desktop or mobile platform. In this article I present my critical views on what can potentially go wrong in achieving the promise of WebRTC.

The promise

The technology promises the web developers a plugin-free cross-browser standard that will allow them to create exciting new audio/video communication applications and bring them to the millions (or even billions!) of web users on the desktop and mobile platform.

The rest of the article assumes these premises:
  1. A technology is easy to invent but hard to change.
  2. Business motivation drives technology deployment.

The dangers

Following are just some of the questions I will explore in this article.
  1. What if Apple does not commit to WebRTC?
  2. What if Microsoft's implementation is different than Google's?
  3. What if there are different flavors of the API implementation?
  4. What if a web browser cannot talk to legacy SIP infrastructure without gateways?
  5. What if "they" cannot agree on a common video codec?
Let us begin by listing the existing technologies - real-time communication protocols (RTP, SIP, SDP), HTTP, web servers, Flash Player and related media servers on desktop - these are hard to change. On the other hand the emerging HTML5 technologies can be changed in the right direction if needed.

Now what are the business motivations for the browser vendors? Google is the most open, rich and web-centric browser vendor with a lot of money and a lot of motivation to drive the technology for a true browser-to-browser real-time communication. It makes sense to design a platform that enables web clients to connect to each other via a rendezvous or relay service that runs in the cloud. Improving and simplifying the tools available to web developers while hiding the complexity of protocol negotiation will improve the overall web experience, which in turn will drive business for web-centric companies such as Google. Moreover, the end-to-end encryption and open codecs are more important to Google than having to deal with legacy VoIP infrastructure -- which gets reflected in the choice of the several new RTP profiles for media path and VP8 for video.

On the other hand, Apple prefers closed systems with integrated offering that works seamlessly across Apple platforms. Opening up the Facetime (or related technology) to all the web developers is interesting but not motivational enough, especially when the resulting web application and the WebRTC session is controlled by the web developers and users instead of the iOS developer platform or the iTunes portal. This makes Apple, as a browser vendor, uncommitted to WebRTC until it proves its worth to the business.

Microsoft already has a huge communication server business that is largely SIP based. Any technology that makes it difficult to interoperate (at the media path/RTP level) with the existing SIP application servers is likely to cause trouble. They could build a gateway, but it will be hard to compete with others who can provide such services on the low cost cloud infrastructure. Hence, having the building blocks in place that will enable a web browser to talk to existing SIP devices in the media path is probably a core requirement for Microsoft. Thus, low level access to SDP and more control over media path is proposed in their draft.

For a web browser on desktop, WebRTC is interesting in bringing a variety of new use cases and web application. However, these use cases are already available to web users on the desktop via alternatives such as Adobe's Flash Player. It it not clear what the WebRTC can contribute in terms of additional use cases for web applications on desktop. However, for mobile platforms and smart phones it is a completely different story. Flash Player either does not work or is too difficult to work for video conferencing on mobile. HTML5 presents a great opportunity to bridge the differences between end user platforms when it comes to web applications. Thus, adding WebRTC to web browsers on mobile phones such as iPhone and Android adds a lot of value.

Let us consider a few "what if"s that pose danger to the promise of WebRTC.

What if Apple does not commit to WebRTC?

If Apple with its iOS platform and variety of cool gadgets and frantic fans-base decides to keep the real-time communication out of reach of web developers -- people will still build such application albeit as standalone native applications and sell over iTunes. Eventually the web developer may end up writing custom  platform dependent kludges, e.g., invoke the native API for iOS but use WebRTC for others. Once Apple realizes that the web-based real-time communication is slipping its grips it may be too late.   Moreover, other browsers besides Safari could run on iOS instead.

What if Microsoft's implementation is different than Google's?

Based on the historic browser interoperability issues and the recent announcement by Microsoft, this is quite likely. Fortunately, web developers are used to such custom browser-dependent kludges. However, WebRTC poses greater risk of differences in operation and interoperability beyond just some HTML/Javascript hacks. Having a server side gateway for media path to facilitate interoperability among browsers from different vendors is not in the spirit of end-to-end principle and is not the correct motivation for WebRTC in my opinion. Moreover, the Chrome Frame extension for Internet Explorer could relieve web developers to some extent.

What if there are different flavors of the API in different browsers?

This is similar to the previous threat -- if the implementations among the browsers do not interwork, especially among the leading mobile platform vendors such as Apple, Google and Microsoft, then we are back to the square one. As a web developer what will be the motivation to build three different applications using WebRTC when one can build once using Adobe Flash Player? Perhaps, Adobe will revive the mobile version of Flash Player to solve the differences :)

What if a web browser cannot talk to legacy SIP infrastructure without gateways?

This issue has troubled both telecom and web-centric vendors alike. Web browsers do not want to blindly implement legacy SIP family of standards (actually RTP) because the problems on the web are different, and need to be addressed differently. The telecom centric vendors do not want to re-do everything to interoperate with web browser. If they can get the web browser to talk to their existing SIP/RTP components by keeping the media path intact, it will be a big deal. Having to implement a variety of new RTP profiles in their equipment for interoperability with web is unmotivational.

This could go in one of the two ways - either browser vendors agree on not interoperating with legacy RTP devices in which case we will need a gateway to do interoperability, or the browser vendors implement proprietary extensions (back doors!) that allow such hooks. If it is the latter, it will get immediately picked up by the telecom-sponsored developers to build applications that use the legacy RTP mode only -- not a good thing!

What if "they" cannot agree on a common video codec?

The debate on what codec to pick is as old as the debate on WebRTC. Currently, "they" are forming two camps -- one that supports royalty free VP8 family of standards and other that backs ubiquitous H.264 family of standards (or keep no mandatory codec). The advantage of H.264 is that many hardware already comes with fast H.264 processing hence will be available natively on mobile platform, whereas with VP8 you may end up using your battery life on the codec -- until VP8 becomes available in the hardware platform. It is not clear what will happen. If VP8 gets picked as the mandatory-to-implement codec in the browser, I can see some browser vendors digressing from the standard and implementing only H.264 (hardware supported codec) or implementing both VP8 and H.264.

What are the other potential threats to this technology? What if the standardization works drags for too long, until it is too late, and web users have moved on to something better! Web is driven by web applications. What a browser vendor does or what a standardization committee adopts are just a shadow of what the web applications do (or want to do). A web developer prefers to have a consistent, simple and unified API for real-time communication.

If it were in my hands, I would ban any IETF or W3C proposal without a running and open implementation. But the reality is far from this!

Saturday, March 31, 2012

Getting started with WebRTC

The emerging WebRTC standards and the corresponding Javascript APIs allow creating web-based real-time communication systems that use audio and video without depending on external plugins such as Flash Player. I got involved in a couple of interesting projects that use the pre-standard WebRTC API available in the Google Chrome Canary release. This article describes how to get started with these projects.

The first step is to download a browser that supports WebRTC, e.g.,  Google Chrome Canary. In your browser address bar, go to chrome://flags, search for Enable Media Stream and click to enable it. This enables the WebRTC APIs for your web applications running in the browser. Use this browser for getting started with the following two projects.

1. The first project is voice and video on web that enables web-based audio and video conferencing among multiple participants. It uses resource-oriented data access to move all the application logic running to the client application in HTML/Javascript, whereas the server merely acts as a data store interface. The architecture enables scalable and robust systems because of the thin server. The source code of WebRTC integration is available in the SVN branch, webrtc. Follow the instructions on the project web page to install and configure the system, but use the link to the webrtc branch instead of trunk in SVN. After visiting the main page of the installed project, see the help tab to create a new conference. On the preferences page, make this conference persistent, check to enable WebRTC and uncheck to disable Flash Player for this conference. Visit the conference from multiple tabs in your browser, using different user names. Let the moderator participant enable the video checkboxes for various participants to see the audio and video flowing via WebRTC among the participants.

2. The second project is SIP in Javascript that creates a complete SIP endpoint running in your browser. It contains a SIP/SDP stack implemented in Javascript along with a SIP soft-phone demonstration application. It has two modes: Flash and WebRTC. In the WebRTC mode, it uses a variation of SIP over WebSocket and secure media over WebRTC. The default mode is to use Flash, which you can change to WebRTC as follows. Checkout the source code from SVN and put it on a web server. Install and run a SIP proxy server that supports SIP over WebSocket as mentioned on the project page. Now visit the page on your web server phone.html?network_type=WebRTC on two tabs of your browser. Follow the getting started instructions on that page to register on the first tab's phone with one name, and on the second with another, and try a call from the first to the second. This is an example program trace at the proxy server from a past experiment on how the WebRTC capabilities are negotiated between the two phones using SIP over WebSocket. In addition to the network_type, you can also supply additional parameters on the page URL to pre-configure the phone, e.g., username, displayname, outbound_proxy, target_value. Example links are available to configure as alice and bob to connect to the web and SIP server on localhost, assuming your webserver is running locally on port 8080 and SIP server on websocket port 5080. Just open the two URLs in the two tabs, click on Register on each, make a call from one to another by clicking on the Call button, accept the call from another by clicking on the Call button, and see the audio and video flowing via WebRTC between the two phones.

Sunday, January 22, 2012

Translating H.264 between Flash Player and SIP/RTP

Our SIP-RTMP gateway as part of the rtmplite project includes the translation of packetization between Flash Player's RTMP and SIP/RTP for H.264. There are some hurdles, but it is doable! In this article I present what it takes to do such an interoperability. If you are interested in looking at the implementation, please see the _rtmp2rtpH264 and _rtp2rtmpH264 functions in the module.

Before jumping in to the details, let us take a brief background of H.264 packetization. The encoder encodes sequence of frames or pictures to generate the encoded stream, which is consumed by the decoder to re-create the video. The encoder generates what is called as NALU or network abstraction layer unit. The decoder works on a single NALU and needs sequence of NALUs to decode. Each frame can have one or more slices. Each slice can be encoded in one or more NALUs. There are certain pieces of information that remain same for all or many frames. For example, the sequence parameter set (SPS) and picture parameter set (PPS) are like configuration elements that need to be sent once or only occasionally instead of with every frame or NALU. The configuration parameters apply to the encoder, whereas the decoder should be able to decode any configuration.

RTMP Payload

Flash Player 11+ is capable of capturing from camera and encoding in H.264 to send to an RTMP stream. Each RTMP message contains header and data (or payload), where the header contains crucial information such as timestamp, stream identifier, and the payload contains the encoded video NALUs or actual configuration data. The format of the payload is same as that of the F4V/FLV tag for H.264 video in an FLV file. Each RTMP message contains one frame but may contain more than one NALUs. The first byte contains the encoding type, and for H.264 is either 0x17 (for intra-frame) or 0x27 (for non-intra frame). The second byte contains packet type and is either 0x00 (configuration data) or 0x01 (picture data). The configuration data contains both SPS and PPS as described here.

  rtmp-payload := enc-type[1B] | type[1B] | remaining
enc-type := is-intra[4b] | codec-type[4b]
is-intra := 1 if intra and 2 if non-intra
codec-type := 7 for H.264/AVC
If the type is configuration data then the next four bytes are configuration version (0x01), the profile index, the profile compatibility and the level index. This is followed by one byte containing least-significant two-bits that determine the number of bytes to use for the length of the NALU in subsequent picture data messages. For example, if the bits are 11b then it indicates 3+1=4 bytes of NALU length, and if the bits are 01b then it indicates 1+1=2 bytes of NALU length. Lets call this the length-size and possible values are 1, 2 or 4. This is followed by a byte containing least-significant 5 bit for the number of subsequent SPS blocks. Each SPS block is prefixed by 16-bits length followed by the bit-wise encoding of SPS as per H.264 specification. This is followed by a byte containing the number of subsequent PPS blocks. Each PPS block is prefixed by 16-bits length followed by the bit-wise encoding of PPS as per H.264 specification. Typically only one SPS and one PPS blocks are present.

remaining for config := version[1B] | profile-idc[1B]
| profile-compat[1B]
| level-idc[1B]
| length-flag[1B]
| sps-count[1B] | sps0 ...
| pps-count[1B] | pps0 ...
length-flag := 0[6b] | value[2b] where value + 1 is length-size
sps-count := 0[3b] | count[5b] where count is number of sps
pps-count := number of pps elements
sps(n) := length[2B] | sps
pps(n) := length[2B] | pps
If the type is picture data then the next three bytes contain a 24-bit number for the decoder delay value for the frame and is applicable only for B-frames. The default baseline profile does not include the B-frames. Thus the first five bytes of the picture data payload are like header data. This is followed by one or more NALU blocks. Each NALU block is prefixed by the length of the next NALU encoded-bits. The number of bytes used to encode this length is determined by length-size mentioned earlier. Then the NALU is encoded as per H.264 specification.

remaining-picture := delay[3B] | nalu0 | nalu1 ...
nalu(n) := length | nalu
length := number in length-size bytes
nalu := NAL unit as per H.264
Each NALU has first byte of flags. The flags contains 1 most-significant bit of forbidden, next 2-bits of nri (NAL reference index) and final 5 least-significant bits of nal-type. There are several nal-types such as 0x01 for non-intra regular pictures, 0x05 for intra-pictures, etc. Please see the H.264 specification for the complete list.

The camera captured and encoded data in Flash Player contains three NALUs in each RTMP message -- the access unit delimiter (nal-type 0x06), the timing-information (nal-type 0x09) and the picture slice (nal-type 0x01 or 0x05). The Flash Player is capable of decoding other nal-types as well, and does not require access unit delimiter or timing-information NALUs for decoding. I haven't seen any support for aggregated or fragmented NALUs in the Flash Player.

RTP Payload

The RTP payload format for H.264 is specified in RFC 6184 and is typically supported in SIP-based video phones. The RTP header contains the crucial information such as the payload type, the timing data, and the sequence number, whereas the actual configuration and picture NALUs are sent in the payload as specified by this RFC. The first byte is the type containing one bit forbidden, two bits of nri and 5 bits of nal-type.

nalu := nal-flags[1B] | encoded-data
nal-flags := forbidden[1b] | nri[2b] | nal-type [5b]
In addition to the base nal-types of H.264, the RFC defines new nal-types for fragmentation and aggregation. Traditionally, the Internet plagued by middle-boxes, NATs and firewalls has imposed a limit on the size of the UDP packet that can be pragmatically used on the Internet, and the typically MTU is around 1400-1500 bytes. The H.264 encoder is capable of generating much larger encoded frame sizes hence cannot be successfully sent as one frame per RTP packet over UDP in many cases. On the other hand, some low-sized encoded frames may be much smaller than MTU thus incurring additional overhead for RTP headers. These low-sized frames can be aggregated for efficiency.

Many SIP video phones configure their H.264 encoders to use multiple slice NALUs in a single frame, unlike Flash Player which generates one picture NALU per frame. Thus the traditional SIP video phones are capable of using low sized encoded payload without RFC 6184 which can be sent in a single RTP/UDP packet.

When a large encoded frame is fragmented to smaller fragments, the nal-type=28 is used in the first byte of each fragment, followed by the second byte containing the actual nal-type of the frame as well as the start and end markers. This is followed by the actual encoded data. The RTP header of all these fragments contain the same timestamp value. The last fragment of the frame contains the marker set to true, whereas all the previous ones set it to false. When multiple smaller frames are aggregated, the nal-type of 24 is used in the first byte of the aggregate payload, followed by one or more NAL data. Each NAL data is prefixed by 16-bit length of the encoded NALU. There are non-trivial rules on how the nri is obtained and we refer you to the RFC for the details.

To fragment:
let encoded-data = fragment0 | fragment1 | fragment2...
encoded-data of fragment(n) := orig-nal-flags[1B] | fragment(n)
orig-nal-flags := start[1b] | end[1b] | ignore[1b]
| orig-nal-type[5b]
start := 1 if first fragment else 0
end := 1 if last fragment else 0

To aggregate:
encoded-data of aggregate := nalu0 | nalu1 | nalu2 ...
nalu(n) := length[2B] | orig-nalu(n)
In additional to sending the SPS and PPS packets in RTP, the video phones also negotiate the configuration data via external protocol such as SIP/SDP. Since Flash Player does not do that, we will not discuss it further.


Now that we understand the packetization of H.264 for Flash Player as well as SIP/RTP, let us go over the details of the translation process.

The configuration data is sent periodically by Flash Player before every intra-picture frame. However, SIP phones may not send the configuration data periodically. It is desirable to cache the configuration data received from both sides, and re-use it when the other side connects. The first packet sent must contain the configuration data. It is also desirable to periodically send the configuration data to both Flash Player and SIP sides from the translator, irrespective of whether the configuration data is received periodically. In our translator we send the configuration data before every infra frame.

In Flash Player to SIP/RTP direction, when the configuration data is received on RTMP, it is sent in two RTP packets, one for SPS and one for PPS. Both use the same timestamp and set the marker to true. When picture data is received on RTMP and need to be sent to the RTP side, it is dropped until a previous configuration data has been sent to the RTP side. If the picture data is not dropped, all the NALUs are extracted. The last out-of-three NALUs per RTMP message is the actual picture NALU which is sent to the RTP side as follows. Only the nal-type of 1 and 5 are used, whereas others are ignored. If the NAL size is less than 1500 bytes, it is used as is in the RTP payload with marker set to true. If the NAL size is more, it is fragmented in to smaller fragments with each of size at most 1500 bytes. Multiple fragmented RTP packets are generated as per the RFC. All but the last fragment has marker set to false. The RTP marker of true indicates end of frame. All the fragments use the same timestamp value.

In the SIP/RTP to Flash Player direction, the configuration data is received in multiple RTP packets and are cached by the translator. When both SPS and PPS payloads have been received from the RTP side, we are ready to start streaming to the Flash Player side. Any incoming RTP packet is put in a queue. When the last packet in the queue (that was most recently received) has marker set to true, the queue is examined and RTMP messages are created to be sent to the Flash Player side. Since Flash Player handles complete frames in each RTMP message, we need to wait until the marker is set to true so that we only send complete frames to Flash Player. If the RTMP stream is ready but we have not received the configuration data from RTP or we have not or are not sending the first intra frame to RTMP, then received packets are dropped. If no infra frames are received for 5 seconds, then we send a fast-intra-update (FIR) request to the SIP/RTP side, so that it triggers the SIP phone to send an intra frame.

Once we decide that we can send packets to RTMP from the received RTP queue, we divide the queue in to groups of packets of same timestamp and same nal-type values while preserving the order of the packets. If the nal-type is 5 indicating that an intra-frame is being sent to RTMP, then we send a configuration data too before the actual picture data. The configuration payload format is explained earlier and contains both PPS and SPS along with other elements. Each group of packets of the same timestamp and same nal-type is sent as a single RTMP message in the same order containing one or more NALUs. If the nal-type is 1 or 5, the NALU from the RTP payload is used as is in the RTMP payload with five bytes of header as explained earlier. If the nal-type is 28 indicating fragmented packets, then all the fragmented payloads are combined in to a single NALU. If the nal-type is 24 indicating aggregated packet, then it is split in to individual NALU data. Then the sequence of NALUs generated from this group of packets of same timestamp and nal-type are combined in to a single RTMP payload to be sent to the Flash Player.


As mentioned in my previous article, there are a few gotchas. You must use the new-style RTMP handshake, otherwise the Flash Player will not decode/display the received H.264 stream. You must use Flash Player 11.2 (beta) or later when using "live" mode, otherwise the Flash Player does not accept multiple slice NALUs of a single frame. If audio and video are enabled, then the timestamp of video must be synchronized with the timestamp of audio sent to RTMP. Note that RTP picks random initial timestamp for each media stream so the audio and video RTP timestamp values are not easily co-related unless using RTCP or external mechanism. You need to co-related the RTP timestamps of audio and video to a single timestamp clock of RTMP.


It is possible to do re-packetization of H.264 between Flash Player's RTMP and standard SIP/RTP without having to do actual video transcoding. This article explains the tricks and gotchas of doing so!

The implementation works between Flash Player 11.2 and a few SIP video phones such as Ekiga and Bria 3.

[6] ITU-T recommendation H.264, "advanced video coding for generic audiovisual services", March 2010.
[7] ISO/IEC International Standard 14496-10:2008.