Kundan Singh

A journey through real time...

I did my first project on real-time communications in 1997. In the past two decades I did numerous projects in this general area. In particular, these were related to voice over IP (VoIP), multimedia and web communication. As 2016 comes to an end, I reminisce my 20-year journey. I attempt to capture the gist of my projects. How my project themes and ideas evolved over time?

What are endpoint driven communication systems?

A lot of my work in the past decade has focussed on endpoint driven systems, e.g., peer-to-peer Internet telephony (P2P-SIP) [1,2] for inherent scalability/robustness, Rich Internet Applications for web video conferencing [3,4], and more recently, resource-based software architecture [5,6]. In this article, I emphasize the importance of such systems, and differentiate them from other system architectures in the context of real-time communication.

Impress.js 3D presentations on cloud, communication, WebRTC and mobile

I am really impressed by impress.js presentation framework based on CSS3. It is pretty low level, requires knowledge of HTML and CSS, and is small enough that it can stay out of your way. Others have created very impressive 3D presentations. I also did my share of 3D presentations in 2015, as various conference paper presentations, while at Avaya Labs.

click here to see all my presentations

Below, I list my presentations and associated demo videos largely based on impress.js framework. When viewing a presentation, use an HTML5 capable browser such as Chrome or Safari, and please wait for it to loads completely, and the browser's loading icon to stop spinning, before doing the slide show.

Command line Twilio client

[This post is neither contributed not endorsed by Twilio]

Twilio client [1] enables embedding voice conversation in web and mobile apps, by creating a voice pipe between your browser or mobile device and the service. One thing missing there is the ability to create such a voice pipe from non-mobile or non-web programs, such as a command line application. There is a shell script [2] to place a call for testing from command line, but it uses pre-defined text or recorded file for media, and looses the real-time interactive nature of the voice path. In particular, it does not create a voice pipe between the local machine and the service.

Motivation

Ability to connect real-time interactive voice path from an application opens doors to wide range of other use cases such as media path processing and analysis, e.g., for real-time transcription, or to bridge call between diverse services, e.g., translate between IM and voice call. Secondly, such as mechanism is independent of a specific browser or mobile platform, and can work in headless mode for automated testing or client-side programmability of voice call or its media path.

At the high level, there are four potential ways to create such a voice pipe. Two of these can be accomplished using my rtclite project, that we will describe in more detail in this article.

WebRTC - using Twilio 1.3+ web client with command line WebRTC app
Using Twilio mobile client interface ported to command line app
RTMP - using Twilio 1.2 web client API with command line RTMP client
SIP - send/receive SIP call to/from Twilio [3]

The first two approaches essentially implement a command line version of the web and mobile SDKs. For example, a WebRTC stack compiled for Linux/OS X may be used to accomplish (1). The third approach uses a command line RTMP client and an older version of client API, and the fourth one uses a command line SIP endpoint to dial into the service.

We describe how to do the last two using software pieces from our open source rtclite project [5]. The following video demonstration shows the command line call initiation. Don't forget to view in full screen!

Connect to Twilio from command line SIP endpoint

The approach is described on the provider's website [3] including the steps for creating the SIP domain/endpoint such as yourname.sip.twilio.com. The description there is targeted for VoIP providers rather than client. Currently it is not possible [4] to configure your SIP softphone to use the Twilio service as SIP server. Once the SIP domain is created, you can send request to "sip:something@yourname.sip.twilio.com". The sender's IP address needs to be white-listed or the caller's credential needs to be preapproved for authentication. we have only tried the IP address white-listing, and not the SIP authentication using credentials.

After configuring a SIP endpoint, you should be able to see the SIP domain and its associated voice URL on the provider's website, e.g., "yourname.sip.twilio.com" mapped to voice URL of "http://yourserver/yourtwiml.xml".A simple call forwarding can be done using the following TwiML. These steps can be used to connect to any other TwiML application from the command line SIP endpoint, if you configure your SIP endpoint's voice URL accordingly.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Dial timeout="20" callerId="1415xxxxxxx" method="POST">415yyyyyyy</Dial>
</Response>

You can also create programmable server side script to derive the target number using the called SIP address. For example, send call to sip:1415yyyyyyy@yourname.sip.twilio.com, and in your script, extract the user name part of the URL to populate the Dial tag's content.

Once the initial configuration is complete, use the SIP endpoint available in the rtclite project to initiate a call. More details on the command line SIP endpoint are available in my previous blog post as well as on the project website [6]. After setting up the initial dependencies of the project, run the caller module as follows.

$ python -m rtclite.app.sip.caller --domain=example.net --use-lf --samplerate=48000 \
         --to=sip:something@yourname.sip.twilio.com

The "domain", "use-lf" and "samplerate" options are described on the project website. The "to" option is important, and represents the target address to call to. Once the call is established, the py-audio project's modules are used to interface with the audio device, and to send and receive audio in the call.

Some important notes follow. For a demo account with the provider, your callerId and target number in the Dial tag must both be verified for your account. Some Internet Service Providers (ISP) may block a SIP request on port 5060. In fact, I sometime experience wifi reset of my residential equipment, when I send a SIP packet out from my home machine to outside VoIP service. Router interference of SIP messages may also be the reason for incorrect CRLF handling, and the need for use-lf option. Using TLS may be an option to work around router interference. Make sure that the sample rate specified on the command matches the allowed sample rate of the audio device on your client machine. On Mac OS X, audio capture is usually done with 48kHz.

Connect to Twilio from command line RTMP client

Here, we exploit the older Flash-based Twilio Web Client SDK version 1.2. First, we describe how to figure out the client-server RTMP message flow, and next we show how to do this using the rtmpclient module from the rtclite project.

The web client SDK 1.2 used this twilio.js javascript file. A quick look indicates that it internally loads another twilio.js file.

This second JavaScript file shows that the Flash-based client-server connection is accomplished using the MediaStream class which internally attempts RTMP connection to the service.

Using the hello-client-monkey.php example from the provider, but hosted on your website, and running wireshark on default RTMP port 1935, you can find more information about this client-server exchange. Make sure to use http instead of https for your hosted web app to avoid encryption. The following screenshots show a few example RTMP messages, and their order, and the parameters sent in the initial connect request.

In summary, the NetConnection's connect is done to "rtmp://chunder.twilio.com/chunder", with additional 6 arguments. The first argument looks like the capability token generated by the helper library. The second and third are null and empty string respectively. The fourth one is a JSON formatted string with some client side attributes. The fifth looks like the account SID. And last one looks like the client SDK version. After connect is complete, there are two createStream calls and a startCall RPC method. This is followed by a received RPC method callsid from the server. Next, the "input" stream name is published, and "output" stream name is played. This is followed by bunch of audio data.

We wrote an example application, client.py [7], which uses the rtmpclient module from the rtclite project, and automates client-server exchange process. It then connects the audio sent/received on the two streams to the local audio device. Although, not well tested, there are other command line options in this application that allow recording the received stream, or playing a file to the sent stream.

The command line application takes three mandatory parameters, which if not supplied, will be prompted for. These parameters are for account SID, auth token and application SID. The client application then connects to the service using RTMP and uses those supplied parameters. An example command line invocation is shown below, where the three parameters are fake - use the correct values in your test!

$ python -m rtclite.vnd.twilio.client --account=ACXXXXX --token=YYYYY --app=APZZZZZ

Use ctrl-C to terminate the client application.

Comparison

Here, we attempt to compare our two approaches for command line client: SIP and RTMP.

The audio codec used are G.711 in SIP and Speex in RTMP. Thus bandwidth requirement is more for SIP, but can theoretically give better quality, e.g., for real-time transcription. It may be possible to send G.711 in RTMP to server, but is not clear how to force the server to send back G.711 in RTMP. The media path is over UDP (RTP) for SIP vs. TCP for RTMP - making RTMP one more suseptible to network issues such as latency in interactive conversations. While both these are voice only at this time, using WebRTC or Mobile SDK based approach may allow video pipe. The py-webrtc project may become useful for this, once it is completed.

The SIP approach has provider supplied documentation on how to do incoming calls, but the RTMP approach requires more work to figure that out. The SIP approach seems to target server-to-server call flows, e.g., to connect your soft PBX to provider service. Due to router interference of SIP messages, it may not work in all the cases or all the time from a client machine. On the other hand, the RTMP approach is inspired by the client SDK, and hence suitable for clients. However, Flash-based client has been deprecated by the provider, and the corresponding service may no longer be available in near future. Moreover, RTMP is kind of an obsolete technology. Using WebRTC or Mobile SDK approach will work better in that case. Once SIP registration becomes available on the provider service, a standard command line SIP endpoint [6] should be enough for send/receive of calls.

References

Twilio client: embed voice conversation in Web and Mobile Apps, https://www.twilio.com/client
Twilio Labs: place a Twilio call from the shell, https://www.twilio.com/labs/bash
Programmable voice SIP, https://www.twilio.com/docs/api/twilio-sip
Can I configure my soft phone to work with the Twilio SIP endpoint, https://www.twilio.com/help/faq/voice/can-i-configure-my-soft-phone-to-work-with-the-twilio-sip-endpoint
Rtclite: light weight implementations of real-time communication protocols and applications in Python, https://github.com/theintencity/rtclite
Command line SIP endpoint, https://github.com/theintencity/rtclite/blob/master/rtclite/app/sip/caller.md
Command line Twilio client, https://github.com/theintencity/rtclite/blob/master/rtclite/vnd/twilio/client.py

How to make phone calls from command line?

This article presents a command line SIP endpoint available as part of my rtclite [1] project. It is useful in a number of scenarios such as:

dialing out a phone number from command line,
performing automated VoIP system tests,
showing quick demos of communication systems, or
experimenting with media processing on the voice path, e.g., for speech recognition, recording or text-to-speech.

These tasks cannot easily be done using existing user interface based web, installed or mobile apps. I had implemented something like this, named sipua [2], about 15 years ago in C/C++ at Columbia University. There are other projects such as sipp [3], pjsua [4] or sipcmd [5] that implement some version of command line SIP user agent, but may have limitations such as lack of support for audio capture device, or hard to extend to add new media processing capability such as text-to-speech. This article describes a SIP endpoint written in Python [6] as part of my open source project.

Click here to see the full description of this SIP endpoint written in Python [6].

The project page shows how to use the SIP endpoint to send/receive instant messages and voice call. The command line options allow you to configure various attributes such as whether to register with a SIP server, how to respond to incoming requests, how to work around SIP entities that incorrectly handle CRLF for line endings, how to test signaling without media path, and so on. The project page also describes how to use the module in your own Python project.

Following video demo of the SIP endpoint shows interoperability with X-lite terminal and dialing out toll free numbers using a VoIP provider. It shows how to dial a phone number using a VoIP provider, how to send DTMF digits from terminal, and some experimental features such as text-to-speech and speech recognition. Do not forget to watch in full screen!

References

Rtclite: light weight implementations of real-time communication protocols and applications in Python, http://www.rtclite.com, https://github.com/theintencity/rtclite
Sipua: a SIP test user agent for Solaris, Linux, FreeBSD and Windows NT, http://www1.cs.columbia.edu/irt/cinema/doc/sipua.html
SIPp: free open source test tool/traffic generator for SIP, http://sipp.sourceforge.net/
Pjsua: open source command line SIP user agent (softphone), http://www.pjsip.org/pjsua.htm
sipcmd: the command line SIP/H.323/RTP softphone, https://github.com/tmakkonen/sipcmd
caller: SIP application to initiate or receive VoIP calls from command line. https://github.com/theintencity/rtclite/blob/master/rtclite/app/sip/caller.md

Introducing rtclite

We have attempted to unify our diverse Python based projects related to SIP, RTP, XMPP, RTMP, REST, etc., into a single theme of real-time communication (RTC). In particular, we have migrated the relevant source code from other projects to the new "rtclite" project after cleanup and refactoring. Moving forward, any new functionality related to Python implementations of real-time communication protocols and applications will be done in this new project.

Over the next few weeks, we will continue to evolve this project, migrate or fix issues/comments, and deprecate the previous older projects.

I will also write blog posts here to introduce various specific modules and how they are used in real world in the new few weeks.

From the project page at http://www.rtclite.com (currently forwards to https://github.com/theintencity/rtclite)

"Light-weight implementations of real-time communication protocols and application in Python

This project aims to create an open source repository of light weight implementations of real-time communication (RTC) protocols and applications. In a nutshell, it contains reference implementations, prototypes and sample applications, mostly written in Python, of various RTC standards defined in the IETF, W3C and elsewhere, e.g., RFC 3261 (SIP), RFC 3550 (RTP), RFC 3920 (XMPP), RTMP, etc."

We encourage all users of my open source projects to visit this new project, and if possible migrate your applications to use this new project.

My WebRTC related papers from 2015

This post continues from the previous one and gives an overview of my ongoing research on web-based multimedia communication or WebRTC. I wrote five published co-authored research papers last year answering these questions - How do we solve some of the enterprise challenges of WebRTC using browser extensions? How do we address cloud challenges of a multi-services and apps vendor? How do we create pure web-based enterprise communication and collaboration system without depending on legacy protocols? How do we do user reachability in a multi-apps environment created by WebRTC? And, how do we do write-once-run-anywhere for WebRTC based team apps using a cross platform tool? Read on if one or more of these questions interest you...

Enterprise WebRTC powered by browser extensions

Traversing WebRTC flows created by external third-party websites across restricted enterprise firewalls is challenging. There are other challenges in adopting WebRTC in enterprises, e.g., how to integrate it seamlessly with existing communication equipments, or how to enforce enterprise policies such as call recording of WebRTC flows on third-party websites? We show how to use browser extensions to solve these problems. This systems paper is based on my implementations of two interesting projects.

“We use browser extensions to solve two important issues in adopting WebRTC (Web Real-Time Communications) in enterprises: how to integrate WebRTC-centric communication with existing systems such as corporate directories, communication infrastructure and intranet websites, and how to traverse media paths across enterprise firewalls. Vclick is a simple and easy to use web-based video collaboration application that enables click-to-call from other webpages. SecureEdge is a network border traversal system for policy and security enforcement, and consists of a secure media relay that sits at the network border or in the cloud. A browser extension in the enterprise user’s device transparently injects this media relay in every WebRTC media path needing to traverse the enterprise network edge to enable authenticated border traversal without help from the websites hosting the WebRTC pages. We attempt to generically support WebRTC in enterprises on a variety of application scenarios instead of creating another fragmented communication island. The challenges faced and techniques used in our proof-of-concepts are likely extensible to other enterprise WebRTC scenarios using the emerging HTML5 technologies.”

Keywords: WebRTC, enterprise communication, secure edge, browser extension, VoIP, video call, firewall traversal, media relay.

more>>

ALICE: Avaya Labs Innovations Cloud Engagement

Although we know how to create cloud-hosted services, platforms and infrastructures, little is known about cloud hosted communication and collaboration services, especially to enable multi-tenancy and self-service. This research focuses on the challenges of hosting cloud services for customer trials, where resources are limited to make existing services cloud ready or to fit a specific platform. This is based on my work on creating a cloud portal to host research-oriented early or pre-product services on the cloud, and identifying common themes and techniques.

“We present the architecture and implementation of our enterprise cloud portal named ALICE, Avaya Labs Innovations Cloud Engagement, which provides self-service access to service developers, tenants, and users to various communication and collaboration applications. Currently ALICE is used for field testing of advanced research prototype services based on technologies such as WebRTC and HTML5. This paper describes the current portal and extensions to support multi-tenancy.

We describe challenges in creating a self-service multi-tenant SaaS (software-as-a-service) portal to host communications and collaboration applications for small to medium scale businesses. The challenges faced and the techniques used in our architecture relate to security, provisioning, management, complexity, cost savings and multi-tenancy, and are applicable and useful to other cloud deployments of diverse enterprise applications.”

Keywords: Cloud, system architecture, portal, multi-tenancy, Internet telephony, enterprise communications, web collaboration.

more>>

Vclick: endpoint driven enterprise WebRTC

One of my earliest project at Avaya Labs was on creating a light-weight service for wide range of web communication and collaboration scenarios. Vclick is a collection of many loosely coupled apps that run the app-logic in the browser or endpoint, and mash up at the data level. It contains applications for video call, conferencing, video presence, text chat, click-to-call, screen sharing, shared notepad and whiteboard, and so on. It goes against the conventional web wisdom of thin-client, single-page-apps, or rigid GUI, and presents a new software architecture to create robust endpoint driven apps. The paper is really about how to keep the endpoints smart and network (or service) dumb in the context of collaboration applications.

“We present a robust, scalable and secure system architecture for web-based multimedia collaboration that keeps the application logic in the endpoint browser. Vclick is a simple and easy-to-use application for video interaction, collaboration and presence using HTML5 technologies including WebRTC (Web Real Time Communication), and is independent of legacy Voice-over-IP systems. Since its conception in early 2013, it has received many positive feedbacks, undergone improvements, and has been used in many enterprise communications research projects both in the cloud and on premise, on desktop as well as mobile. The techniques used and the challenges faced are useful to other emerging WebRTC applications.”

Keywords: WebRTC, enterprise communication, web video conferencing, resource-based architecture, web applications.

more>>

User reachability in multi-apps environment

With numerous "walled-garden" services and apps emerging because of WebRTC, there is a need to identify the best way to reach your contacts, irrespective of which app or service she is on. This systems paper describes my work on implementing a mobile (and desktop) app called Strata Top9, to quickly reach your important contacts. It really is a front-end to launch other applications. Unlike existing presence based systems, we propose to iterate during call initiation. The paper presents the software architecture and design decisions along with several motivational use cases of our project. It also details the concept of dynamic contacts, and endpoint driven caller and receiver policies.

“Recent progress in web real-time communication (WebRTC) promotes multi-apps environment by creating islands of communication apps where users of one website or service cannot easily communicate with those of another. We describe the architecture and implementation of a multi-platform system to do user reachability in multiple communication services where users decide how they want to be reached on multiple apps, e.g., in an organization that has voice-over-IP, web conferencing and messaging from different vendors. Our architecture separates the user contacts from reachability apps, supports user and endpoint driven reachability policies, and has several independent and non-interoperable WebRTC-based apps for two-way and multiparty multimedia communication. Our flexible implementation can be used for enterprise or personal communications, or as a white-labeled app for consumers of a business.”

Keywords: system design; mobile app; user reachability; multiservices; VoIP; WebRTC; caller policy.

more>>

Developing WebRTC-based team apps with a cross-platform mobile framework

Ability to write-once-run-anywhere still eludes many app developers. Luckily several cross-platform development tools exist. However, creating cross-platform communication and collaboration related apps is still challenging. This paper presents my implementation work on creating cross platform apps. In particular, four types of platforms - web app on PC and mobile, and installed app on PC and mobile - are considered, and seven different apps are covered for a wide range of enterprise use cases. Techniques and steps for creating such cross platform apps are presented along with lessons learned based on practical experience. Additionally, considerations for iOS and wearable Glass devices are presented.

“We present lessons learned in developing cross platform multi-party team applications. Our apps include a range of communication and collaboration scenarios: document and content sharing in a team space, an agent-based meeting helper, phone number dialer via a voice-over-IP (VoIP) gateway, and multi-party call in peer-to-peer or client-server mode. We use web real-time communication (WebRTC) to enable the audio and video media paths in the apps. We use frameworks such as Chrome Apps and Apache Cordova to create apps that can be accessed from a browser, or installed on a desktop, mobile device, or wearable. The challenges and techniques described in our paper related to audio, video, network, power conservation and security are important to other developers building cross-platform apps involving WebRTC, VoIP and cloud services.”

Keywords: HTML5, Apache Cordova, Chrome Apps, WebRTC, Mobile, Cloud, Wearable.

more>>

WebRTC vs. SIP/SDP

Every time a new protocol appears in the protocol jungle of multimedia communications [2], people attempt to compare and contrast it with existing established protocols, such as SIP. With WebRTC as the new kid on the map, you can find several attempts to compare SIP and WebRTC [3][4][5][6][7]. Depending on which camp the comparison originates from, you may find flavors of favoritism, or unintentional downplay of the importance of the other camp.

This article presents my point of view, hopefully unbiased, on this topic. Let me start by saying that SIP and WebRTC are different, and it is not fair to compare them without an established context.

Complex protocol vs. simple API

SIP is a protocol, not an API; whereas WebRTC is an API, with an associated set of protocols.
Consider that TCP is a protocol but socket is an API. TCP has complex state machinery to enable reliable bi-directional end-to-end packet flow assuming that intermediate routers and networks can have problems but end to end reliability is assured. On the other hand, the socket API hides all the complexity of TCP and provides an easy to use abstraction. Luckily in the case of TCP, developers only work on top of the standardized socket API. Whereas in the case of SIP, usually they end up dealing with the protocol directly, or work with (partly broken) implementations of intermediate network element such as a SIP proxy, or use (poorly defined) APIs from third-party SIP libraries. Depending on what level of abstraction is exposed in the SIP library, the development effort can range from very simple to very complex. In the case of WebRTC, the API is being defined and standardized, and is intended as a high level abstraction suitable for JavaScript/Web developers.

To list the differences between SIP and WebRTC - one could say that SIP is programming language agnostic but WebRTC is not, and must use JavaScript, or that SIP application could run on any device or platform, but a WebRTC application requires a browser to host the app. These are superficial differences - due to the core difference in the nature of SIP vs. WebRTC, i.e., protocol vs. API.

SIP would have been perceived as much simpler if there were well defined APIs from the start. But that would have limited the flexibility (and freedom) of what SIP could be used for. More on the flexibility and freedom is discussed later in this article.

One thing to note, however, is that WebRTC does use a set of complex protocols behind the scenes, many of which are shared by a SIP system, as we describe next. Although a web application using WebRTC is simple due to the high level API, the implementation of WebRTC in a browser itself is not that simple!

Protocol stack

SIP is an established protocol for Internet communication. WebRTC is an emerging API intended to be provided by the browsers and consumed by the applications in JavaScript. It has an associated set of protocols, some of which overlap with those used in a SIP system. The title of this article indicates a comparison of WebRTC with SIP/SDP, and not just with SIP. This is because SIP is often only a part of the overall puzzle in a real world communication system, as shown in the following diagram taken from my research paper [1]. In fact, the role that SIP plays in a real system is outside the scope of the WebRTC specification currently. Which means that it does not really make sense to compare SIP alone with WebRTC protocols, but it is okay to compare a SIP system with that based on WebRTC.

Comparison of typical SIP vs WebRTC application stack. The dotted line separates what is programmed by the application developer and what is provided by the platform.

Although most SIP systems use SDP, it is not a requirement, and some SIP systems may use something else if needed. Same thing holds true for other protocol elements such as RTP for media transport. However, within the context of a voice and video call application, you will likely see the above mentioned protocols in effect. On the other hand, the WebRTC effort specifies the mandatory set of protocols including SDP, SRTP and ICE, and also, the minimum list of mandatory audio and video codecs, which must be implemented by a WebRTC capable browser. In the diagram above, while SIP is only one block in the system, WebRTC specifies multiple blocks including the API, codecs, SRTP/RTP/RTCP, ICE and to some extent SDP. While SDP is used in the WebRTC API, an application is free to change it to anything else when exchanging the session information with the peer, if needed. Defining and mandating more things in WebRTC makes the interoperability a little easier, unlike SIP where many such things are left to the implementation. We will talk about interoperability later in this article.

Network elements

The SIP specification defines the behavior of an endpoint (or user agent) as well as various network elements such as proxies and redirect servers. Other SIP extensions define more network elements such as a conference server, a presence server, and so on. With WebRTC, the focus is only on browser-to-browser communication, and only the browser's behavior is defined in the specification. Other elements such as a gateway or media server are free to implement the behavior. However, that part is likely not going to be standardized. This hopefully limits the number of specifications related to WebRTC, and hence limits the resulting complexity.

This does not mean that WebRTC systems will not need such network elements. Ability to call a phone number via a gateway, or to host a multi-party voice mixing or recording on a media server, or to locate the target user via a lookup service are all examples of crucial services required in many real world applications. Lack of standardization means that every vendor will likely define its own version of these elements, creating fragmentation. Such scenarios are actually inline with WebRTC's goal which aims at creating media path among browsers visiting the same website, and does not care for inter-website communication. More on interoperability vs. fragmentation is discussed later in this article.

Fortunately, it is entirely possible to create fully functional WebRTC systems without using complex or heavy weight network elements. In fact, many initial demonstrations of WebRTC applications have followed this path with peer-to-peer media flows, along with a very simple signaling/rendezvous service. SIP, too, started out with this model of peer-to-peer media flows with a simple signaling/rendezvous service in the form of a SIP proxy or redirect server. However, practical limitations and real-world deployments have mostly thrown this idealistic model out of the window, and have largely adopted a "managed" network centric model with heavy weight network elements in the form of back-to-back user agents, session border controllers, and call stateful proxies. It is likely that the similar real world forces may cause a similar network centric deployment model in the case of WebRTC. Only time will tell!

Message flow

To establish a multimedia call, three pieces of information are exchanged between the two parties - (1) an "invite" from caller to callee as an intention to establish a call, which the callee may answer or decline, or ignore, (2) session description containing capabilities of each party, (3) transport candidates for establishing connectivity on the media path, hopefully peer-to-peer. Let us call these steps as (1) "invite", (2) session description exchange and (3) transport candidates exchange.

In a WebRTC-based application, these may be separated out into different phases or may be combined together, as determined the web application. In fact, (1) above is not part of the specification, but is often implemented in a WebRTC call, and (2) and (3) are exchanged between the two parties using an out-of-band channel outside the scope of the WebRTC specification. The term "trickle ICE" refers to separating (3) from (2) in a WebRTC session negotiation, and is the default mode - although it is possible to not use it, as determined by the web application. In a SIP system, all three pieces are usually bundled in a single request-response transaction and are carried in a SIP INVITE and it's 2xx response between the two parties, often via one or more intermediate proxies.

The separation of these pieces of information in WebRTC makes it more flexible, e.g., an application can invent any kind of user lookup service over any protocol (HTTP POST, WebSocket, XMPP, Google Channel, LDAP, hard-coded, or whatever, or even SIP [1]). Some applications may not even need such a lookup service, e.g., a virtual presence application where everyone who visits the website is able to see and hear others on that site. Unlike this, in SIP, a user lookup service must be accessible via SIP.

For those who remember H.323, and its multiple steps of call setup in version 1, may argue that we have come back a full circle - started with multiple steps call setup, invited H.323 fast connect and SIP for single step call setup, and now back to multiple steps to establish a WebRTC call. However the situations are different in H.323 vs. WebRTC - in H.323 all the steps were done by the same entity, and after the initial "invite" step, the subsequent ones were follow through, with no perceived benefit to keep them separate. In the case of WebRTC, the "invite" step is outside the scope of WebRTC specification, hence it makes sense to keep it separate. The last transport candidates exchange step is incremental with real benefit to keep it separate from session description exchange, so that better peer-to-peer flows can be detected in future, which first peer-to-peer flow that works gets in use as soon as possible. In the case of H.323, there is no equivalent of this step in my opinion, because the second and third steps are about exchanging media capabilities, and then establishing fixed set of logical channels based on those media capabilities.

The SIP specification and many of its extensions, such as call transfer, interoperability with phone, or capabilities of user agents, actually deal with only the first piece of information mentioned above, i.e., how to reach the target user or device, and are often related to the notion of a "call". In WebRTC specification, there is no notion of a "call", or signaling to lookup the target user. It is assumed that such constructs and concepts belong to the web application or the website that is hosting the application. This gives flexibility to the application developer on how to implement advanced features such as conferencing or call transfer. With more flexibility, comes more power, and more non-interoperability, described next.

Interoperability vs. fragmentation

In the past fifteen years, non-interoperability has been a major problem in the SIP community. Although it is often easy to achieve interoperable voice call signaling, it is rather difficult to make sure that those hundreds or thousands of SIP systems out there will implement every little detail consistently as per the specification. Moreover, there is no mandatory codec defined by SIP, which means that two perfectly compliant SIP endpoints may not talk to each other.

With WebRTC, you only need to achieve interoperability among the major browsers. WebRTC easily enables communication between users visiting the same website, where the application fills the void of signaling, user lookup, and the likes. Since signaling is outside the scope of the specification, interoperability between users visiting separate websites is not readily available and requires one-to-one federation between the two websites or web application. This fragmented communication behavior is inline with how existing web applications behave, i.e., they tend to create islands of users, where users within an island can talk to each other, but not beyond. Have you ever thought of being able to reach your Linked-In contact from your Facebook page? I have, and have tried to do something about it [8][9].

The fundamental difference of triangular vs trapezoidal call model in WebRTC vs SIP, respectively -- which means the two parties are connected to the same server in WebRTC, whereas two parties can potentially be on separate servers in SIP -- results in major differences in interoperability requirements.

The triangular way of WebRTC is not how VoIP community perceived open communication to be. From that perspective, even though WebRTC is an emerging open standard, it will created closed walled gardens of communication islands - leading to communication fragmentation. Unfortunately, even in SIP-based systems, due to non-standard behaviors in implementations or different implementation choices (e.g., codecs) by different vendors, it tends to create fragmented islands, where an equipment from one vendor rarely talks freely with that of another. This tends to create an ecosystem, where a customer must purchase phone endpoints and servers from the same vendor, because it is often hard to achieve true mix-and-match interoperability out-of-the-box.

The problem is more relaxed in the case of WebRTC, because there are only a few major browsers, and the protocol specification does not involve any signaling or media server, except for generic STUN and TURN servers, when appropriate. A fair comparison would have been possible if there were hundreds or thousands of browsers implementing WebRTC, similar to the plethora of SIP implementations that exist today. Or if there were only five major vendors creating SIP user agents. With SIP, the protocol was invented first, followed by an attempt to deploy the user agents and servers. With WebRTC, we already had user agents (browsers) and servers (web servers, STUN/TURN servers), and we put together a set of protocols to let them talk in real-time.

By leaving signaling outside the scope, a WebRTC specification escapes many problems related to non-interoperability. In practice, fragmentation seems to be unavoidable, largely due to business reasons -- or lack of incentive for one website to let its users talk to that on another website. The benefit of better interoperability, maintainability and usability far out weighs the cost of fragmentation for many websites and web applications. For some applications that require cross-site communication, a gateway can do the job, without being part of the specification.

The biggest non-interoperability and fragmentation threat to WebRTC is that some major browser vendors may not implement it altogether, or may implement a variant of the specification, making it non-interoperable with other browsers. Hopefully, market pressure will stabilize the interoperability among major browser. Or developers will find alternatives (plugins? gateways?) to fill the gap. Again, only time will tell!

Flexibility and freedom

We have mentioned about flexibility at the protocol and application level. In summary, SIP provides greater flexibility at the protocol level and WebRTC at the application level. With greater flexibility comes more freedom to innovate, and unfortunately, more problems in interoperability.

First, SIP provides flexibility at the protocol level, e.g., ability to use non-SDP based session description, or ability to use RTP vs SRTP vs ZRTP based media transport, or ability to add newer extensions for media path or call signaling. WebRTC mandates many such elements in the protocol, e.g., use of SDP, SRTP, ICE and so on. In future, if a newer or better media transport, NAT traversal or session description is invented, the specification will need to be changed to incorporate that. However, upgrading the specification when there are only a few browser implementations should not be a nightmare, especially if the upgrade is backward compatible. Once WebRTC specification gets ratified, it leaves little room to change or improve, in an implementation. With SIP's flexibility and resulting freedom, people have created many hundreds of different implementations and use cases using the same base protocol. WebRTC lacks the ability, for example, to implement non-trivial media path, such as application level multicast or local LAN broadcast, or to let the web application configure differential service for different media flows, or to let the web application inject text-to-speech in a media path.

The protocol flexibility of SIP comes at a price - no guarantee of common features such as security or common codecs in different implementations. For example, the voice path in one SIP call could be unencrypted and in another could be encrypted - determined by the implementation choice. Whereas in WebRTC, many of such features are well defined and mandated, as described in the next section.

Keeping the signaling part out of scope and defining the API give good application level flexibility to WebRTC systems. As mentioned before, features such as call transfer, conferencing or user lookup are application's responsibility. This allows a social website to implement these features different from a click-to-call provider's website, for example. Given the numerous websites, and their potential for incorporating real-time in-browser communication, such flexibility is very useful and highly desired.

Protocol differences in SIP vs WebRTC system

Both SIP and WebRTC systems use certain common set of protocols, e.g., SDP for session description, ICE for NAT traversal, and (S)RTP for media transport. The following chart from my paper [1] summarizes the interoperability differences.

Interoperability differences in SIP vs WebRTC; (o) means optional.

In practice, the session description (SDP) and media path defined by WebRTC requires many extensions, that are not found in existing SIP systems, and hence it is very hard to transparently interoperate between the two without using a media gateway. Eventually, one would hope that existing SIP systems would adapt to incorporate these new extensions. But history has told us that this is next to impossible - given the large number of SIP implementations, and given the availability of media gateways. Thus, a browser-to-SIP call will almost always involve a media gateway, unless you sell your own SIP endpoint.

Frequently asked questions (FAQ)

I have heard these. You will likely hear them too.

Are SIP and WebRTC competing technologies? Yes and no. It depends on the context. As an access protocol to connect a client to an established voice/video service or infrastructure, yes. For anything else, most likely no.
Is WebRTC a subset or refinement of SIP? No. WebRTC has certain elements and features that are not present in SIP/SDP. For example peer-to-peer data channel, or trickle ICE, or programming APIs.
Is WebRTC a superset or generic case of SIP? No. A SIP-based system includes elements and features that are not present in a WebRTC browser, e.g., well defined network elements, call event notifications, or user lookup service.
Is WebRTC similar to P2P-SIP? or Jingle? or RTMFP? or name-your-favorite-here? No, for all. In theory, WebRTC has some overlap with session description and media path of a standard SIP system. In practice, WebRTC is very different from any of the existing systems we had.
Does SIP make WebRTC better? What about the other way? Depends on who you talk to. A SIP proponent would argue that cross-site communication and PSTN interoperability requires using SIP at the core, or that SIP in JavaScript is good enough for signaling of WebRTC. A WebRTC proponent would argue that replacing SIP with WebRTC will improve on security, interoperability and call quality. In my opinion, these arguments are either too narrowly focussed or a compelling counter argument can easily be made. So my answer is "no" on both counts.
Can SIP and WebRTC work together, instead of competing? Absolutely. Firstly, they do not compete except in a very narrow context. And secondly, they do work together - the SDP generated by WebRTC can be carried in SIP, directly from the browser or via a gateway. See [1] for alternatives.
Will WebRTC remain at the end, and must SIP be at the core? No. It is entirely possible to create WebRTC capable network elements such as media servers and gateways without depending on SIP, that allow creating voice/video infrastructure with WebRTC at the core. It also possible to create many existing use cases with peer-to-peer media path of WebRTC, without using these network elements at all.
Does WebRTC have better voice quality than SIP? No. Because, SIP does not specific fixed set of voice codecs or voice quality. At present, when many SIP systems use G.711, G.729 or Speex, the voice quality of Opus used in WebRTC is superior to those SIP systems. However, nobody has stopped those SIP systems from using Opus (and related media engine) in their voice stack.
Will WebRTC improve voice quality of my PC-to-phone application? No. Unless you are talking about smart-phone with WebRTC running on phone also. When voice path gets translated to a phone call, the quality will likely be reduced to G.711 or G.729 or whatever the gateway provider has deemed suitable on the phone network.
Will WebRTC give better voice quality for my conferencing application? Depends. See previous question. For centralized media path (with audio mixer), it requires transcoding, so voice quality will likely suffer. Optimization such as using peer-to-peer voice path, or reflecting loudest-N streams to client without mixing may preserve the end-to-end voice quality in conferencing.
Why does browser not implement SIP directly? For browser-to-browser use case, SIP is not needed. For anything else where one end is a browser, a gateway can do the job of translating to SIP if needed. And WebRTC is focussed on browser-to-browser use case only. Having said that, it is possible to do SIP in JavaScript [1] running in your browser, or to implement your own plugin or browser with built-in SIP stack and related APIs.
What does it mean that a SIP conference server supports WebRTC? It could mean that it supports a browser end-point to participate in a multi-party conference over WebRTC, or it could mean that it uses Google's WebRTC media engine or the OPUS/VP8 codec for the voice/video stack but requires SIP/H.323 endpoints to connect. Be careful!
Is WebRTC better than SIP/SDP? Hard to say. WebRTC attempts to fill the gap for a very specific problem, but requires robust, secure and high quality implementation, and defines simple high level API. On the other hand, SIP leaves many such things open to implementers. So for the specific problem, WebRTC is better, some might say.
Between WebRTC and SIP, who will win the racing? There is no racing. Or if there is one, then WebRTC and SIP are racing in different directions, at 90 degrees to each other. Depending on where you look from, or which axis has the finish line, either of them could win. But the truth is, WebRTC is for browser-to-browser communication, where SIP/SDP/RTP is not feasible currently -- so there is no racing. One could argue that with more number of browsers supporting WebRTC than number of SIP endpoints, or with more number of JavaScript developers than number of SIP implementors, WebRTC is likely to win the racing. No comments on that. A more correct analogy of racing, in my opinion, is in specific contexts, e.g., between WebRTC of browser and RTMP+ RTMFP of Flash Player for browser-to-browser communication, or between SIP phone vs. WebRTC browser for client access to backend voice/video service or infrastructure.
What is the real difference between WebRTC and SIP? Ah, glad you asked! If you have already read the rest of this article, you will be surprised to find that unlike all the differences mentioned before, the real difference is who the most important customers of WebRTC vs. SIP are - for WebRTC it is the web developers community, and for SIP it has been telecom providers and equipment vendors.

Conclusions

This article has attempted to present several factors comparing and contrasting SIP/SDP vs. WebRTC systems, without really diving in to the specific protocol differences of how SDP/RTP are used in SIP vs. WebRTC. This is because, while WebRTC specifies the list of profiles and extensions to SDP/RTP, the SIP system is free to choose whichever it likes. So any such comparison will end up being one between WebRTC and SIP of vendor-X.

There are similarities and differences between the two types of systems. In the end it is about the applications and user experiences, rather than protocols and APIs. In practice, it is easy for an established SIP-based IVR system to add WebRTC access, compared to that for a newbie WebRTC system to implement full fledged IVR functionality. The same holds true for many existing telecom, web communication and Internet conferencing applications. Thus, many existing SIP-based applications will likely also adopt WebRTC for access, if there is a demand. And similarly, new pure WebRTC-based web applications will likely incorporate SIP gateway-ing to reach out to non-browser users, if there is demand. SIP or WebRTC itself is just a small piece of the puzzle in the overall system or application. Nevertheless, WebRTC attempts to fill a void that has existed for a really long time in the SIP-based communication world, for which past attempts such as using plugins have largely been unsuccessful, especially on mobile platform.

About Bellheads, Netheads and Webheads

Many people are familiar with the Netheads vs Bellheads battle [1][2]. A Bellhead has a tightly controlled and centrally managed view of services that work all the time and on which the provider can deploy applications and consumers will pay for them, usually a heft price every month. A Nethead is happy to use loosely connected hardware that is up 99% of the time, but allows creating a massive loosely connected network of software applications, where the hardware does not care which software is running or sending traffic.

I believe, most people of my generation belong to the Netheads family, even though their employers may be Bellheads. From [1] "This philosophical divide should sound familiar. It's the difference between 19th-century scientists who believed in a clockwork universe, mechanistic and predictable, and those contemporary scientists who see everything as an ecosystem, riddled with complexity. It's the difference between central and distributed control, between mainframes and workstations."

More recently a new generation is evolving, called Webheads. (Please do not google for this term, or you will be misled). For a Bellhead, the control must be strict and centralized, with managed services. For a Nethead, the control, if any, is distributed in the end-point, with layered applications on top of the basic best effort infrastructure. For a Webhead, everything must run or be accessible from a browser.

"Weird!" - some might say.

"How is it even in the same league?" - others might ask.

Here is my attempt to present the distinguishing attributes.

	Bellhead	Nethead	Webhead
Evolution	Initially there were Bellheads. They were happy living in their islands, working hard towards what the island demanded, and created applications and services for the island.	Then Netheads evolved, and separated applications from the grasp of the services. And interconnected the islands. They created Internet. They created the world-wide-web. They created SIP. And the list goes on.	Then Webheads emerged, and they started moving applications to the web. Slowly and steadily. First, it was email to webmail. Then documents on web editor. Then VoIP to web telephony. Now it looks like they have spread every where and practically moved all kinds of applications to the web.
Control	Believe in centralized control, like one super-power to rule them all. The central element has enough gravity to pull many applications and services, resulting is heavy weight servers but thin clients.	Believe in distributed control or no-control, resulting in chaos at times. The fluidity of the services and applications keeps them decoupled, or loosely connected with each other, spreading their wings to the peripherals.	Believe in centralized control in the form of the website on which their application is hosted, but often accept the fact that there can be numerous such control elements. Work hard to avoid any connectivity or information leak to other controls, outside their own.
Robustness	Everything hardware must be 99.999% reliable. Overall system must be 99.9% reliable. If system goes down, even if nobody is using it right now, they will lose business. Things have strict guarantees.	The system works most of the time, but fails sometimes, but no one person or entity may be blamed, because it is collective responsibility. No guarantees, just best effort service, whatever... whatever works well	The system works if it does, and does not if it does not. They work hard to keep the system running. But many systems are poorly designed, and often fail on sudden popularity of the website. Often blame for the failure is put on the network, and many times on the database.
Example	To add video communication on top of voice calls is a massive renovation project involving upgrading existing services, making the pipes bigger to carry video, incorporating fixed bit rate video codecs to guarantee quality, and upgrading terminal devices to support video. It must work all the time from day one.	To add video on top of a voice call client-server application, is just changing the application to use the desktop webcam and display, and pushing new video packet along side voice, without changing any other infrastructure element in most cases.	To add video on top of a voice call web page, there is a redesign of the website to accommodate the visual elements, so that everyone viewing the same web page can see and hear each other. Regarding the codecs and bandwidth, delegate that to Flash or WebRTC.

In some ways, like Bellheads, Webheads also think of centralized control and management. The hosting website is the controller of an application and keeper of all the user data that could be generated on that website. In a way, they both create islands of innovations which do not communicate with each other very easily, except occasionally via rowboats, i.e., the likes of oauth and web APIs. Like Netheads, Webheads do not care about lower layers of communication, as long as the browsers can reach their web servers. Many things that are well studied in the Netheads community, e.g., socket programming, voice and video communication, or binary data processing, are only recently being introduced to Webheads, e.g., in the form of WebSocket, WebRTC or ArrayBuffer. If Bellheads are purists, then Netheads are hackers. And compared to them, Webheads are hackers++.

[Heard elsewhere...]

Bellhead: "We invented video communication two decades ago - from H.320, H.323/Q.931/H.225/H.245, H.324, H.325.. (15 more). It works on ISDN, LAN, ATM, MPLS, POTS, 802.1x, ... (10 more). There is wide variety of codecs including G.711, G.722, G.723, G.729, H.261, H.263, H.264, ... (20 more). We just finalized our first $100k sale of equipment to whats-it-called company for their intra-office video telepresence."

Nethead: "We re-invented everything you did, and we can quickly build endpoint terminals that can do two party video call to large scale video broadcast. Everyone is now using what we created. We even sent our engineers to your companies to add our protocols to your equipments. Did you know that your equipment also supports SIP for video call? You may not like it, but our protocols are everywhere, and better than yours."

Webhead: "Not sure what you guys are talking about, but just got a tweet from my IT guy that our click-to-video-call app reached 1 million unique users. We are going to hire a new CEO to monetize our app now."

For Webheads, focus is not about services or protocols, but only about apps and users.

Spaghetti, ravioli and lasagna in software

[fiction]

As I was walking down the street, down the street, down the street. A very good friend I chanced to meet... And we went to this italiano restauranto around the corner. Here is what we talked about. [blink blink, screen turns black and white, time goes back...]

"Have you heard of spaghetti code?"

"Of course, I write it all the time. It is easy. I think of a problem, and start writing code to solve it. And do not stop until either the code is complete or my laptop battery runs out. And yes, I don't delete any line of code that I write. Eventually I end up with like two thousand lines of code in one giant function, and like two hundred control statements. It is fast, furious and fun. But tastes awful the next day."

"Really?"

"Yes. So, what kind of code do you write?"

"Well, I have tried my hands on ravioli, lasagna, and more recently farfalle and penne."

"I know ravioli and lasagna - just too modular or too many layers for my taste - I get lost in which piece or layer I am working on, and end up making a mess, embarrassing myself. What about farfalle and penne?"

"Oh, farfalle is what I create when I refactor my correctly working software to smaller pieces, to such an extent that each piece looks broken and half complete."

"What is penne?"

"It is when I refactor it right, and pieces look good. Not too big, not too small. And no information hiding and stuff like that. Frankly, I don't really like information hiding - it is more hype than hoopla. I like data oriented programming - software that is independent of the data - or pasta that is independent of the meat."

[waitress comes by, and asks...]

"I heard pasta, so assume you are ready. What would you two gentlemen have?"

"I will have the three-pasta house combo, with lasagna at the core, the first layer filled with penne, and the second layer with cheese ravioli. Spare the meat. Mix tomato, pesto and alfredo sauce and spread it all over."

[waitress makes a face, and turns to the other...]

"I like spaghetti, but I will also have what my esteemed friend asked for. Additionally, can you open all the layers and separate them, placing side by side, and break all the raviolis to remove the cheese and put it on the side too. I like to keep my pasta separate from other ingredients."

Irony in Internet Communications

We live in a world where

first, we build Internet Protocol over phone network (modem), and then we build a phone system over Internet Protocol (VoIP),
first, we build a phone system to run on Internet (VoIP), and then drive the internet from our phones (smart phones),
first, we build open communication on top of monopolies (Internet over PSTN), and then build closed communication on top of that to promote monopolies (Facetime, Skype, or you-name-it-here).

Wouldn't it be easier if we start afresh?

NetConnection or PeerConnection?

Flash Player and the corresponding ActionScript programming language have NetConnection/NetStream programming APIs for developing communicating web applications. Emerging WebRTC has PeerConnection/MediaStream APIs to enable real-time multimedia communication in the browser. This article compares and contrasts these two approaches to designing application programming interfaces (API). I have attempted to compare the API-styles rather than the specific implementations in Flash Player or browsers.

Background

In Flash Player + ActionScript, four categories of objects (or classes) are needed to build a communicating web application such as a video chat. First, you need a connection to the server which may be a media server processing the media path in a client server approach or just a rendezvous server for establishing a peer-to-peer media path in the peer-to-peer approach. Second, you create named streams, for publishing and for playing. Third, you attach the publish side stream with your capture devices for audio and/or video. And fourth, you attach the play side stream with display or playback of remote media. The corresponding classes for these are: NetConnection, NetStream, Camera and/or Microphone, and Video or VideoDisplay [1] (pp.16-18). The diagram below shows a two party video call between our regular actors, Alice and Bob, each creating a named stream to publish, and each playing the other party's named stream to display. On each terminal, a NetConnection object is used to create two named NetStream objects, one is attached to Camera and Microphone and another to a Video or VideoDisplay element. Additionally, for displaying local preview, the Camera object is also attached directly to a Video or VideoDisplay.

In WebRTC capable browser + Javascript, again, four categories of objects (or classes) are needed. First, a LocalMediaStream created via the getUserMedia function to represent the capture stream from the user's camera and microphone devices. Second, an PeerConnection (actually called RTCPeerConnection) object is created to establish a peer-to-peer path between the two browser instances. Third, the LocalMediaStream object is attached (or added) to the PeerConnection on one end, and appears as remote MediaStream object at the other end. Fourth, this remote MediaStream is attached to an audio or video element to playback or display the received media. The LocalMediaStream is also attached to another video element to display local preview. Unlike publish vs. play mode of NetStream, I put the LocalMediaStream vs. remote MediaStream into separate categories, due to the way they are created and what they represent, even though LocalMediaStream is also a MediaStream.

For comparison purpose, let us call these Flash vs WebRTC set of APIs as NetConnection vs PeerConnection.

Level of abstraction: named stream vs. offer-answer

Clearly NetConnection presents a higher level of abstraction than the PeerConnection. In PeerConnection, each media path is explicitly created via a new RTCPeerConnection object on each side. It uses offer-answer model - where one side sends a session offer with its capabilities and the other side responds back with an answer with its side's capabilities. The offer-answer messages are exchanged between the two browsers outside the WebRTC APIs. In case of NetConnection, when a participant publishes a named stream (MediaStream), zero or more other participants may play that named stream. The implementation of these connection and stream classes hide the details of the peer-to-peer or client-server media path establishment. Internally, some variation of offer-answer and transport candidate messages are exchanged between the Flash Player instances, but are not exposed to the programming interface - thus making it a higher level of abstraction.

Topology creation: client-server vs. peer-to-peer

A NetConnection is a client-server abstraction where as a PeerConnection is a peer-to-peer path abstraction. A NetConnection object can be used to create client-server as well as peer-to-peer media paths, whereas PeerConnection is targeted only towards a peer-to-peer media path. To further clarify - a NetConnection is between a client and a server, which may be using RTMP in which case the server is a media server for the client-server media path, or may be RTMFP in which case the server may just be a rendezvous server to establish peer-to-peer MediaStreams between the two browsers. Note that with RTMFP it is also possible to do client-server media path. Additionally, it also allows application level multicast for use cases such as large event broadcast. With PeerConnection, the media path is intended to be peer-to-peer, but it is also possible to implement PeerConnection at a server (outside the browser), in which case it gets transparently used to create a client-server media path because the server behaves as another browser endpoint of the WebRTC media path. With WebRTC, once MediaStream proxying is implemented in the browser, it may be possible, though not trivial, to create application level multicast for large group event broadcast. Proxying a media stream means that a received remote MediaStream from one peer may be attached (or attached) to another PeerConnection to another peer, thus proxying the media path via this peer.

Signaling channel: built-in vs. external

As explained in the previous points, NetConnection is the signaling channel to establish the client-server or peer-to-peer MediaStream objects. Due to the built-in nature of the client-server signaling channel, it is non-trivial for third-party vendors to build such a signaling server without implementing the full RTMP and/or RTMFP protocol. On the other hand, a PeerConnection generates the offer-answer and other signaling messages, which must be exchanged via another channel between the two browser instances, e.g., via a notification service over WebSocket, by using Ajax, or over other third-party channels. Such notification service does not need to understand the details of the WebRTC protocol. This gives more flexibility and freedom to the developers on how the signaling channel for a particular communicating web application is implemented.

Server dependent vs. independent

As explained in the previous points, a NetConnection is between a client and a server, thus dependent on the server. Additionally, any server side media processing such as mixing or gateway'ing to VoIP can only be done by extending or modifying this server. If the server is vendor dependent, which seems to be the case for RTMFP, until recently, then it makes these media processing and gateway'ing features also vendor dependent. On the other hand, a PeerConnection does not care about the server side. But an implementation of a web application that uses this object does. Hence, in practice, these types of servers may be needed - notification server for exchanging offer-answer and other messages, media server for doing things like server side recording or mixing, media gateway for translating to VoIP or other networks, and media relay for gathering reflexive or relayed transport candidates and for actually relaying media path across network middle boxes.

Summary and further reading

In summary, the higher level NetConnection-style interface with built-in signaling channel makes it easier for programmers to create applications, ranging from two-party video calls to multi-party conferences to large scale event broadcasts. On the other hand, greater flexibility (and freedom) of the PeerConnection-style interface allows building wide range of system architectures using third-party components such as notification servers, media servers, media gateways and media relays.

In my past research on flash-videoio [2] and aRtisy [3], I have proposed another higher level abstraction for the programming APIs for communicating web applications. In fact [3] describes how to implement the NetConnection-style interface on top of WebRTC's PeerConnection-style interface and using an external resource server to provide the notification service. It extends the named stream model to named resource model to implement several other applications such as text chat or shared whiteboard or notepad. Then, [2] and [3] describe how to build a widget-oriented interface which consists of a single video-io widget used in either publish or play mode to create a wide range of communicating web applications such as two-party calls, multi-party conferences and video presence for Flash and WebRTC, respectively.

My WebRTC related papers from 2013

This post gives an overview of my past and ongoing research on web-based multimedia communication. I wrote four published co-authored research papers last year answering these questions - How does WebRTC interoperate with existing SIP/VoIP systems and how does SIP in JavaScript compare with other approaches? What challenges lie ahead of enterprises in adopting WebRTC and what can they do? How can developers take advantage of WebRTC and HTML5 in creating rich internet applications in the cloud? How does emerging HTML5 technology allow better enterprise collaboration using rich application mash-ups? Read on if one or more of these questions interest you...

A Case for SIP in JavaScript

WebRTC (Web Real Time Communications) promises easy interaction with others using just your web browser on a website. SIP (Session Initiation Protocol) is the backbone of Internet communications interconnecting many devices, servers and systems you use daily for your call. Connecting WebRTC with SIP/RTP raises more questions than what we already know, e.g., does interoperability really work? what changes do you need in the existing devices and systems? does the conversion happen at the server or in the browser? what are the trade-offs of doing the translation? what are the risks if you go full throttle with WebRTC in the browser? and many more. There is a need for a fair comparison to evaluate not only what is theoretically possible but also practically achievable with existing systems.

"This paper presents the challenges and compares the alternatives to interoperate between the Session Initiation Protocol (SIP)-based systems and the emerging standards for the Web Real-Time Communication (WebRTC). We argue for an end-point and web-focused architecture, and present both sides of the SIP in JavaScript approach. Until WebRTC has ubiquitous cross-browser availability, we suggest a fall back strategy for web developers — detect and use HTML5 if available, otherwise fall back to a browser plugin."

read >>

Taking on WebRTC in an enterprise

On one hand WebRTC enables the end-users to communicate with others directly from their browsers, on the other it opens ~~a can of worms~~ a door to many novel communicating applications that scares your enterprise IT guy. Existing enterprise policies and tools are designed to work with existing communication systems such as emails and VoIP, and employ bunch of edge devices such as firewalls, reverse proxies, email filters and SBCs (session border controllers) to protect the enterprise data and interaction from the outside world. Unfortunately, session-less WebRTC media flows and, more importantly, end-to-end encrypted data channels, pose the same threat to enterprise security that peer-to-peer applications did. The first reaction from your IT guy will likely be to block all such peer-to-peer flows in the firewall. There is a need to systematically understand the risk and propose potential solutions so that the pearls of WebRTC can co-exist with the chains-and-locks of enterprise security.

"WebRTC will have a major impact on enterprise communications, as well as consumer communications and their interactions with enterprises. This article illustrates and discusses a number of issues that are specific to WebRTC enterprise usage. Some of these relate to security: firewall traversal, access control, and peer-to-peer data flows. Others relate to compliance: recording, logging, and enforcing enterprise policies. Additional enterprise considerations relate to integration and interoperation with existing communication infrastructure and session-centric telephony systems."

read>>

Building communicating web applications leveraging endpoints and cloud resource service

One problem with the growth of SIP was that there were not many applications that demonstrated its capabilities beyond replacing a telephone call. This resulted in telephony mindset dominating the evolution in both standards and deployments. At the core of it, WebRTC essentially does most of what SIP aims for - peer-to-peer media flows and control in the end-points. Unfortunately, telephony influence in existing SIP devices, servers and providers have hidden that benefit from the end-user. There is a need to avoid such dominating telephony influence on WebRTC by demonstrating how easy it is to build communicating web applications, without relying on existing (and expensive) VoIP infrastructure.

Secondly, the open web is threatened by closed social websites that tend to lock the user data in their ecosystem, and force the user to only use what they build, go where they exist and talk to who they allow. This developer focused research is an attempt to go against them, by showing a rich internet application architecture focused on application logic in the end-point and application mash-ups at the data-level. It shows my 15+ crucial web widgets written in few hundred lines of code each, covering wide rand of use cases from video click-to-call to automatic-call-distribution, call-queue and shared white-boards, and my web applications such as multiparty video collaboration, instant messaging/communicator, video presence and personal social wall, all written using HTML5 without depending on "the others" (imagine tune from lost played here) - the others are the existing SIP, XMPP or Flash Player based systems.

"We describe a resource-based architecture to quickly and easily build communicating web applications. Resources are structured and hierarchical data stored in the server but accessed by the endpoint via the application logic running in the browser. The architecture enables deployments that are fully cloud based, fully on-premise or hybrid of the two. Unlike a single web application controlling the user's social data, this model allows any application to access the authenticated user's resources promoting application mash-ups. For example, user contacts are created by one application but used by another based on the permission from the user instead of the first application."

read>>

Private overlay of enterprise social data and interactions in the public web context

Continuing the previous research theme, this research focuses on enterprise collaboration and enterprise social interactions. The goal is to define a few architectural principles to facilitate rich user interactions, both real-time and offline in the form of digital trail, in an enterprise environment. In particular, my proof-of-concept project described in the paper shows how to do web annotations, virtual presence, co-browsing, click-to-call from corporate directory, and internal context sensitive personal wall. The basic idea is to separate the application logic from the user data, keep the user data protected within the enterprise network, use public web as a context for storing the user interactions and build a generic web application framework running the browser.

"We describe our project, living-content, that creates a private overlay of enterprise social data and interactions on public websites via a browser extension and HTML5. It enables starting collaboration from and storing private interactions in the context of web pages. It addresses issues such as lack of adoption, privacy concerns and fragmented collaboration seen in enterprise social networks. In a data-centric approach, we show application scenarios for virtual presence, web annotations, interactions and mash-ups such as showing a user's presence on linked-in pages or embedding a social wall in corporate directory without help from those websites. The project enables new group interactions not found in existing social networks."

read>>

The dangers to the promise of WebRTC

WebRTC is an emerging HTML5 component to enable real-time audio and video communication from within the web browser of a desktop or mobile platform. In this article I present my critical views on what can potentially go wrong in achieving the promise of WebRTC.

The promise

The technology promises the web developers a plugin-free cross-browser standard that will allow them to create exciting new audio/video communication applications and bring them to the millions (or even billions!) of web users on the desktop and mobile platform.

The rest of the article assumes these premises:

A technology is easy to invent but hard to change.
Business motivation drives technology deployment.

The dangers

Following are just some of the questions I will explore in this article.

What if Apple does not commit to WebRTC?
What if Microsoft's implementation is different than Google's?
What if there are different flavors of the API implementation?
What if a web browser cannot talk to legacy SIP infrastructure without gateways?
What if "they" cannot agree on a common video codec?

Let us begin by listing the existing technologies - real-time communication protocols (RTP, SIP, SDP), HTTP, web servers, Flash Player and related media servers on desktop - these are hard to change. On the other hand the emerging HTML5 technologies can be changed in the right direction if needed.

Now what are the business motivations for the browser vendors? Google is the most open, rich and web-centric browser vendor with a lot of money and a lot of motivation to drive the technology for a true browser-to-browser real-time communication. It makes sense to design a platform that enables web clients to connect to each other via a rendezvous or relay service that runs in the cloud. Improving and simplifying the tools available to web developers while hiding the complexity of protocol negotiation will improve the overall web experience, which in turn will drive business for web-centric companies such as Google. Moreover, the end-to-end encryption and open codecs are more important to Google than having to deal with legacy VoIP infrastructure -- which gets reflected in the choice of the several new RTP profiles for media path and VP8 for video.

On the other hand, Apple prefers closed systems with integrated offering that works seamlessly across Apple platforms. Opening up the Facetime (or related technology) to all the web developers is interesting but not motivational enough, especially when the resulting web application and the WebRTC session is controlled by the web developers and users instead of the iOS developer platform or the iTunes portal. This makes Apple, as a browser vendor, uncommitted to WebRTC until it proves its worth to the business.

Microsoft already has a huge communication server business that is largely SIP based. Any technology that makes it difficult to interoperate (at the media path/RTP level) with the existing SIP application servers is likely to cause trouble. They could build a gateway, but it will be hard to compete with others who can provide such services on the low cost cloud infrastructure. Hence, having the building blocks in place that will enable a web browser to talk to existing SIP devices in the media path is probably a core requirement for Microsoft. Thus, low level access to SDP and more control over media path is proposed in their draft.

For a web browser on desktop, WebRTC is interesting in bringing a variety of new use cases and web application. However, these use cases are already available to web users on the desktop via alternatives such as Adobe's Flash Player. It it not clear what the WebRTC can contribute in terms of additional use cases for web applications on desktop. However, for mobile platforms and smart phones it is a completely different story. Flash Player either does not work or is too difficult to work for video conferencing on mobile. HTML5 presents a great opportunity to bridge the differences between end user platforms when it comes to web applications. Thus, adding WebRTC to web browsers on mobile phones such as iPhone and Android adds a lot of value.

Let us consider a few "what if"s that pose danger to the promise of WebRTC.

What if Apple does not commit to WebRTC?

If Apple with its iOS platform and variety of cool gadgets and frantic fans-base decides to keep the real-time communication out of reach of web developers -- people will still build such application albeit as standalone native applications and sell over iTunes. Eventually the web developer may end up writing custom platform dependent kludges, e.g., invoke the native API for iOS but use WebRTC for others. Once Apple realizes that the web-based real-time communication is slipping its grips it may be too late. Moreover, other browsers besides Safari could run on iOS instead.

What if Microsoft's implementation is different than Google's?

Based on the historic browser interoperability issues and the recent announcement by Microsoft, this is quite likely. Fortunately, web developers are used to such custom browser-dependent kludges. However, WebRTC poses greater risk of differences in operation and interoperability beyond just some HTML/Javascript hacks. Having a server side gateway for media path to facilitate interoperability among browsers from different vendors is not in the spirit of end-to-end principle and is not the correct motivation for WebRTC in my opinion. Moreover, the Chrome Frame extension for Internet Explorer could relieve web developers to some extent.

What if there are different flavors of the API in different browsers?

This is similar to the previous threat -- if the implementations among the browsers do not interwork, especially among the leading mobile platform vendors such as Apple, Google and Microsoft, then we are back to the square one. As a web developer what will be the motivation to build three different applications using WebRTC when one can build once using Adobe Flash Player? Perhaps, Adobe will revive the mobile version of Flash Player to solve the differences :)

What if a web browser cannot talk to legacy SIP infrastructure without gateways?

This issue has troubled both telecom and web-centric vendors alike. Web browsers do not want to blindly implement legacy SIP family of standards (actually RTP) because the problems on the web are different, and need to be addressed differently. The telecom centric vendors do not want to re-do everything to interoperate with web browser. If they can get the web browser to talk to their existing SIP/RTP components by keeping the media path intact, it will be a big deal. Having to implement a variety of new RTP profiles in their equipment for interoperability with web is unmotivational.

This could go in one of the two ways - either browser vendors agree on not interoperating with legacy RTP devices in which case we will need a gateway to do interoperability, or the browser vendors implement proprietary extensions (back doors!) that allow such hooks. If it is the latter, it will get immediately picked up by the telecom-sponsored developers to build applications that use the legacy RTP mode only -- not a good thing!

What if "they" cannot agree on a common video codec?

The debate on what codec to pick is as old as the debate on WebRTC. Currently, "they" are forming two camps -- one that supports royalty free VP8 family of standards and other that backs ubiquitous H.264 family of standards (or keep no mandatory codec). The advantage of H.264 is that many hardware already comes with fast H.264 processing hence will be available natively on mobile platform, whereas with VP8 you may end up using your battery life on the codec -- until VP8 becomes available in the hardware platform. It is not clear what will happen. If VP8 gets picked as the mandatory-to-implement codec in the browser, I can see some browser vendors digressing from the standard and implementing only H.264 (hardware supported codec) or implementing both VP8 and H.264.

What are the other potential threats to this technology? What if the standardization works drags for too long, until it is too late, and web users have moved on to something better! Web is driven by web applications. What a browser vendor does or what a standardization committee adopts are just a shadow of what the web applications do (or want to do). A web developer prefers to have a consistent, simple and unified API for real-time communication.

If it were in my hands, I would ban any IETF or W3C proposal without a running and open implementation. But the reality is far from this!