Articles on technology, Internet, Protocols, Web, Open Source, VoIP, P2P, SIP, RTMP and WebRTC.
A journey through real time...
What are endpoint driven communication systems?
Impress.js 3D presentations on cloud, communication, WebRTC and mobile
click here to see all my presentations
Below, I list my presentations and associated demo videos largely based on impress.js framework. When viewing a presentation, use an HTML5 capable browser such as Chrome or Safari, and please wait for it to loads completely, and the browser's loading icon to stop spinning, before doing the slide show.
Command line Twilio client
[This post is neither contributed not endorsed by Twilio]
Twilio client [1] enables embedding voice conversation in web and mobile apps, by creating a voice pipe between your browser or mobile device and the service. One thing missing there is the ability to create such a voice pipe from non-mobile or non-web programs, such as a command line application. There is a shell script [2] to place a call for testing from command line, but it uses pre-defined text or recorded file for media, and looses the real-time interactive nature of the voice path. In particular, it does not create a voice pipe between the local machine and the service.
Motivation
Ability to connect real-time interactive voice path from an application opens doors to wide range of other use cases such as media path processing and analysis, e.g., for real-time transcription, or to bridge call between diverse services, e.g., translate between IM and voice call. Secondly, such as mechanism is independent of a specific browser or mobile platform, and can work in headless mode for automated testing or client-side programmability of voice call or its media path.
At the high level, there are four potential ways to create such a voice pipe. Two of these can be accomplished using my rtclite project, that we will describe in more detail in this article.
- WebRTC - using Twilio 1.3+ web client with command line WebRTC app
- Using Twilio mobile client interface ported to command line app
- RTMP - using Twilio 1.2 web client API with command line RTMP client
- SIP - send/receive SIP call to/from Twilio [3]
The first two approaches essentially implement a command line version of the web and mobile SDKs. For example, a WebRTC stack compiled for Linux/OS X may be used to accomplish (1). The third approach uses a command line RTMP client and an older version of client API, and the fourth one uses a command line SIP endpoint to dial into the service.
We describe how to do the last two using software pieces from our open source rtclite project [5]. The following video demonstration shows the command line call initiation. Don't forget to view in full screen!
Connect to Twilio from command line SIP endpoint
The approach is described on the provider's website [3] including the steps for creating the SIP domain/endpoint such as yourname.sip.twilio.com. The description there is targeted for VoIP providers rather than client. Currently it is not possible [4] to configure your SIP softphone to use the Twilio service as SIP server. Once the SIP domain is created, you can send request to "sip:something@yourname.sip.twilio.com". The sender's IP address needs to be white-listed or the caller's credential needs to be preapproved for authentication. we have only tried the IP address white-listing, and not the SIP authentication using credentials.
After configuring a SIP endpoint, you should be able to see the SIP domain and its associated voice URL on the provider's website, e.g., "yourname.sip.twilio.com" mapped to voice URL of "http://yourserver/yourtwiml.xml".A simple call forwarding can be done using the following TwiML. These steps can be used to connect to any other TwiML application from the command line SIP endpoint, if you configure your SIP endpoint's voice URL accordingly.
<?xml version="1.0" encoding="UTF-8"?> <Response> <Dial timeout="20" callerId="1415xxxxxxx" method="POST">415yyyyyyy</Dial> </Response>
You can also create programmable server side script to derive the target number using the called SIP address. For example, send call to sip:1415yyyyyyy@yourname.sip.twilio.com, and in your script, extract the user name part of the URL to populate the Dial tag's content.
Once the initial configuration is complete, use the SIP endpoint available in the rtclite project to initiate a call. More details on the command line SIP endpoint are available in my previous blog post as well as on the project website [6]. After setting up the initial dependencies of the project, run the caller module as follows.
$ python -m rtclite.app.sip.caller --domain=example.net --use-lf --samplerate=48000 \
--to=sip:something@yourname.sip.twilio.com
The "domain", "use-lf" and "samplerate" options are described on the project website. The "to" option is important, and represents the target address to call to. Once the call is established, the py-audio project's modules are used to interface with the audio device, and to send and receive audio in the call.
Some important notes follow. For a demo account with the provider, your callerId and target number in the Dial tag must both be verified for your account. Some Internet Service Providers (ISP) may block a SIP request on port 5060. In fact, I sometime experience wifi reset of my residential equipment, when I send a SIP packet out from my home machine to outside VoIP service. Router interference of SIP messages may also be the reason for incorrect CRLF handling, and the need for use-lf option. Using TLS may be an option to work around router interference. Make sure that the sample rate specified on the command matches the allowed sample rate of the audio device on your client machine. On Mac OS X, audio capture is usually done with 48kHz.
Connect to Twilio from command line RTMP client
Here, we exploit the older Flash-based Twilio Web Client SDK version 1.2. First, we describe how to figure out the client-server RTMP message flow, and next we show how to do this using the rtmpclient module from the rtclite project.
The web client SDK 1.2 used this twilio.js javascript file. A quick look indicates that it internally loads another twilio.js file.
This second JavaScript file shows that the Flash-based client-server connection is accomplished using the MediaStream class which internally attempts RTMP connection to the service.Using the hello-client-monkey.php example from the provider, but hosted on your website, and running wireshark on default RTMP port 1935, you can find more information about this client-server exchange. Make sure to use http instead of https for your hosted web app to avoid encryption. The following screenshots show a few example RTMP messages, and their order, and the parameters sent in the initial connect request.
In summary, the NetConnection's connect is done to "rtmp://chunder.twilio.com/chunder", with additional 6 arguments. The first argument looks like the capability token generated by the helper library. The second and third are null and empty string respectively. The fourth one is a JSON formatted string with some client side attributes. The fifth looks like the account SID. And last one looks like the client SDK version. After connect is complete, there are two createStream calls and a startCall RPC method. This is followed by a received RPC method callsid from the server. Next, the "input" stream name is published, and "output" stream name is played. This is followed by bunch of audio data.
We wrote an example application, client.py [7], which uses the rtmpclient module from the rtclite project, and automates client-server exchange process. It then connects the audio sent/received on the two streams to the local audio device. Although, not well tested, there are other command line options in this application that allow recording the received stream, or playing a file to the sent stream.
The command line application takes three mandatory parameters, which if not supplied, will be prompted for. These parameters are for account SID, auth token and application SID. The client application then connects to the service using RTMP and uses those supplied parameters. An example command line invocation is shown below, where the three parameters are fake - use the correct values in your test!
$ python -m rtclite.vnd.twilio.client --account=ACXXXXX --token=YYYYY --app=APZZZZZ
Use ctrl-C to terminate the client application.
Comparison
Here, we attempt to compare our two approaches for command line client: SIP and RTMP.
The audio codec used are G.711 in SIP and Speex in RTMP. Thus bandwidth requirement is more for SIP, but can theoretically give better quality, e.g., for real-time transcription. It may be possible to send G.711 in RTMP to server, but is not clear how to force the server to send back G.711 in RTMP. The media path is over UDP (RTP) for SIP vs. TCP for RTMP - making RTMP one more suseptible to network issues such as latency in interactive conversations. While both these are voice only at this time, using WebRTC or Mobile SDK based approach may allow video pipe. The py-webrtc project may become useful for this, once it is completed.
The SIP approach has provider supplied documentation on how to do incoming calls, but the RTMP approach requires more work to figure that out. The SIP approach seems to target server-to-server call flows, e.g., to connect your soft PBX to provider service. Due to router interference of SIP messages, it may not work in all the cases or all the time from a client machine. On the other hand, the RTMP approach is inspired by the client SDK, and hence suitable for clients. However, Flash-based client has been deprecated by the provider, and the corresponding service may no longer be available in near future. Moreover, RTMP is kind of an obsolete technology. Using WebRTC or Mobile SDK approach will work better in that case. Once SIP registration becomes available on the provider service, a standard command line SIP endpoint [6] should be enough for send/receive of calls.
References
- Twilio client: embed voice conversation in Web and Mobile Apps, https://www.twilio.com/client
- Twilio Labs: place a Twilio call from the shell, https://www.twilio.com/labs/bash
- Programmable voice SIP, https://www.twilio.com/docs/api/twilio-sip
- Can I configure my soft phone to work with the Twilio SIP endpoint, https://www.twilio.com/help/faq/voice/can-i-configure-my-soft-phone-to-work-with-the-twilio-sip-endpoint
- Rtclite: light weight implementations of real-time communication protocols and applications in Python, https://github.com/theintencity/rtclite
- Command line SIP endpoint, https://github.com/theintencity/rtclite/blob/master/rtclite/app/sip/caller.md
- Command line Twilio client, https://github.com/theintencity/rtclite/blob/master/rtclite/vnd/twilio/client.py
How to make phone calls from command line?
- dialing out a phone number from command line,
- performing automated VoIP system tests,
- showing quick demos of communication systems, or
- experimenting with media processing on the voice path, e.g., for speech recognition, recording or text-to-speech.
Click here to see the full description of this SIP endpoint written in Python [6].
The project page shows how to use the SIP endpoint to send/receive instant messages and voice call. The command line options allow you to configure various attributes such as whether to register with a SIP server, how to respond to incoming requests, how to work around SIP entities that incorrectly handle CRLF for line endings, how to test signaling without media path, and so on. The project page also describes how to use the module in your own Python project.Following video demo of the SIP endpoint shows interoperability with X-lite terminal and dialing out toll free numbers using a VoIP provider. It shows how to dial a phone number using a VoIP provider, how to send DTMF digits from terminal, and some experimental features such as text-to-speech and speech recognition. Do not forget to watch in full screen!
References
- Rtclite: light weight implementations of real-time communication protocols and applications in Python, http://www.rtclite.com, https://github.com/theintencity/rtclite
- Sipua: a SIP test user agent for Solaris, Linux, FreeBSD and Windows NT, http://www1.cs.columbia.edu/irt/cinema/doc/sipua.html
- SIPp: free open source test tool/traffic generator for SIP, http://sipp.sourceforge.net/
- Pjsua: open source command line SIP user agent (softphone), http://www.pjsip.org/pjsua.htm
- sipcmd: the command line SIP/H.323/RTP softphone, https://github.com/tmakkonen/sipcmd
- caller: SIP application to initiate or receive VoIP calls from command line. https://github.com/theintencity/rtclite/blob/master/rtclite/app/sip/caller.md
Introducing rtclite
"Light-weight implementations of real-time communication protocols and application in Python


My WebRTC related papers from 2015
This post continues from the previous one and gives an overview of my ongoing research on web-based multimedia communication or WebRTC. I wrote five published co-authored research papers last year answering these questions - How do we solve some of the enterprise challenges of WebRTC using browser extensions? How do we address cloud challenges of a multi-services and apps vendor? How do we create pure web-based enterprise communication and collaboration system without depending on legacy protocols? How do we do user reachability in a multi-apps environment created by WebRTC? And, how do we do write-once-run-anywhere for WebRTC based team apps using a cross platform tool? Read on if one or more of these questions interest you...
Enterprise WebRTC powered by browser extensions
Traversing WebRTC flows created by external third-party websites across restricted enterprise firewalls is challenging. There are other challenges in adopting WebRTC in enterprises, e.g., how to integrate it seamlessly with existing communication equipments, or how to enforce enterprise policies such as call recording of WebRTC flows on third-party websites? We show how to use browser extensions to solve these problems. This systems paper is based on my implementations of two interesting projects.
“We use browser extensions to solve two important issues in adopting WebRTC (Web Real-Time Communications) in enterprises: how to integrate WebRTC-centric communication with existing systems such as corporate directories, communication infrastructure and intranet websites, and how to traverse media paths across enterprise firewalls. Vclick is a simple and easy to use web-based video collaboration application that enables click-to-call from other webpages. SecureEdge is a network border traversal system for policy and security enforcement, and consists of a secure media relay that sits at the network border or in the cloud. A browser extension in the enterprise user’s device transparently injects this media relay in every WebRTC media path needing to traverse the enterprise network edge to enable authenticated border traversal without help from the websites hosting the WebRTC pages. We attempt to generically support WebRTC in enterprises on a variety of application scenarios instead of creating another fragmented communication island. The challenges faced and techniques used in our proof-of-concepts are likely extensible to other enterprise WebRTC scenarios using the emerging HTML5 technologies.”
Keywords: WebRTC, enterprise communication, secure edge, browser extension, VoIP, video call, firewall traversal, media relay.
ALICE: Avaya Labs Innovations Cloud Engagement
Although we know how to create cloud-hosted services, platforms and infrastructures, little is known about cloud hosted communication and collaboration services, especially to enable multi-tenancy and self-service. This research focuses on the challenges of hosting cloud services for customer trials, where resources are limited to make existing services cloud ready or to fit a specific platform. This is based on my work on creating a cloud portal to host research-oriented early or pre-product services on the cloud, and identifying common themes and techniques.
“We present the architecture and implementation of our enterprise cloud portal named ALICE, Avaya Labs Innovations Cloud Engagement, which provides self-service access to service developers, tenants, and users to various communication and collaboration applications. Currently ALICE is used for field testing of advanced research prototype services based on technologies such as WebRTC and HTML5. This paper describes the current portal and extensions to support multi-tenancy.
We describe challenges in creating a self-service multi-tenant SaaS (software-as-a-service) portal to host communications and collaboration applications for small to medium scale businesses. The challenges faced and the techniques used in our architecture relate to security, provisioning, management, complexity, cost savings and multi-tenancy, and are applicable and useful to other cloud deployments of diverse enterprise applications.”
Keywords: Cloud, system architecture, portal, multi-tenancy, Internet telephony, enterprise communications, web collaboration.
Vclick: endpoint driven enterprise WebRTC
One of my earliest project at Avaya Labs was on creating a light-weight service for wide range of web communication and collaboration scenarios. Vclick is a collection of many loosely coupled apps that run the app-logic in the browser or endpoint, and mash up at the data level. It contains applications for video call, conferencing, video presence, text chat, click-to-call, screen sharing, shared notepad and whiteboard, and so on. It goes against the conventional web wisdom of thin-client, single-page-apps, or rigid GUI, and presents a new software architecture to create robust endpoint driven apps. The paper is really about how to keep the endpoints smart and network (or service) dumb in the context of collaboration applications.
“We present a robust, scalable and secure system architecture for web-based multimedia collaboration that keeps the application logic in the endpoint browser. Vclick is a simple and easy-to-use application for video interaction, collaboration and presence using HTML5 technologies including WebRTC (Web Real Time Communication), and is independent of legacy Voice-over-IP systems. Since its conception in early 2013, it has received many positive feedbacks, undergone improvements, and has been used in many enterprise communications research projects both in the cloud and on premise, on desktop as well as mobile. The techniques used and the challenges faced are useful to other emerging WebRTC applications.”
Keywords: WebRTC, enterprise communication, web video conferencing, resource-based architecture, web applications.
User reachability in multi-apps environment
With numerous "walled-garden" services and apps emerging because of WebRTC, there is a need to identify the best way to reach your contacts, irrespective of which app or service she is on. This systems paper describes my work on implementing a mobile (and desktop) app called Strata Top9, to quickly reach your important contacts. It really is a front-end to launch other applications. Unlike existing presence based systems, we propose to iterate during call initiation. The paper presents the software architecture and design decisions along with several motivational use cases of our project. It also details the concept of dynamic contacts, and endpoint driven caller and receiver policies.
“Recent progress in web real-time communication (WebRTC) promotes multi-apps environment by creating islands of communication apps where users of one website or service cannot easily communicate with those of another. We describe the architecture and implementation of a multi-platform system to do user reachability in multiple communication services where users decide how they want to be reached on multiple apps, e.g., in an organization that has voice-over-IP, web conferencing and messaging from different vendors. Our architecture separates the user contacts from reachability apps, supports user and endpoint driven reachability policies, and has several independent and non-interoperable WebRTC-based apps for two-way and multiparty multimedia communication. Our flexible implementation can be used for enterprise or personal communications, or as a white-labeled app for consumers of a business.”
Keywords: system design; mobile app; user reachability; multiservices; VoIP; WebRTC; caller policy.
Developing WebRTC-based team apps with a cross-platform mobile framework
Ability to write-once-run-anywhere still eludes many app developers. Luckily several cross-platform development tools exist. However, creating cross-platform communication and collaboration related apps is still challenging. This paper presents my implementation work on creating cross platform apps. In particular, four types of platforms - web app on PC and mobile, and installed app on PC and mobile - are considered, and seven different apps are covered for a wide range of enterprise use cases. Techniques and steps for creating such cross platform apps are presented along with lessons learned based on practical experience. Additionally, considerations for iOS and wearable Glass devices are presented.
“We present lessons learned in developing cross platform multi-party team applications. Our apps include a range of communication and collaboration scenarios: document and content sharing in a team space, an agent-based meeting helper, phone number dialer via a voice-over-IP (VoIP) gateway, and multi-party call in peer-to-peer or client-server mode. We use web real-time communication (WebRTC) to enable the audio and video media paths in the apps. We use frameworks such as Chrome Apps and Apache Cordova to create apps that can be accessed from a browser, or installed on a desktop, mobile device, or wearable. The challenges and techniques described in our paper related to audio, video, network, power conservation and security are important to other developers building cross-platform apps involving WebRTC, VoIP and cloud services.”
Keywords: HTML5, Apache Cordova, Chrome Apps, WebRTC, Mobile, Cloud, Wearable.
WebRTC vs. SIP/SDP
This article presents my point of view, hopefully unbiased, on this topic. Let me start by saying that SIP and WebRTC are different, and it is not fair to compare them without an established context.
Complex protocol vs. simple API
SIP is a protocol, not an API; whereas WebRTC is an API, with an associated set of protocols.Consider that TCP is a protocol but socket is an API. TCP has complex state machinery to enable reliable bi-directional end-to-end packet flow assuming that intermediate routers and networks can have problems but end to end reliability is assured. On the other hand, the socket API hides all the complexity of TCP and provides an easy to use abstraction. Luckily in the case of TCP, developers only work on top of the standardized socket API. Whereas in the case of SIP, usually they end up dealing with the protocol directly, or work with (partly broken) implementations of intermediate network element such as a SIP proxy, or use (poorly defined) APIs from third-party SIP libraries. Depending on what level of abstraction is exposed in the SIP library, the development effort can range from very simple to very complex. In the case of WebRTC, the API is being defined and standardized, and is intended as a high level abstraction suitable for JavaScript/Web developers.
To list the differences between SIP and WebRTC - one could say that SIP is programming language agnostic but WebRTC is not, and must use JavaScript, or that SIP application could run on any device or platform, but a WebRTC application requires a browser to host the app. These are superficial differences - due to the core difference in the nature of SIP vs. WebRTC, i.e., protocol vs. API.
SIP would have been perceived as much simpler if there were well defined APIs from the start. But that would have limited the flexibility (and freedom) of what SIP could be used for. More on the flexibility and freedom is discussed later in this article.
One thing to note, however, is that WebRTC does use a set of complex protocols behind the scenes, many of which are shared by a SIP system, as we describe next. Although a web application using WebRTC is simple due to the high level API, the implementation of WebRTC in a browser itself is not that simple!
Protocol stack
SIP is an established protocol for Internet communication. WebRTC is an emerging API intended to be provided by the browsers and consumed by the applications in JavaScript. It has an associated set of protocols, some of which overlap with those used in a SIP system. The title of this article indicates a comparison of WebRTC with SIP/SDP, and not just with SIP. This is because SIP is often only a part of the overall puzzle in a real world communication system, as shown in the following diagram taken from my research paper [1]. In fact, the role that SIP plays in a real system is outside the scope of the WebRTC specification currently. Which means that it does not really make sense to compare SIP alone with WebRTC protocols, but it is okay to compare a SIP system with that based on WebRTC.![]() |
| Comparison of typical SIP vs WebRTC application stack. The dotted line separates what is programmed by the application developer and what is provided by the platform. |
Network elements
The SIP specification defines the behavior of an endpoint (or user agent) as well as various network elements such as proxies and redirect servers. Other SIP extensions define more network elements such as a conference server, a presence server, and so on. With WebRTC, the focus is only on browser-to-browser communication, and only the browser's behavior is defined in the specification. Other elements such as a gateway or media server are free to implement the behavior. However, that part is likely not going to be standardized. This hopefully limits the number of specifications related to WebRTC, and hence limits the resulting complexity.This does not mean that WebRTC systems will not need such network elements. Ability to call a phone number via a gateway, or to host a multi-party voice mixing or recording on a media server, or to locate the target user via a lookup service are all examples of crucial services required in many real world applications. Lack of standardization means that every vendor will likely define its own version of these elements, creating fragmentation. Such scenarios are actually inline with WebRTC's goal which aims at creating media path among browsers visiting the same website, and does not care for inter-website communication. More on interoperability vs. fragmentation is discussed later in this article.
Fortunately, it is entirely possible to create fully functional WebRTC systems without using complex or heavy weight network elements. In fact, many initial demonstrations of WebRTC applications have followed this path with peer-to-peer media flows, along with a very simple signaling/rendezvous service. SIP, too, started out with this model of peer-to-peer media flows with a simple signaling/rendezvous service in the form of a SIP proxy or redirect server. However, practical limitations and real-world deployments have mostly thrown this idealistic model out of the window, and have largely adopted a "managed" network centric model with heavy weight network elements in the form of back-to-back user agents, session border controllers, and call stateful proxies. It is likely that the similar real world forces may cause a similar network centric deployment model in the case of WebRTC. Only time will tell!
Message flow
To establish a multimedia call, three pieces of information are exchanged between the two parties - (1) an "invite" from caller to callee as an intention to establish a call, which the callee may answer or decline, or ignore, (2) session description containing capabilities of each party, (3) transport candidates for establishing connectivity on the media path, hopefully peer-to-peer. Let us call these steps as (1) "invite", (2) session description exchange and (3) transport candidates exchange.In a WebRTC-based application, these may be separated out into different phases or may be combined together, as determined the web application. In fact, (1) above is not part of the specification, but is often implemented in a WebRTC call, and (2) and (3) are exchanged between the two parties using an out-of-band channel outside the scope of the WebRTC specification. The term "trickle ICE" refers to separating (3) from (2) in a WebRTC session negotiation, and is the default mode - although it is possible to not use it, as determined by the web application. In a SIP system, all three pieces are usually bundled in a single request-response transaction and are carried in a SIP INVITE and it's 2xx response between the two parties, often via one or more intermediate proxies.
The separation of these pieces of information in WebRTC makes it more flexible, e.g., an application can invent any kind of user lookup service over any protocol (HTTP POST, WebSocket, XMPP, Google Channel, LDAP, hard-coded, or whatever, or even SIP [1]). Some applications may not even need such a lookup service, e.g., a virtual presence application where everyone who visits the website is able to see and hear others on that site. Unlike this, in SIP, a user lookup service must be accessible via SIP.
For those who remember H.323, and its multiple steps of call setup in version 1, may argue that we have come back a full circle - started with multiple steps call setup, invited H.323 fast connect and SIP for single step call setup, and now back to multiple steps to establish a WebRTC call. However the situations are different in H.323 vs. WebRTC - in H.323 all the steps were done by the same entity, and after the initial "invite" step, the subsequent ones were follow through, with no perceived benefit to keep them separate. In the case of WebRTC, the "invite" step is outside the scope of WebRTC specification, hence it makes sense to keep it separate. The last transport candidates exchange step is incremental with real benefit to keep it separate from session description exchange, so that better peer-to-peer flows can be detected in future, which first peer-to-peer flow that works gets in use as soon as possible. In the case of H.323, there is no equivalent of this step in my opinion, because the second and third steps are about exchanging media capabilities, and then establishing fixed set of logical channels based on those media capabilities.
The SIP specification and many of its extensions, such as call transfer, interoperability with phone, or capabilities of user agents, actually deal with only the first piece of information mentioned above, i.e., how to reach the target user or device, and are often related to the notion of a "call". In WebRTC specification, there is no notion of a "call", or signaling to lookup the target user. It is assumed that such constructs and concepts belong to the web application or the website that is hosting the application. This gives flexibility to the application developer on how to implement advanced features such as conferencing or call transfer. With more flexibility, comes more power, and more non-interoperability, described next.
Interoperability vs. fragmentation
In the past fifteen years, non-interoperability has been a major problem in the SIP community. Although it is often easy to achieve interoperable voice call signaling, it is rather difficult to make sure that those hundreds or thousands of SIP systems out there will implement every little detail consistently as per the specification. Moreover, there is no mandatory codec defined by SIP, which means that two perfectly compliant SIP endpoints may not talk to each other.With WebRTC, you only need to achieve interoperability among the major browsers. WebRTC easily enables communication between users visiting the same website, where the application fills the void of signaling, user lookup, and the likes. Since signaling is outside the scope of the specification, interoperability between users visiting separate websites is not readily available and requires one-to-one federation between the two websites or web application. This fragmented communication behavior is inline with how existing web applications behave, i.e., they tend to create islands of users, where users within an island can talk to each other, but not beyond. Have you ever thought of being able to reach your Linked-In contact from your Facebook page? I have, and have tried to do something about it [8][9].
The fundamental difference of triangular vs trapezoidal call model in WebRTC vs SIP, respectively -- which means the two parties are connected to the same server in WebRTC, whereas two parties can potentially be on separate servers in SIP -- results in major differences in interoperability requirements.
The triangular way of WebRTC is not how VoIP community perceived open communication to be. From that perspective, even though WebRTC is an emerging open standard, it will created closed walled gardens of communication islands - leading to communication fragmentation. Unfortunately, even in SIP-based systems, due to non-standard behaviors in implementations or different implementation choices (e.g., codecs) by different vendors, it tends to create fragmented islands, where an equipment from one vendor rarely talks freely with that of another. This tends to create an ecosystem, where a customer must purchase phone endpoints and servers from the same vendor, because it is often hard to achieve true mix-and-match interoperability out-of-the-box.
The problem is more relaxed in the case of WebRTC, because there are only a few major browsers, and the protocol specification does not involve any signaling or media server, except for generic STUN and TURN servers, when appropriate. A fair comparison would have been possible if there were hundreds or thousands of browsers implementing WebRTC, similar to the plethora of SIP implementations that exist today. Or if there were only five major vendors creating SIP user agents. With SIP, the protocol was invented first, followed by an attempt to deploy the user agents and servers. With WebRTC, we already had user agents (browsers) and servers (web servers, STUN/TURN servers), and we put together a set of protocols to let them talk in real-time.
By leaving signaling outside the scope, a WebRTC specification escapes many problems related to non-interoperability. In practice, fragmentation seems to be unavoidable, largely due to business reasons -- or lack of incentive for one website to let its users talk to that on another website. The benefit of better interoperability, maintainability and usability far out weighs the cost of fragmentation for many websites and web applications. For some applications that require cross-site communication, a gateway can do the job, without being part of the specification.
The biggest non-interoperability and fragmentation threat to WebRTC is that some major browser vendors may not implement it altogether, or may implement a variant of the specification, making it non-interoperable with other browsers. Hopefully, market pressure will stabilize the interoperability among major browser. Or developers will find alternatives (plugins? gateways?) to fill the gap. Again, only time will tell!
Flexibility and freedom
Keeping the signaling part out of scope and defining the API give good application level flexibility to WebRTC systems. As mentioned before, features such as call transfer, conferencing or user lookup are application's responsibility. This allows a social website to implement these features different from a click-to-call provider's website, for example. Given the numerous websites, and their potential for incorporating real-time in-browser communication, such flexibility is very useful and highly desired.
Protocol differences in SIP vs WebRTC system
Both SIP and WebRTC systems use certain common set of protocols, e.g., SDP for session description, ICE for NAT traversal, and (S)RTP for media transport. The following chart from my paper [1] summarizes the interoperability differences.![]() |
| Interoperability differences in SIP vs WebRTC; (o) means optional. |
Frequently asked questions (FAQ)
- Are SIP and WebRTC competing technologies? Yes and no. It depends on the context. As an access protocol to connect a client to an established voice/video service or infrastructure, yes. For anything else, most likely no.
- Is WebRTC a subset or refinement of SIP? No. WebRTC has certain elements and features that are not present in SIP/SDP. For example peer-to-peer data channel, or trickle ICE, or programming APIs.
- Is WebRTC a superset or generic case of SIP? No. A SIP-based system includes elements and features that are not present in a WebRTC browser, e.g., well defined network elements, call event notifications, or user lookup service.
- Is WebRTC similar to P2P-SIP? or Jingle? or RTMFP? or name-your-favorite-here? No, for all. In theory, WebRTC has some overlap with session description and media path of a standard SIP system. In practice, WebRTC is very different from any of the existing systems we had.
- Does SIP make WebRTC better? What about the other way? Depends on who you talk to. A SIP proponent would argue that cross-site communication and PSTN interoperability requires using SIP at the core, or that SIP in JavaScript is good enough for signaling of WebRTC. A WebRTC proponent would argue that replacing SIP with WebRTC will improve on security, interoperability and call quality. In my opinion, these arguments are either too narrowly focussed or a compelling counter argument can easily be made. So my answer is "no" on both counts.
- Can SIP and WebRTC work together, instead of competing? Absolutely. Firstly, they do not compete except in a very narrow context. And secondly, they do work together - the SDP generated by WebRTC can be carried in SIP, directly from the browser or via a gateway. See [1] for alternatives.
- Will WebRTC remain at the end, and must SIP be at the core? No. It is entirely possible to create WebRTC capable network elements such as media servers and gateways without depending on SIP, that allow creating voice/video infrastructure with WebRTC at the core. It also possible to create many existing use cases with peer-to-peer media path of WebRTC, without using these network elements at all.
- Does WebRTC have better voice quality than SIP? No. Because, SIP does not specific fixed set of voice codecs or voice quality. At present, when many SIP systems use G.711, G.729 or Speex, the voice quality of Opus used in WebRTC is superior to those SIP systems. However, nobody has stopped those SIP systems from using Opus (and related media engine) in their voice stack.
- Will WebRTC improve voice quality of my PC-to-phone application? No. Unless you are talking about smart-phone with WebRTC running on phone also. When voice path gets translated to a phone call, the quality will likely be reduced to G.711 or G.729 or whatever the gateway provider has deemed suitable on the phone network.
- Will WebRTC give better voice quality for my conferencing application? Depends. See previous question. For centralized media path (with audio mixer), it requires transcoding, so voice quality will likely suffer. Optimization such as using peer-to-peer voice path, or reflecting loudest-N streams to client without mixing may preserve the end-to-end voice quality in conferencing.
- Why does browser not implement SIP directly? For browser-to-browser use case, SIP is not needed. For anything else where one end is a browser, a gateway can do the job of translating to SIP if needed. And WebRTC is focussed on browser-to-browser use case only. Having said that, it is possible to do SIP in JavaScript [1] running in your browser, or to implement your own plugin or browser with built-in SIP stack and related APIs.
- What does it mean that a SIP conference server supports WebRTC? It could mean that it supports a browser end-point to participate in a multi-party conference over WebRTC, or it could mean that it uses Google's WebRTC media engine or the OPUS/VP8 codec for the voice/video stack but requires SIP/H.323 endpoints to connect. Be careful!
- Is WebRTC better than SIP/SDP? Hard to say. WebRTC attempts to fill the gap for a very specific problem, but requires robust, secure and high quality implementation, and defines simple high level API. On the other hand, SIP leaves many such things open to implementers. So for the specific problem, WebRTC is better, some might say.
- Between WebRTC and SIP, who will win the racing? There is no racing. Or if there is one, then WebRTC and SIP are racing in different directions, at 90 degrees to each other. Depending on where you look from, or which axis has the finish line, either of them could win. But the truth is, WebRTC is for browser-to-browser communication, where SIP/SDP/RTP is not feasible currently -- so there is no racing. One could argue that with more number of browsers supporting WebRTC than number of SIP endpoints, or with more number of JavaScript developers than number of SIP implementors, WebRTC is likely to win the racing. No comments on that. A more correct analogy of racing, in my opinion, is in specific contexts, e.g., between WebRTC of browser and RTMP+ RTMFP of Flash Player for browser-to-browser communication, or between SIP phone vs. WebRTC browser for client access to backend voice/video service or infrastructure.
- What is the real difference between WebRTC and SIP? Ah, glad you asked! If you have already read the rest of this article, you will be surprised to find that unlike all the differences mentioned before, the real difference is who the most important customers of WebRTC vs. SIP are - for WebRTC it is the web developers community, and for SIP it has been telecom providers and equipment vendors.
Conclusions
This article has attempted to present several factors comparing and contrasting SIP/SDP vs. WebRTC systems, without really diving in to the specific protocol differences of how SDP/RTP are used in SIP vs. WebRTC. This is because, while WebRTC specifies the list of profiles and extensions to SDP/RTP, the SIP system is free to choose whichever it likes. So any such comparison will end up being one between WebRTC and SIP of vendor-X.There are similarities and differences between the two types of systems. In the end it is about the applications and user experiences, rather than protocols and APIs. In practice, it is easy for an established SIP-based IVR system to add WebRTC access, compared to that for a newbie WebRTC system to implement full fledged IVR functionality. The same holds true for many existing telecom, web communication and Internet conferencing applications. Thus, many existing SIP-based applications will likely also adopt WebRTC for access, if there is a demand. And similarly, new pure WebRTC-based web applications will likely incorporate SIP gateway-ing to reach out to non-browser users, if there is demand. SIP or WebRTC itself is just a small piece of the puzzle in the overall system or application. Nevertheless, WebRTC attempts to fill a void that has existed for a really long time in the SIP-based communication world, for which past attempts such as using plugins have largely been unsuccessful, especially on mobile platform.
About Bellheads, Netheads and Webheads
I believe, most people of my generation belong to the Netheads family, even though their employers may be Bellheads. From [1] "This philosophical divide should sound familiar. It's the difference between 19th-century scientists who believed in a clockwork universe, mechanistic and predictable, and those contemporary scientists who see everything as an ecosystem, riddled with complexity. It's the difference between central and distributed control, between mainframes and workstations."
More recently a new generation is evolving, called Webheads. (Please do not google for this term, or you will be misled). For a Bellhead, the control must be strict and centralized, with managed services. For a Nethead, the control, if any, is distributed in the end-point, with layered applications on top of the basic best effort infrastructure. For a Webhead, everything must run or be accessible from a browser.
"Weird!" - some might say.
"How is it even in the same league?" - others might ask.
Here is my attempt to present the distinguishing attributes.
| Bellhead | Nethead | Webhead | |
|---|---|---|---|
| Evolution | Initially there were Bellheads. They were happy living in their islands, working hard towards what the island demanded, and created applications and services for the island. | Then Netheads evolved, and separated applications from the grasp of the services. And interconnected the islands. They created Internet. They created the world-wide-web. They created SIP. And the list goes on. | Then Webheads emerged, and they started moving applications to the web. Slowly and steadily. First, it was email to webmail. Then documents on web editor. Then VoIP to web telephony. Now it looks like they have spread every where and practically moved all kinds of applications to the web. |
| Control | Believe in centralized control, like one super-power to rule them all. The central element has enough gravity to pull many applications and services, resulting is heavy weight servers but thin clients. | Believe in distributed control or no-control, resulting in chaos at times. The fluidity of the services and applications keeps them decoupled, or loosely connected with each other, spreading their wings to the peripherals. | Believe in centralized control in the form of the website on which their application is hosted, but often accept the fact that there can be numerous such control elements. Work hard to avoid any connectivity or information leak to other controls, outside their own. |
| Robustness | Everything hardware must be 99.999% reliable. Overall system must be 99.9% reliable. If system goes down, even if nobody is using it right now, they will lose business. Things have strict guarantees. | The system works most of the time, but fails sometimes, but no one person or entity may be blamed, because it is collective responsibility. No guarantees, just best effort service, whatever... whatever works well | The system works if it does, and does not if it does not. They work hard to keep the system running. But many systems are poorly designed, and often fail on sudden popularity of the website. Often blame for the failure is put on the network, and many times on the database. |
| Example | To add video communication on top of voice calls is a massive renovation project involving upgrading existing services, making the pipes bigger to carry video, incorporating fixed bit rate video codecs to guarantee quality, and upgrading terminal devices to support video. It must work all the time from day one. | To add video on top of a voice call client-server application, is just changing the application to use the desktop webcam and display, and pushing new video packet along side voice, without changing any other infrastructure element in most cases. | To add video on top of a voice call web page, there is a redesign of the website to accommodate the visual elements, so that everyone viewing the same web page can see and hear each other. Regarding the codecs and bandwidth, delegate that to Flash or WebRTC. |
In some ways, like Bellheads, Webheads also think of centralized control and management. The hosting website is the controller of an application and keeper of all the user data that could be generated on that website. In a way, they both create islands of innovations which do not communicate with each other very easily, except occasionally via rowboats, i.e., the likes of oauth and web APIs. Like Netheads, Webheads do not care about lower layers of communication, as long as the browsers can reach their web servers. Many things that are well studied in the Netheads community, e.g., socket programming, voice and video communication, or binary data processing, are only recently being introduced to Webheads, e.g., in the form of WebSocket, WebRTC or ArrayBuffer. If Bellheads are purists, then Netheads are hackers. And compared to them, Webheads are hackers++.
[Heard elsewhere...]
Bellhead: "We invented video communication two decades ago - from H.320, H.323/Q.931/H.225/H.245, H.324, H.325.. (15 more). It works on ISDN, LAN, ATM, MPLS, POTS, 802.1x, ... (10 more). There is wide variety of codecs including G.711, G.722, G.723, G.729, H.261, H.263, H.264, ... (20 more). We just finalized our first $100k sale of equipment to whats-it-called company for their intra-office video telepresence."
Nethead: "We re-invented everything you did, and we can quickly build endpoint terminals that can do two party video call to large scale video broadcast. Everyone is now using what we created. We even sent our engineers to your companies to add our protocols to your equipments. Did you know that your equipment also supports SIP for video call? You may not like it, but our protocols are everywhere, and better than yours."
Webhead: "Not sure what you guys are talking about, but just got a tweet from my IT guy that our click-to-video-call app reached 1 million unique users. We are going to hire a new CEO to monetize our app now."
For Webheads, focus is not about services or protocols, but only about apps and users.
Spaghetti, ravioli and lasagna in software
As I was walking down the street, down the street, down the street. A very good friend I chanced to meet... And we went to this italiano restauranto around the corner. Here is what we talked about. [blink blink, screen turns black and white, time goes back...]
"Have you heard of spaghetti code?"
"Of course, I write it all the time. It is easy. I think of a problem, and start writing code to solve it. And do not stop until either the code is complete or my laptop battery runs out. And yes, I don't delete any line of code that I write. Eventually I end up with like two thousand lines of code in one giant function, and like two hundred control statements. It is fast, furious and fun. But tastes awful the next day."
"Really?"
"Yes. So, what kind of code do you write?"
"Well, I have tried my hands on ravioli, lasagna, and more recently farfalle and penne."
"I know ravioli and lasagna - just too modular or too many layers for my taste - I get lost in which piece or layer I am working on, and end up making a mess, embarrassing myself. What about farfalle and penne?"
"Oh, farfalle is what I create when I refactor my correctly working software to smaller pieces, to such an extent that each piece looks broken and half complete."
"What is penne?"
"It is when I refactor it right, and pieces look good. Not too big, not too small. And no information hiding and stuff like that. Frankly, I don't really like information hiding - it is more hype than hoopla. I like data oriented programming - software that is independent of the data - or pasta that is independent of the meat."
[waitress comes by, and asks...]
"I heard pasta, so assume you are ready. What would you two gentlemen have?"
"I will have the three-pasta house combo, with lasagna at the core, the first layer filled with penne, and the second layer with cheese ravioli. Spare the meat. Mix tomato, pesto and alfredo sauce and spread it all over."
[waitress makes a face, and turns to the other...]
"I like spaghetti, but I will also have what my esteemed friend asked for. Additionally, can you open all the layers and separate them, placing side by side, and break all the raviolis to remove the cheese and put it on the side too. I like to keep my pasta separate from other ingredients."
Irony in Internet Communications
- first, we build Internet Protocol over phone network (modem), and then we build a phone system over Internet Protocol (VoIP),
- first, we build a phone system to run on Internet (VoIP), and then drive the internet from our phones (smart phones),
- first, we build open communication on top of monopolies (Internet over PSTN), and then build closed communication on top of that to promote monopolies (Facetime, Skype, or you-name-it-here).
NetConnection or PeerConnection?
Background
In Flash Player + ActionScript, four categories of objects (or classes) are needed to build a communicating web application such as a video chat. First, you need a connection to the server which may be a media server processing the media path in a client server approach or just a rendezvous server for establishing a peer-to-peer media path in the peer-to-peer approach. Second, you create named streams, for publishing and for playing. Third, you attach the publish side stream with your capture devices for audio and/or video. And fourth, you attach the play side stream with display or playback of remote media. The corresponding classes for these are: NetConnection, NetStream, Camera and/or Microphone, and Video or VideoDisplay [1] (pp.16-18). The diagram below shows a two party video call between our regular actors, Alice and Bob, each creating a named stream to publish, and each playing the other party's named stream to display. On each terminal, a NetConnection object is used to create two named NetStream objects, one is attached to Camera and Microphone and another to a Video or VideoDisplay element. Additionally, for displaying local preview, the Camera object is also attached directly to a Video or VideoDisplay.In WebRTC capable browser + Javascript, again, four categories of objects (or classes) are needed. First, a LocalMediaStream created via the getUserMedia function to represent the capture stream from the user's camera and microphone devices. Second, an PeerConnection (actually called RTCPeerConnection) object is created to establish a peer-to-peer path between the two browser instances. Third, the LocalMediaStream object is attached (or added) to the PeerConnection on one end, and appears as remote MediaStream object at the other end. Fourth, this remote MediaStream is attached to an audio or video element to playback or display the received media. The LocalMediaStream is also attached to another video element to display local preview. Unlike publish vs. play mode of NetStream, I put the LocalMediaStream vs. remote MediaStream into separate categories, due to the way they are created and what they represent, even though LocalMediaStream is also a MediaStream.
For comparison purpose, let us call these Flash vs WebRTC set of APIs as NetConnection vs PeerConnection.
Level of abstraction: named stream vs. offer-answer
Clearly NetConnection presents a higher level of abstraction than the PeerConnection. In PeerConnection, each media path is explicitly created via a new RTCPeerConnection object on each side. It uses offer-answer model - where one side sends a session offer with its capabilities and the other side responds back with an answer with its side's capabilities. The offer-answer messages are exchanged between the two browsers outside the WebRTC APIs. In case of NetConnection, when a participant publishes a named stream (MediaStream), zero or more other participants may play that named stream. The implementation of these connection and stream classes hide the details of the peer-to-peer or client-server media path establishment. Internally, some variation of offer-answer and transport candidate messages are exchanged between the Flash Player instances, but are not exposed to the programming interface - thus making it a higher level of abstraction.Topology creation: client-server vs. peer-to-peer
A NetConnection is a client-server abstraction where as a PeerConnection is a peer-to-peer path abstraction. A NetConnection object can be used to create client-server as well as peer-to-peer media paths, whereas PeerConnection is targeted only towards a peer-to-peer media path. To further clarify - a NetConnection is between a client and a server, which may be using RTMP in which case the server is a media server for the client-server media path, or may be RTMFP in which case the server may just be a rendezvous server to establish peer-to-peer MediaStreams between the two browsers. Note that with RTMFP it is also possible to do client-server media path. Additionally, it also allows application level multicast for use cases such as large event broadcast. With PeerConnection, the media path is intended to be peer-to-peer, but it is also possible to implement PeerConnection at a server (outside the browser), in which case it gets transparently used to create a client-server media path because the server behaves as another browser endpoint of the WebRTC media path. With WebRTC, once MediaStream proxying is implemented in the browser, it may be possible, though not trivial, to create application level multicast for large group event broadcast. Proxying a media stream means that a received remote MediaStream from one peer may be attached (or attached) to another PeerConnection to another peer, thus proxying the media path via this peer.Signaling channel: built-in vs. external
As explained in the previous points, NetConnection is the signaling channel to establish the client-server or peer-to-peer MediaStream objects. Due to the built-in nature of the client-server signaling channel, it is non-trivial for third-party vendors to build such a signaling server without implementing the full RTMP and/or RTMFP protocol. On the other hand, a PeerConnection generates the offer-answer and other signaling messages, which must be exchanged via another channel between the two browser instances, e.g., via a notification service over WebSocket, by using Ajax, or over other third-party channels. Such notification service does not need to understand the details of the WebRTC protocol. This gives more flexibility and freedom to the developers on how the signaling channel for a particular communicating web application is implemented.Server dependent vs. independent
Summary and further reading
In summary, the higher level NetConnection-style interface with built-in signaling channel makes it easier for programmers to create applications, ranging from two-party video calls to multi-party conferences to large scale event broadcasts. On the other hand, greater flexibility (and freedom) of the PeerConnection-style interface allows building wide range of system architectures using third-party components such as notification servers, media servers, media gateways and media relays.In my past research on flash-videoio [2] and aRtisy [3], I have proposed another higher level abstraction for the programming APIs for communicating web applications. In fact [3] describes how to implement the NetConnection-style interface on top of WebRTC's PeerConnection-style interface and using an external resource server to provide the notification service. It extends the named stream model to named resource model to implement several other applications such as text chat or shared whiteboard or notepad. Then, [2] and [3] describe how to build a widget-oriented interface which consists of a single video-io widget used in either publish or play mode to create a wide range of communicating web applications such as two-party calls, multi-party conferences and video presence for Flash and WebRTC, respectively.
My WebRTC related papers from 2013
A Case for SIP in JavaScript
Taking on WebRTC in an enterprise
Building communicating web applications leveraging endpoints and cloud resource service
Private overlay of enterprise social data and interactions in the public web context
The dangers to the promise of WebRTC
The promise
The technology promises the web developers a plugin-free cross-browser standard that will allow them to create exciting new audio/video communication applications and bring them to the millions (or even billions!) of web users on the desktop and mobile platform.The rest of the article assumes these premises:
- A technology is easy to invent but hard to change.
- Business motivation drives technology deployment.
The dangers
- What if Apple does not commit to WebRTC?
- What if Microsoft's implementation is different than Google's?
- What if there are different flavors of the API implementation?
- What if a web browser cannot talk to legacy SIP infrastructure without gateways?
- What if "they" cannot agree on a common video codec?
On the other hand, Apple prefers closed systems with integrated offering that works seamlessly across Apple platforms. Opening up the Facetime (or related technology) to all the web developers is interesting but not motivational enough, especially when the resulting web application and the WebRTC session is controlled by the web developers and users instead of the iOS developer platform or the iTunes portal. This makes Apple, as a browser vendor, uncommitted to WebRTC until it proves its worth to the business.
Microsoft already has a huge communication server business that is largely SIP based. Any technology that makes it difficult to interoperate (at the media path/RTP level) with the existing SIP application servers is likely to cause trouble. They could build a gateway, but it will be hard to compete with others who can provide such services on the low cost cloud infrastructure. Hence, having the building blocks in place that will enable a web browser to talk to existing SIP devices in the media path is probably a core requirement for Microsoft. Thus, low level access to SDP and more control over media path is proposed in their draft.
For a web browser on desktop, WebRTC is interesting in bringing a variety of new use cases and web application. However, these use cases are already available to web users on the desktop via alternatives such as Adobe's Flash Player. It it not clear what the WebRTC can contribute in terms of additional use cases for web applications on desktop. However, for mobile platforms and smart phones it is a completely different story. Flash Player either does not work or is too difficult to work for video conferencing on mobile. HTML5 presents a great opportunity to bridge the differences between end user platforms when it comes to web applications. Thus, adding WebRTC to web browsers on mobile phones such as iPhone and Android adds a lot of value.
Let us consider a few "what if"s that pose danger to the promise of WebRTC.
What if Apple does not commit to WebRTC?
What are the other potential threats to this technology? What if the standardization works drags for too long, until it is too late, and web users have moved on to something better! Web is driven by web applications. What a browser vendor does or what a standardization committee adopts are just a shadow of what the web applications do (or want to do). A web developer prefers to have a consistent, simple and unified API for real-time communication.
If it were in my hands, I would ban any IETF or W3C proposal without a running and open implementation. But the reality is far from this!






