Wednesday, December 31, 2014


Every time a new protocol appears in the protocol jungle of multimedia communications [2], people attempt to compare and contrast it with existing established protocols, such as SIP. With WebRTC as the new kid on the map, you can find several attempts to compare SIP and WebRTC [3][4][5][6][7]. Depending on which camp the comparison originates from, you may find flavors of favoritism, or unintentional downplay of the importance of the other camp.

This article presents my point of view, hopefully unbiased, on this topic. Let me start by saying that SIP and WebRTC are different, and it is not fair to compare them without an established context.

Complex protocol vs. simple API

SIP is a protocol, not an API; whereas WebRTC is an API, with an associated set of protocols.
Consider that TCP is a protocol but socket is an API. TCP has complex state machinery to enable reliable bi-directional end-to-end packet flow assuming that intermediate routers and networks can have problems but end to end reliability is assured. On the other hand, the socket API hides all the complexity of TCP and provides an easy to use abstraction. Luckily in the case of TCP, developers only work on top of the standardized socket API. Whereas in the case of SIP, usually they end up dealing with the protocol directly, or work with (partly broken) implementations of intermediate network element such as a SIP proxy, or use (poorly defined) APIs from third-party SIP libraries. Depending on what level of abstraction is exposed in the SIP library, the development effort can range from very simple to very complex. In the case of WebRTC, the API is being defined and standardized, and is intended as a high level abstraction suitable for  JavaScript/Web developers.

To list the differences between SIP and WebRTC - one could say that SIP is programming language agnostic but WebRTC is not, and must use JavaScript, or that SIP application could run on any device or platform, but a WebRTC application requires a browser to host the app. These are superficial differences - due to the core difference in the nature of SIP vs. WebRTC, i.e., protocol vs. API.

SIP would have been perceived as much simpler if there were well defined APIs from the start. But that would have limited the flexibility (and freedom) of what SIP could be used for. More on the flexibility and freedom is discussed later in this article.

One thing to note, however, is that WebRTC does use a set of complex protocols behind the scenes, many of which are shared by a SIP system, as we describe next. Although a web application using WebRTC is simple due to the high level API, the implementation of WebRTC in a browser itself is not that simple!

Protocol stack

SIP is an established protocol for Internet communication. WebRTC is an emerging API intended to be provided by the browsers and consumed by the applications in JavaScript. It has an associated set of protocols, some of which overlap with those used in a SIP system. The title of this article indicates a comparison of WebRTC with SIP/SDP, and not just with SIP. This is because SIP is often only a part of the overall puzzle in a real world communication system, as shown in the following diagram taken from my research paper [1].  In fact, the role that SIP plays in a real system is outside the scope of the WebRTC specification currently. Which means that it does not really make sense to compare SIP alone with WebRTC protocols, but it is okay to compare a SIP system with that based on WebRTC.

Comparison of typical SIP vs WebRTC application stack. The dotted line separates what is programmed by the application developer and what is provided by the platform. 
Although most SIP systems use SDP, it is not a requirement, and some SIP systems may use something else if needed. Same thing holds true for other protocol elements such as RTP for media transport. However, within the context of a voice and video call application, you will likely see the above mentioned protocols in effect. On the other hand, the WebRTC effort specifies the mandatory set of protocols including SDP, SRTP and ICE, and also, the minimum list of mandatory audio and video codecs, which must be implemented by a WebRTC capable browser. In the diagram above, while SIP is only one block in the system, WebRTC specifies multiple blocks including the API, codecs, SRTP/RTP/RTCP, ICE and to some extent SDP. While SDP is used in the WebRTC API, an application is free to change it to anything else when exchanging the session information with the peer, if needed. Defining and mandating more things in WebRTC makes the interoperability a little easier, unlike SIP where many such things are left to the implementation. We will talk about interoperability later in this article.

Network elements

The SIP specification defines the behavior of an endpoint (or user agent) as well as various network elements such as proxies and redirect servers. Other SIP extensions define more network elements such as a conference server, a presence server, and so on. With WebRTC, the focus is only on browser-to-browser communication, and only the browser's behavior is defined in the specification. Other elements such as a gateway or media server are free to implement the behavior. However, that part is likely not going to be standardized. This hopefully limits the number of specifications related to WebRTC, and hence limits the resulting complexity.

This does not mean that WebRTC systems will not need such network elements. Ability to call a phone number via a gateway, or to host a multi-party voice mixing or recording on a media server, or to locate the target user via a lookup service are all examples of crucial services required in many real world applications. Lack of standardization means that every vendor will likely define its own version of these elements, creating fragmentation. Such scenarios are actually inline with WebRTC's goal which aims at creating media path among browsers visiting the same website, and does not care for inter-website communication. More on interoperability vs. fragmentation is discussed later in this article.

Fortunately, it is entirely possible to create fully functional WebRTC systems without using complex or heavy weight network elements. In fact, many initial demonstrations of WebRTC applications have followed this path with peer-to-peer media flows, along with a very simple signaling/rendezvous service. SIP, too, started out with this model of peer-to-peer media flows with a simple signaling/rendezvous service in the form of a SIP proxy or redirect server. However, practical limitations and real-world deployments have mostly thrown this idealistic model out of the window, and have largely adopted a "managed" network centric model with heavy weight network elements in the form of back-to-back user agents, session border controllers, and call stateful proxies. It is likely that the similar real world forces may cause a similar network centric deployment model in the case of WebRTC. Only time will tell!

Message flow

To establish a multimedia call, three pieces of information are exchanged between the two parties - (1) an "invite" from caller to callee as an intention to establish a call, which the callee may answer or decline, or ignore, (2) session description containing capabilities of each party, (3) transport candidates for establishing connectivity on the media path, hopefully peer-to-peer. Let us call these steps as (1) "invite", (2) session description exchange and (3) transport candidates exchange.

In a WebRTC-based application, these may be separated out into different phases or may be combined together, as determined the web application. In fact, (1) above is not part of the specification, but is often implemented in a WebRTC call, and (2) and (3) are exchanged between the two parties using an out-of-band channel outside the scope of the WebRTC specification. The term "trickle ICE" refers to separating (3) from (2) in a WebRTC session negotiation, and is the default mode - although it is possible to not use it, as determined by the web application. In a SIP system, all three pieces are usually bundled in a single request-response transaction and are carried in a SIP INVITE and it's 2xx response between the two parties, often via one or more intermediate proxies.

The separation of these pieces of information in WebRTC makes it more flexible, e.g., an application can invent any kind of user lookup service over any protocol (HTTP POST, WebSocket, XMPP, Google Channel, LDAP, hard-coded, or whatever, or even SIP [1]). Some applications may not even need such a lookup service, e.g., a virtual presence application where everyone who visits the website is able to see and hear others on that site. Unlike this, in SIP, a user lookup service must be accessible via SIP.

For those who remember H.323, and its multiple steps of call setup in version 1, may argue that we have come back a full circle - started with multiple steps call setup, invited H.323 fast connect and SIP for single step call setup, and now back to multiple steps to establish a WebRTC call. However the situations are different in H.323 vs. WebRTC - in H.323 all the steps were done by the same entity, and after the initial "invite" step, the subsequent ones were follow through, with no perceived benefit to keep them separate. In the case of WebRTC, the "invite" step is outside the scope of WebRTC specification, hence it makes sense to keep it separate. The last transport candidates exchange step is incremental with real benefit to keep it separate from session description exchange, so that better peer-to-peer flows can be detected in future, which first peer-to-peer flow that works gets in use as soon as possible. In the case of H.323, there is no equivalent of this step in my opinion, because the second and third steps are about exchanging media capabilities, and then establishing fixed set of logical channels based on those media capabilities.

The SIP specification and many of its extensions, such as call transfer, interoperability with phone, or capabilities of user agents, actually deal with only the first piece of information mentioned above, i.e., how to reach the target user or device, and are often related to the notion of a "call". In WebRTC specification, there is no notion of a "call", or signaling to lookup the target user. It is assumed that such constructs and concepts belong to the web application or the website that is hosting the application. This gives flexibility to the application developer on how to implement advanced features such as conferencing or call transfer. With more flexibility, comes more power, and more non-interoperability, described next.

Interoperability vs. fragmentation

In the past fifteen years, non-interoperability has been a major problem in the SIP community. Although it is often easy to achieve interoperable voice call signaling, it is rather difficult to make sure that those hundreds or thousands of SIP systems out there will implement every little detail consistently as per the specification. Moreover, there is no mandatory codec defined by SIP, which means that two perfectly compliant SIP endpoints may not talk to each other.

With WebRTC, you only need to achieve interoperability among the major browsers. WebRTC easily enables communication between users visiting the same website, where the application fills the void of signaling, user lookup, and the likes. Since signaling is outside the scope of the specification, interoperability between users visiting separate websites is not readily available and requires one-to-one federation between the two websites or web application. This fragmented communication behavior is inline with how existing web applications behave, i.e., they tend to create islands of users, where users within an island can talk to each other, but not beyond. Have you ever thought of being able to reach your Linked-In contact from your Facebook page? I have, and have tried to do something about it [8][9].

The fundamental difference of triangular vs trapezoidal call model in WebRTC vs SIP, respectively -- which means the two parties are connected to the same server in WebRTC, whereas two parties can potentially be on separate servers in SIP -- results in major differences in interoperability requirements.

The triangular way of WebRTC is not how VoIP community perceived open communication to be. From that perspective, even though WebRTC is an emerging open standard, it will created closed walled gardens of communication islands - leading to communication fragmentation. Unfortunately, even in SIP-based systems, due to non-standard behaviors in implementations or different implementation choices (e.g., codecs) by different vendors, it tends to create fragmented islands, where an equipment from one vendor rarely talks freely with that of another. This tends to create an ecosystem, where a customer must purchase phone endpoints and servers from the same vendor, because it is often hard to achieve true mix-and-match interoperability out-of-the-box.

The problem is more relaxed in the case of WebRTC, because there are only a few major browsers, and the protocol specification does not involve any signaling or media server, except for generic STUN and TURN servers, when appropriate. A fair comparison would have been possible if there were hundreds or thousands of browsers implementing WebRTC, similar to the plethora of SIP implementations that exist today. Or if there were only five major vendors creating SIP user agents. With SIP, the protocol was invented first, followed by an attempt to deploy the user agents and servers. With WebRTC, we already had user agents (browsers) and servers (web servers, STUN/TURN servers), and we put together a set of protocols to let them talk in real-time.

By leaving signaling outside the scope, a WebRTC specification escapes many problems related to non-interoperability. In practice, fragmentation seems to be unavoidable, largely due to business reasons -- or lack of incentive for one website to let its users talk to that on another website. The benefit of better interoperability, maintainability and usability far out weighs the cost of fragmentation for many websites and web applications. For some applications that require cross-site communication, a gateway can do the job, without being part of the specification.

The biggest non-interoperability and fragmentation threat to WebRTC is that some major browser vendors may not implement it altogether, or may implement a variant of the specification, making it non-interoperable with other browsers. Hopefully, market pressure will stabilize the interoperability among major browser. Or developers will find alternatives (plugins? gateways?) to fill the gap. Again, only time will tell!

Flexibility and freedom

We have mentioned about flexibility at the protocol and application level. In summary, SIP provides greater flexibility at the protocol level and WebRTC at the application level. With greater flexibility comes more freedom to innovate, and unfortunately, more problems in interoperability. 

First, SIP provides flexibility at the protocol level, e.g., ability to use non-SDP based session description, or ability to use RTP vs SRTP vs ZRTP based media transport, or ability to add newer extensions for media path or call signaling. WebRTC mandates many such elements in the protocol, e.g., use of SDP, SRTP, ICE and so on. In future, if a newer or better media transport, NAT traversal or session description is invented, the specification will need to be changed to incorporate that. However, upgrading the specification when there are only a few browser implementations should not be a nightmare, especially if the upgrade is backward compatible. Once WebRTC specification gets ratified, it leaves little room to change or improve, in an implementation. With SIP's flexibility and resulting freedom, people have created many hundreds of different implementations and use cases using the same base protocol. WebRTC lacks the ability, for example, to implement non-trivial media path, such as application level multicast or local LAN broadcast, or to let the web application configure differential service for different media flows, or to let the web application inject text-to-speech in a media path.

The protocol flexibility of SIP comes at a price - no guarantee of common features such as security or common codecs in different implementations. For example, the voice path in one SIP call could be unencrypted and in another could be encrypted - determined by the implementation choice. Whereas in WebRTC, many of such features are well defined and mandated, as described in the next section.

Keeping the signaling part out of scope and defining the API give good application level flexibility to WebRTC systems. As mentioned before, features such as call transfer, conferencing or user lookup are application's responsibility. This allows a social website to implement these features different from a click-to-call provider's website, for example. Given the numerous websites, and their potential for incorporating real-time in-browser communication, such flexibility is very useful and highly desired.

Protocol differences in SIP vs WebRTC system

Both SIP and WebRTC systems use certain common set of protocols, e.g., SDP for session description, ICE for NAT traversal, and (S)RTP for media transport. The following chart from my paper [1] summarizes the interoperability differences.

Interoperability differences in SIP vs WebRTC; (o) means optional.
In practice, the session description (SDP) and media path defined by WebRTC requires many extensions, that are not found in existing SIP systems, and hence it is very hard to transparently interoperate between the two without using a media gateway. Eventually, one would hope that existing SIP systems would adapt to incorporate these new extensions. But history has told us that this is next to impossible - given the large number of SIP implementations, and given the availability of media gateways. Thus, a browser-to-SIP call will almost always involve a media gateway, unless you sell your own SIP endpoint.

Frequently asked questions (FAQ)

I have heard these. You will likely hear them too.
  1. Are SIP and WebRTC competing technologies? Yes and no. It depends on the context. As an access protocol to connect a client to an established voice/video service or infrastructure, yes. For anything else, most likely no.
  2. Is WebRTC a subset or refinement of SIP? No. WebRTC has certain elements and features that are not present in SIP/SDP. For example peer-to-peer data channel, or trickle ICE, or programming APIs. 
  3. Is WebRTC a superset or generic case of SIP? No. A SIP-based system includes elements and features that are not present in a WebRTC browser, e.g., well defined network elements, call event notifications, or user lookup service.
  4. Is WebRTC similar to P2P-SIP? or Jingle? or RTMFP? or name-your-favorite-here? No, for all. In theory, WebRTC has some overlap with session description and media path of a standard SIP system. In practice, WebRTC is very different from any of the existing systems we had.
  5. Does SIP make WebRTC better? What about the other way? Depends on who you talk to. A SIP proponent would argue that cross-site communication and PSTN interoperability requires using SIP at the core, or that SIP in JavaScript is good enough for signaling of WebRTC. A WebRTC proponent would argue that replacing SIP with WebRTC will improve on security, interoperability and call quality. In my opinion, these arguments are either too narrowly focussed or a compelling counter argument can easily be made. So my answer is "no" on both counts.
  6. Can SIP and WebRTC work together, instead of competing? Absolutely. Firstly, they do not compete except in a very narrow context. And secondly, they do work together - the SDP generated by WebRTC can be carried in SIP, directly from the browser or via a gateway. See [1] for alternatives.
  7. Will WebRTC remain at the end, and must SIP be at the core? No. It is entirely possible to create WebRTC capable network elements such as media servers and gateways without depending on SIP, that allow creating voice/video infrastructure with WebRTC at the core. It also possible to create many existing use cases with peer-to-peer media path of WebRTC, without using these network elements at all. 
  8. Does WebRTC have better voice quality than SIP? No. Because, SIP does not specific fixed set of voice codecs or voice quality. At present, when many SIP systems use G.711, G.729 or Speex, the voice quality of Opus used in WebRTC is superior to those SIP systems. However, nobody has stopped those SIP systems from using Opus (and related media engine) in their voice stack.
  9. Will WebRTC improve voice quality of my PC-to-phone application? No. Unless you are talking about smart-phone with WebRTC running on phone also. When voice path gets translated to a phone call, the quality will likely be reduced to G.711 or G.729 or whatever the gateway provider has deemed suitable on the phone network.
  10. Will WebRTC give better voice quality for my conferencing application? Depends. See previous question. For centralized media path (with audio mixer), it requires transcoding, so voice quality will likely suffer. Optimization such as using peer-to-peer voice path, or reflecting loudest-N streams to client without mixing may preserve the end-to-end voice quality in conferencing. 
  11. Why does browser not implement SIP directly? For browser-to-browser use case, SIP is not needed. For anything else where one end is a browser, a gateway can do the job of translating to SIP if needed. And WebRTC is focussed on browser-to-browser use case only. Having said that, it is possible to do SIP in JavaScript [1] running in your browser, or to implement your own plugin or browser with built-in SIP stack and related APIs. 
  12. What does it mean that a SIP conference server supports WebRTC? It could mean that it supports a browser end-point to participate in a multi-party conference over WebRTC, or it could mean that it uses Google's WebRTC media engine or the OPUS/VP8 codec for the voice/video stack but requires SIP/H.323 endpoints to connect. Be careful!
  13. Is WebRTC better than SIP/SDP? Hard to say. WebRTC attempts to fill the gap for a very specific problem, but requires robust, secure and high quality implementation, and defines simple high level API. On the other hand, SIP leaves many such things open to implementers. So for the specific problem, WebRTC is better, some might say.
  14. Between WebRTC and SIP, who will win the racing? There is no racing. Or if there is one, then WebRTC and SIP are racing in different directions, at 90 degrees to each other. Depending on where you look from, or which axis has the finish line, either of them could win. But the truth is, WebRTC is for browser-to-browser communication, where SIP/SDP/RTP is not feasible currently -- so there is no racing. One could argue that with more number of browsers supporting WebRTC than number of SIP endpoints, or with more number of JavaScript developers than number of SIP implementors, WebRTC is likely to win the racing. No comments on that. A more correct analogy of racing, in my opinion, is in specific contexts, e.g., between WebRTC of browser and RTMP+ RTMFP of Flash Player for browser-to-browser communication, or between SIP phone vs.  WebRTC browser for client access to backend voice/video service or infrastructure. 
  15. What is the real difference between WebRTC and SIP? Ah, glad you asked! If you have already read the rest of this article, you will be surprised to find that unlike all the differences mentioned before, the real difference is who the most important customers of WebRTC vs. SIP are - for WebRTC it is the web developers community, and for SIP it has been telecom providers and equipment vendors.


This article has attempted to present several factors comparing and contrasting SIP/SDP vs. WebRTC systems, without really diving in to the specific protocol differences of how SDP/RTP are used in SIP vs. WebRTC. This is because, while WebRTC specifies the list of profiles and extensions to SDP/RTP, the SIP system is free to choose whichever it likes. So any such comparison will end up being one between WebRTC and SIP of vendor-X.

There are similarities and differences between the two types of systems. In the end it is about the applications and user experiences, rather than protocols and APIs. In practice, it is easy for an established SIP-based IVR system to add WebRTC access, compared to that for a newbie WebRTC system to implement full fledged IVR functionality. The same holds true for many existing telecom, web communication and Internet conferencing applications. Thus, many existing SIP-based applications will likely also adopt WebRTC for access, if there is a demand. And similarly, new pure WebRTC-based web applications will likely incorporate SIP gateway-ing to reach out to  non-browser users, if there is demand. SIP or WebRTC itself is just a small piece of the puzzle in the overall system or application. Nevertheless, WebRTC attempts to fill a void that has existed for a really long time in the SIP-based communication world, for which past attempts such as using plugins have largely been unsuccessful, especially on mobile platform.


نوفل said...

Thank you so match. This was really helpful. I have a question in regard to WebRTC and the QoS provided by ISPs. Do ISPs need to add some configurations to provide QoS for WebRTC streams. I have 1Mbps connection, I tried "" with a friend of mine who has better connection but the quality was very low. I couldn't see past the first image of his face.

Thanks in advance

naren for INDIA said...

Do write more. You explanations are crisp. I recently started my career in telecom. I work on gateways and gateway controllers. I am impressed by the book implementing SIP in python. I took a python course and started to follow your book. Hopefully, I ll be able to build the whole stack. Wish me luck
Thank you kundan