WebRTC vs. SIP/SDP

Every time a new protocol appears in the protocol jungle of multimedia communications [2], people attempt to compare and contrast it with existing established protocols, such as SIP. With WebRTC as the new kid on the map, you can find several attempts to compare SIP and WebRTC [3][4][5][6][7]. Depending on which camp the comparison originates from, you may find flavors of favoritism, or unintentional downplay of the importance of the other camp.

This article presents my point of view, hopefully unbiased, on this topic. Let me start by saying that SIP and WebRTC are different, and it is not fair to compare them without an established context.

Complex protocol vs. simple API

SIP is a protocol, not an API; whereas WebRTC is an API, with an associated set of protocols.
Consider that TCP is a protocol but socket is an API. TCP has complex state machinery to enable reliable bi-directional end-to-end packet flow assuming that intermediate routers and networks can have problems but end to end reliability is assured. On the other hand, the socket API hides all the complexity of TCP and provides an easy to use abstraction. Luckily in the case of TCP, developers only work on top of the standardized socket API. Whereas in the case of SIP, usually they end up dealing with the protocol directly, or work with (partly broken) implementations of intermediate network element such as a SIP proxy, or use (poorly defined) APIs from third-party SIP libraries. Depending on what level of abstraction is exposed in the SIP library, the development effort can range from very simple to very complex. In the case of WebRTC, the API is being defined and standardized, and is intended as a high level abstraction suitable for  JavaScript/Web developers.

To list the differences between SIP and WebRTC - one could say that SIP is programming language agnostic but WebRTC is not, and must use JavaScript, or that SIP application could run on any device or platform, but a WebRTC application requires a browser to host the app. These are superficial differences - due to the core difference in the nature of SIP vs. WebRTC, i.e., protocol vs. API.

SIP would have been perceived as much simpler if there were well defined APIs from the start. But that would have limited the flexibility (and freedom) of what SIP could be used for. More on the flexibility and freedom is discussed later in this article.

One thing to note, however, is that WebRTC does use a set of complex protocols behind the scenes, many of which are shared by a SIP system, as we describe next. Although a web application using WebRTC is simple due to the high level API, the implementation of WebRTC in a browser itself is not that simple!

Protocol stack

SIP is an established protocol for Internet communication. WebRTC is an emerging API intended to be provided by the browsers and consumed by the applications in JavaScript. It has an associated set of protocols, some of which overlap with those used in a SIP system. The title of this article indicates a comparison of WebRTC with SIP/SDP, and not just with SIP. This is because SIP is often only a part of the overall puzzle in a real world communication system, as shown in the following diagram taken from my research paper [1].  In fact, the role that SIP plays in a real system is outside the scope of the WebRTC specification currently. Which means that it does not really make sense to compare SIP alone with WebRTC protocols, but it is okay to compare a SIP system with that based on WebRTC.


Comparison of typical SIP vs WebRTC application stack. The dotted line separates what is programmed by the application developer and what is provided by the platform. 
Although most SIP systems use SDP, it is not a requirement, and some SIP systems may use something else if needed. Same thing holds true for other protocol elements such as RTP for media transport. However, within the context of a voice and video call application, you will likely see the above mentioned protocols in effect. On the other hand, the WebRTC effort specifies the mandatory set of protocols including SDP, SRTP and ICE, and also, the minimum list of mandatory audio and video codecs, which must be implemented by a WebRTC capable browser. In the diagram above, while SIP is only one block in the system, WebRTC specifies multiple blocks including the API, codecs, SRTP/RTP/RTCP, ICE and to some extent SDP. While SDP is used in the WebRTC API, an application is free to change it to anything else when exchanging the session information with the peer, if needed. Defining and mandating more things in WebRTC makes the interoperability a little easier, unlike SIP where many such things are left to the implementation. We will talk about interoperability later in this article.

Network elements

The SIP specification defines the behavior of an endpoint (or user agent) as well as various network elements such as proxies and redirect servers. Other SIP extensions define more network elements such as a conference server, a presence server, and so on. With WebRTC, the focus is only on browser-to-browser communication, and only the browser's behavior is defined in the specification. Other elements such as a gateway or media server are free to implement the behavior. However, that part is likely not going to be standardized. This hopefully limits the number of specifications related to WebRTC, and hence limits the resulting complexity.

This does not mean that WebRTC systems will not need such network elements. Ability to call a phone number via a gateway, or to host a multi-party voice mixing or recording on a media server, or to locate the target user via a lookup service are all examples of crucial services required in many real world applications. Lack of standardization means that every vendor will likely define its own version of these elements, creating fragmentation. Such scenarios are actually inline with WebRTC's goal which aims at creating media path among browsers visiting the same website, and does not care for inter-website communication. More on interoperability vs. fragmentation is discussed later in this article.

Fortunately, it is entirely possible to create fully functional WebRTC systems without using complex or heavy weight network elements. In fact, many initial demonstrations of WebRTC applications have followed this path with peer-to-peer media flows, along with a very simple signaling/rendezvous service. SIP, too, started out with this model of peer-to-peer media flows with a simple signaling/rendezvous service in the form of a SIP proxy or redirect server. However, practical limitations and real-world deployments have mostly thrown this idealistic model out of the window, and have largely adopted a "managed" network centric model with heavy weight network elements in the form of back-to-back user agents, session border controllers, and call stateful proxies. It is likely that the similar real world forces may cause a similar network centric deployment model in the case of WebRTC. Only time will tell!

Message flow

To establish a multimedia call, three pieces of information are exchanged between the two parties - (1) an "invite" from caller to callee as an intention to establish a call, which the callee may answer or decline, or ignore, (2) session description containing capabilities of each party, (3) transport candidates for establishing connectivity on the media path, hopefully peer-to-peer. Let us call these steps as (1) "invite", (2) session description exchange and (3) transport candidates exchange.

In a WebRTC-based application, these may be separated out into different phases or may be combined together, as determined the web application. In fact, (1) above is not part of the specification, but is often implemented in a WebRTC call, and (2) and (3) are exchanged between the two parties using an out-of-band channel outside the scope of the WebRTC specification. The term "trickle ICE" refers to separating (3) from (2) in a WebRTC session negotiation, and is the default mode - although it is possible to not use it, as determined by the web application. In a SIP system, all three pieces are usually bundled in a single request-response transaction and are carried in a SIP INVITE and it's 2xx response between the two parties, often via one or more intermediate proxies.

The separation of these pieces of information in WebRTC makes it more flexible, e.g., an application can invent any kind of user lookup service over any protocol (HTTP POST, WebSocket, XMPP, Google Channel, LDAP, hard-coded, or whatever, or even SIP [1]). Some applications may not even need such a lookup service, e.g., a virtual presence application where everyone who visits the website is able to see and hear others on that site. Unlike this, in SIP, a user lookup service must be accessible via SIP.

For those who remember H.323, and its multiple steps of call setup in version 1, may argue that we have come back a full circle - started with multiple steps call setup, invited H.323 fast connect and SIP for single step call setup, and now back to multiple steps to establish a WebRTC call. However the situations are different in H.323 vs. WebRTC - in H.323 all the steps were done by the same entity, and after the initial "invite" step, the subsequent ones were follow through, with no perceived benefit to keep them separate. In the case of WebRTC, the "invite" step is outside the scope of WebRTC specification, hence it makes sense to keep it separate. The last transport candidates exchange step is incremental with real benefit to keep it separate from session description exchange, so that better peer-to-peer flows can be detected in future, which first peer-to-peer flow that works gets in use as soon as possible. In the case of H.323, there is no equivalent of this step in my opinion, because the second and third steps are about exchanging media capabilities, and then establishing fixed set of logical channels based on those media capabilities.

The SIP specification and many of its extensions, such as call transfer, interoperability with phone, or capabilities of user agents, actually deal with only the first piece of information mentioned above, i.e., how to reach the target user or device, and are often related to the notion of a "call". In WebRTC specification, there is no notion of a "call", or signaling to lookup the target user. It is assumed that such constructs and concepts belong to the web application or the website that is hosting the application. This gives flexibility to the application developer on how to implement advanced features such as conferencing or call transfer. With more flexibility, comes more power, and more non-interoperability, described next.

Interoperability vs. fragmentation

In the past fifteen years, non-interoperability has been a major problem in the SIP community. Although it is often easy to achieve interoperable voice call signaling, it is rather difficult to make sure that those hundreds or thousands of SIP systems out there will implement every little detail consistently as per the specification. Moreover, there is no mandatory codec defined by SIP, which means that two perfectly compliant SIP endpoints may not talk to each other.

With WebRTC, you only need to achieve interoperability among the major browsers. WebRTC easily enables communication between users visiting the same website, where the application fills the void of signaling, user lookup, and the likes. Since signaling is outside the scope of the specification, interoperability between users visiting separate websites is not readily available and requires one-to-one federation between the two websites or web application. This fragmented communication behavior is inline with how existing web applications behave, i.e., they tend to create islands of users, where users within an island can talk to each other, but not beyond. Have you ever thought of being able to reach your Linked-In contact from your Facebook page? I have, and have tried to do something about it [8][9].

The fundamental difference of triangular vs trapezoidal call model in WebRTC vs SIP, respectively -- which means the two parties are connected to the same server in WebRTC, whereas two parties can potentially be on separate servers in SIP -- results in major differences in interoperability requirements.

The triangular way of WebRTC is not how VoIP community perceived open communication to be. From that perspective, even though WebRTC is an emerging open standard, it will created closed walled gardens of communication islands - leading to communication fragmentation. Unfortunately, even in SIP-based systems, due to non-standard behaviors in implementations or different implementation choices (e.g., codecs) by different vendors, it tends to create fragmented islands, where an equipment from one vendor rarely talks freely with that of another. This tends to create an ecosystem, where a customer must purchase phone endpoints and servers from the same vendor, because it is often hard to achieve true mix-and-match interoperability out-of-the-box.

The problem is more relaxed in the case of WebRTC, because there are only a few major browsers, and the protocol specification does not involve any signaling or media server, except for generic STUN and TURN servers, when appropriate. A fair comparison would have been possible if there were hundreds or thousands of browsers implementing WebRTC, similar to the plethora of SIP implementations that exist today. Or if there were only five major vendors creating SIP user agents. With SIP, the protocol was invented first, followed by an attempt to deploy the user agents and servers. With WebRTC, we already had user agents (browsers) and servers (web servers, STUN/TURN servers), and we put together a set of protocols to let them talk in real-time.

By leaving signaling outside the scope, a WebRTC specification escapes many problems related to non-interoperability. In practice, fragmentation seems to be unavoidable, largely due to business reasons -- or lack of incentive for one website to let its users talk to that on another website. The benefit of better interoperability, maintainability and usability far out weighs the cost of fragmentation for many websites and web applications. For some applications that require cross-site communication, a gateway can do the job, without being part of the specification.

The biggest non-interoperability and fragmentation threat to WebRTC is that some major browser vendors may not implement it altogether, or may implement a variant of the specification, making it non-interoperable with other browsers. Hopefully, market pressure will stabilize the interoperability among major browser. Or developers will find alternatives (plugins? gateways?) to fill the gap. Again, only time will tell!

Flexibility and freedom

We have mentioned about flexibility at the protocol and application level. In summary, SIP provides greater flexibility at the protocol level and WebRTC at the application level. With greater flexibility comes more freedom to innovate, and unfortunately, more problems in interoperability. 

First, SIP provides flexibility at the protocol level, e.g., ability to use non-SDP based session description, or ability to use RTP vs SRTP vs ZRTP based media transport, or ability to add newer extensions for media path or call signaling. WebRTC mandates many such elements in the protocol, e.g., use of SDP, SRTP, ICE and so on. In future, if a newer or better media transport, NAT traversal or session description is invented, the specification will need to be changed to incorporate that. However, upgrading the specification when there are only a few browser implementations should not be a nightmare, especially if the upgrade is backward compatible. Once WebRTC specification gets ratified, it leaves little room to change or improve, in an implementation. With SIP's flexibility and resulting freedom, people have created many hundreds of different implementations and use cases using the same base protocol. WebRTC lacks the ability, for example, to implement non-trivial media path, such as application level multicast or local LAN broadcast, or to let the web application configure differential service for different media flows, or to let the web application inject text-to-speech in a media path.

The protocol flexibility of SIP comes at a price - no guarantee of common features such as security or common codecs in different implementations. For example, the voice path in one SIP call could be unencrypted and in another could be encrypted - determined by the implementation choice. Whereas in WebRTC, many of such features are well defined and mandated, as described in the next section.

Keeping the signaling part out of scope and defining the API give good application level flexibility to WebRTC systems. As mentioned before, features such as call transfer, conferencing or user lookup are application's responsibility. This allows a social website to implement these features different from a click-to-call provider's website, for example. Given the numerous websites, and their potential for incorporating real-time in-browser communication, such flexibility is very useful and highly desired.

Protocol differences in SIP vs WebRTC system

Both SIP and WebRTC systems use certain common set of protocols, e.g., SDP for session description, ICE for NAT traversal, and (S)RTP for media transport. The following chart from my paper [1] summarizes the interoperability differences.

Interoperability differences in SIP vs WebRTC; (o) means optional.
In practice, the session description (SDP) and media path defined by WebRTC requires many extensions, that are not found in existing SIP systems, and hence it is very hard to transparently interoperate between the two without using a media gateway. Eventually, one would hope that existing SIP systems would adapt to incorporate these new extensions. But history has told us that this is next to impossible - given the large number of SIP implementations, and given the availability of media gateways. Thus, a browser-to-SIP call will almost always involve a media gateway, unless you sell your own SIP endpoint.

Frequently asked questions (FAQ)

I have heard these. You will likely hear them too.
  1. Are SIP and WebRTC competing technologies? Yes and no. It depends on the context. As an access protocol to connect a client to an established voice/video service or infrastructure, yes. For anything else, most likely no.
  2. Is WebRTC a subset or refinement of SIP? No. WebRTC has certain elements and features that are not present in SIP/SDP. For example peer-to-peer data channel, or trickle ICE, or programming APIs. 
  3. Is WebRTC a superset or generic case of SIP? No. A SIP-based system includes elements and features that are not present in a WebRTC browser, e.g., well defined network elements, call event notifications, or user lookup service.
  4. Is WebRTC similar to P2P-SIP? or Jingle? or RTMFP? or name-your-favorite-here? No, for all. In theory, WebRTC has some overlap with session description and media path of a standard SIP system. In practice, WebRTC is very different from any of the existing systems we had.
  5. Does SIP make WebRTC better? What about the other way? Depends on who you talk to. A SIP proponent would argue that cross-site communication and PSTN interoperability requires using SIP at the core, or that SIP in JavaScript is good enough for signaling of WebRTC. A WebRTC proponent would argue that replacing SIP with WebRTC will improve on security, interoperability and call quality. In my opinion, these arguments are either too narrowly focussed or a compelling counter argument can easily be made. So my answer is "no" on both counts.
  6. Can SIP and WebRTC work together, instead of competing? Absolutely. Firstly, they do not compete except in a very narrow context. And secondly, they do work together - the SDP generated by WebRTC can be carried in SIP, directly from the browser or via a gateway. See [1] for alternatives.
  7. Will WebRTC remain at the end, and must SIP be at the core? No. It is entirely possible to create WebRTC capable network elements such as media servers and gateways without depending on SIP, that allow creating voice/video infrastructure with WebRTC at the core. It also possible to create many existing use cases with peer-to-peer media path of WebRTC, without using these network elements at all. 
  8. Does WebRTC have better voice quality than SIP? No. Because, SIP does not specific fixed set of voice codecs or voice quality. At present, when many SIP systems use G.711, G.729 or Speex, the voice quality of Opus used in WebRTC is superior to those SIP systems. However, nobody has stopped those SIP systems from using Opus (and related media engine) in their voice stack.
  9. Will WebRTC improve voice quality of my PC-to-phone application? No. Unless you are talking about smart-phone with WebRTC running on phone also. When voice path gets translated to a phone call, the quality will likely be reduced to G.711 or G.729 or whatever the gateway provider has deemed suitable on the phone network.
  10. Will WebRTC give better voice quality for my conferencing application? Depends. See previous question. For centralized media path (with audio mixer), it requires transcoding, so voice quality will likely suffer. Optimization such as using peer-to-peer voice path, or reflecting loudest-N streams to client without mixing may preserve the end-to-end voice quality in conferencing. 
  11. Why does browser not implement SIP directly? For browser-to-browser use case, SIP is not needed. For anything else where one end is a browser, a gateway can do the job of translating to SIP if needed. And WebRTC is focussed on browser-to-browser use case only. Having said that, it is possible to do SIP in JavaScript [1] running in your browser, or to implement your own plugin or browser with built-in SIP stack and related APIs. 
  12. What does it mean that a SIP conference server supports WebRTC? It could mean that it supports a browser end-point to participate in a multi-party conference over WebRTC, or it could mean that it uses Google's WebRTC media engine or the OPUS/VP8 codec for the voice/video stack but requires SIP/H.323 endpoints to connect. Be careful!
  13. Is WebRTC better than SIP/SDP? Hard to say. WebRTC attempts to fill the gap for a very specific problem, but requires robust, secure and high quality implementation, and defines simple high level API. On the other hand, SIP leaves many such things open to implementers. So for the specific problem, WebRTC is better, some might say.
  14. Between WebRTC and SIP, who will win the racing? There is no racing. Or if there is one, then WebRTC and SIP are racing in different directions, at 90 degrees to each other. Depending on where you look from, or which axis has the finish line, either of them could win. But the truth is, WebRTC is for browser-to-browser communication, where SIP/SDP/RTP is not feasible currently -- so there is no racing. One could argue that with more number of browsers supporting WebRTC than number of SIP endpoints, or with more number of JavaScript developers than number of SIP implementors, WebRTC is likely to win the racing. No comments on that. A more correct analogy of racing, in my opinion, is in specific contexts, e.g., between WebRTC of browser and RTMP+ RTMFP of Flash Player for browser-to-browser communication, or between SIP phone vs.  WebRTC browser for client access to backend voice/video service or infrastructure. 
  15. What is the real difference between WebRTC and SIP? Ah, glad you asked! If you have already read the rest of this article, you will be surprised to find that unlike all the differences mentioned before, the real difference is who the most important customers of WebRTC vs. SIP are - for WebRTC it is the web developers community, and for SIP it has been telecom providers and equipment vendors.

Conclusions

This article has attempted to present several factors comparing and contrasting SIP/SDP vs. WebRTC systems, without really diving in to the specific protocol differences of how SDP/RTP are used in SIP vs. WebRTC. This is because, while WebRTC specifies the list of profiles and extensions to SDP/RTP, the SIP system is free to choose whichever it likes. So any such comparison will end up being one between WebRTC and SIP of vendor-X.

There are similarities and differences between the two types of systems. In the end it is about the applications and user experiences, rather than protocols and APIs. In practice, it is easy for an established SIP-based IVR system to add WebRTC access, compared to that for a newbie WebRTC system to implement full fledged IVR functionality. The same holds true for many existing telecom, web communication and Internet conferencing applications. Thus, many existing SIP-based applications will likely also adopt WebRTC for access, if there is a demand. And similarly, new pure WebRTC-based web applications will likely incorporate SIP gateway-ing to reach out to  non-browser users, if there is demand. SIP or WebRTC itself is just a small piece of the puzzle in the overall system or application. Nevertheless, WebRTC attempts to fill a void that has existed for a really long time in the SIP-based communication world, for which past attempts such as using plugins have largely been unsuccessful, especially on mobile platform.


About Bellheads, Netheads and Webheads

Many people are familiar with the Netheads vs Bellheads battle [1][2]. A Bellhead has a tightly controlled and centrally managed view of services that work all the time and on which the provider can deploy applications and consumers will pay for them, usually a heft price every month. A Nethead is happy to use loosely connected hardware that is up 99% of the time, but allows creating a massive loosely connected network of software applications, where the hardware does not care which software is running or sending traffic.

I believe, most people of my generation belong to the Netheads family, even though their employers may be Bellheads. From [1] "This philosophical divide should sound familiar. It's the difference between 19th-century scientists who believed in a clockwork universe, mechanistic and predictable, and those contemporary scientists who see everything as an ecosystem, riddled with complexity. It's the difference between central and distributed control, between mainframes and workstations."

More recently a new generation is evolving, called Webheads. (Please do not google for this term, or you will be misled). For a Bellhead, the control must be strict and centralized, with managed services. For a Nethead, the control, if any, is distributed in the end-point, with layered applications on top of the basic best effort infrastructure. For a Webhead, everything must run or be accessible from a browser.

"Weird!" - some might say.

"How is it even in the same league?" - others might ask.

Here is my attempt to present the distinguishing attributes.

BellheadNetheadWebhead
Evolution Initially there were Bellheads. They were happy living in their islands, working hard towards what the island demanded, and created applications and services for the island. Then Netheads evolved, and separated applications from the grasp of the services. And interconnected the islands. They created Internet. They created the world-wide-web. They created SIP. And the list goes on. Then Webheads emerged, and they started moving applications to the web. Slowly and steadily. First, it was email to webmail. Then documents on web editor. Then VoIP to web telephony. Now it looks like they have spread every where and practically moved all kinds of applications to the web.
Control Believe in centralized control, like one super-power to rule them all. The central element has enough gravity to pull many applications and services, resulting is heavy weight servers but thin clients. Believe in distributed control or no-control, resulting in chaos at times. The fluidity of the services and applications keeps them decoupled, or loosely connected with each other, spreading their wings to the peripherals. Believe in centralized control in the form of the website on which their application is hosted, but often accept the fact that there can be numerous such control elements. Work hard to avoid any connectivity or information leak to other controls, outside their own.
Robustness Everything hardware must be 99.999% reliable. Overall system must be 99.9% reliable. If system goes down, even if nobody is using it right now, they will lose business. Things have strict guarantees.The system works most of the time, but fails sometimes, but no one person or entity may be blamed, because it is collective responsibility. No guarantees, just best effort service, whatever... whatever works wellThe system works if it does, and does not if it does not. They work hard to keep the system running. But many systems are poorly designed, and often fail on sudden popularity of the website. Often blame for the failure is put on the network, and many times on the database.
Example To add video communication on top of voice calls is a massive renovation project involving upgrading existing services, making the pipes bigger to carry video, incorporating fixed bit rate video codecs to guarantee quality, and upgrading terminal devices to support video. It must work all the time from day one.To add video on top of a voice call client-server application, is just changing the application to use the desktop webcam and display, and pushing new video packet along side voice, without changing any other infrastructure element in most cases.To add video on top of a voice call web page, there is a redesign of the website to accommodate the visual elements, so that everyone viewing the same web page can see and hear each other. Regarding the codecs and bandwidth, delegate that to Flash or WebRTC.


In some ways, like Bellheads, Webheads also think of centralized control and management. The hosting website is the controller of an application and keeper of all the user data that could be generated on that website. In a way, they both create islands of innovations which do not communicate with each other very easily, except occasionally via rowboats, i.e., the likes of oauth and web APIs. Like Netheads, Webheads do not care about lower layers of communication, as long as the browsers can reach their web servers. Many things that are well studied in the Netheads community, e.g., socket programming, voice and video communication, or binary data processing, are only recently being introduced to Webheads, e.g., in the form of WebSocket, WebRTC or ArrayBuffer. If Bellheads are purists, then Netheads are hackers. And compared to them, Webheads are hackers++.

[Heard elsewhere...]

Bellhead: "We invented video communication two decades ago - from H.320, H.323/Q.931/H.225/H.245, H.324, H.325.. (15 more). It works on ISDN, LAN, ATM, MPLS, POTS, 802.1x, ... (10 more). There is wide variety of codecs including G.711, G.722, G.723, G.729, H.261, H.263, H.264, ... (20 more). We just finalized our first $100k sale of equipment to whats-it-called company for their intra-office video telepresence."

Nethead: "We re-invented everything you did, and we can quickly build endpoint terminals that can do two party video call to large scale video broadcast. Everyone is now using what we created. We even sent our engineers to your companies to add our protocols to your equipments. Did you know that your equipment also supports SIP for video call? You may not like it, but our protocols are everywhere, and better than yours."

Webhead: "Not sure what you guys are talking about, but just got a tweet from my IT guy that our click-to-video-call app reached 1 million unique users. We are going to hire a new CEO to monetize our app now."

For Webheads, focus is not about services or protocols, but only about apps and users.


Spaghetti, ravioli and lasagna in software

[fiction]

As I was walking down the street, down the street, down the street. A very good friend I chanced to meet... And we went to this italiano restauranto around the corner. Here is what we talked about. [blink blink, screen turns black and white, time goes back...]

"Have you heard of spaghetti code?"

"Of course, I write it all the time. It is easy. I think of a problem, and start writing code to solve it. And do not stop until either the code is complete or my laptop battery runs out. And yes, I don't delete any line of code that I write. Eventually I end up with like two thousand lines of code in one giant function, and like two hundred control statements. It is fast, furious and fun. But tastes awful the next day."

"Really?"

"Yes. So, what kind of code do you write?"

"Well, I have tried my hands on ravioli, lasagna, and more recently farfalle and penne."

"I know ravioli and lasagna - just too modular or too many layers for my taste - I get lost in which piece or layer I am working on, and end up making a mess, embarrassing myself. What about farfalle and penne?"

"Oh, farfalle is what I create when I refactor my correctly working software to smaller pieces, to such an extent that each piece looks broken and half complete."

"What is penne?"

"It is when I refactor it right, and pieces look good. Not too big, not too small. And no information hiding and stuff like that. Frankly, I don't really like information hiding - it is more hype than hoopla. I like data oriented programming - software that is independent of the data - or pasta that is independent of the meat."

[waitress comes by, and asks...]

"I heard pasta, so assume you are ready. What would you two gentlemen have?"

"I will have the three-pasta house combo, with lasagna at the core, the first layer filled with penne, and the second layer with cheese ravioli. Spare the meat. Mix tomato, pesto and alfredo sauce and spread it all over."

[waitress makes a face, and turns to the other...]

"I like spaghetti, but I will also have what my esteemed friend asked for. Additionally, can you open all the layers and separate them, placing side by side, and break all the raviolis to remove the cheese and put it on the side too. I like to keep my pasta separate from other ingredients."

Irony in Internet Communications

We live in a world where

  1. first, we build Internet Protocol over phone network (modem), and then we build a phone system over Internet Protocol (VoIP),
  2. first, we build a phone system to run on Internet (VoIP), and then drive the internet from our phones (smart phones),
  3. first, we build open communication on top of monopolies (Internet over PSTN), and then build closed communication on top of that to promote monopolies (Facetime, Skype, or you-name-it-here).
Wouldn't it be easier if we start afresh?

NetConnection or PeerConnection?

Flash Player and the corresponding ActionScript programming language have NetConnection/NetStream programming APIs for developing communicating web applications. Emerging WebRTC has PeerConnection/MediaStream APIs to enable real-time multimedia communication in the browser. This article compares and contrasts these two approaches to designing application programming interfaces (API). I have attempted to compare the API-styles rather than the specific implementations in Flash Player or browsers.

Background

In Flash Player + ActionScript, four categories of objects (or classes) are needed to build a communicating web application such as a video chat. First, you need a connection to the server which may be a media server processing the media path in a client server approach or just a rendezvous server for establishing a peer-to-peer media path in the peer-to-peer approach. Second, you create named streams, for publishing and for playing. Third, you attach the publish side stream with your capture devices for audio and/or video. And fourth, you attach the play side stream with display or playback of remote media. The corresponding classes for these are: NetConnection, NetStream, Camera and/or Microphone, and Video or VideoDisplay [1] (pp.16-18). The diagram below shows a two party video call between our regular actors, Alice and Bob, each creating a named stream to publish, and each playing the other party's named stream to display. On each terminal, a NetConnection object is used to create two named NetStream objects, one is attached to Camera and Microphone and another to a Video or VideoDisplay element. Additionally, for displaying local preview, the Camera object is also attached directly to a Video or VideoDisplay.


In WebRTC capable browser + Javascript, again, four categories of objects (or classes) are needed. First, a LocalMediaStream created via the getUserMedia function to represent the capture stream from the user's camera and microphone devices. Second, an PeerConnection (actually called RTCPeerConnection) object is created to establish a peer-to-peer path between the two browser instances. Third, the LocalMediaStream object is attached (or added) to the PeerConnection on one end, and appears as remote MediaStream object at the other end. Fourth, this remote MediaStream is attached to an audio or video element to playback or display the received media. The LocalMediaStream is also attached to another video element to display local preview. Unlike publish vs. play mode of NetStream, I put the LocalMediaStream vs. remote MediaStream into separate categories, due to the way they are created and what they represent, even though LocalMediaStream is also a MediaStream.


For comparison purpose, let us call these Flash vs WebRTC set of APIs as NetConnection vs PeerConnection.

Level of abstraction: named stream vs. offer-answer

Clearly NetConnection presents a higher level of abstraction than the PeerConnection. In PeerConnection, each media path is explicitly created via a new RTCPeerConnection object on each side. It uses offer-answer model - where one side sends a session offer with its capabilities and the other side responds back with an answer with its side's capabilities. The offer-answer messages are exchanged between the two browsers outside the WebRTC APIs. In case of NetConnection, when a participant publishes a named stream (MediaStream), zero or more other participants may play that named stream. The implementation of these connection and stream classes hide the details of the peer-to-peer or client-server media path establishment. Internally, some variation of offer-answer and transport candidate messages are exchanged between the Flash Player instances, but are not exposed to the programming interface - thus making it a higher level of abstraction.

Topology creation: client-server vs. peer-to-peer

A NetConnection is a client-server abstraction where as a PeerConnection is a peer-to-peer path abstraction. A NetConnection object can be used to create client-server as well as peer-to-peer media paths, whereas PeerConnection is targeted only towards a peer-to-peer media path. To further clarify - a NetConnection is between a client and a server, which may be using RTMP in which case the server is a media server for the client-server media path, or may be RTMFP in which case the server may just be a rendezvous server to establish peer-to-peer MediaStreams between the two browsers. Note that with RTMFP it is also possible to do client-server media path. Additionally, it also allows application level multicast for use cases such as large event broadcast. With PeerConnection, the media path is intended to be peer-to-peer, but it is also possible to implement PeerConnection at a server (outside the browser), in which case it gets transparently used to create a client-server media path because the server behaves as another browser endpoint of the WebRTC media path. With WebRTC, once MediaStream proxying is implemented in the browser, it may be possible, though not trivial, to create application level multicast for large group event broadcast. Proxying a media stream means that a received remote MediaStream from one peer may be attached (or attached) to another PeerConnection to another peer, thus proxying the media path via this peer.

Signaling channel: built-in vs. external

As explained in the previous points, NetConnection is the signaling channel to establish the client-server or peer-to-peer MediaStream objects. Due to the built-in nature of the client-server signaling channel, it is non-trivial for third-party vendors to build such a signaling server without implementing the full RTMP and/or RTMFP protocol. On the other hand, a PeerConnection generates the offer-answer and other signaling messages, which must be exchanged via another channel between the two browser instances, e.g., via a notification service over WebSocket, by using Ajax, or over other third-party channels. Such notification service does not need to understand the details of the WebRTC protocol. This gives more flexibility and freedom to the developers on how the signaling channel for a particular communicating web application is implemented.

Server dependent vs. independent

As explained in the previous points, a NetConnection is between a client and a server, thus dependent on the server. Additionally, any server side media processing such as mixing or gateway'ing to VoIP can only be done by extending or modifying this server. If the server is vendor dependent, which seems to be the case for RTMFP, until recently, then it makes these media processing and gateway'ing features also vendor dependent. On the other hand, a PeerConnection does not care about the server side. But an implementation of a web application that uses this object does. Hence, in practice, these types of servers may be needed - notification server for exchanging offer-answer and other messages, media server for doing things like server side recording or mixing, media gateway for translating to VoIP or other networks, and media relay for gathering reflexive or relayed transport candidates and for actually relaying media path across network middle boxes.

Summary and further reading

In summary, the higher level NetConnection-style interface with built-in signaling channel makes it easier for programmers to create applications, ranging from two-party video calls to multi-party conferences to large scale event broadcasts. On the other hand, greater flexibility (and freedom) of the PeerConnection-style interface allows building wide range of system architectures using third-party components such as notification servers, media servers, media gateways and media relays.

In my past research on flash-videoio [2] and aRtisy [3], I have proposed another higher level abstraction for the programming APIs for communicating web applications. In fact [3] describes how to implement the NetConnection-style interface on top of WebRTC's PeerConnection-style interface and using an external resource server to provide the notification service. It extends the named stream model to named resource model to implement several other applications such as text chat or shared whiteboard or notepad. Then, [2] and [3] describe how to build a widget-oriented interface which consists of a single video-io widget used in either publish or play mode to create a wide range of communicating web applications such as two-party calls, multi-party conferences and video presence for Flash and WebRTC, respectively.