Kundan Singh: Problems

Showing posts with label Problems. Show all posts

7 Project ideas on the topic of Web Apps and/or Multimedia Communication

Here are some project ideas largely related to multimedia communication, WebRTC and related technologies. Please get in touch with me if you would like to explore further, contribute to, or sponsor one or more of these projects. I will be happy to provide insights, guidance, partial code, and/or connect with the right folks. As I make more progress in any of these, I will create separate articles describing them.

How not to design a video conferencing product?

"Can you please stop sharing your screen so that I can share mine?" "I can't see that part of your shared screen because those buttons are overlaid on top." "How can I share my second webcam without stopping the first?" "Can you please say that again? I missed the last part." "It shows only up to nine videos even though I have a very big screen."

Do you ever feel frustrated due to some artificial restriction imposed by the video conferencing product you use? In this informal article, which some will find quite opinionated, I list some annoying product "features" when it comes to video conferencing.

What do I do as a software architect?

"I am the Architect. I created the matrix. I’ve been waiting for you." -- The Architect

I present my view on what an architect does or should do? And what are the important things to keep in mind in my opinion?

Topics: Introduction. Making decisions using trade-offs. Research and evaluate technology and tools. Create proof-of-concept of big picture. Systematically divide the goal into smaller solvable problems. Create knowledge base and training for others. Continuous monitoring and improvement of the system. Identify and address disruptive technologies. Conclusions.

From "TO DO" to "DONE"

What does it take to get things done? Why do some tasks get done quickly while others drag on forever? Is there a science behind it? Or just philosophy?

Here, I present my thoughts on these topics. This is largely based on my first hand experience in software industry as well as academia, working in small and large companies, collaborating in small and large teams, and observing wide range of things getting or not getting done.

Reason for technology failures - chat bots, video conferencing, or you name it.

Every so often, I come across articles explaining why a piece of technology failed. For example, why chat bots failed in 2018, or why video call did not work for customer support. I think the answer to these and other similar questions can be attributed to three points: (1) not a holistic approach, (2) wrong audience, (3) unreasonable expectations. Let me elaborate further.

What are endpoint driven communication systems?

A lot of my work in the past decade has focussed on endpoint driven systems, e.g., peer-to-peer Internet telephony (P2P-SIP) [1,2] for inherent scalability/robustness, Rich Internet Applications for web video conferencing [3,4], and more recently, resource-based software architecture [5,6]. In this article, I emphasize the importance of such systems, and differentiate them from other system architectures in the context of real-time communication.

My WebRTC related papers from 2015

This post continues from the previous one and gives an overview of my ongoing research on web-based multimedia communication or WebRTC. I wrote five published co-authored research papers last year answering these questions - How do we solve some of the enterprise challenges of WebRTC using browser extensions? How do we address cloud challenges of a multi-services and apps vendor? How do we create pure web-based enterprise communication and collaboration system without depending on legacy protocols? How do we do user reachability in a multi-apps environment created by WebRTC? And, how do we do write-once-run-anywhere for WebRTC based team apps using a cross platform tool? Read on if one or more of these questions interest you...

Enterprise WebRTC powered by browser extensions

Traversing WebRTC flows created by external third-party websites across restricted enterprise firewalls is challenging. There are other challenges in adopting WebRTC in enterprises, e.g., how to integrate it seamlessly with existing communication equipments, or how to enforce enterprise policies such as call recording of WebRTC flows on third-party websites? We show how to use browser extensions to solve these problems. This systems paper is based on my implementations of two interesting projects.

“We use browser extensions to solve two important issues in adopting WebRTC (Web Real-Time Communications) in enterprises: how to integrate WebRTC-centric communication with existing systems such as corporate directories, communication infrastructure and intranet websites, and how to traverse media paths across enterprise firewalls. Vclick is a simple and easy to use web-based video collaboration application that enables click-to-call from other webpages. SecureEdge is a network border traversal system for policy and security enforcement, and consists of a secure media relay that sits at the network border or in the cloud. A browser extension in the enterprise user’s device transparently injects this media relay in every WebRTC media path needing to traverse the enterprise network edge to enable authenticated border traversal without help from the websites hosting the WebRTC pages. We attempt to generically support WebRTC in enterprises on a variety of application scenarios instead of creating another fragmented communication island. The challenges faced and techniques used in our proof-of-concepts are likely extensible to other enterprise WebRTC scenarios using the emerging HTML5 technologies.”

Keywords: WebRTC, enterprise communication, secure edge, browser extension, VoIP, video call, firewall traversal, media relay.

more>>

ALICE: Avaya Labs Innovations Cloud Engagement

Although we know how to create cloud-hosted services, platforms and infrastructures, little is known about cloud hosted communication and collaboration services, especially to enable multi-tenancy and self-service. This research focuses on the challenges of hosting cloud services for customer trials, where resources are limited to make existing services cloud ready or to fit a specific platform. This is based on my work on creating a cloud portal to host research-oriented early or pre-product services on the cloud, and identifying common themes and techniques.

“We present the architecture and implementation of our enterprise cloud portal named ALICE, Avaya Labs Innovations Cloud Engagement, which provides self-service access to service developers, tenants, and users to various communication and collaboration applications. Currently ALICE is used for field testing of advanced research prototype services based on technologies such as WebRTC and HTML5. This paper describes the current portal and extensions to support multi-tenancy.

We describe challenges in creating a self-service multi-tenant SaaS (software-as-a-service) portal to host communications and collaboration applications for small to medium scale businesses. The challenges faced and the techniques used in our architecture relate to security, provisioning, management, complexity, cost savings and multi-tenancy, and are applicable and useful to other cloud deployments of diverse enterprise applications.”

Keywords: Cloud, system architecture, portal, multi-tenancy, Internet telephony, enterprise communications, web collaboration.

more>>

Vclick: endpoint driven enterprise WebRTC

One of my earliest project at Avaya Labs was on creating a light-weight service for wide range of web communication and collaboration scenarios. Vclick is a collection of many loosely coupled apps that run the app-logic in the browser or endpoint, and mash up at the data level. It contains applications for video call, conferencing, video presence, text chat, click-to-call, screen sharing, shared notepad and whiteboard, and so on. It goes against the conventional web wisdom of thin-client, single-page-apps, or rigid GUI, and presents a new software architecture to create robust endpoint driven apps. The paper is really about how to keep the endpoints smart and network (or service) dumb in the context of collaboration applications.

“We present a robust, scalable and secure system architecture for web-based multimedia collaboration that keeps the application logic in the endpoint browser. Vclick is a simple and easy-to-use application for video interaction, collaboration and presence using HTML5 technologies including WebRTC (Web Real Time Communication), and is independent of legacy Voice-over-IP systems. Since its conception in early 2013, it has received many positive feedbacks, undergone improvements, and has been used in many enterprise communications research projects both in the cloud and on premise, on desktop as well as mobile. The techniques used and the challenges faced are useful to other emerging WebRTC applications.”

Keywords: WebRTC, enterprise communication, web video conferencing, resource-based architecture, web applications.

more>>

User reachability in multi-apps environment

With numerous "walled-garden" services and apps emerging because of WebRTC, there is a need to identify the best way to reach your contacts, irrespective of which app or service she is on. This systems paper describes my work on implementing a mobile (and desktop) app called Strata Top9, to quickly reach your important contacts. It really is a front-end to launch other applications. Unlike existing presence based systems, we propose to iterate during call initiation. The paper presents the software architecture and design decisions along with several motivational use cases of our project. It also details the concept of dynamic contacts, and endpoint driven caller and receiver policies.

“Recent progress in web real-time communication (WebRTC) promotes multi-apps environment by creating islands of communication apps where users of one website or service cannot easily communicate with those of another. We describe the architecture and implementation of a multi-platform system to do user reachability in multiple communication services where users decide how they want to be reached on multiple apps, e.g., in an organization that has voice-over-IP, web conferencing and messaging from different vendors. Our architecture separates the user contacts from reachability apps, supports user and endpoint driven reachability policies, and has several independent and non-interoperable WebRTC-based apps for two-way and multiparty multimedia communication. Our flexible implementation can be used for enterprise or personal communications, or as a white-labeled app for consumers of a business.”

Keywords: system design; mobile app; user reachability; multiservices; VoIP; WebRTC; caller policy.

more>>

Developing WebRTC-based team apps with a cross-platform mobile framework

Ability to write-once-run-anywhere still eludes many app developers. Luckily several cross-platform development tools exist. However, creating cross-platform communication and collaboration related apps is still challenging. This paper presents my implementation work on creating cross platform apps. In particular, four types of platforms - web app on PC and mobile, and installed app on PC and mobile - are considered, and seven different apps are covered for a wide range of enterprise use cases. Techniques and steps for creating such cross platform apps are presented along with lessons learned based on practical experience. Additionally, considerations for iOS and wearable Glass devices are presented.

“We present lessons learned in developing cross platform multi-party team applications. Our apps include a range of communication and collaboration scenarios: document and content sharing in a team space, an agent-based meeting helper, phone number dialer via a voice-over-IP (VoIP) gateway, and multi-party call in peer-to-peer or client-server mode. We use web real-time communication (WebRTC) to enable the audio and video media paths in the apps. We use frameworks such as Chrome Apps and Apache Cordova to create apps that can be accessed from a browser, or installed on a desktop, mobile device, or wearable. The challenges and techniques described in our paper related to audio, video, network, power conservation and security are important to other developers building cross-platform apps involving WebRTC, VoIP and cloud services.”

Keywords: HTML5, Apache Cordova, Chrome Apps, WebRTC, Mobile, Cloud, Wearable.

more>>

WebRTC vs. SIP/SDP

Every time a new protocol appears in the protocol jungle of multimedia communications [2], people attempt to compare and contrast it with existing established protocols, such as SIP. With WebRTC as the new kid on the map, you can find several attempts to compare SIP and WebRTC [3][4][5][6][7]. Depending on which camp the comparison originates from, you may find flavors of favoritism, or unintentional downplay of the importance of the other camp.

This article presents my point of view, hopefully unbiased, on this topic. Let me start by saying that SIP and WebRTC are different, and it is not fair to compare them without an established context.

Complex protocol vs. simple API

SIP is a protocol, not an API; whereas WebRTC is an API, with an associated set of protocols.
Consider that TCP is a protocol but socket is an API. TCP has complex state machinery to enable reliable bi-directional end-to-end packet flow assuming that intermediate routers and networks can have problems but end to end reliability is assured. On the other hand, the socket API hides all the complexity of TCP and provides an easy to use abstraction. Luckily in the case of TCP, developers only work on top of the standardized socket API. Whereas in the case of SIP, usually they end up dealing with the protocol directly, or work with (partly broken) implementations of intermediate network element such as a SIP proxy, or use (poorly defined) APIs from third-party SIP libraries. Depending on what level of abstraction is exposed in the SIP library, the development effort can range from very simple to very complex. In the case of WebRTC, the API is being defined and standardized, and is intended as a high level abstraction suitable for JavaScript/Web developers.

To list the differences between SIP and WebRTC - one could say that SIP is programming language agnostic but WebRTC is not, and must use JavaScript, or that SIP application could run on any device or platform, but a WebRTC application requires a browser to host the app. These are superficial differences - due to the core difference in the nature of SIP vs. WebRTC, i.e., protocol vs. API.

SIP would have been perceived as much simpler if there were well defined APIs from the start. But that would have limited the flexibility (and freedom) of what SIP could be used for. More on the flexibility and freedom is discussed later in this article.

One thing to note, however, is that WebRTC does use a set of complex protocols behind the scenes, many of which are shared by a SIP system, as we describe next. Although a web application using WebRTC is simple due to the high level API, the implementation of WebRTC in a browser itself is not that simple!

Protocol stack

SIP is an established protocol for Internet communication. WebRTC is an emerging API intended to be provided by the browsers and consumed by the applications in JavaScript. It has an associated set of protocols, some of which overlap with those used in a SIP system. The title of this article indicates a comparison of WebRTC with SIP/SDP, and not just with SIP. This is because SIP is often only a part of the overall puzzle in a real world communication system, as shown in the following diagram taken from my research paper [1]. In fact, the role that SIP plays in a real system is outside the scope of the WebRTC specification currently. Which means that it does not really make sense to compare SIP alone with WebRTC protocols, but it is okay to compare a SIP system with that based on WebRTC.

Comparison of typical SIP vs WebRTC application stack. The dotted line separates what is programmed by the application developer and what is provided by the platform.

Although most SIP systems use SDP, it is not a requirement, and some SIP systems may use something else if needed. Same thing holds true for other protocol elements such as RTP for media transport. However, within the context of a voice and video call application, you will likely see the above mentioned protocols in effect. On the other hand, the WebRTC effort specifies the mandatory set of protocols including SDP, SRTP and ICE, and also, the minimum list of mandatory audio and video codecs, which must be implemented by a WebRTC capable browser. In the diagram above, while SIP is only one block in the system, WebRTC specifies multiple blocks including the API, codecs, SRTP/RTP/RTCP, ICE and to some extent SDP. While SDP is used in the WebRTC API, an application is free to change it to anything else when exchanging the session information with the peer, if needed. Defining and mandating more things in WebRTC makes the interoperability a little easier, unlike SIP where many such things are left to the implementation. We will talk about interoperability later in this article.

Network elements

The SIP specification defines the behavior of an endpoint (or user agent) as well as various network elements such as proxies and redirect servers. Other SIP extensions define more network elements such as a conference server, a presence server, and so on. With WebRTC, the focus is only on browser-to-browser communication, and only the browser's behavior is defined in the specification. Other elements such as a gateway or media server are free to implement the behavior. However, that part is likely not going to be standardized. This hopefully limits the number of specifications related to WebRTC, and hence limits the resulting complexity.

This does not mean that WebRTC systems will not need such network elements. Ability to call a phone number via a gateway, or to host a multi-party voice mixing or recording on a media server, or to locate the target user via a lookup service are all examples of crucial services required in many real world applications. Lack of standardization means that every vendor will likely define its own version of these elements, creating fragmentation. Such scenarios are actually inline with WebRTC's goal which aims at creating media path among browsers visiting the same website, and does not care for inter-website communication. More on interoperability vs. fragmentation is discussed later in this article.

Fortunately, it is entirely possible to create fully functional WebRTC systems without using complex or heavy weight network elements. In fact, many initial demonstrations of WebRTC applications have followed this path with peer-to-peer media flows, along with a very simple signaling/rendezvous service. SIP, too, started out with this model of peer-to-peer media flows with a simple signaling/rendezvous service in the form of a SIP proxy or redirect server. However, practical limitations and real-world deployments have mostly thrown this idealistic model out of the window, and have largely adopted a "managed" network centric model with heavy weight network elements in the form of back-to-back user agents, session border controllers, and call stateful proxies. It is likely that the similar real world forces may cause a similar network centric deployment model in the case of WebRTC. Only time will tell!

Message flow

To establish a multimedia call, three pieces of information are exchanged between the two parties - (1) an "invite" from caller to callee as an intention to establish a call, which the callee may answer or decline, or ignore, (2) session description containing capabilities of each party, (3) transport candidates for establishing connectivity on the media path, hopefully peer-to-peer. Let us call these steps as (1) "invite", (2) session description exchange and (3) transport candidates exchange.

In a WebRTC-based application, these may be separated out into different phases or may be combined together, as determined the web application. In fact, (1) above is not part of the specification, but is often implemented in a WebRTC call, and (2) and (3) are exchanged between the two parties using an out-of-band channel outside the scope of the WebRTC specification. The term "trickle ICE" refers to separating (3) from (2) in a WebRTC session negotiation, and is the default mode - although it is possible to not use it, as determined by the web application. In a SIP system, all three pieces are usually bundled in a single request-response transaction and are carried in a SIP INVITE and it's 2xx response between the two parties, often via one or more intermediate proxies.

The separation of these pieces of information in WebRTC makes it more flexible, e.g., an application can invent any kind of user lookup service over any protocol (HTTP POST, WebSocket, XMPP, Google Channel, LDAP, hard-coded, or whatever, or even SIP [1]). Some applications may not even need such a lookup service, e.g., a virtual presence application where everyone who visits the website is able to see and hear others on that site. Unlike this, in SIP, a user lookup service must be accessible via SIP.

For those who remember H.323, and its multiple steps of call setup in version 1, may argue that we have come back a full circle - started with multiple steps call setup, invited H.323 fast connect and SIP for single step call setup, and now back to multiple steps to establish a WebRTC call. However the situations are different in H.323 vs. WebRTC - in H.323 all the steps were done by the same entity, and after the initial "invite" step, the subsequent ones were follow through, with no perceived benefit to keep them separate. In the case of WebRTC, the "invite" step is outside the scope of WebRTC specification, hence it makes sense to keep it separate. The last transport candidates exchange step is incremental with real benefit to keep it separate from session description exchange, so that better peer-to-peer flows can be detected in future, which first peer-to-peer flow that works gets in use as soon as possible. In the case of H.323, there is no equivalent of this step in my opinion, because the second and third steps are about exchanging media capabilities, and then establishing fixed set of logical channels based on those media capabilities.

The SIP specification and many of its extensions, such as call transfer, interoperability with phone, or capabilities of user agents, actually deal with only the first piece of information mentioned above, i.e., how to reach the target user or device, and are often related to the notion of a "call". In WebRTC specification, there is no notion of a "call", or signaling to lookup the target user. It is assumed that such constructs and concepts belong to the web application or the website that is hosting the application. This gives flexibility to the application developer on how to implement advanced features such as conferencing or call transfer. With more flexibility, comes more power, and more non-interoperability, described next.

Interoperability vs. fragmentation

In the past fifteen years, non-interoperability has been a major problem in the SIP community. Although it is often easy to achieve interoperable voice call signaling, it is rather difficult to make sure that those hundreds or thousands of SIP systems out there will implement every little detail consistently as per the specification. Moreover, there is no mandatory codec defined by SIP, which means that two perfectly compliant SIP endpoints may not talk to each other.

With WebRTC, you only need to achieve interoperability among the major browsers. WebRTC easily enables communication between users visiting the same website, where the application fills the void of signaling, user lookup, and the likes. Since signaling is outside the scope of the specification, interoperability between users visiting separate websites is not readily available and requires one-to-one federation between the two websites or web application. This fragmented communication behavior is inline with how existing web applications behave, i.e., they tend to create islands of users, where users within an island can talk to each other, but not beyond. Have you ever thought of being able to reach your Linked-In contact from your Facebook page? I have, and have tried to do something about it [8][9].

The fundamental difference of triangular vs trapezoidal call model in WebRTC vs SIP, respectively -- which means the two parties are connected to the same server in WebRTC, whereas two parties can potentially be on separate servers in SIP -- results in major differences in interoperability requirements.

The triangular way of WebRTC is not how VoIP community perceived open communication to be. From that perspective, even though WebRTC is an emerging open standard, it will created closed walled gardens of communication islands - leading to communication fragmentation. Unfortunately, even in SIP-based systems, due to non-standard behaviors in implementations or different implementation choices (e.g., codecs) by different vendors, it tends to create fragmented islands, where an equipment from one vendor rarely talks freely with that of another. This tends to create an ecosystem, where a customer must purchase phone endpoints and servers from the same vendor, because it is often hard to achieve true mix-and-match interoperability out-of-the-box.

The problem is more relaxed in the case of WebRTC, because there are only a few major browsers, and the protocol specification does not involve any signaling or media server, except for generic STUN and TURN servers, when appropriate. A fair comparison would have been possible if there were hundreds or thousands of browsers implementing WebRTC, similar to the plethora of SIP implementations that exist today. Or if there were only five major vendors creating SIP user agents. With SIP, the protocol was invented first, followed by an attempt to deploy the user agents and servers. With WebRTC, we already had user agents (browsers) and servers (web servers, STUN/TURN servers), and we put together a set of protocols to let them talk in real-time.

By leaving signaling outside the scope, a WebRTC specification escapes many problems related to non-interoperability. In practice, fragmentation seems to be unavoidable, largely due to business reasons -- or lack of incentive for one website to let its users talk to that on another website. The benefit of better interoperability, maintainability and usability far out weighs the cost of fragmentation for many websites and web applications. For some applications that require cross-site communication, a gateway can do the job, without being part of the specification.

The biggest non-interoperability and fragmentation threat to WebRTC is that some major browser vendors may not implement it altogether, or may implement a variant of the specification, making it non-interoperable with other browsers. Hopefully, market pressure will stabilize the interoperability among major browser. Or developers will find alternatives (plugins? gateways?) to fill the gap. Again, only time will tell!

Flexibility and freedom

We have mentioned about flexibility at the protocol and application level. In summary, SIP provides greater flexibility at the protocol level and WebRTC at the application level. With greater flexibility comes more freedom to innovate, and unfortunately, more problems in interoperability.

First, SIP provides flexibility at the protocol level, e.g., ability to use non-SDP based session description, or ability to use RTP vs SRTP vs ZRTP based media transport, or ability to add newer extensions for media path or call signaling. WebRTC mandates many such elements in the protocol, e.g., use of SDP, SRTP, ICE and so on. In future, if a newer or better media transport, NAT traversal or session description is invented, the specification will need to be changed to incorporate that. However, upgrading the specification when there are only a few browser implementations should not be a nightmare, especially if the upgrade is backward compatible. Once WebRTC specification gets ratified, it leaves little room to change or improve, in an implementation. With SIP's flexibility and resulting freedom, people have created many hundreds of different implementations and use cases using the same base protocol. WebRTC lacks the ability, for example, to implement non-trivial media path, such as application level multicast or local LAN broadcast, or to let the web application configure differential service for different media flows, or to let the web application inject text-to-speech in a media path.

The protocol flexibility of SIP comes at a price - no guarantee of common features such as security or common codecs in different implementations. For example, the voice path in one SIP call could be unencrypted and in another could be encrypted - determined by the implementation choice. Whereas in WebRTC, many of such features are well defined and mandated, as described in the next section.

Keeping the signaling part out of scope and defining the API give good application level flexibility to WebRTC systems. As mentioned before, features such as call transfer, conferencing or user lookup are application's responsibility. This allows a social website to implement these features different from a click-to-call provider's website, for example. Given the numerous websites, and their potential for incorporating real-time in-browser communication, such flexibility is very useful and highly desired.

Protocol differences in SIP vs WebRTC system

Both SIP and WebRTC systems use certain common set of protocols, e.g., SDP for session description, ICE for NAT traversal, and (S)RTP for media transport. The following chart from my paper [1] summarizes the interoperability differences.

Interoperability differences in SIP vs WebRTC; (o) means optional.

In practice, the session description (SDP) and media path defined by WebRTC requires many extensions, that are not found in existing SIP systems, and hence it is very hard to transparently interoperate between the two without using a media gateway. Eventually, one would hope that existing SIP systems would adapt to incorporate these new extensions. But history has told us that this is next to impossible - given the large number of SIP implementations, and given the availability of media gateways. Thus, a browser-to-SIP call will almost always involve a media gateway, unless you sell your own SIP endpoint.

Frequently asked questions (FAQ)

I have heard these. You will likely hear them too.

Are SIP and WebRTC competing technologies? Yes and no. It depends on the context. As an access protocol to connect a client to an established voice/video service or infrastructure, yes. For anything else, most likely no.
Is WebRTC a subset or refinement of SIP? No. WebRTC has certain elements and features that are not present in SIP/SDP. For example peer-to-peer data channel, or trickle ICE, or programming APIs.
Is WebRTC a superset or generic case of SIP? No. A SIP-based system includes elements and features that are not present in a WebRTC browser, e.g., well defined network elements, call event notifications, or user lookup service.
Is WebRTC similar to P2P-SIP? or Jingle? or RTMFP? or name-your-favorite-here? No, for all. In theory, WebRTC has some overlap with session description and media path of a standard SIP system. In practice, WebRTC is very different from any of the existing systems we had.
Does SIP make WebRTC better? What about the other way? Depends on who you talk to. A SIP proponent would argue that cross-site communication and PSTN interoperability requires using SIP at the core, or that SIP in JavaScript is good enough for signaling of WebRTC. A WebRTC proponent would argue that replacing SIP with WebRTC will improve on security, interoperability and call quality. In my opinion, these arguments are either too narrowly focussed or a compelling counter argument can easily be made. So my answer is "no" on both counts.
Can SIP and WebRTC work together, instead of competing? Absolutely. Firstly, they do not compete except in a very narrow context. And secondly, they do work together - the SDP generated by WebRTC can be carried in SIP, directly from the browser or via a gateway. See [1] for alternatives.
Will WebRTC remain at the end, and must SIP be at the core? No. It is entirely possible to create WebRTC capable network elements such as media servers and gateways without depending on SIP, that allow creating voice/video infrastructure with WebRTC at the core. It also possible to create many existing use cases with peer-to-peer media path of WebRTC, without using these network elements at all.
Does WebRTC have better voice quality than SIP? No. Because, SIP does not specific fixed set of voice codecs or voice quality. At present, when many SIP systems use G.711, G.729 or Speex, the voice quality of Opus used in WebRTC is superior to those SIP systems. However, nobody has stopped those SIP systems from using Opus (and related media engine) in their voice stack.
Will WebRTC improve voice quality of my PC-to-phone application? No. Unless you are talking about smart-phone with WebRTC running on phone also. When voice path gets translated to a phone call, the quality will likely be reduced to G.711 or G.729 or whatever the gateway provider has deemed suitable on the phone network.
Will WebRTC give better voice quality for my conferencing application? Depends. See previous question. For centralized media path (with audio mixer), it requires transcoding, so voice quality will likely suffer. Optimization such as using peer-to-peer voice path, or reflecting loudest-N streams to client without mixing may preserve the end-to-end voice quality in conferencing.
Why does browser not implement SIP directly? For browser-to-browser use case, SIP is not needed. For anything else where one end is a browser, a gateway can do the job of translating to SIP if needed. And WebRTC is focussed on browser-to-browser use case only. Having said that, it is possible to do SIP in JavaScript [1] running in your browser, or to implement your own plugin or browser with built-in SIP stack and related APIs.
What does it mean that a SIP conference server supports WebRTC? It could mean that it supports a browser end-point to participate in a multi-party conference over WebRTC, or it could mean that it uses Google's WebRTC media engine or the OPUS/VP8 codec for the voice/video stack but requires SIP/H.323 endpoints to connect. Be careful!
Is WebRTC better than SIP/SDP? Hard to say. WebRTC attempts to fill the gap for a very specific problem, but requires robust, secure and high quality implementation, and defines simple high level API. On the other hand, SIP leaves many such things open to implementers. So for the specific problem, WebRTC is better, some might say.
Between WebRTC and SIP, who will win the racing? There is no racing. Or if there is one, then WebRTC and SIP are racing in different directions, at 90 degrees to each other. Depending on where you look from, or which axis has the finish line, either of them could win. But the truth is, WebRTC is for browser-to-browser communication, where SIP/SDP/RTP is not feasible currently -- so there is no racing. One could argue that with more number of browsers supporting WebRTC than number of SIP endpoints, or with more number of JavaScript developers than number of SIP implementors, WebRTC is likely to win the racing. No comments on that. A more correct analogy of racing, in my opinion, is in specific contexts, e.g., between WebRTC of browser and RTMP+ RTMFP of Flash Player for browser-to-browser communication, or between SIP phone vs. WebRTC browser for client access to backend voice/video service or infrastructure.
What is the real difference between WebRTC and SIP? Ah, glad you asked! If you have already read the rest of this article, you will be surprised to find that unlike all the differences mentioned before, the real difference is who the most important customers of WebRTC vs. SIP are - for WebRTC it is the web developers community, and for SIP it has been telecom providers and equipment vendors.

Conclusions

This article has attempted to present several factors comparing and contrasting SIP/SDP vs. WebRTC systems, without really diving in to the specific protocol differences of how SDP/RTP are used in SIP vs. WebRTC. This is because, while WebRTC specifies the list of profiles and extensions to SDP/RTP, the SIP system is free to choose whichever it likes. So any such comparison will end up being one between WebRTC and SIP of vendor-X.

There are similarities and differences between the two types of systems. In the end it is about the applications and user experiences, rather than protocols and APIs. In practice, it is easy for an established SIP-based IVR system to add WebRTC access, compared to that for a newbie WebRTC system to implement full fledged IVR functionality. The same holds true for many existing telecom, web communication and Internet conferencing applications. Thus, many existing SIP-based applications will likely also adopt WebRTC for access, if there is a demand. And similarly, new pure WebRTC-based web applications will likely incorporate SIP gateway-ing to reach out to non-browser users, if there is demand. SIP or WebRTC itself is just a small piece of the puzzle in the overall system or application. Nevertheless, WebRTC attempts to fill a void that has existed for a really long time in the SIP-based communication world, for which past attempts such as using plugins have largely been unsuccessful, especially on mobile platform.

Three Problems in Interoperating with H.264 of Flash Player

H.264 decoding has been part of Flash Player since version 9, but H264 encoding was recently added in version 11. Once Flash Player 11 beta was out I started looking in to integrating video translation in the SIP-RTMP gateway project. For a Flash-to-Flash video conference you do not need to understand the problems related to H.264 in Flash Player because everything is taken care of behind the scenes by Flash Player. Adding H.264 support in the flash-videoio project was relatively straight forward. However if you are building your own translator to interoperate video between Flash Player and some other application you will need to understand these problems.

1) The first problem is that Flash Player doesn't enable H.264 even for decoding if the RTMP connection does not use the new-style "secure" handshake. In the older version handshaking with bytes containing zeros worked, but not when using H.264. Eventually I found about this on reading some open-source-flash (osflash) forum post and incorporated it in my gateway.

2) The H.264 encoder generates some sequence headers (called SPS and PPS) which are essential in decoding the rest of the video data packets. The same is true with AAC audio codec. In particular in live H.264 publish mode, Flash Player generates periodic SPS/PPS packets so the other Flash Player (or SIP phone) can join the call later and still be able to start decoding the stream. However, some existing SIP video phones generate the sequence packets only once at the beginning. The SIP-RTMP gateway needed to be modified to cache the sequence packets received from non-Flash Player client and re-send them with correct timestamp to the Flash Player client that joined the stream late.

3) Looks like Flash Player 11.0 changed something related to buffering of live stream, which causes problems if the SIP side generates multiple slice NALU (primitive data units in H.264) per frame. The Flash Player itself generates one NALU per frame, however some existing SIP video phones (e.g., Bria 3) generate old-style multiple slice per frame and one NALU per slice and cannot be decoded and displayed in Flash Player 11 in live mode. You can read more about the problem. This is not a problem for buffered playback though. (update on 12/12/2011 -- I can verify that this bug has been fixed in Flash Player 11.2.202.96 and video call works fine now between Bria 3 and Flash Player via my SIP-RTMP gateway)

Ekiga SIP phone uses the new-style RTP mechanism for fragmenting a full H.264 frame instead of using multiple slices in H.264 encoding. This can be easily translated to Flash Player and works with my SIP-RTMP gateway. However, Ekiga has another problem in incorrectly interpreting RTP timestamp of received stream which makes it play the stream much slower.

Lessons in starting a software project

This article presents my thoughts on DOs and DONTs of starting a new software project. Many lessons listed in this article are already well known or common sense, but usually not always followed!

DOs

Brainstorm often: During the initial phases of software growth or even before starting to write a single line of code, you should do several sessions of brain storming. It could be on validating your idea, figuring out competition, predicting the future, picking a programming language, potential learning, etc. This is the difference between carefully planned birth versus unexpected pregnancy. Just because you can write some software, should you? Especially if better alternatives exist?
Use good version control system: Even for the most trivial projects, you should try to use version control system. I like SVN (subversion) for my open-source projects, but if you can afford git, it works better for complex project management. If you are starting an open source project, consider code.google.com for hosting your SVN repository -- it is fast, simple and hassle free. It is like a good home for your baby software.
Document all ideas: When the software is evolving you will have many ideas for new features, doing things differently, or incorporating competing features. Obviously due to lack of resources and time, you won't be able to incorporate all these. But you must document all the ideas, and if possible prioritize them. Keep a single list of ideas. Usually the software will evolve on its own to attract new features. Implement only the most crucial ideas and features, and resist the temptation to add many features.
Few developers during growth: Keep the core set of excellent developers to one, two or at most three when the project is growing. Every major piece of software should have only one excellent developer. This avoid unnecessary friction and induces feeling of ownership. Software is like a baby, which needs a good parent to raise and grow, before it can mature and face the world. You wouldn't want to raise your software in a foster house where nobody feels ownership, i.e., in an organization with an engineering "team".
Pick the right language and tools: Every programming language has some strengths and weaknesses. Make sure you select the right language, that is quick to develop with and maintain, and works well for your target application. For example, with low-level C/C++ you get performance, and with high-level Java, Python, you get portability. Over the years I have liked Python for most of my applications. Unfortunately, in corporate environment, Java is the pet-child because there are many fold more software developers and managers who know Java well. For modern Internet and web applications, Python, Ruby, Erlang and ActionScript are becoming more popular.
Include testing and defensive programming: To be successful, sooner or later your software project will need to get out of the demo-mode and face the real world. It might become too late at that time to worry about scalability or glaring bugs if those involve redesigning your software. It saves a lot of time and energy to use common techniques such as good logging, unit testing, performance best practices, and defensive programming from day 1. Also maintain an issue tracker and log even the tiniest of issues with your software. Sooner or later you will need to address them.

DONTs

Don't procrastinate: If you have an idea to work on, don't procrastinate. Just get started, write something up, try to get a prototype going. Most successful projects need a complete re-write at least once. So don't be afraid to write throwaway code.
Don't document before coding: While software engineering people will say that you should follow good software process -- writing requirements specification, design document, test cases, etc. -- those can be written later too! Source code is what makes or breaks a software. You can write detailed specification and design documents, after you already have a prototype and want to document it or propose a change. In my experience, any design document written before writing the code is incorrect, and needs to change drastically after the source code is written.
Don't spend time on one-off items: For your software, there are some items which are directly related, and then there are one-off items. For example, for a VoIP client, the protocol implementation, good voice quality, etc., are directly related. On the other hand, having a user signup page, instant messaging text chat, file sharing, etc., are one or two-off items, which are not directly related, but indirectly assist users in VoIP. When you start a project, do not spend time doing one-off items, but work on directly related items first.
Don't wait too long for 1.0 release: There is 80% difference between an 80% complete software and a released software. When you formally release your software, you have to take care of user manual, getting started guide, installer as well as finish those last annoying bugs. In the case of software projects, it is very easy to get started but very difficult to put an end. There is always an endless list of features which needs to be completed before the release, and hence your release never happens. Unless, you make it happen. You will have to make a firm decision about what bugs are important and what can remain as known issues for version 1.0.

FAQ on using Flash Player to make phone calls

I present my answers to some frequently asked questions (FAQ) on using Flash Player to make phone calls.

1. Is Flash Application a good choice for VOIP?

Depends, the RTMP based application is not a good choice, whereas new RTMFP application is good for Flash to Flash Internet voice applications. For Flash to Phone applications, Flash is not a good choice as it is. Flash is good at user interface and ubiquitous availability but the TCP-based RTMP is not suitable for real-time interactive media, and UDP-based RTMFP is proprietary so cannot interwork with existing SIP-based VoIP systems.

Secondly, Flash Player is missing some of the crucial VoIP pieces such as good silence suppression and echo cancellation, so Flash based VoIP client becomes useless without a headset.

Thirdly, Although Flash Player supports open standard Speex audio codec, many existing VoIP providers do not support Speex, and expect only traditional voice codecs like G.729 and G.723.1. So you may also need to incorporate transcoding which is CPU intensive. Video transcoding is more difficult because of the proprietary video codec in Flash Player.

2. Will there be any performance degradation when the call goes through the following paths? (Flash Client -> Media Server ->RTMP to SIP Converter -> VOIP Server -> VoIP/PSTN Gateway -> PSTN Network -> Telephone)

Yes. If you can avoid intermediaries to cut down on media path latency, it will help a lot. Typically the VoIP Server (or SIP proxy server) is independent of the media path so that doesn't affect. But the media path goes through Media Server (FMS?) and RTMP to SIP converter, and that too over TCP. This degrades the quality a lot. One way could be to remove the "Media Server" from your path by having Flash Client directly connect to the RTMP to SIP converter. Also if you can reduce the network distance between the Flash Client and RTMP to SIP Converter, that will help a lot.

Secondly, with Flash Player you may need to do audio transcoding in your RTMP to SIP converter. This further degrades the performance and limits the scalability of your converter.

3. Some experts says that the development in C or C++ is prefered for VOIP call to phone instead of Flash Player for performance reason. Is that true?

A native VoIP client is preferred over Flash Player because the media packets can go directly from the client to the telephone instead of going through the RTMP to SIP converter. The advantage is because (1) the native client can use UDP instead of restricted to TCP-based RTMP, and (2) the network distance is lower for a direct path. Even if your converter is on good network and close to your client so that the network distance is not much of an issue, the UDP-vs-TCP makes a great impact in improving the quality of native VoIP client implementation over Flash Player.

In general the network component affects the quality more than the programming language. So whether you use C/C++, Python, Java or some other language, it doesn't matter much. But if you can have end-to-end media path over UDP between the two clients, or between the client and the gateway, it is much better. Obviously with Flash Player you cannot have the packets go directly unless your RTMP to SIP converter is local to the Flash Client.

All the existing good quality systems (Skype, GTalk) tend to use end-to-end media-path over UDP as much as possible.

4. There are different media servers available. like Adobe Flash Media server (FMS), Wowza, Red5 etc. Which one is the best choice?

Do you still want to pursue RTMP to SIP converter? Anyways: In terms of performance I would guess that FMS is the best choice. But if your aim to build a RTMP to SIP converter than probably Red5 is the the best. FMS is proprietary with not much customization/programming choices available, so you cannot easily integrate a SIP stack or a RTMP to SIP converter to FMS. On the other hand Red5 is completely open source and in Java so allows easy integration with other Java based SIP stack. Additionally you could integrate SIP stacks written in other advanced languages such as Python or Ruby because Red5 allows applications in those languages, whereas an FMS application is restricted to ActionScript 1.0.

I haven't worked with or used Wowza so I cannot comment on that. I have worked with FMS and Red5 though, as well as Python based rtmplite and siprtmp projects.

6. We are now in a confusion whether to develop our VOIP application in Flash technology or QT/Java/C#. What will be your choice?

I think that decision mostly comes from your business case. But I would suggest non-Flash technology if possible and if your business demands very good quality of voice service. If your VoIP client will be assisting your main business, then people won't mind downloading and installing the VoIP client. The advantage Flash has is that it is already available on most people's browser so doesn't require additional download or installation. So if your VoIP application is only a small part of your main web-based business, then Flash technology will be better I think.

Another option is to use the Gmail video/voice architecture described in my article. Basically it uses Flash Player for user interface, but all the networking or voice related processing happens using their native GoogleTalk plugin.

Systems Software Research

A very interesting talk by Rob Pike on Systems Software Research is Irrelevant".

Some quotes from the slides (by Rob Pike):

"We see a thriving software industry that largely ignored research, and a research community that writes papers rather than software".

"Java is to C++ as Windows is to Machintosh: an industrial response to an interesting but technically flawed piece of systems software."

"Linux's cleverness is not in the software, but in the development model, hardly a triumph of academic CS (software engineering) by any measure."

"It (systems research) is just a lot of measurement: a misinterpretation and misapplication of the scientific method. Invention has been replaced by observation."

"If it didn't run on a PC, it didn't matter because the average, mean, median, and mode computer was a PC."

"To be a viable computer system, one must honor a huge list of large, and often changing, standards: TCP/IP, HTTP, HTML, XML, CORBA, Unicode, POSIX, NFS, SMB, MIME, POP, IMAP, X, ... With so many externally imposed structure, there is little left for novelty."

"Commercial companies that 'own' standards deliberately make standards hard to comply with, to frustrate competition. Academic is a casualty."

"New employees in our lab now bring their world (Unix, X, Emacs, Tex) with them, or expect it to be there when they arrive... Narrowness of experience leads to narrowness of imagination."

"In science, we reserve our highest honors for those who prove we were wrong. But in computer science..."

"How can operating systems research be relevant when the resulting operating systems are all indistinguishable? (Unix is) a victim of its own success: portability led to ubiquity. That meant architecture didn't matter, so there's only one."

"Government funded and corporate research is directed at very fast 'return on investment'... The metric of merit is wrong."

"Measure success by ideas, not just papers and money. Make the industry want your work."

"The future is distributed computation, but the language community has done very little to address that possibility."

My take on the lessons learned, again in the form of quotes:

"Keep the ideas flowing, even if the implementation is not feasible (using existing systems)."

"When thinking of distributed systems -- think beyond web, Browser and Flash Player"

"Something is popular, does not mean it is correct or best way to do that thing."

"Do not publish papers that fake measurement as research."

"Do not take a job that you are not truly motivated about."

"Writing software in Java is like writing detailed machine instructions. Learn Python instead."

Problems in RTMP

Adobe's RTMP or Real-Time Messaging Protocol was recently made available to public as an open specification as part of Adobe's Open Screen initiative. Most of the protocol has already been implemented in third-party software such as Red5, rtmpy and rtmplite much before this specification became public. In this article I take a critical look at the protocol.

There are three parts in the specification: (1) RTMP chunk stream, (2) RTMP message format and (3) RTMP command messages. At the high level, there are different types of messages such as command, data, audio and video. The last specification describes the high-level RPC (remote-procedure call) for various commands and their responses such as creating a network stream or publishing a stream. The actual formatting and parsing of individual types in a command are specified using AMF (Action Message Format) which comes in two flavors: AMF0 and AMF3. The messages that control the protocol such as setting the window size of lower layer or bandwidth for the peer, are specified in the second specification. Finally, the first specification defines the low level chunk format and separates the high level message stream from low level transport (chunk) stream.

The first (and worst) problem with RTMP is that it is overly complex in doing what it does. One reason is that it was poorly designed without extensibility or competing peer protocols in mind, and later on "fixed" itself to extend new features. As an example of complexity: the chunk stream ID field in the first specification was initially intended to be up to 63 but later extended to 65599. For ID 2 to 63, the first byte stores the value in its most significant 6 bits. For ID in the range 64-319 the second byte stores the value minus 64, whereas the first 6 bits of first byte store 0. For values between 64-65599, the second and third bytes store the value using a complicated formula whereas the first six bits of the first byte store 1. Another example is the timestamp field which is 24-bits. However, the protocol supports 32-bits timestamp such that if the value is more than 24-bits than the 24-bits are all 1's, and the actual (extended) timestamp is stored after the header. What is surprising is that a binary protocol called RTP (Real-time Transport Protocol) existed before RTMP was conceived, and had well defined and well thought-of message layout. For example, RTP has version field for extensibility, and 32-bit timestamp. Unfortunately, RTMP didn't learn from the peer protocol and suffered in the form of excessive complexity.

RTMP is designed to work only on TCP, and cannot work on UDP without several modifications. One well understood conclusion of early Internet multimedia research was that UDP is better suited than TCP for real-time media transport. While RTMP calls itself as real-time, it was designed to work solely on TCP. There is no sequence number to handle lost packets, hence it relies on the lower layer (TCP) to provide guaranteed packet delivery. Note that timestamp cannot be used to detect lost packets. The header optimization does not work if packets are delivered out-of-order. The new RTMFP does work over UDP but has its own set of problems and is not yet an open specification.

RTMP has several unnecessary elements. The chunk stream mechanism is not necessary and actually hurts the performance of real-time media transport, besides complicating the implementation. In particular, for client-server communication where typically number of connections/streams between one client-server pair is one, there is no good advantage of using chunks. It can have advantage in server-to-server communication in avoiding head-of-line blocking of one stream from another. Secondly, the initial bulky handshake of RTMP which, I believe, was intended to measure bandwidth or end-to-end latency, actually is not useful.

Media and control path should be separate. The IETF Protocols such as RTSP or SIP as well as ITU-T protocol H.323 exhibit this separation by delegating the media transport to separate RTP stream. This has several advantages because control path usually travels through application servers that are CPU and memory intensive, and have different scaling requirements than media servers which are bandwidth and disk intensive. Separating media from control path achieves scalability, robustness and distributed component architecture in the system. On the other hand, in RTMP control goes hand-in-hand with media. For example, the application server that handles shared objects and conference state, also handles media storage and transport.

RTMP has inconsistencies. First example is the use of some data types. The stream ID field appears at several places in the protocol, in different forms: 32-bit little endian, 32-bit big endian, and 64-bit floating point number. Second example is incoherency between layers: The default chunk size is 128 bytes. The default real-time audio captured from microphone is streamed to the server using Nellymoser encoded audio packets with two frames per packet. Each Nellymoser encoded frame is 64 bytes. Besides, there is a one byte header indicating the codec type. Thus each packet in the default case is 129 bytes. Thus, under default operation, a Flash Player should immediately change the chunk size from 128 to 129 to accommodate a full audio packet in a chunk (so as to avoid fragmenting it which will be inefficient). Going off by 1 byte indicates that something went wrong while designing the protocol for the default case.

When rest of the world was moving towards open standards such as RTP, Adobe embraced closed and proprietary RTMP. Adobe has been a proponent of proprietary technologies and imposing sub-optimal technologies to the developers and users. Another example is the RTMPE extension for encrypted RTMP communication. Readers are encouraged to read this article: "The major implication of this takedown notice is that Adobe has definitively told us that a fully-compliant free software Flash player is illegal. This is because RTMPE is part of Flash, circumventing RTMPE is illegal (in the US at least), and Adobe will never give a key to a free software project since they cannot hide the key. As a result, Flash cannot truly be a standard..."

Problems due to NATs and firewalls

Network Address Translators (NAT) and firewalls create problems for end-to-end connectivity on the Internet. This not only affects P2P-SIP but also client-server SIP. In this article I post some example numbers to illustrate the point.

These numbers are for example only: suppose there are 10% public Internet nodes, 30% nodes behind good (cone or address restricted) NAT, 30% nodes behind bad (symmetric) NAT and 30% nodes behind UDP blocking firewalls (F). Let's denote these as P=10%, G=30%, B=30%, F=30%. Here the public Internet nodes are typically from universities and research institutes, those behind good NAT are usually from residential DSL/Cable access, those behind bad NAT are partly from residential and partly from enterprise environment, and those behind UDP blocking firewalls are from enterprise and corporate networks. Suppose a call event between any two pair of nodes is independent of each other for the probability analysis purpose and nodes are equally likely to call any other node. Thus, percentage of calls between two public Internet nodes is (10%)^2 = 0.01 = 1%.

Now let us enumerate the NAT and firewall traversal techniques available to SIP. STUN helps with good NAT, whereas TURN relay is needed for bad NAT. ICE is used to negotiate the connectivity using STUN or TURN bindings. A TCP-based relay (or even HTTP relay) is needed for UDP blocking and very restricted firewalls. (what about TCP hole punching and other techniques?) A STUN server is light in terms of bandwidth utilization, whereas a TURN relay needs high network bandwidth and hence costs the service provider more money. Same is the case with TCP-based relay.

In a call if one participant is behind a UDP blocking firewall (F), then the call must use a TCP relay. This amounts to 1-(1-F)^2 = 51% calls going through TCP relay.

In a call if both participants are behind bad NAT, then we need a TURN relay. This amounts to B^2 = 9% of the calls.

If one participant in a call is either on public Internet or good NAT and other is on public Internet, good NAT or bad NAT, then the media can go end-to-end using STUN bindings. This amounts to 40% of the calls.

In conclusion, the VoIP provider will need to host UDP or TCP relays for 51+9=60% of the calls. This is not a good proposition.

In real world, the call events are not independent of each other: probability of a corporate user calling another corporate user within the same corporation is high. Also probability of a home user calling another home user is also high. For example, a SIP service targeted towards consumers can expect to have most of the calls among residential users. Thus, the percentage of calls that can be end-to-end is much higher than 40%. Similarly, an enterprise VoIP system can expect to have mostly internal intra-enterprise calls, which do not need to cross the enterprise firewall. Hence the percentage of calls needing the relay is not as high as 60%. Let us analyze these two use cases separately.

Suppose, for a consumer SIP service, the distribution of nodes is P=15%, G=50%, B=30%, F=5%, i.e., less number of users are from bad NAT or UDP blocking firewalls. In this scenario about 20% calls need a relay whereas 80% calls don't.

In an enterprise VoIP system, suppose 60% calls are intra-office and 40% are with outside the office network, then only those 40% calls need a relay whereas 60% calls don't. In a properly engineered enterprise VoIP system, appropriate ports are opened for UDP as well as appropriate media relays are installed in DMZ which facilitates smooth media path for inter-office communication.

While we can play with these numbers as much as we want, the fact remains that a significant percentage of calls need media relay, either UDP TURN relays or TCP relays. This puts unnecessary burden on the VoIP service provider to install and manage relays and buy network bandwidth for those relays, or simply disallow calls that require relay (in which case they may lose customers).

In a peer-to-peer system with super nodes such as Skype, these super nodes can act as media relays and hence save a lot of bandwidth and maintenance cost for the provider. There are some things to consider though: a node behind public Internet can become UDP as well as TCP relay for any call, whereas a node behind good NAT can become only UDP relay with some workaround, but not a TCP relay. This puts too much burden on nodes behind public Internet.

Let us consider the original example with P=10%, G=30%, B=30%, F=30%. In this case the 51% of calls that require TCP relay must use one of the 10% P nodes. When acting as a relay, the bandwidth requirement at the relay is twice that of when the node is in a call. Suppose each node makes N calls a day, and generally speaking needs bandwidth for N calls. However, a public Internet node not only needs bandwidth for its own N calls, but also for relaying 5xN calls of other users which amounts to total bandwidth for 11xN calls. Thus, while the super-node architecture is beneficial to the provider, it heavily punishes users on the public Internet. (My guess is that number of public nodes using VoIP are about 4-5%, which further burdens the public nodes).

A managed P2P-SIP infrastructure can be a good alternative, where corporations and universities donate hosts/bandwidth on high speed network to act as relays/super-nodes. Alternatively, one can have an incentive system to promote hosts to become relays and super-nodes.

RTMFP vs SIP

Adobe's RTMFP is not P2P-VoIP as exemplified by Skype. On the other hand, RTMFP is closer to client-server SIP or H.323 where signaling happens via a server and media path can be end-to-end between the endpoints. When people refer to RTMFP as P2P, it is more like 'end-to-end media' similar to client-server SIP.

Why is RTMFP important? The previous Adobe protocol RTMP is strictly client-server even for media path. This gives poor quality for real-time media communication because media packets go from client to server, that too over TCP, and then are redistributed to the other client, again on TCP. End-to-end media based VoIP systems existed before Adobe implemented RTMP. I suppose the difficulty of NAT and firewall traversal and lack of interactive video communication requirement in Flash Player resulted in RTMP. Adobe corrected this mistake in the new protocol RTMFP which allows NAT and firewall traversal (to some extent) and allows end-to-end media path without going through the server. Although, the signaling is still going via the central server.

Once we understand this difference between P2P-VoIP and RTMFP, lets enumerate the differences between an RTMFP-based and a client-server SIP-based communication system.

1. RTMFP is a closed protocol, although Adobe recently opened up the previous RTMP. On the other hand, SIP is an open standard from IETF. This means anyone can implement SIP whereas only Adobe can implement RTMFP. That also means that a bug in the RTMFP protocol or its implementation is outside the scope of public review such as for security experts.

2. RTMFP is an integrated protocol that has support for signaling, encryption, media flow (flow control and congestion control), NAT traversal. Whereas SIP is just one piece of the puzzle, that is used in conjunction with RTP/RTCP, SDP, STUN, TURN, ICE, SRTP, etc. to build a complete system. In that regard there is more scope for interoperability problems in SIP systems. The SIP interoperability test (SIPit) events have helped in solving interoperability problems among current products for over a decade. (see next point on why RTMFP alone may not be sufficient?)

3. Based on the available documentation, RTMFP works on UDP. Whereas SIP can work on UDP as well as TCP. In an RTMFP application, the client should fall back to TCP-based RTMP if for some reason UDP is blocked for the client-server communication. This also means that the client will lose some of the benefits such as encryption available in RTMFP. There are other protocols RTMPS and RTMPE to facilitate security and encryption over TCP-based RTMP.

4. Although RTMFP works on UDP, it implements additional flow control and TCP-friendly congestion control. This helps media traffic deal with network congestion and slow receivers. On the other hand most existing SIP system do not implement such mechanisms in the media path. While this looks like an advantage in RTMFP, it turns out to be a problem because of the way it is implemented. In particular, the network components are disconnected from the media source components such as camera and microphone. The rate control mechanisms are implemented in network components which internally slow down the media traffic by delaying or dropping the UDP media packets. On the other hand the encode quality settings on camera and microphone components are unaffected. This results in packet drops due to congestion and hence choppy video or audio drop-outs. A good application built on top of RTMFP is supposed to get feedback from network components and adjust the encode quality parameters (framerate, bitrate, quality) in the camera and microphone components so that the packet drops are reduced. Thus, unless the application is smart enough to deal with this, the disconnected implementation of rate control and media source causes quality problems in RTMFP.

5. Both RTMFP and SIP can use media relays to workaround NATs and firewalls. However, RTMFP does not use a super-node architecture where some clients (Flash Player instances) act as relays, whereas (P2P) SIP can use existing client nodes to act as media relays. This means that when using RTMFP, the service provider must bear all the bandwidth cost of the relays, whereas in (P2P) SIP the cost can be distributed among the users because of the peer-to-peer nature. I analyze the cost due to NAT and firewall traversal in my next post.

Why does client-server video conference fail?

I analyze some problems in client-server communication for multi-party video conferencing.

Audio communication differs from video in two important ways: (1) usually in a conference only one person is speaking at any time whereas everyone's video is on, (2) audio codecs are usually fixed bit-rate whereas video codecs adjust bit-rate based on various parameters such as available network bandwidth and desired frame-rate.

Problem 1:
In a client server mode, because video coming from one participant needs to be distributed to all the other participants, the bandwidth and processing requirement at the server can be higher; unlike audio where usually only one person is speaking. Secondly, the downstream video bandwidth requirement at the client increases with the number of participants in a conference. In an N-party conference, each client will have usually one outbound audio stream, one inbound audio stream, one outbound video stream and N-1 inbound video streams. Note that this problem is worse for peer-to-peer (P2P) video conference, where everyone is sending video stream to everyone else: in which case there are N-1 inbound and N-1 outbound video streams at each client. For asymmetric network access (ADSL or Cable), where upstream bandwidth is lower than downstream, this causes early saturation in outbound network bandwidth. Shutting down video stream or reducing the video quality while a person is not speaking saves some bandwidth especially for speaker mode conferences.

Problem 2:
Second point of difference is that audio is usually encoded using fixed bit-rate codec whereas video bit-rate is adjusted based on several parameters such as available network bandwidth, desired quality and frame-rate. In a client-server environment most implementations use the client-to-server network quality information to decide what bit-rate to use for client's video encoding. Consider a two party client-server conference, where first client is closer to the server hence has lower latency. The first client decides to use high quality high bitrate video encoding. On the other hand the second client decides to use low quality low bitrate video encoding. This asymmetry causes the first client to receive poor quality video whereas the second client's downstream link gets congested with high bitrate video. The problem is further aggravated if in a multi-party conference there is only one participant on poor quality network. The problem is caused because we use client-server network latency metric instead of end-to-end network latency metric in deciding the video encoding bitrate.

Problem 3:
Sometimes, the conference server imposes bitrate control to limit the traffic towards a low bandwidth client. However, for efficiency reason the server doesn't re-encode the video packets. Instead, it just drops non-Intra frames if there is not enough bandwidth. This causes marginal to no improvement primarily because Intra frames are several times bigger than other frames. Secondly, it causes choppy video which further degrades the experience. The layered encoding in MPEG solves this problem.

Problem 4:
Larger video packets may not traverse end-to-end over UDP. An encoded audio packet is usually small, of the order of 10-80 bytes per 20 ms. On the other hand an intra-frame video packet size can be much larger, say 1000-10000 bytes. When media packets are sent over UDP, and the packet size is large, there is high probability of getting the packet dropped. This is because of the MTU restriction and middle-boxes (NAT and firewall) in the media path. An UDP packet of size larger than MTU (typically approx 1300-1400 bytes) gets fragmented at the IP layer such that subsequent fragments after the first one do not have the UDP header information (such as source and destination port numbers). A port inspecting NAT or firewall that doesn't handle fragmentation correctly may drop such subsequent fragments, causing loss of the whole UDP packet at the receiver end. Thus, video over UDP has to take care of additional fragmentation and reassembly, and/or discovery of path MTU in the application layer.

Problem 5:
The server may allow video over UDP as well as TCP from the clients, typically to support NAT and firewall traversal. If some clients are over TCP and others over UDP, then the server also needs to proxy packets from one to other. If the client over TCP assumes ordered packet delivery, then the server will also need to do buffering, packet re-ordering and delay adjustment, which further adds to the implementation complexity of the server. The problem is not that visible for audio beyond a glitch in sound, whereas for video the view may get completely corrupted until the next Intra frame.

Problem 6:
A slightly related problem is when the conference server does audio mixing but video forwarding. In this case, the server must perform delay adjustment, packet re-ordering, and buffering for the audio path. However, for efficiency reason it may blindly forward the video packets among the participants. Thus the synchronization information between the audio and video gets lost, and performing lip synchronization at the receiving client becomes a challenge. A correct implementation of the server should act as an RTP mixer, i.e., include the contributing source information in the mixed audio stream, and distribute RTCP information to all that participants for synchronization. (How to do this if each audio call leg is a separate RTP session?)

Some of these problems (2,3,5,6) can be solved to some extent by using peer-to-peer video conferencing.