Kundan Singh

WebRTC vs Flash Player

This year has been great for the world of IP communications so far -- with the Skype deal, Flash Player adding echo cancellation, and now Google open sourcing WebRTC (with source code) that includes the audio/video codecs and quality engines.

RTC-Web is an effort started in the IETF (and Web-RTC in W3C) to standardize the way media streams are transported end-to-end between two browser instances for a real-time communication experience within the browser. It consists of a protocol for establishing end-to-end media path, abstractions for audio/video codecs and devices, and the language elements to use this feature from with Javascript/HTML. Traditionally browser communication has been done using plugins such as Flash Player. I have written a few open source software projects that use Flash based audio and video communication (flash-videoio, siprtmp, vvowproject). The WebRTC effort brings a completely new dimension, in a good way, because now we do not depend on external plugins for web based real time communications. The real-time communication becomes a first class construct to web developers.

This article summarizes some differences between WebRTC and Flash Player approaches for real-time audio/video communication. It also mentions a separate application approach as described in the VVoW project.

WebRTC is inline with the evolution of web protocols whereas using Flash Player is like patching an incomplete system. With WebRTC there is no external dependency beyond the basic web browser. However, given the ubiquitous availability of Flash Player compared to basic inter-operating HTML5 features, Flash Player approach is still promising, at least in the short term.

The number of web developers who understand Javascript/HTML is clearly much more than Actionscript/MXML, which benefits WebRTC approach as there can be many more new applications and use cases implemented in practice. However, the complexity of building Javascript based application combining various individual pieces of the communication elements may be overwhelming. On the other hand existing IDE tools for Flash development take away a lot of complexity from the developers.

Many users are reluctant to change their browser, and hence getting ubiquitous user adoption may take a long time unless this gets added to Internet Explorer. Moreover, dealing with device interfaces in a portable manner is a challenge. It is also not clear how the devices should be accessed across multiple instances of the same browser or different browser.

In the past, incompatibility in HTML among browsers has been a nightmare for web developers, and extending HTML for yet another feature is bound to cause more interoperability problems. Two interoperability scenarios are significant: between browsers from different vendors running the same web page, and between two different web sites. The latter is tricky from security point of view if open standards are used because the web site owners would want to restrict communication of its user to another web site user, whereas the protocol will be capable of such communication.

On the other hand, Flash Player has shown more ubiquitous availability on user's desktops and laptops than any specific web browser. Flash Player allows implementing platform agnostic software because all the incompatibilities between browsers and platforms are taken care by the plugin vendor.

Flash Player has the ability to do group communication by building scalable application level multicast tree among Flash Player instances. This is useful for one-to-many broadcast type communication scenarios. WebRTC is still in the initial phases of two party communication. Obviously, multiparty communication can be built on top of the two-party communication elements, but requires more effort to achieve efficiency.

In terms of video codecs, WebRTC provides open source high quality video codec, whereas Flash Player's camera captured video is still in outdated Sorenson codec, which is difficult to interoperate with non-Flash products. Availability of source code enables a WebRTC-based project to add new codecs as needed without depending on the vendor to provide new audio and video codec features.

The main problems with Flash Player approach is that the protocol for end-to-end media path is proprietary so interoperating with existing VoIP gears is inefficient without buying server pieces from the plugin vendor. Although, interoperability is possible using open RTMP and SIP-RTMP translators, it is not efficient because the browser to translator media path over TCP incurs unnecessary latency for some users. Secondly, for any new feature, we depend on the vendor, for example, echo cancellation, new codec, portability to new device. Luckily, Adobe has been releasing new updates with new features periodically. For example, echo cancellation feature released in Flash Player 10.3 solved a lot of problems for real-time communication. (Please see the public-chat demo in my flash-videoio project page to try out the video conference with echo cancellation.)

Some problems common to both the approaches are: (1) lack of a listening TCP socket or a general purpose UDP socket which could be used to implement a peer-to-peer application protocol within the browser without relying on servers, (2) the scope of an application is within a web page as defined by the Javascript or Flash elements, so if the user navigates to another web page the communication is lost. This is not a problem for web communication use case, but people are generally not used to this model in traditional communication.

On the other hand, the separate application model as used in the VVoW project allows you to have host resident software for communication, which can be used by any application including a web application running in your browser by connecting to the resident software locally. The resident application can reuse the existing research, e.g., Host Identity Protocol and P2P-SIP. This can save initial setup time for every connection of WebRTC. The main problem is that it involves yet another download and installation by the end user which hampers wide adoption.

I will continue to explore the WebRTC software developed by Google and try to include it in my open source projects. Some example projects could be: (1) add interoperability between WebRTC and Flash Player for communication in my siprtmp project, (2) add option to detect WebRTC support and use that in my flash-videoio project if available, and fallback to Flash Player, and (3) use the WebRTC source code to implement a separate application with high quality end-to-end media path in the VVoW project, and (4) create a Python wrapper to use WebRTC from within any Python application.

Performance of siprtmp: multitask vs gevent

Poor performance has been an issue in my RTMP server and SIP-RTMP gateway. Traditionally, I blamed the multitask framework for the poor performance. In this article I present my measurement results as well as introduce an alternative gevent-based implementation to improve the performance.

There are several performance aspects of this software, e.g., CPU utilization per call or session, memory usage, bandwidth requirement, etc. This article only focuses on the CPU performance. Moreover, I only consider the steady state CPU usage to measure the number of active simultaneous calls through the gateway. The CPU usage during call setup and termination is not considered.

The conclusion of my measurement is as follows. The SIP-RTMP gateway software using gevent takes about 2/3 the CPU cycles than using multitask, and the RTMP server software using gevent takes about 1/2 the CPU cycles than using multitask. After the improvements, on a dual-core 2.13 GHz CPU machine, a single audio call going though gevent-based siprtmp using Speex audio codec at 8Hz sampling takes about 3.1% CPU, and hence in theory can support about 60 active calls in steady state. Another way to look at it is that the software requires CPU cycles of about 66 MHz per audio call.

The gevent-based software is also available under the same license for you to try out. The next step to further improve the performance is to move part of the media processing of siprtmp to an external C/C++ extension module.

Background

Traditionally, I have used the multitask framework for co-operative multitasking in my Python software including p2p-sip, rtmplite and siprtmp. In the past, people have complained about high CPU utilization in siprtmp for a single call or even with no call. Part of the discussion is documented in issue 31. It turned out that the no-call CPU usage was a bug, and that we could optimize the multitask framework to improve the performance by approximately 2x. The optimization alters the way in which the multitask framework looks for io-events and more tasks. In particular, it gives more preference to tasks than to io-events, hence if a single io-event generates multiple tasks, all of them run before waiting for next io-events. These optimizations and fixes are in SVN r60 and r68. Unfortunately, these optimizations are not enough.

To further improve the performance, I looked at the built-in asyncore module of Python and re-implemented rtmp.py to use asyncore. There was significant improvement of approximately another 1.5x to 2x. Unfortunately, getting timers to work with asyncore is not trivial. Hence I couldn't implement siprtmp easily as the SIP/RTP library relies heavily on timers.

Then I looked at the gevent project, thanks to a co-worker for recommending it. It supports co-routine based co-operative multitasking by modifying the existing blocking modules such as socket. Compared to the multitask framework, the source code using gevent is more readable and easy to maintain because it works behind the scene. Unlike this, the multitask framework requires yield statements scattered everywhere and non-trivial StopIteration exception to return from a task. I re-implemented siprtmp.py, and related SIP/RTP modules, using gevent. Since siprtmp module includes all of rtmp module, this can also be used as an RTMP server in addition to being a SIP-RTMP gateway.

Test Setup

All my tests were done on my MacBook laptop, 2.13 GHz Intel core 2 duo, 2GB memory, and running Mac OS X 10.5.6. I used Python 2.7 for server side components and flash debug player version MAC 10,0,45,2 (how to find?). I used X-lite version 3 as a standard SIP client. The debug trace on the server was disabled, by not supplying any -d option. All my clients and server ran locally on my local host hence bandwidth was not an issue. I used Mac's Activity Monitor to measure the CPU usage.

Measurement Result

The main metric is the CPU usage in percentage as reported by the Activity Monitor. There are several parameters that were altered and the effects were measured.

The siprtmp performance was measured for an audio call between a web-based VideoPhone sample application available as part of the siprtmp software, and the third-party X-Lite application. The sampling rate of the Speex audio codec can be 8kHz or 16kHz. The larger the sampling rate, the larger the encoded packet is. The CPU usage increases with higher sampling rate. Note that there is no transcoding in siprtmp. The following table shows the percentage CPU usage for siprtmp using multitask and gevent, and for the two sampling rates.

Rate	multitask	gevent
8 kHz	4.8-5.1%	3.1-3.2%
16 kHz	6.2-6.5%	4.0-4.1%

Base on these, we can conclude that the gevent-based SIP-RTMP gateway takes about 2/3 the CPU compared to multitask-based gateway. Roughly, the gevent-based gateway takes about 66 MHz/audio-call of the CPU cycles in steady state.

The rtmp performance was measured using one publisher and zero or more players. The CPU usage increases with the number of players. Typically, audio only session gives less variance in the CPU usage, whereas if video is included then depending on the amount of movement or image details the packet size changes, and so does the CPU usage. I used the Flash VideoIO project's test page to perform the tests. If video is present, then Flash Player's camera capture uses these properties: cameraQuality=80, cameraWidth=320, cameraHeight=240, cameraFPS=12. Audio is always Speex 16 kHz with encodeQuality=6. The following tables shows the percentage CPU usage using multitask and gevent, with one publisher and different number of players, and with or without video. If the variance is small, only the average is reported, whereas if the variance is large the range is listed.

Media	#players	multitask	gevent
Audio	0	2.2%	1.3%
Audio	1	3.5%	1.8%
Audio	2	4.5%	2.1%
Audio	3	5.5%	2.5%
Audio+Video	0	3.0-3.9%	1.4-1.7%
Audio+Video	1	4.2-4.7%	2.1%
Audio+Video	2	5.5-6.3%	2.7%
Audio+Video	3	7.0-7.6%	3.1%

Based on these, we can conclude that gevent-based software takes less than 1/2 the CPU than the multitask-based software for RTMP streaming. Roughly, the gevent-based server takes 34 MHz/publisher and 12 MHz/player of the CPU cycles in steady state.

Implementing video conferencing and text chat using Channel API

Last week, Google finally released the Channel API [1, 2] for Google App Engine. It has been available to developers for six months [3], but not on actual app engine for production. I had built a few video conferencing and text chat applications [4, 5] using Flash VideoIO project [6] on Google App Engine. Earlier, I had to use Ajax/polling technique to get events related to chat and user list. In the last couple of days, I modified those applications to use the asynchronous event notifications using the Channel API. More text from [6] follows:

"Random-Face [4]: This is a chatroulette-type application built using the Flash VideoIO component on Adobe Stratus service and Python-based Google App Engine. ... You can view the source code of two files, index.html that renders the front end user interface and main.py that forms the back-end service."

"Public-Chat [5]: This is a multi-party audio, video and text chat application built on top of Python-based Google App Engine and using Channel API for asynchronous instant messaging and presence. ... Developers can see the source code files: index.html is the front-end user interface, webtalk.js is the client side Javascript to do signaling, and main.py is the back-end service code."

The Channel API essentially implements an XMPP-style asynchronous communication from your server to the Javascript client. I use this to implement notifications for new messages, change in user list, and update of user video session to other participants in the system.

What is Flash Media Gateway?

I recently saw description of Adobe's Flash Media Gateway [1] and a related information on how Adobe Connect 8 can use it to make and receive SIP calls [2]. This article lists my view on advantages and problems of such a gateway architecture. (Disclaimer: I have not used any of these products though, so my views may be completely wrong).

In summary the new Flash Media Gateway is similar to the bunch of other SIP-RTMP gateway products that already existed for few years, e.g., siprtmp, gtalk2voip, flaphone and red5phone. I feel the industry demand of interoperating between Flash Player and SIP devices eventually forced the company to do something about it. Unfortunately, it did something which is sub-optimal as I describe here.

I have been involved with development of open source siprtmp project [3] hence I can speak from my experience about advantages and problems with such an architecture. I have also blogged earlier about FAQ on using Flash Player to make phone calls [4].

Advantages of Flash Media Gateway

It allows you to build Flash applications that can talk to SIP devices using Adobe's servers at the back end. While it is not useful for those who already have resorted to other solutions such as Red5 and Wowza, it is useful to those who use Adobe's Flash Media Server (FMS) and do not want to switch to other alternatives for any reason. Problem: It is not clear whether the Flash Media Gateway can work with other media servers such as Red5 or Wowza.
It supports audio transcoding among Speex, Nellymoser and G.711, as well as mixing for a simple conference bridge. This allows working with older Flash players that do not have Speex and with SIP devices that do not have Speex. A third-party product such as siprtmp is typically reluctant to implement transcoding with Nellymoser because of licensing restrictions. Problem: In general transcoding is not the best option because it takes significant CPU cycles on your (expensive) hosted servers. It can drastically reduce the capacity of your server by a factor, e.g., support 100 calls with Speex or support 10 calls between Speex and G.711.
It supports video using H.264. Problem: It is not clear whether it allows only one-direction H.264 from SIP device to Flash Player, or whether it supports bi-directional H.264. A bi-directional H.264 will a huge advantage, but will mean that Flash Player is capable of capturing and sending H.264 video, which does not look like the case.
It can potentially support UDP between Flash Player and server. Note that one of the biggest issue with real-time voice calls with Flash Player was that RTMP (over TCP) caused high latency not suitable for interactive communication. Adobe added another protocol, RTMFP (over UDP), that could allow end-to-end media path among the participants thus drastically reducing the end-to-end audio latency. While a gateway architecture does not allow end-to-end media path, it can still allow UDP between Flash Player and media server using RTMFP. This could reduce the end-to-end latency to some extent. Problem: It is not clear whether RTMFP can be used in conjunction with Flash Media Gateway.

Problems with Flash Media Gateway

It does not allow you to build a SIP client in the browser. The communication between Flash application and the media server/gateway is still over RTMP (or RTMFP). This means unlike true end-to-end media path for SIP calls, the media must go through the server/gateway. I don't think the connect plugin is implementing a SIP/RTP stack because it says that it uses the gateway in the back end.
If RTMFP is not allowed for such SIP calls, then the RTMP (over TCP) connection will significantly contribute to latency which is not suitable for interactive voice calls unless you have deployed the gateway close to your user.
Most SIP-PSTN gateways that translate SIP calls to phone network support traditional voice codecs of G.711, G.729, G.723.1 but not Speex or Nellymoser, whereas the Flash Player supports only Speex and Nellymoser for captured voice. Thus you always need a transcoding. Unfortunately, G.711 at 64 kb/s is expensive on bandwidth compared to say G.729 at 8 kb/s. Since the gateway does not support common voice codecs of PSTN providers, in most cases you will need to run some form of transcoding, twice! or live with higher bandwidth usage.
It does not add any more significant value to what already exists with red5phone or siprtmp. You still need to use a third-party SIP provider who can terminate your PSTN calls. It does not optimize the media path latency because of the gateway architecture. And finally it does not really improve the call experience for Flash to SIP calls to the end-user.

Ideally, the SIP/RTP and related protocols should become part of Flash Player, so that it allows one to create a SIP user agent in the browser and enable low latency end-to-end media path with third-party SIP user agents.

How to extend HTML5 for real-time video communication?

A few months ago, I was discussing HTML5 with a friend of mine. We tried to figure out what would it take to extend it to support web-based video communication. The proposed HTML5 already includes audio and video tags, but are useful only for streaming video applications. This article presents more refined thoughts on how to extend the browser to support video communication.

First approach: extend the video tag

W3C has added new video element in HTML5 to facilitate playback of interoperable video formats across browsers. Existing web sites use "object" element to run an external plugin such as Flash Player for video playback, which is intended to be replaced by the HTML5's video element. This allows browser manufacturers especially for phones and other devices to easily playback web videos, without having to implement the full Flash Player plugin. The "src" property allows specifying the URL of the video to play, and additional properties such as poster, preload, autoplay, loop and controls allow controlling the behavior of the video player.

One way to support video communication is to extend the video element with additional properties that allow it to capture and publish local video, and control camera and microphone behavior. For example, in a two-party call between Alice and Bob, Alice can have two video elements, one to publish local video to URL stream "alice" and other to play remote video from URL stream "bob". Similarly, Bob can have two video elements, one to publish local video to URL stream "bob" and other to play remote video from URL stream "alice". The "src" property can specify the central media server or rendezvous server location as well as the publish or play stream names, e.g., "rtmp://server/conf123?publish=alice".

This is the idea behind my Flash-based audio and video communication project. In addition to existing properties such as src, preload, autoplay, loop and controls, it defines new properties for microphone, camera, playing, recording, etc., as you can see on How to use the VideoIO API?. It also overloads the "src" property to allow "rtmp" and "rtmfp" URLs for media server or rendezvous server location, respectively. This application with its new properties can be used as a drop-in replacement for a video element that supports video communication in the browser.

This approach of extending the existing video element with new properties works well for two-party as well as multi-party conferences, and centralized as well as end-to-end media path. The nice thing about this approach is that it keeps the actual call signaling out-of-scope of the video element, e.g., your web application implements call signaling using existing Javascript/Ajax/websocket/server-event technologies. It keeps the specific rendezvous protocol mechanism such as "rtmp", "rtmfp", and in future "sip" or "rtsp", outside the video element.

To avoid interoperability problems, a minimum subset of supported rendezvous is recommended. The requirements of such a protocol is to support real-time media transport, preferably over UDP, in centralized or end-to-end path in presence of network middle boxes such as NATs and firewalls.

Second approach: define new connection object

The previous approach integrates capture, playback and connection functions in to a single video element, with additional properties. Alternatively, these functions can be split in to different elements and Javascript objects, e.g., the video element does display/playback, but new camera and microphone objects allow capture, and new connection object allows end-to-end real-time media path among participants. The Javascript application actually connects these different elements and objects to build a complete video communication system.

There are several proposals on how the new connection or transport API will look like. Example attributes are: protocol (udp or tcp), list of reflectors and relay servers , mode (initiating or receiving), secure (boolean). Additionally, it has methods such as connect and send, and events to indicate connection status and incoming data. Existing protocols such as ICE, STUN, RTP/RTCP and SIP may be implemented in the browser or external gateways to support such as transport object. Finally, these transport objects can be piped with display and capture components, audio and video codecs and filters, etc., to implement a complete video communication application.

In summary, this approach defines new Javascript objects such as Transport, Camera, Microphone, Codec, etc., and allows the application to connect these objects to build a real application. This is more complex than the first approach, but allows fine-grained application logic.

Third approach: use external application

This approach understands the limitations of HTML and does not try to "add" video communication to it. We are considering this approach of a separate application in our web communications project at Illinois Institute of Technology.

While the idea of extending HTML to support video communication is useful and interesting, there are many limitations. In the past, incompatibility in HTML among browsers has been a nightmare for web developers, and extending HTML for yet another feature is bound to cause more interoperability problems. Browser manufacturers are sometimes not too keen to add a new feature, e.g., for business reasons if it competes with the manufacturer's existing product or service. Third important reason is that the video element of HTML5 lacks some digital rights management related features, which causes media owners to publish their media using restricted Flash application. Fourth, adoption of new HTML5 is slow, so web site developers still need to fall back to Flash-based application for video playback at least in the short term. Finally, adding capture and end-to-end transport components in HTML5 gives rise to a plethora of issues related to privacy, security and denial of service attacks, in case of faulty browser implementation. Due to these reasons many people believe that extending HTML and browsers to support video communication is not the right approach.

Hundreds of applications exist that implement consumer video communication. Some popular ones are Skype, Gmail, tinychat and Facetime. The technology behind these are drastically different, especially for signaling and control. However, at the bottom, every video communication application tends to establish some form of end-to-end UDP-based real-time media path, and fall-back to relays if that fails. As mentioned before, IETF standards exist to establish such media path and relays.

Imagine a standard-compliant resident application, rtc-app, that runs on user's machine independent of the browser, but allows any application including browser to establish real-time media-path. The browser can use existing API such as websocket or HTTP to interact with rtc-app. The rtc-app application is not owned by a specific vendor, and is installed by the end-user. The avoids re-implementing the feature by every vendor who wants to do real-time video communication. To address the privacy and security concerns, rtc-app must directly ask permission from the end-user before initiating or accepting a connection instead of automatically (and randomly) on API calls. This is similar to how Flash Player asks the end-user for permission to capture from microphone or camera, but can remember the application for future use if told so by the end-user.

The main advantage of this approach is that it does not require changing the browser or HTML, but still is a generic implementation-focussed way to enable real-time video communication for many other applications. If an existing vendor such as Skype or Google opens up its API, it will be a big step forward. While rtc-app can provide transport functions, the audio and video capture still needs to be done somehow. Various codec licensing issues may prevent us from including it in rtc-app, but Flash player based application similar to the first approach can perform capture on its behalf. The main problem with this approach is that it requires an additional download and install by the end-user.

How to conduct a technical interview for software engineer?

(This article presents my thoughts on how to effectively conduct a technical interview for a software engineering position. It presents the "interviewer's" point of view based on more than 30 technical interviews I have conducted, and quality of candidates I have recommended. If you are an "interviewee" I suggest you look elsewhere, e.g., interview questions.)

Know the position you are hiring for. If you have been part of a software engineering team or have read the book, "The mythical man-month", you would know that you need several different "types" of members in a successful team. You need a "magician", who knows or can figure out solution to every technical problem you may have. You need a couple of "plumbers" who are willing to fix any broken software piece. You need a "general" who is very motivated about what you are doing, knows how and when to delegate, and keeps everyone together. You need a few "soldiers" who can follow orders, do the job, and be happy to contribute. And so on. As an interviewer, you need to know what position you are hiring for? You need to tailer your interview as per the requirement. One interview pattern does not fit all types.
Do your homework. Before the interview, thoroughly read the candidate's resume/CV. If she has extensive work experience, identify only one or two of her past projects to focus on. If you have even a slight doubt about her programming ability, prepare a written programming test. If possible, scheduler a separate or additional time slot before your face-to-face interview for the programming test. Do not use any existing online programming test material, otherwise you won't be able to distinguish between someone who knows how to program vs someone who has gone through many web sites containing interview questions. Do not give take home tests. Do not share your programming interview questions with other interviewers in your organization.
Start with knowledge questions. During the interview, after initial introductions, start with a question on her past experience. Your interview should balance between knowledge and application types of questions. Do not ignore his experience or knowledge, and do not focus only on his experience. Getting started with what the candidate already knows is also a good way to make her comfortable. You can ask something from her past project, e.g., "Describe in one minute what you did in XYZ?", or ask about a past technology that he used extensively, e.g., "Did you use STL in C++? What are the common STL classes available?"
Focus on real application problems. Most software engineering positions require applying your existing knowledge to a new problem. The one quality which distinguishes a good programmer from a mediocre programmer is that a good programmer can easily translate your problem in to pseudo-code. If you are interviewing for "soldiers" and not "magician" or "general", avoid discussing high-level design type of problems, but instead focus on more low level real technical problems. For example, instead of asking "How would you design a scalable web server for blah blah?" ask more specific questions. In my experience, people who can answer high level design questions can create "vaporware" but those who can translate a small real problem to pseudo-code can actually write "software". If you need software engineers, avoid wasting time on high level design questions. Also, such application problems should be independent of specific domains but just be able to test whether the candidate has the required mathematical and computer skills to translate your problem to pseudo-code. I have given some examples later.
Follow thought processes and provide hints. If you believe that the candidate is getting diverted in to incorrect answer, there is no harm to give hints or counter-questions to course correct her thoughts. Do not be too adamant on your answer. Sometimes, a 75% correct answer is good enough.
Provide itemized feedback. When you submit your recommendation to the HR or your manager about a candidate, specifically itemize individual qualities and performance, and emphasize specific skills and lack of it. For example, "I had a nice 45 min conversation with XYZ, and I found her to be a very good programmer but needs training on Flex. After initial introductions, I asked one algorithm and three programming questions. She did good in two programming ones and average on others. Programming ability: very good; Needs hand-holding: yes; Algorithms: average; Strength: programming; Weakness: Flex; Recommendation: weak accept." My final recommendation is one of strong-accept, weak-accept, weak-reject, or strong-reject, with implied meaning of "a very strong candidate, and must hire her", "a good enough candidate, but won't argue to hire him if others disliked her", "an average candidate, but won't argue to reject him if others strongly liked her", "a poor candidate, and must not hire her", respectively.

As an interviewer you would be wondering about examples of real questions that would distinguish a good programmer from an average one. These are some examples. As mentioned before, you should create your own question, instead of using these, otherwise you cannot distinguish a candidate who genuinely solved the problem from the one who has read this blog.

Video conferencing layout: suppose you know the window dimension, WxH, and want to fit participant videos in MxM tile. Each video has fixed aspect ratio of 4:3. All video objects are of same size in the layout. Your MxM tile should be laid out in the middle-center, with potential empty spaces near window edges. The layout should maximize the size of the MxM tile, so that the empty spaces near edges are minimized. You are given an array of video objects V[] and a function layout(v:video, x, y, w, h) which lays out a single video object with size (w, h) at position (x, y) inside the window. Write pseudo-code to layout participant videos. (Hint: start with 1 video, then 2x2, then 3x3, then generalize. Additional questions: how would you modify it to NxM tile instead of MxM? What should happen if number of videos is more then 9 but less than 16 -- which boxes are empty? How would you modify it so that empty spaces including empty boxes are minimized in NxM layout?)
Path optimization: suppose you have a map of a city with Manhattan-style layout. Suppose north-south streets are named, a1, a2, etc., and east-west streets are named b1, b2, etc. Some streets have traffic signals, with 5-second walk sign, 15 seconds count-down to continue walking if started, and 20 seconds don't walk sign, periodically repeating in that order. Other streets do not have traffic signal, in which case traffic must yield to pedestrians. Suppose you need to walk from corner of a5/b5 to corner of a7/b10, and only street with traffic lights on you way is a6. You walk at the same speed. Crossing a6 takes 15 seconds whereas crossing any other street takes only 5 seconds. You do not want to cross a6 if you know you can't finish before it turns to don't walk sign. You want to minimize the time taken from source to destination, hence minimize the time waiting on traffic lights. You have function named walk(), turn(left or right), stop(). Write pseudo-code for your decision process from your source to destination point. (Hint: draw out the map first, then it becomes easy to visualize and solve. Additional question: can you generalize between any two points as long as you know the complete map and which streets have signals?)

If you have more ideas, feel free to comment.

Why do P2P-SIP?

One of the questions I often get: if SIP itself is peer-to-peer, why do P2P SIP?

Unlike typical client-server web application or Jabber presence system, a SIP user agent is both client and server. Which means, a SIP user agent can send as well as receive a SIP request. In theory, a SIP proxy is not a required component in a SIP system. Ideally, the caller does not need to know whether a call to "sip:bob@some-server.com" will be received by a SIP user agent directly or by an intermediate SIP proxy server first. In practice, people often refer to an intermediate SIP proxy (and registration) server as the SIP server, and such SIP servers are norm rather than exceptions.

In theory, once the end-to-end SIP session is established between the caller and callee user agent, the media can traverse directly between the two user agents on the IP network. This is what makes SIP systems support peer-to-peer (or end-to-end) media path. In practice, presence of network address translators and firewalls prevent direct IP connectivity between the two user agents for media path. This requires workarounds such as interactive connectivity establishment or network relay services such as media-proxy.

A bigger problem in practice is that due to business reasons, the VoIP provider does not want the media path to be end-to-end, so that it can have control over the "service" for accounting, billing, advertisement or other reasons.

In short, SIP is a tool that enables peer-to-peer (or end-to-end) media as well as signaling path. Protocols are like tools, which vendors use to build applications and systems, similar to how a construction worker uses various tools to build a house. In the current Internet, many vendors have used SIP to build closed walled gardens of managed services, unlike what SIP was initially envisioned for.

On the other hand, P2P-SIP is a realization of this problem to explicitly define SIP-based communication without using managed servers. Instead of central SIP servers managed by your provider, which can impose constraints to break end-to-end media and communication services, P2P-SIP aims at decentralizing the registration and proxy functions so that the signaling path is peer-to-peer.

The main benefits of P2P-SIP are as follows:

Organizations and services providers can save cost of server maintenance. In particular, there is no new training or position required specifically for a VoIP IT staff. There is no need to host dedicated servers in data centers with 99% availability and pay for energy and bandwidth.
The VoIP industry can move away from traditional service provider oriented business to a more open end-to-end user application. Essentially, it becomes more inline with your other Internet services such as web and email: how web browser don't distinguish between servers or domains, and how you can send email using any mail client and any provider to anybody else.
Finally, the most important reason is that the cost saving eventually propagates to end-user, who can enjoy free VoIP service as long as they are paying for their IP network access. Peer-to-peer infrastructure enables highly scalable communication system such as Skype at a very low cost. A small vendor's VoIP is able to scale to millions of users only if it can save cost of server maintenance and bandwidth.

The important aspects of P2P-SIP are:

It does not depend on a VoIP service provider for signaling and media path. Hence there is not much money for managed services in P2P-SIP.
It can use end-user devices on public Internet to relay media path for end-users behind restricted networks. Hence it requires incentive for public end-users to help restricted end-users.

As you can see, benefits of P2P-SIP are for end-users, at the cost of service provider businesses. Most of existing VoIP effort is driven by large corporations and carriers who do not have any interest in making the service open to end-users and lose control over them.

Theory vs practice of SIP-based VoIP

I recently attended the VoIP conference and expo [1], at Illinois Institute of Technology, organized by Prof Carol Davids, and also got a chance to speak on a couple of topics [2]. There were many interesting presentations in the conference giving perspectives from leading software and equipment vendors, carriers and service providers, government bodies, standardization forums as well as open source developers. This article presents some of my thoughts regarding the theory and practice of SIP-based VoIP.

The inaugural session showed a demonstration of IP-based 911 call by students by integrating and using the software pieces developed at other universities. It was great to see sipd [3] being used by other universities for exciting new projects, and brought back the memories when we were developing sipd.

Theory

The session initiation protocol (SIP) was invented to create and control multimedia sessions on the Internet because the previous protocols were either insufficient (e.g., HTTP, RTSP) or too complex (e.g., H.323). Unlike HTTP which typically requires a dedicated server, SIP was designed to be more peer-to-peer. Hence, your VoIP phone itself acts like both client as well as server to send and receive SIP requests. Ideally, you do not need to keep the SIP proxies in the session path, except for initial call setup. The protocol is designed to encourage subsequent requests such as ACK, BYE or re-INVITEs to be end-to-end, if possible. The protocol includes mechanism to enable an intermediate proxy to require that all subsequent requests in a session be sent via that proxy. But this was designed to be used as an exception rather than a rule. The motivation is scalability: keep the proxies lightly loaded.

In theory, a SIP-based system follows the end-to-end principle of Internet: keep the intelligence in the end, and have the network (or intermediate proxies) be dumb. The end-to-end principle has been crucial in the success of the Internet and more recently the web applications. As long as you keep the network provider independent of your application provider, you see explosion of application innovations.

Practice

In practice, your SIP-based system is typically "owned" by your network provider who has business incentive to provide you billable applications and services, and prevent you from talking to other open SIP-based systems without going through their billed "services". The largest SIP systems such as Comcast and Verizon digital voice are designed to be closed systems which use SIP in the network without allowing end-users to access it directly. More recently, Apple's Facetime is a closed service and does not inter-operate with other SIP services. With SIP-based IP multimedia subsystem (IMS) being adopted by wireless carriers, there is more incentive for businesses to convert SIP from an end-to-end protocol to a network centered and managed service.

One of the term I kept hearing during the VoIP conference was the "managed" services, and why it is useful for consumers, and what kind of new innovations are happening. When I looked at details, these services and innovations are basically what SIP already provides, e.g., service APIs, unified communications, etc. I had the opportunity to work on some of these 5-10 years ago when I was at Columbia University. It is discouraging to know that because of narrow minded business incentives of vendors and carriers, walled gardens of SIP systems are created which prevent open innovations and require significantly many fold effort to get basic features to the consumers. First, (1) the vendors and carriers use an open protocol, SIP, to build a VoIP system, then (2) invest resources in making it a walled garden, and finally (3) invest more money and resources to create federation of these walled gardens. In the end, (2) and (3) nullify each other, and just (1) was sufficient. All the money invested in (2) and (3) gets wasted in the long term.

Conclusion

I would like to advise vendors and carriers to just focus on providing a good IP network with some quality of service and less restricted NATs, and leave the rest of the VoIP services to the millions of application developers. As Henry Sinnreich said during the conference, the only service is "the Internet". Instead of providing "managed" network services, open up your network for end-to-end innovations. In the long term this will boost your network and bring more revenue. With Internet and web, there is more opportunity for everyone, and a walled garden approach in the network is just going to keep you away from the long term growth.

The open innovations in VoIP are bound to happen. If you do not want to be part of that, someone else definitely will. Adrian Georgescu presented Blink [4], a fully featured easy to use SIP user agent, as a great example of SIP-based open innovation. For every VoIP developer in "managed" service organization, there are probably a thousand independent developers such as web application developers. Sooner or later, these developers will build some open platform or system which will attract hundred times more traffic than your managed services and hence many fold more revenue. At that time all your investment in "managed" services will go down the drain. Because if you don't open up your SIP systems, these developers will not wait for it, and build something else. This has happened before with Internet and web applications, and will happen again sooner or later with Internet voice/video communication or VoIP.

Tips for implementing application protocols

This article presents some tips for implementing application protocols such as for web services, multimedia communication, streaming or Internet telephony. The tips are mostly relevant for implementations in the Python programming language.

Keep all blocking operations outside your protocol implementation. This mostly includes sockets, files and timers. If you design your protocol parser and controller to be independent of blocking calls, then it can easily be converted to various asynchronous or synchronous controllers as needed. For example, the rfc3261.py module implements core SIP stack using the Stack class. The application supplies API for timer creation, message receiving as well as sending. When the application receives a packet on socket, it invokes a method on the stack. When stack has parsed the received packet and needs to inform an high-level event such as incoming call to the application, it invokes a method on the application. This allows the application such as voip.py to provide co-operative multitasking based controller. On the other hand, the built-in HTTPServer in Python includes synchronous and blocking calls for sockets and disks. This makes the built-in class' HTTP implementation hard to use for various high-performance application that cannot afford to block. Due to this, almost every web framework implements its own HTTP, instead of re-using the built-in class. The trade-off is that your implementation may become more involved if you keep blocking operations outside.
Do not use multi-threading in your protocol implementation. Firstly, getting a multi-threaded application right is very hard. Secondly, for CPU intensive tasks or disk I/O bound tasks, the CPython's global interpreter lock (GIL) will prevent efficiency anyway; hence multiprocessing should be used. Thirdly, for network I/O bound applications multi-threading has advantage, but not as much as multiprocessing. Consider using multiprocessing, but beware of cross-platform problems, especially on Windows! In my experience, co-operative multitasking (or green-thread) works best for protocol implementation. If you are worried about efficiency on multi-core CPUs, you should leave that decision to the main application that will use your protocol implementation to present a client or a server application. The main application can decide whether to use multi-threading or multiprocessing and co-ordinate among them.
Decouple the protocol parsing and handling implementation. Sometimes you may need to use just the parser without the handler. For example, if a single incoming TCP connection can have either HTTP, SIP or RTSP messages, then it becomes easier for the application to first parse the message to determine what it is, and then invoke the appropriate handler. Because of NAT and firewall, many application protocols need to be sent via a single port, e.g., 80 or 443. If an application from Flash Player connects to your server on TCP, it will first send a socket policy request, before sending any other actual application protocol message. If your protocol parser is separate from the handler, you can invoke the socket policy request parser as well as protocol parser, to determine what request it is.
Avoid blocking on DNS lookup, if possible. This goes back to first point; do not block in your protocol implementation. Usually it is hard to notice the DNS lookup as blocking. Most built-in libraries provide synchronous and blocking calls for DNS lookup. Consider using some asynchronous DNS library. If that is not possible, move the DNS lookup out of the core protocol implementation, to the main application. Sometimes DNS lookup is done during logging, e.g., to convert client IP address to host name, and may be hard to detect.
Log all warning, errors and exceptions. In server implementations, you may get tempted to handle various exceptions and ignore it, to make your server more "robust". Unfortunately, this practice leads to more headaches later on when some critical bug appears but is hard to detect. If you log all warning, error or exception conditions, even if you ignore them, you may be able to detect such bugs early on. A warning is a suspicious behavior either in your code or external system. An error is a failure case due to some external problem, e.g., file requested by client was not found on server. An exception is most likely a programming mistake, e.g., accessing attribute on "NoneType".
Do not hold on to resources. With automatic reference counting and garbage collection, it becomes your responsibility to free up any unused references. Typically the application protocol defines how long the resources should be kept, e.g., how long a SIP transaction lasts. But there are some resources which can persist for much longer duration, e.g., user contact location. External databases are more suitable for such resources. Secondly, with event driven software architecture such as listener-provider model, it is easy to get in to reference loops, e.g., listener has reference to provider and vice-versa. Similarly, a Message object may have list of Header objects, and each Header may refer back to the Message. Your cleanup code should correctly free up unused references, e.g., "del varname".

Scalability vs Performance

I have been reading articles on scalability and performance. This article summarizes some of my understanding about this topic.

Scalability is the ability to scale the system to higher load. Performance determines the throughput of the system under load [1]. In theory, scalability and performance are orthogonal; you can handle higher load either by scaling the system or by improving the performance of individual components of the system. In practice, scaling and performance improvement are used together to improve the overall system.

Suppose a single machine can handle a load of N. If it is possible to handle 2N load by adding another machine, or kN load by adding another k-1 machines, then the application is designed to be scalable. On the other hand, you can always try to optimize your application or buy more expensive hardware to make your application handle 2N load in the single machine. Clearly there is a limit to the performance gain on a single machine. Also, for the same amount of overall improvement, typically scaling the system by adding redundancy is much cheaper than improving the performance of single machine by optimization or buying more expensive hardware.

This seems to indicate that scaling should always be preferred. Unfortunately the problem is that designing your application for scalability is not trivial. As an example, Google AppEngine (GAE) is designed to be scalable, but not necessarily high-performance [2]. On the other hand, rational database such as MySQL can be optimized for high-performance, but designing your application to scale with MySQL is a challenge. In most web applications, typically the database server eventually becomes a bottleneck at high load. On the other hand Google's Bigtable is designed to be scalable. The tradeoff is that GAE API does not allow many relational database features such as join and hence requires the programmers to learn a new way of data storage and access.

Horizontal scalability refers to adding more machines to handle the load, whereas vertical scalability (which we call high-performance) refers to adding hardware components in existing machines such as more memory or better CPU to handle higher load [3].

High Scalability Techniques

Partitioning the data is most common scalability technique. It allows you to distribute different partitions on different servers. Consistent hashing has been used in distributed hash tables and distributed server farm to assist partitioning and replication of data in the presence of high churn when machines come and go frequently.

Stateless servers are much more scalable than stateful, because stateful servers may need to communicate with each other or share state which limits the scalability. Web servers and SIP proxy servers are easy to make stateless, whereas conference servers, presence servers or gateways are difficult to make stateless. Many applications too require stateful processing at the server, e.g., web applications that need stateful database storage. This concept can be used together with partitioning to build a two-stage server farm where first stage stateless servers just do load balancing whereas second stage stateful servers work on a small data partition. Unfortunately, some applications such as presence or publish-subscribe are too complex for easy data partition.

High Performance Techniques

The C10K problem [4] talks about the typical web server limitation of only about ten thousand simultaneous connections due to operating system and software constraints, and presents several references to improve the performance. The usual software performance bottlenecks are data copies, context switches, memory allocation and lock contention. Various techniques to handle these problems are summarized in [5].

Asynchronous and non-blocking IO are commonly used to convert blocking/synchronous methods to event-based. Although asynchronous and non-blocking refer to almost the same thing, there are certain crucial differences in the API [6]. Non-blocking refers to making your methods not block and hence return immediately, e.g., with an error code indicating that the method is not complete. Typically, additional method is available to know the state of the IO. For example, socket API allows non-blocking mode, and can use select to check the state of the socket, whether read or write can be done or not. Thus, the application program has full control of when the read is done and in which thread/stack. On the other hand, asynchronous API are more event-based, where the application registers a method handler for an event, and the system calls the method when that event occurs from within the system thread, or posts that event to the main application's handler loop.

A well known topic of debate is whether event-based or threads are more suitable for high performance servers? Theoretically, both are equivalent with non-preemptive threads and co-operative multi-tasking. But in practice due to the way threads are implemented and resources needed by threads, event-based systems have performed better on single CPU machines. Unfortunately, pure event-based systems are difficult to take advantage of multi-CPU hardware.

Thread-pool and process-pool have been used to improve the system performance and take advantage of multi-CPU hardware. Both multi-process and multi-thread systems have been built in practice. The advantage of multi-process implementation is that multiple processes can listen for incoming connections on the same socket, whereas in multi-thread implementation only one thread can be listening on a socket. The problem in multi-process implementation is that it needs explicit inter-process communication using message passing or shared memory, whereas in multi-thread implementation it is easy to use global variables with mutex and conditions to share state. With respect to event-based systems, there are two design patterns: a reactor pattern allows the application to register for "ready" event and perform the read operation when event is received; a proactor pattern allows the application to register for "complete" event and receives the incoming data along with the read event [7].

The thundering herd problem in OS is that when an IO event is received all the waiting threads are woken up. But only one thread will handle the event and others will go back to sleep. This wastes CPU cycles. The problem and a solution is proposed in [8].

For a high-performance server implementation, general consensus is to always use non-blocking IO, and use thread or process pool with minimum number of threads/processes. The idea is that there should be one-thread/process per CPU. This paper [9] describes a SIP server architecture which can maintain few hundred thousand active TCP connections. For pure network IO it is possible to always use non-blocking IO on commodity hardware, whereas for disk IO it is not so easy. Hence, thread-pool model with worker threads to wait on disk IO completion have been used with success in the past.

Lessons in starting a software project

This article presents my thoughts on DOs and DONTs of starting a new software project. Many lessons listed in this article are already well known or common sense, but usually not always followed!

DOs

Brainstorm often: During the initial phases of software growth or even before starting to write a single line of code, you should do several sessions of brain storming. It could be on validating your idea, figuring out competition, predicting the future, picking a programming language, potential learning, etc. This is the difference between carefully planned birth versus unexpected pregnancy. Just because you can write some software, should you? Especially if better alternatives exist?
Use good version control system: Even for the most trivial projects, you should try to use version control system. I like SVN (subversion) for my open-source projects, but if you can afford git, it works better for complex project management. If you are starting an open source project, consider code.google.com for hosting your SVN repository -- it is fast, simple and hassle free. It is like a good home for your baby software.
Document all ideas: When the software is evolving you will have many ideas for new features, doing things differently, or incorporating competing features. Obviously due to lack of resources and time, you won't be able to incorporate all these. But you must document all the ideas, and if possible prioritize them. Keep a single list of ideas. Usually the software will evolve on its own to attract new features. Implement only the most crucial ideas and features, and resist the temptation to add many features.
Few developers during growth: Keep the core set of excellent developers to one, two or at most three when the project is growing. Every major piece of software should have only one excellent developer. This avoid unnecessary friction and induces feeling of ownership. Software is like a baby, which needs a good parent to raise and grow, before it can mature and face the world. You wouldn't want to raise your software in a foster house where nobody feels ownership, i.e., in an organization with an engineering "team".
Pick the right language and tools: Every programming language has some strengths and weaknesses. Make sure you select the right language, that is quick to develop with and maintain, and works well for your target application. For example, with low-level C/C++ you get performance, and with high-level Java, Python, you get portability. Over the years I have liked Python for most of my applications. Unfortunately, in corporate environment, Java is the pet-child because there are many fold more software developers and managers who know Java well. For modern Internet and web applications, Python, Ruby, Erlang and ActionScript are becoming more popular.
Include testing and defensive programming: To be successful, sooner or later your software project will need to get out of the demo-mode and face the real world. It might become too late at that time to worry about scalability or glaring bugs if those involve redesigning your software. It saves a lot of time and energy to use common techniques such as good logging, unit testing, performance best practices, and defensive programming from day 1. Also maintain an issue tracker and log even the tiniest of issues with your software. Sooner or later you will need to address them.

DONTs

Don't procrastinate: If you have an idea to work on, don't procrastinate. Just get started, write something up, try to get a prototype going. Most successful projects need a complete re-write at least once. So don't be afraid to write throwaway code.
Don't document before coding: While software engineering people will say that you should follow good software process -- writing requirements specification, design document, test cases, etc. -- those can be written later too! Source code is what makes or breaks a software. You can write detailed specification and design documents, after you already have a prototype and want to document it or propose a change. In my experience, any design document written before writing the code is incorrect, and needs to change drastically after the source code is written.
Don't spend time on one-off items: For your software, there are some items which are directly related, and then there are one-off items. For example, for a VoIP client, the protocol implementation, good voice quality, etc., are directly related. On the other hand, having a user signup page, instant messaging text chat, file sharing, etc., are one or two-off items, which are not directly related, but indirectly assist users in VoIP. When you start a project, do not spend time doing one-off items, but work on directly related items first.
Don't wait too long for 1.0 release: There is 80% difference between an 80% complete software and a released software. When you formally release your software, you have to take care of user manual, getting started guide, installer as well as finish those last annoying bugs. In the case of software projects, it is very easy to get started but very difficult to put an end. There is always an endless list of features which needs to be completed before the release, and hence your release never happens. Unless, you make it happen. You will have to make a firm decision about what bugs are important and what can remain as known issues for version 1.0.

Flash-VideoIO: Flash-based audio and video communication

I launched the Flash-VideoIO project to facilitate audio and video communication using easy-to-use reusable Flash application with extensive JavaScript API. More from the project page...

"Flash-VideoIO is a reusable generic Flash application to record and play live audio and video content. The Flash-VideoIO project aims at implementing a generic Flash application named VideoIO.swf which can be used for variety of use cases in audio and video communication, e.g., live camera view, recording of multimedia messages, playing video files from web server or via streaming, live video call and conferencing using client-server as well as peer-to-peer technology."

Developers are invited to explore and experiment with the VideoIO component, provide feedback and/or contribute to the development.

Distributed Systems Development: Client vs Server

In this article I compare the distributed systems development for client vs server. When you start implementing a distributed system such as a client or server for some protocol, the basic functionality is easy to implement. But to make your software usable in real world, the client or server specific considerations take a lot of time. This article tells you how to build good quality distributed systems: client or server.

Client

Considerations: Auto-configuration, IP address change detection, NAT and firewall traversal, robustness against failures, adapt to network condition, consistent user interface and view, command line vs user interface, guaranteed security, idle and sleep detection, responsiveness of user interface, redundant connections to servers, keep-alive, caching, analytics.

Examples: Firefox browser, Skype, Gtalk

Description: A client should automatically configure as much as possible, e.g., network IP, hostname, username, machine type, etc., from system. If the client relies on local IP address, it should automatically detect any change in IP address. For example, a SIP client should re-register the new IP address as contact if there is a change. NAT and firewall traversal is one annoying reality on the Internet. Most often an HTTP based client works out of the box because most networks are permissive of HTTP and HTTPS. However, if you are building any other client such as IM and chat, VoIP or media recording, then there is some network in some enterprise which will block your connection. Most protocols have an alternative to perform NAT and firewall traversal. For example, RTMP has RTMPT, XMPP can work on BOSH, and SIP uses bunch of techniques.

A client should be able to adapt to any network condition. This not only applies to the network topology and filtering, but also to the bandwidth and quality. A VoIP client should automatically adapt to lower quality codec if it detects lower end-to-end bandwidth. Bandwidth detection and adaption should be a continuous process instead of performing only at the beginning. If you need to connect to a server, and there are many distributed servers, the client could periodically detect a list of closest servers, and connect to one or more of the closest servers in network proximity. Where network proximity is determined by network distance or delay and jitter. This allows your client to handle geographic distribution. If you have multiple redundant servers, you client should be able to failover in case one server fails. A better approach could be to keep persistent connections to two servers, so that failover latency is minimized. The automatic configuration, detection and adaption of various network and system conditions is one of the most crucial property of successful peer-to-peer clients such as Skype. Some clients need detection of idle or sleep behavior, e.g., to update your presence status. If the user puts the system on sleep (or standby) then your software may not get any chance to communicate to the server about the status. In such cases, your protocol or server should be robust in detecting idle clients.

A client is a user facing software. The responsiveness of the client user interface distinguishes a good software from an average one. For example, if the client doesn't get a server response within 200 ms, it can automatically inform the user via an hour glass or rolling wheel indication. If your GUI becomes unresponsive while it is "processing" instead of giving an indication, then user is likely to get annoyed or make mistakes clicking on the same button multiple times. You should always use event-based system for your user interface, instead of synchronous processing especially if it can block. Caching can be used if needed to speed up your performance. For example, instead of fetching the user list to display in your client every time you switch to the user list view, you can cache it and display a cached copy. Periodically, refresh your local cache with the actual data or from server. Caching is also useful in other places where client-server communication becomes overloaded.

Command line clients are becoming less common these days. But such clients are more powerful in some scenarios. Consider whether a command line alternative is useful and feasible as well. Finally, a guarantee on security is a must for the Internet client applications. Most application protocols define secure communication, e.g., over TLS/SSL, S/MIME, etc. Your client should have an option to go completely secure and encrypted.

In summary, a good client software is one which can do one thing that it is meant for. You may add many new features, but how you do the essential function is what will make your client useful and popular. Consider using analytics in finding which feature is gaining popularity, or which feature is no longer used. A software is like a human body. If you don't do exercise to remove body fat -- remove unused pieces and re-factor periodically -- you will become too fat, slow and useless. This is more important for client facing software, because client behavior keeps changing and what you used last year may not be the same client this year.

Server

Considerations: Easy configuration, logging, vertical and horizontal scalability, robustness and automatic failover, auto loading of configuration changes, connectivity to different backends, programmability, event based but multi-threaded, use multi-core CPUs, memory usage optimization, management console, command line control, activity monitoring, admission control for quality, stateless vs stateful, replication of critical data, partitioning of data for scalability, caching, keep-alive for crash detection of server, detection of idle or unresponsive clients.

Examples: Apache web server, ejabberd, SIP express router

Anti-example: Tomcat, Flash Media Server

Description: A server should have explicit, easy and extensive configuration option so that it can be deployed on variety of different scenarios, e.g., Apache config file. Note that when it comes to configuration: explicit is better than implicit, easy is better than complex. Another important feature of the server is being able to load the configuration changes without having to kill the server. For example, Tomcat automatically detects new war files and re-deploys the applications. Apache web server can be made to re-read the configuration using Unix signal. Some servers take the configuration to an extreme by defining an easy to use script that controls the server behavior. For example SIP express router defines a perl-like programming script to handle incoming request, forward to telephony gateway or perform authentication. Such fine grained configuration allows deploying the server in variety of different environments -- from personal use to enterprise or carrier deployments. On the other hand, I find J2EE model of defining services and classes in XML configuration files hard to use. Even though the configuration is done by configuration file or script, a easy to use web based management console gives a clean interface to the server control and monitoring.

Easy to use and configurable logging is another crucial piece of server software. A server log is typically the first place you go when you detect a problem. There is a tradeoff between extensive logging vs selective logging. I prefer extensive logging with selective viewing. Also I prefer accessing log from command line using "tail -f logfile.log" instead of the variety of web based log viewers.

Scalability and robustness are part of good server design. There are many other articles and web site dedicated to discussion on this, e.g., highscalability.com. There are several techniques such as event based thread pool, connectivity to different backends, bi-directional master-slave databases, replication of critical data, in-memory distributed cache such as memcache, partitioning of data, two-stage load sharing architecture, and use of servers from different vendors for robustness against security exploits. The server should prefer stateless operations. It should be able to detect unresponsive clients in case of stateful sessions, e.g., by periodically sending keep-alives. Note that a server initiated keep-alive is more robust than a client-initiated for distributed applications. For example, in client initiated keep-alive, if client1's keep-alive fails, client1 assumes it is disconnected, but client2 doesn't know that client1 is disconnected; whereas in server-initiated keep-alive, once the server detects that client1 is disconnected, it can inform other related clients about it.

The server should use the available resources in the best possible way. Typically memory, CPU and bandwidth are the critical resources. Some form of activity monitor should detect the resource usage by the server and inform the concerned IT person in case of abnormal behavior. This could be because of memory leak in the server or some security attack from malicious systems. Obviously the implementers should strive to fix any memory leaks. Another useful behavior by the server is to do admission control based on available resources. For example, if the server detects that it is using 90% of its bandwidth, then it should not admin a new media streaming client, of if it detects it is CPU is fully utilized, it should reject new requests with appropriate error response, so that client retries with exponential back-off timeouts. In a distributed server farm, the servers should be able to not only automatically configure based on configuration of other servers, but also detect overload on and share load from other servers in the farm. For example, a self organizing server can detect other servers in the farm, and automatically assume load sharing and/or secondary server responsibility.

In summary, configuration, scalability and robustness form the core of a good server implementation.

What Great Programmers think?

I found a very interesting blog article and wanted to summarize the great programmers' view!

1. What is the most important skill every programmer should posses?

Good "taste". Communication skills and expression in writing. Strong sense of value of what you are doing is worth. Concentration. Passion. Self motivation. Think clearly. Prefer evidence over intuition.

2. What will be the next big thing in computer programming?

Web application programming will replace any GTK, Java Swing, Qt, Win32, MFC, etc. Real AI can change the current incremental trends in programming. Large-scale distributed processing. But many great programmers admit that they can't and don't want to predict future.

3. Why are some programmers 10 or 100 times more productive than others?

Ability to restate hard problems as easy ones. Genius is one percent inspiration and ninety-nine percent perspiration. Ability to fit the whole problem in their heads at ones. Care about what they do. They don't rush and slap things together, but have holistic picture of what is to be built. Knowledge of tools.

4. What are the most important tools?

Python, Lisp, Emacs, SVN, MySQL, GIMP, Firefox, TextMate, Pine, Ruby, make, TeX, vi, Unix, sam, bash -- they all are extensible. Learn everything in /bin and /usr/bin on Unix.

My take on the articles is that there is something common among all great programmers -- modesty, persistence, self motivation, taste, and extensive knowledge of useful tools.

I suggest also read this great article on how to recognize great programmers. If you want to be successful as a programmer I also suggest reading this book.

FAQ on using Flash Player to make phone calls

I present my answers to some frequently asked questions (FAQ) on using Flash Player to make phone calls.

1. Is Flash Application a good choice for VOIP?

Depends, the RTMP based application is not a good choice, whereas new RTMFP application is good for Flash to Flash Internet voice applications. For Flash to Phone applications, Flash is not a good choice as it is. Flash is good at user interface and ubiquitous availability but the TCP-based RTMP is not suitable for real-time interactive media, and UDP-based RTMFP is proprietary so cannot interwork with existing SIP-based VoIP systems.

Secondly, Flash Player is missing some of the crucial VoIP pieces such as good silence suppression and echo cancellation, so Flash based VoIP client becomes useless without a headset.

Thirdly, Although Flash Player supports open standard Speex audio codec, many existing VoIP providers do not support Speex, and expect only traditional voice codecs like G.729 and G.723.1. So you may also need to incorporate transcoding which is CPU intensive. Video transcoding is more difficult because of the proprietary video codec in Flash Player.

2. Will there be any performance degradation when the call goes through the following paths? (Flash Client -> Media Server ->RTMP to SIP Converter -> VOIP Server -> VoIP/PSTN Gateway -> PSTN Network -> Telephone)

Yes. If you can avoid intermediaries to cut down on media path latency, it will help a lot. Typically the VoIP Server (or SIP proxy server) is independent of the media path so that doesn't affect. But the media path goes through Media Server (FMS?) and RTMP to SIP converter, and that too over TCP. This degrades the quality a lot. One way could be to remove the "Media Server" from your path by having Flash Client directly connect to the RTMP to SIP converter. Also if you can reduce the network distance between the Flash Client and RTMP to SIP Converter, that will help a lot.

Secondly, with Flash Player you may need to do audio transcoding in your RTMP to SIP converter. This further degrades the performance and limits the scalability of your converter.

3. Some experts says that the development in C or C++ is prefered for VOIP call to phone instead of Flash Player for performance reason. Is that true?

A native VoIP client is preferred over Flash Player because the media packets can go directly from the client to the telephone instead of going through the RTMP to SIP converter. The advantage is because (1) the native client can use UDP instead of restricted to TCP-based RTMP, and (2) the network distance is lower for a direct path. Even if your converter is on good network and close to your client so that the network distance is not much of an issue, the UDP-vs-TCP makes a great impact in improving the quality of native VoIP client implementation over Flash Player.

In general the network component affects the quality more than the programming language. So whether you use C/C++, Python, Java or some other language, it doesn't matter much. But if you can have end-to-end media path over UDP between the two clients, or between the client and the gateway, it is much better. Obviously with Flash Player you cannot have the packets go directly unless your RTMP to SIP converter is local to the Flash Client.

All the existing good quality systems (Skype, GTalk) tend to use end-to-end media-path over UDP as much as possible.

4. There are different media servers available. like Adobe Flash Media server (FMS), Wowza, Red5 etc. Which one is the best choice?

Do you still want to pursue RTMP to SIP converter? Anyways: In terms of performance I would guess that FMS is the best choice. But if your aim to build a RTMP to SIP converter than probably Red5 is the the best. FMS is proprietary with not much customization/programming choices available, so you cannot easily integrate a SIP stack or a RTMP to SIP converter to FMS. On the other hand Red5 is completely open source and in Java so allows easy integration with other Java based SIP stack. Additionally you could integrate SIP stacks written in other advanced languages such as Python or Ruby because Red5 allows applications in those languages, whereas an FMS application is restricted to ActionScript 1.0.

I haven't worked with or used Wowza so I cannot comment on that. I have worked with FMS and Red5 though, as well as Python based rtmplite and siprtmp projects.

6. We are now in a confusion whether to develop our VOIP application in Flash technology or QT/Java/C#. What will be your choice?

I think that decision mostly comes from your business case. But I would suggest non-Flash technology if possible and if your business demands very good quality of voice service. If your VoIP client will be assisting your main business, then people won't mind downloading and installing the VoIP client. The advantage Flash has is that it is already available on most people's browser so doesn't require additional download or installation. So if your VoIP application is only a small part of your main web-based business, then Flash technology will be better I think.

Another option is to use the Gmail video/voice architecture described in my article. Basically it uses Flash Player for user interface, but all the networking or voice related processing happens using their native GoogleTalk plugin.