How to extend HTML5 for real-time video communication?

A few months ago, I was discussing HTML5 with a friend of mine. We tried to figure out what would it take to extend it to support web-based video communication. The proposed HTML5 already includes audio and video tags, but are useful only for streaming video applications. This article presents more refined thoughts on how to extend the browser to support video communication.

First approach: extend the video tag
W3C has added new video element in HTML5 to facilitate playback of interoperable video formats across browsers. Existing web sites use "object" element to run an external plugin such as Flash Player for video playback, which is intended to be replaced by the HTML5's video element. This allows browser manufacturers especially for phones and other devices to easily playback web videos, without having to implement the full Flash Player plugin. The "src" property allows specifying the URL of the video to play, and additional properties such as poster, preload, autoplay, loop and controls allow controlling the behavior of the video player.

One way to support video communication is to extend the video element with additional properties that allow it to capture and publish local video, and control camera and microphone behavior. For example, in a two-party call between Alice and Bob, Alice can have two video elements, one to publish local video to URL stream "alice" and other to play remote video from URL stream "bob". Similarly, Bob can have two video elements, one to publish local video to URL stream "bob" and other to play remote video from URL stream "alice". The "src" property can specify the central media server or rendezvous server location as well as the publish or play stream names, e.g., "rtmp://server/conf123?publish=alice".

This is the idea behind my Flash-based audio and video communication project. In addition to existing properties such as src, preload, autoplay, loop and controls, it defines new properties for microphone, camera, playing, recording, etc., as you can see on How to use the VideoIO API?. It also overloads the "src" property to allow "rtmp" and "rtmfp" URLs for media server or rendezvous server location, respectively. This application with its new properties can be used as a drop-in replacement for a video element that supports video communication in the browser.

This approach of extending the existing video element with new properties works well for two-party as well as multi-party conferences, and centralized as well as end-to-end media path. The nice thing about this approach is that it keeps the actual call signaling out-of-scope of the video element, e.g., your web application implements call signaling using existing Javascript/Ajax/websocket/server-event technologies. It keeps the specific rendezvous protocol mechanism such as "rtmp", "rtmfp", and in future "sip" or "rtsp", outside the video element.

To avoid interoperability problems, a minimum subset of supported rendezvous is recommended. The requirements of such a protocol is to support real-time media transport, preferably over UDP, in centralized or end-to-end path in presence of network middle boxes such as NATs and firewalls.

Second approach: define new connection object
The previous approach integrates capture, playback and connection functions in to a single video element, with additional properties. Alternatively, these functions can be split in to different elements and Javascript objects, e.g., the video element does display/playback, but new camera and microphone objects allow capture, and new connection object allows end-to-end real-time media path among participants. The Javascript application actually connects these different elements and objects to build a complete video communication system.

There are several proposals on how the new connection or transport API will look like. Example attributes are: protocol (udp or tcp), list of reflectors and relay servers , mode (initiating or receiving), secure (boolean). Additionally, it has methods such as connect and send, and events to indicate connection status and incoming data. Existing protocols such as ICE, STUN, RTP/RTCP and SIP may be implemented in the browser or external gateways to support such as transport object. Finally, these transport objects can be piped with display and capture components, audio and video codecs and filters, etc., to implement a complete video communication application.

In summary, this approach defines new Javascript objects such as Transport, Camera, Microphone, Codec, etc., and allows the application to connect these objects to build a real application. This is more complex than the first approach, but allows fine-grained application logic.

Third approach: use external application
This approach understands the limitations of HTML and does not try to "add" video communication to it. We are considering this approach of a separate application in our web communications project at Illinois Institute of Technology.

While the idea of extending HTML to support video communication is useful and interesting, there are many limitations. In the past, incompatibility in HTML among browsers has been a nightmare for web developers, and extending HTML for yet another feature is bound to cause more interoperability problems. Browser manufacturers are sometimes not too keen to add a new feature, e.g., for business reasons if it competes with the manufacturer's existing product or service. Third important reason is that the video element of HTML5 lacks some digital rights management related features, which causes media owners to publish their media using restricted Flash application. Fourth, adoption of new HTML5 is slow, so web site developers still need to fall back to Flash-based application for video playback at least in the short term. Finally, adding capture and end-to-end transport components in HTML5 gives rise to a plethora of issues related to privacy, security and denial of service attacks, in case of faulty browser implementation. Due to these reasons many people believe that extending HTML and browsers to support video communication is not the right approach.

Hundreds of applications exist that implement consumer video communication. Some popular ones are Skype, Gmail, tinychat and Facetime. The technology behind these are drastically different, especially for signaling and control. However, at the bottom, every video communication application tends to establish some form of end-to-end UDP-based real-time media path, and fall-back to relays if that fails. As mentioned before, IETF standards exist to establish such media path and relays.

Imagine a standard-compliant resident application, rtc-app, that runs on user's machine independent of the browser, but allows any application including browser to establish real-time media-path. The browser can use existing API such as websocket or HTTP to interact with rtc-app. The rtc-app application is not owned by a specific vendor, and is installed by the end-user. The avoids re-implementing the feature by every vendor who wants to do real-time video communication. To address the privacy and security concerns, rtc-app must directly ask permission from the end-user before initiating or accepting a connection instead of automatically (and randomly) on API calls. This is similar to how Flash Player asks the end-user for permission to capture from microphone or camera, but can remember the application for future use if told so by the end-user.

The main advantage of this approach is that it does not require changing the browser or HTML, but still is a generic implementation-focussed way to enable real-time video communication for many other applications. If an existing vendor such as Skype or Google opens up its API, it will be a big step forward. While rtc-app can provide transport functions, the audio and video capture still needs to be done somehow. Various codec licensing issues may prevent us from including it in rtc-app, but Flash player based application similar to the first approach can perform capture on its behalf. The main problem with this approach is that it requires an additional download and install by the end-user.



How to conduct a technical interview for software engineer?

(This article presents my thoughts on how to effectively conduct a technical interview for a software engineering position. It presents the "interviewer's" point of view based on more than 30 technical interviews I have conducted, and quality of candidates I have recommended. If you are an "interviewee" I suggest you look elsewhere, e.g., interview questions.)
  1. Know the position you are hiring for. If you have been part of a software engineering team or have read the book, "The mythical man-month", you would know that you need several different "types" of members in a successful team. You need a "magician", who knows or can figure out solution to every technical problem you may have. You need a couple of "plumbers" who are willing to fix any broken software piece. You need a "general" who is very motivated about what you are doing, knows how and when to delegate, and keeps everyone together. You need a few "soldiers" who can follow orders, do the job, and be happy to contribute. And so on. As an interviewer, you need to know what position you are hiring for? You need to tailer your interview as per the requirement. One interview pattern does not fit all types.
  2. Do your homework. Before the interview, thoroughly read the candidate's resume/CV. If she has extensive work experience, identify only one or two of her past projects to focus on. If you have even a slight doubt about her programming ability, prepare a written programming test. If possible, scheduler a separate or additional time slot before your face-to-face interview for the programming test. Do not use any existing online programming test material, otherwise you won't be able to distinguish between someone who knows how to program vs someone who has gone through many web sites containing interview questions. Do not give take home tests. Do not share your programming interview questions with other interviewers in your organization.
  3. Start with knowledge questions. During the interview, after initial introductions, start with a question on her past experience. Your interview should balance between knowledge and application types of questions. Do not ignore his experience or knowledge, and do not focus only on his experience. Getting started with what the candidate already knows is also a good way to make her comfortable. You can ask something from her past project, e.g., "Describe in one minute what you did in XYZ?", or ask about a past technology that he used extensively, e.g., "Did you use STL in C++? What are the common STL classes available?"
  4. Focus on real application problems. Most software engineering positions require applying your existing knowledge to a new problem. The one quality which distinguishes a good programmer from a mediocre programmer is that a good programmer can easily translate your problem in to pseudo-code. If you are interviewing for "soldiers" and not "magician" or "general", avoid discussing high-level design type of problems, but instead focus on more low level real technical problems. For example, instead of asking "How would you design a scalable web server for blah blah?" ask more specific questions. In my experience, people who can answer high level design questions can create "vaporware" but those who can translate a small real problem to pseudo-code can actually write "software". If you need software engineers, avoid wasting time on high level design questions. Also, such application problems should be independent of specific domains but just be able to test whether the candidate has the required mathematical and computer skills to translate your problem to pseudo-code. I have given some examples later.
  5. Follow thought processes and provide hints. If you believe that the candidate is getting diverted in to incorrect answer, there is no harm to give hints or counter-questions to course correct her thoughts. Do not be too adamant on your answer. Sometimes, a 75% correct answer is good enough.
  6. Provide itemized feedback. When you submit your recommendation to the HR or your manager about a candidate, specifically itemize individual qualities and performance, and emphasize specific skills and lack of it. For example, "I had a nice 45 min conversation with XYZ, and I found her to be a very good programmer but needs training on Flex. After initial introductions, I asked one algorithm and three programming questions. She did good in two programming ones and average on others. Programming ability: very good; Needs hand-holding: yes; Algorithms: average; Strength: programming; Weakness: Flex; Recommendation: weak accept." My final recommendation is one of strong-accept, weak-accept, weak-reject, or strong-reject, with implied meaning of "a very strong candidate, and must hire her", "a good enough candidate, but won't argue to hire him if others disliked her", "an average candidate, but won't argue to reject him if others strongly liked her", "a poor candidate, and must not hire her", respectively.
As an interviewer you would be wondering about examples of real questions that would distinguish a good programmer from an average one. These are some examples. As mentioned before, you should create your own question, instead of using these, otherwise you cannot distinguish a candidate who genuinely solved the problem from the one who has read this blog.
  1. Video conferencing layout: suppose you know the window dimension, WxH, and want to fit participant videos in MxM tile. Each video has fixed aspect ratio of 4:3. All video objects are of same size in the layout. Your MxM tile should be laid out in the middle-center, with potential empty spaces near window edges. The layout should maximize the size of the MxM tile, so that the empty spaces near edges are minimized. You are given an array of video objects V[] and a function layout(v:video, x, y, w, h) which lays out a single video object with size (w, h) at position (x, y) inside the window. Write pseudo-code to layout participant videos. (Hint: start with 1 video, then 2x2, then 3x3, then generalize. Additional questions: how would you modify it to NxM tile instead of MxM? What should happen if number of videos is more then 9 but less than 16 -- which boxes are empty? How would you modify it so that empty spaces including empty boxes are minimized in NxM layout?)
  2. Path optimization: suppose you have a map of a city with Manhattan-style layout. Suppose north-south streets are named, a1, a2, etc., and east-west streets are named b1, b2, etc. Some streets have traffic signals, with 5-second walk sign, 15 seconds count-down to continue walking if started, and 20 seconds don't walk sign, periodically repeating in that order. Other streets do not have traffic signal, in which case traffic must yield to pedestrians. Suppose you need to walk from corner of a5/b5 to corner of a7/b10, and only street with traffic lights on you way is a6. You walk at the same speed. Crossing a6 takes 15 seconds whereas crossing any other street takes only 5 seconds. You do not want to cross a6 if you know you can't finish before it turns to don't walk sign. You want to minimize the time taken from source to destination, hence minimize the time waiting on traffic lights. You have function named walk(), turn(left or right), stop(). Write pseudo-code for your decision process from your source to destination point. (Hint: draw out the map first, then it becomes easy to visualize and solve. Additional question: can you generalize between any two points as long as you know the complete map and which streets have signals?)
If you have more ideas, feel free to comment.