Sunday, June 23, 2019

Multi-language software programs

When two bilingual people talk to each other, they often tend to mix the two languages in their conversation. Have you heard of Hinglish?

As a software developer, it makes me wonder - is this doable or already done for programming languages?

There are ways to compile modules or files written in different languages in the same binary or application, e.g., JNI or Python module, so that one program in one language can invoke an "external" module of another language. There are ways to write program in one language and auto-generate equivalent programs in others, e.g., haxe. Or to connect between separate programs in separate languages via networking. (see more, and more, and much more)

However, syntax embedding is not very common, except in web-style or user interface drivven programs. An HTML web page often includes scripts in JavaScript and styles in CSS, and can potentially include code snippet from any other language, as long as the underlying browser or the embedded script can interpret or apply it.

The interesting question is - is it possible, or even beneficial, to have a program file that can freely intermix any language code on demand?

This is like including code snippets written a Python, C/C++, Java and JavaScript in the same program file. Different languages' code snippets will access the same data (data structures, variables, stack, etc). It will allow a multi-lingual developer to use whichever programming language is best for the given functionality. Even if the mechanism requires separate files within the same compiled or interpreted application, can they freely use the data, and add a new file in a new language without pre-provisioning of the supported languages?

What does it take to create such a developer tool?

Vanilla JS

I am a proponent of using Vanilla JS [1,2,3,4] wherever possible. It is plain JavaScript without external frameworks. In my last two jobs, I built many feature-rich and cross platform apps using HTML5, JavaScript and CSS, without depending on the popular frameworks such as jQuery, React or Angular. There are numerous articles on the web about the benefits (and problems) of using plain JS, i.e., without frameworks. Here, I present my thoughts and some reading list for developers who want to avoid frameworks too if possible.

Many JavaScript frameworks and libraries fall under two categories: (a) stop-gap measure until the popular browsers can implement the desired and standardized behavior, or (b) short cut way to quickly build a type of application. This calls for understanding the difference between a polyfill library versus a framework. Libraries and polyfills [5,5] are actually life savers for web development. However, frameworks are of concern, because they change way you need to write code, or structure the application, or fit your thoughts in a particular paradigm [7,8,9,10,11,12,13,14]. 

At the high level, many popular frameworks require or recommend developing single-page-applications (SPA). One reason is because of the inefficiency in reloading the framework code in every page of a multi-page-application. Another is for the ability to maintain state between screen transitions. Single page applications have their own drawbacks [15,16] - can cause bulky script injection in run time affecting the performance, or memory leak buildup over time - hence may not be suitable for long running mobile (or even desktop) apps. Without framework, such monolithic code becomes hard to maintain, and with framework, it is hard to develop beyond the features available in or allowed by the framework.

At the low level, some frameworks attempt to do the same things in JavaScript that are available natively in the browser in a more efficient manner, such as via CSS animation/transition or DOM nodes selection. When such applications are ported to mobile, e.g., using Cordova or PhoneGap, the effect on performance is visible. On the other hand, CSS transition and key-frames styles are powerful, versatile and fast, and behaves like native animation on mobile devices.

Emerging web standards and their availability on modern browsers is outstanding. With web components [17,18,19], a chunk of web page and related processing can be delegated to separate independent code. Alternatively, iframes can be used to modularize the software into loosely-coupled components. Unlike SPA, using iframes carefully can avoid long running memory leak buildup - the content and related memory is unloaded when the iframe source is unloaded. Constructs such as defined in ES6 and Promise further help create short, concise and easy to understand code in plain JS.

In the end, whether framework is used or not, good programming practice always wins. I agree that a good framework can avoid common programming mistakes by mediocre developers. However, the technical debt it can incur and the frustration it can cause can be huge if not carefully planned. So are you ready to take the no-framework challenge? [20]

Saturday, June 22, 2019

A generic video-io component for WebRTC

This article proposes a generic video-io web component for WebRTC. It can enable various real-time applications such as voice/video conferencing, video messaging, video presence and video broadcast. This client side component may be connected with any signaling channel or mechanism to integrate with existing or new websites or applications. 

By encapsulating the core functions in a single easy-to-understand component, the component enables reuse of the core media features present in WebRTC. Without this, a user relies on the individual website or application developer to support and expose those media features. 

Additionally, I describe the design of a global secure application service that can act as a signaling mechanism for all such component instances.

Some early version of the idea in the form of a video-io widget exists, and is described in my papers [artisy,vclick].


In my earlier life, I had implemented the flash-videoio project [github,paper,slides] and demonstrated a wide range of applications using a simple, flexible and easy-to-use widget. These applications included video messaging, live broadcast, two-party video call, multi-party conference, panel discussion broadcast, interoperability with phone network, and audio-only communication. 

The flash-videoio widget combined the various media abstractions available in Flash Player in a single box that exposed a well defined API. It could be invoked from the embedding web page in JavaScript or the parent Flash application in ActionScript. 

Two things were very important in that project - independence of the client-side widget from the website hosting it, and a single abstraction that was easy to understand. Since signaling and control remained outside the widget, it could enable integration with existing or new application services over XMPP, WebSocket, Google Channel API, Facebook API, and so on. 

With WebRTC, the APIs available in the browser are different than what the Flash Player provides. But the idea of a widget that can provide a simple abstraction, and can work with any website or service is still very promising.

In HTML5, web components allow creating new custom elements such as with associated styles and script. This promotes interoperability, reuse and modularization of web application fragments. Numerous web components are available for various appllications [link]. 

This article proposes a generic video-io component and its API, and discusses ideas to make it secure, robust and versatile.


Firstly, let us look at the common use cases enabled by WebRTC — not just for browsers, but also for mobile apps or other devices. These scenarios can motivate us further to create a generic video-io component, and guide us in its API design.

1) Customer support: visitor of a webpage can easily connect with owner or support person. Although the visitor is on the browser or mobile app, the support person may not be. Interactive dialog with machine such as IVR, Alexa or Siri-style communication is similar except that the other end is an automated response system.

2) Surveillance or one way media: passively record activities via audio and video, and generate alerts or connect to live agents when something interesting happens. Several always-on videos, drones or IoT use cases also fall under this category.

3) Fast interaction: video games and other AR/VR related applications require very low latency communication among participants. Other emerging areas such as online healthcare, remote surgery operation,  etc., also fall under this category.

4) Online classroom: typically a one-to-many media path, but occasionally two-way media for asking questions or short interactions. Event broadcast, webinars or online TV to large number of visitors also falls under this category.

5) Team collaboration: online meetings including two-party as well as multi-party group meetings. Besides the ability to do real-time communication, it requires other features such as ability to record or share screen.
Instant communicator: a text chat or instant messaging application often needs the ability to initiate voice and video call, and to add more participants to the call. Social video chat, video office, and other forms of impromptu multi-modal communication fall under this category. 

6) Yet another client: for many telecom service providers, enabling browser and mobile devices to connect to their services is big plus. Various types of gateways to connect between WebRTC and other systems fall under this category.

These are just some of the categories that come to mind, but there are definitely many more. 

What are the common abstractions among these? 

Earlier, I wrote about three paradigms - call, conference and named stream [link] - and, indeed, they seem to be applicable in these examples as well. For example, surveillance and online classroom roughly fall under named stream, team collaboration under conference, whereas instant communicator, yet another client and customer support under the call paradigm. Please note that boundary between call and conference abstraction is blurred in many scenarios.

Unfortunately, no single application provider supports all these abstractions. In fact, existing application providers are often too rigid, without the flexibility to enable scenarios beyond their carefully crafted use cases. On the other hand, separating the signaling services from the media features enables innovation on multiple fronts. A video-io component facilitates such separation.


Here are some reasons to support a generic component that encapsulates media features for many such application scenarios.

Reuse media features across application: 

A video-on-demand website typically encapsulates the media features in a video player (like a widget or component), but other provider specific features such as authentication or access control outside the video player. Examples of media features are pause or playback timeline. 

A WebRTC application also includes several features - some are provider specific such as for identity, authentication or access control, and others are media related such as the ability to mute, zoom, record or change camera capture attributes. Typically, a WebRTC application developer would like to restrict and control the provider specific features, but would prefer to enable a wide range of common media features. But the developer must explicitly include the implementations of these features in the application. Otherwise, the customer is left wondering why she cannot mute her call, or how she can zoom in to a part of the video. 

Encapsulating such media features in an easy to understand web component allows reusing the common functions across various applications and websites. By keeping the signaling outside the component, it can still be integrated with wide range of existing and new application scenarios. A comprehensive and well defined API for this component can serve as a list of required media features among supporting applications, instead of depending on the specific website or provider implementations, especially when the feature is entirely local or end-to-end. 

Reduce dependency on third-party application providers: 

WebRTC promotes communication silos [link]. It enables any website to easily become a communication service provider for its users or visitors. 

Previously, VoIP signaling protocols such as SIP presented a consistent endpoint identity and allowed the communicating endpoints to exchange media and transport information. This allowed users of separate providers to talk to each other in a trapezoid topology [link]. 

With the triangle topology of WebRTC, and lack of global identity or signaling channel specification, the two endpoints connected to separate websites cannot easily talk to each other. Moreover, cross-site communication lacks the business incentives especially for websites that want to control (and hide) every aspect of their visitors’ interactions. 

The end result is the availability of numerous incompatible “signaling SDKs” for various WebRTC applications. These are often too rigid to be used in new scenarios, or too tightly-coupled with a third-party application provider service outside the main website provider. 

If the website’s communication is locked to such an external provider, many issues arise, e.g., which identity is used on the two systems? how does third-party audio/video integrates with website’s existing communication? who stores the recordings of the interactions? or who can perform real-time or offline media analytics? 

Often times, it is in the long term interest of the website to reuse its own service to enable audio/video communication, instead of relying on (and getting locked to) a third-party application provider. Using a client side web component can off-load some aspects of this task, and can easily allow reusing existing web servers and user identities of the website provider.

Block diagram

So what does the video-io component look like? Here is a logical block diagram of the component. 

The component combines the individual low level primitives available in WebRTC into a single widget represented as a box. The component can act as an audio/video publisher or subscriber (i.e., player). A publisher box can display video captured from the camera device, and also send the captured audio/video stream out, whereas a subscriber box can play a received audio/video stream. 

The component hides the differences between various web or mobile audio/video software to provide a comprehensive but consistent and easy to understand abstraction. It also keeps the actual signaling outside the box, and provider a generic way to plug-in a new signaling mechanism. 

The diagram above shows the two interfaces - application and signaling. The application interface is used to control the component by the application, and the signaling interface is used by the component to exchange signaling data with some signaling or notification service or with other component instances. 


The component can aim to define generic and extensible APIs to enable wide range of existing and future signaling and media technologies. For example, the media stack could use WebRTC as default, but could fall back to Flash Player or third-party plugins for actual media stream if needed. The external signaling can potentially use broadcast on local area network, WebSocket, Google Channel, XMPP, proprietary plugin, mobile device notifications, or something else. 

Furthermore, a cloud hosted service to enable secure and robust signaling in the default case can help bootstrap the usage of the component. More about this global service is discussed later in this document.

The following diagram shows how the component is embedded or used on web vs. mobile. If the mobile platform does not allow or support WebRTC or WebView, then the library communicates with an external mobile app container that hosts one or more component instances. The component loaded from a web page should support both WebRTC and Flash Player, and depending on the session negotiation and capability of the browser, can pick either. The component loaded in the WebView of a mobile app should include only WebRTC capability for platforms that support WebRTC in the WebView. For other platforms, a separate mobile app may be launched that presents the user interface for one or more component instances. This is just one way to implement the cross platform component, using WebView and cross-platform tools such as Cordova.

The library runs in the scope of the parent web application, presents simple programming primitives, and interfaces with the component, or uses an iframe for browsers that do not support web components. In the case of mobile app, the library provides native programming primitives and bridges with either the WebView or the external mobile app containing the component instances. The external mobile app may be implemented by a third-party by following the interface described in this document.


Let us enumerate some component interfaces for common media features that are applicable to many of the scenarios listed above. A number of attributes below are inspired by the HTML5 video element or the flash-videoio widget. Furthermore, the proposed component is expected to support both live and stored media, for publish and play.

  • Signaling and failover: ability to attach a signaling channel or mechanism to a component instance. This may use application or website specific mechanism, such as over WebSocket or SIP, or using proprietary RESTful APIs. Furthermore, it includes the ability to fallback to a secondary signaling channel or mechanism if the primary one fails during runtime. Seamless failover is needed to reduce initial (call setup) latency.
  • Appearance: controls how the component appears, and should have the ability to scale down or remove visual elements if the media type is audio-only. 
  • Poster: specify an initial image to display before the video starts playing, either from local or remote stream. This is similar to the poster attribute of the video element. The component may allow a video file in addition to an image.
  • Preload, autoplay, loop: Similar to the video element, but applicable only for playback of stored media in general.
  • Controls: whether the component displays its own user interface controls such as for pause, mute, sound volume, etc? In addition to the playback mode, this also needs to support publish mode.
  • Publish, play, record, mode: named stream (to publish or play) that is attached to this component, using live voice and/or video, or stored content. The mode determines whether it is live or stored content. A live stream may also be recorded, in which case the record mode may be append or replace, to append or replace an existing recording of the same name.
  • Playing, publishing: controls and indicates whether the component is currently actively playing or publishing.
  • Bidirection: controls and indicates whether the component has negotiated bidirectional media stream or not? Even though one component instance can either publish or play a stream, multiple instances may work together behind the scenes to negotiate separate unidirectional streams or combined bidirectional streams.
  • Activity: indicates whether user is involved in any mouse pointer or touch event activity with the component. This can be used by the embedding application to show or hide the application defined controls for that component.
  • Devices (microphone, camera, sound, display, level, volume, width, height): controls whether those underlying features are enabled or not? For example, to mute the microphone or speaker, or to stop the camera capture or to hide the displayed video. Additionally, microphone level and sound volume can be used to indicate the microphone activity level or to control the speaker volume. The capture and display width and height attributes can control or indicate the camera capture dimension or video display dimension. These attributes should also allow controlling which of the multiple microphones or cameras may be used. Also, a list of available devices should be exposed to the application.
  • Codecs and parameters: controls the voice and video codecs to use for publishing, and indicates the codecs in use for playing. Additionally, it includes other parameters such as capture microphone sampling rate, camera or display frame rate, video encode quality, default video key frame interval, camera capture quality, capture bandwidth, echo cancellation, silence suppression. In case of simulcast or scalable video codecs, additional attributes should control or specify a simulcast stream or scalable property, e.g., to receive only low frame rate stream, or low quality video.
  • Zoom: controls how the video content is zoomed to fit in the component size, and is roughly similar to the object-fit CSS attribute. Additionally, the application should be able to zoom the component view to a specified rectangular area. 
  • Mirrored: controls whether the local view is mirrored or not? Although, CSS can be used to control the horizontal or vertical flipping of the component, having such an attribute allows flipping the video horizontally without flipping the controls.
  • Fullscreen: controls and indicates whether the component is rendered in full screen, in which case it may switch to hardware rendering, or enable higher quality automatically, or perform other optimizations as needed.
  • Snapshot: ability to take a one-time snapshot of the video as an image content, that can be consumed in the embedding JavaScript application.
  • Current time, duration: controls and indicates the current capture or playback time since the beginning of publishing or playing. Additionally, other attributes may indicate the total duration of a stored playback content, or statistics of the bytes loaded or total.
  • Quality: indicates playback quality of the audio and video streams, and may be used by the application to indicate to the user, e.g., as quality bars. Separate values for audio and video may be used. Additionally, separate attributes may be used to indicate the current bandwidth, packet loss frequency, playback buffer size, or other quality related metrics.
  • Duplicate: ability to create a duplicate play mode component, which plays from the same source as the original, albeit via separate re-negotiation of the streams. If a publish component is duplicated, it creates a play mode component that plays that original stream.
  • Send/receive: ability to send or receive metadata or other non-media data, such as for text chat. A publish component may be able to send to all the current players. A play component may be able to send to the publisher if any.
  • Group: ability to group together multiple component instances, so that they can share the signaling channel, and can be treated as bidirectional when needed. A two-party call or multi-party conference will often create a group for each call or conference, covering one publish and one or more play component instances.

It is trivial to see that many of the features listed above are media features, whereas some are not. In particular, the signaling, failover and group features should include some form of external signaling services or mechanisms, via the signaling interface. That will allow the component to exchange various media attributes or features on the signaling channel.

The description presented above can be used to implement a generic video-io component. In practice, the component will be part of an application that will include some form of signaling channel or mechanism. Rest of the article attempts to describe global service for such signaling. Note, however, that the client side component is independent of such service, and may be used in proprietary or application specific manner.

Application and service

There are roughly three high level entities in a WebRTC application - the end user, the client application, and the application service - or user, application and service for short.

A WebRTC application can be classified along several dimensions. Let us look at a few of these.

1) Walled garden vs. open system: 

Examples of walled garden (service) are websites that allow WebRTC-based communication only among the visitors of that website, or users of their APIs. Typically, such lookup APIs on the web server or access to the notification server are restricted, e.g., to account holders of that provider or visitors of that website. 

On the other hand, an open system (service) enables third-party developers to create client applications that can connect using the services of that provider or website. In that case one client application from one developer can talk to another client application from another developer using the same service. 

2) Proprietary protocol vs. standards based: 

Examples of standards-based system (service) are those that map to some existing standard protocol such as SIP, so that the triangle topology is readily converted to trapezoid. The goal is to allow a third-party developer to create servers (service) that can federate using open standards. This is so that the users of one service can communicate with those of another. On the other hand, a proprietary protocol is enough for a single service or provider system. 

Thus, walled garden does not allow clients from outsiders, and proprietary protocol does not allow servers from outsiders. 

Note that a walled garden does not mean proprietary, and an open system does not mean standards based. A system can be walled garden but standards based, e.g., if authentication for inclusion in the federation is controlled by one provider, even if all the servers in the federation use SIP. Similarly, it can be open system with proprietary protocol, e.g., with open APIs that do not follow existing signaling standards. 

3) Endpoint driven vs. server driven: 

An endpoint driven signaling system keeps most of the application logic running in the endpoint or client device [link]. In that case, a server is typically needed only for event notification and data storage. On the other hand, a server driven signaling relies on the application logic running in the server, e.g., to determine room membership, stream subscribers, or picking the right user device.

It is not surprising that almost all existing systems are server driven. Endpoint driven systems exhibit little business value, because the service provider loses control of the application logic, especially if it is an open system that allows third-party client applications. Furthermore, many existing systems are walled garden, with proprietary protocol, and server driven. 

Global service

One may ask - is there any benefit in a standards-based endpoint-driven open system? 

What would it look like? Who will benefit from it? Can it allow all the scenarios listed previously? If it can, then it enables an open platform for developing almost all kinds of use cases, where any developer can contribute with a client application or a hosted service, creating a global WebRTC-based system.

The video-io component proposal presented in this article enables creating such a global system. In particular, the media features already reside in the component (in the endpoint). If an application service provider uses a generic and open API, and a pluggable server farm architecture, then such a system can be created. Note, however, that the component itself does not impose any restrictions on the system type, and can readily support open or closed, standards or proprietary, and endpoint vs. server driven system. In any case, the application service provider needs to implement some basic signaling primitives.

These primitives were discussed earlier, and fall under call, conference or named stream abstractions. The call abstraction creates a bi-directional media pipe and also allows registration and lookup of user identity. The conference paradigm presents a room and participant abstraction where each participant is a room can see and hear each other. The named stream paradigm allows unidirectional media flow from zero or one publisher of a stream to zero or more subscribers of that stream, where the publisher and subscribers can come or go independent of each other.

So the question is - what does it take to build an endpoint-driven standards-based open system for WebRTC signaling?

First, the server must be light weight and without any application logic beyond the above mentioned abstractions. Second, the user or device identity must be independent of a single provider. Third, it must be use existing standards for communication between servers, if needed. Finally, it must use open published interfaces between client and server, whenever needed.

If the goal is to prevent creating proprietary applications using this service, it must allow third-party developers to create client applications. On the other hand existing cloud WebRTC application service providers typically allow developers to create applications within the scope of their developer key. Thus one developer owns that application, and often, prevents other developers from creating an application that interacts with her application.

Public key/certificates or their fingerprints can be used as the user or device identity. The identity provider should be independent of the application service provider, and different application provider may trust different identity providers - similar to how a browser can trust many different root certificate authorities.

For simplicity, based on the previous discussion, the application can be abstracted out with three concepts: first, a component that publishes or plays media; second a named stream which can be attached to a publishing or subscribing (playing) component, such that there can be at most one publisher per stream at any time; and third, a collection of zero or more component instances and named streams that are dynamically adjusted as the application demands. Generally, these abstractions are enough to support the application scenarios listed earlier. 

Named stream

One of the key feature described before is the ability of the component to publish or play a named stream. Depending on the signaling layer, this may or may not be readily available in the service. If not, then a server that processes state for various named streams, and connects the publishers and players of a named stream is needed. 

My earlier projects include a light weight resource server that facilitate this feature. However, implementing this from scratch on a clean slate is not difficult. Such a named stream server need to maintain list of named streams that are being published or subscribed. Furthermore, it need to enforce at most one publisher per stream, and zero or more subscribers. The publishers and subscribers of a stream may come or go in any order, and at any time.

The stream name may be globally unique and globally accessible (e.g., URI), or may be unique within a single server. In the latter case, it is scoped within the server address, so that it becomes globally unique. A unique stream name allows simple interface for failover when needed, e.g., publish to stream A on server 1, and if that fails then to stream B on server 2. Otherwise, if server address cannot be specified in the stream name, then a failover within a single service is not robust in all scenarios.

Secure design

With global stream names, comes global problems, i.e., security and access control. 

To elaborate the challenges, consider a naive system that uses clear text named streams. User Alice instructs her application to create a component and publish it to stream named “alice”. User Bob instructs his application to create a component and subscribes it to stream named “alice”. Now Bob will be able to view and hear Alice. Similarly, the reverse direction media flow can be done with stream named “bob”. If a malicious participant Marvin knows about the stream names, he can pretend to be either Alice or Bob, by publishing to that named stream, thus disrupting the system.

To improve upon this, the system can require the publishing application to use clear text stream name, but the subscribing application to use hash (e.g., SHA256) of the stream name. Thus, only Alice will know her clear text stream name, e.g., “alice543”, and Bob will know only the hash, H(alice543). The application will not deliver the clear text stream name to the other participants. Thus, Marvin will not be able to publish to streams of Alice or Bob.

To support other complex scenarios such as to restrict the subscribers to known ones, or to limit the number of subscribers per stream, additional data can be included in the stream name. For example, separate indirections can be distributed to separate subscribers, all pointing to the same published stream, albeit with different keys.

As mentioned earlier, there are three entities here - the end user, the client application, and the application service provider - or user, application and service in short. Should the application trust the service to not misuse the clear text stream name? No. If the service leaks the clear text stream name, then the system falls apart.

Another approach is to use a public-private key-pair for a stream, and have the stream name be derived from the public key. The system allows only the owner of the private key to publish to that stream. In that case the application does not have to give out the private key to the service, but the appllication can still prove that it owns the private key.

Should the application create its own key-pair for the stream, or should the user provide the credentials? Should the user trust the application to not misuse the credentials? No. If the application gets access to the user’s private key, then trust model falls apart again. Hence, the application should be a transparent bridge for the credentials, and the user should manage her private key independent of the application.

One approach to solve this is to use client certificates, directly on the underlying tool (browser or mobile device), bypassing the application for these credentials.

First, the user gets client certificates out-of-band from some identity provider. The certificate verifies that the user is who she claims to be. The user gives out her public key (identity) to other participants, again, out-of-band, e.g., via email or other means. When another participant’s identity is received, the user signs it using her private key, to create a contact certificate - which is also a client certificate but signed by this user.

Second, the user gets an application and instructs it to connect to the service, over SSL. If the connection is for publishing, the service prompts for client certificate directly at the tool level to the user. Here tool is the browser or mobile device, below the client application. The user selects the right client certificate. The service then uses the public key identity from the client certificate to publish that stream name on this connection. If the connection is for subscribing, and the service prompts for client certificate, the user selects the contact certificate. The service then uses it to determine the right stream to play on this connection. Note that the same connection may be used for both requests, however that will disrupt the separate abstractions for publisher and subscriber components.

Third, the service keeps track of active connections for each stream name. When a publisher arrives, the previous publisher if any is disconnected, and all the subscribers are informed. When a subscriber arrives, the publisher is informed. The publisher and subscriber exchange messages via the service on their respective connections, and create peer-to-peer WebRTC media or data channels. Once connected, they verify their credentials, to ensure that the two users talking are who they claim to be. This is needed because a malicious service may act as Man-in-the-Middle, unless the end users can verify each other using pre-determined keys.

Getting prompted to select a client certificate for every connection may become annoying, and deteriorate the user experience. To solve this, a light weight open source application may be used, which securely saves and reuses the certificate containing the private key. The open source nature will enable the end user to trust the application. Finally, on mobile devices, open API of this application should enable other applications to be built on top, without having to deal with low level session negotiations.

In summary, it is possible to create a global system where user, application and service interact with each other such that third-party can contribute to alternative applications and service nodes, without compromising the end-to-end security and trust among the users.


The article described the motivation and design of a generic video-io component. It encapsulates common media features, can be cross platform with failover to other technologies, and can be reused across wide range of application scenarios.

The article also described the motivation and design of a secure global service, that can support such a generic video-io component instances, without sacrificing end-to-end security and trust among the end users. 

Sunday, June 16, 2019

Lessons in web software development

Here is a summary of what I feel are important to me for creating web applications as an individual developer or in a small team. My hope is to help other developers with the important topics that I assembled.

1) Local development - This is the ability to run all the pieces of the puzzle locally on the development machine - even if some of the pieces are replacement stubs. What more? If everything can work without Internet connection, it brings heaven to home - it can take rapid prototyping and productivity to the next level. How many times did you get stuck because the external server did not behave as expected, or did not easily expose the logs for internal server error? Ability to retain local development becomes more difficult but more important as the system becomes more complex, such as with mobile devices or third-party services.

2) Cross platform and responsive - The product designer may not yet have mobile view, or the sales person is not yet targeting Opera browser. But sooner or later these requirements will come up. If I don't write web application to deal with these aspects from the beginning, then I am just incurring technical debt, which may bring foreclosure as the software grows. Luckily there are tools such as Chrome devtool or CSS auto prefixer [link] that can help me. Behind the scenes, cross platform for web applications is largely about HTML5 modern browsers vs. IE. Good news is that Edge is a modern browser. Moreover, there are polyfills out there to work on incompatible features.

3) Loose coupling but strict APIs - A basic guideline of modularity is to create small (preferably single file) modules that interact with other modules using well defined interfaces. For web applications, web components as well as iframes can enable such designs. Some frameworks have ways to enforce such modularity using declarative interface specification. Furthermore, for client-server APIs, enforcing and checking the strict behavior of the request and response content can help detect problems early on. Did the server team changed an attribute, but did not tell the client team about it? Now you can catch that early on.

4) Flexible code - Software flexibility is a huge plus, especially for early stage systems. As the requirements emerge, as the business landscapes clear up, and as the team grows, the ability to quickly change the behavior without incurring four-sprints for one feature is really important. Flexibility can appear in how the data is stored, how the data is passed from one module to another, or how the code operates on the data. Such flexible pieces of code can then be easily orchestrated to run in one scenario versus the other. Modern javascript constructs such as ES6 features and Promise further help in creating flexible yet clean code.

5) Customization - Visual web applications are particularly susceptible to problems in beauty (or lack of it). Since beauty lies in the eye of the beholder (or beer-holder), every user can prefer things differently than what you do, or what your product designer does. At the basic level, knobs to control the theme color, font-family or font-size brings some engagement with the user, and at the advanced level the ability to customize every visual aspect brings incredible flexibility to navigate the rapidly changing landscape of product requirements. Luckily separation of view (CSS) from control (JavaScript) in web programming readily enables such customizations.

6) Avoid artificial or arbitrary restrictions - Although this is loosely related to flexible code, it is important enough to have its own bullet point. "The list display will not fit in small screen if it has more than seven items, should I restrict to maximum seven items, or should I do something else?" "If five tabs are open at the same time, the title starts getting clipped, should I restrict to maximum five tabs, or should I do something different?" You get the idea? Solving this does tend to make things complex in the short term, but is really important in avoiding technical debt, or a call at 3am about broken app. If you really need to have such a restriction, convey it clearly to the end user.

7) State vs. stateless - Depending on the web software architecture, the application state may or  may not be at the client browser, and may or may not be at the server or database. However, each individual module also maintains some state. The ability to recreate the module from a stored state, and the ability to save the current state, goes a long way in making robust and flexible web application. For example, what would you do if your long running chat interaction app is now embedded in another web page, which allows the end user to navigate and hence reload the chat app? How would you recreate the last state before the user's browser crashed, and user re-launches your web application? Implementing stateless modules with complete separation of data is ideal, but not always feasible.

8) Separation of data and application logic - I have talked about this in my earlier posts as well as presented system papers describing applications built using this paradigm. Nevertheless, this is one of the basic principles of scalable software in my opinion. Some existing frameworks enforce the idea. However, some design philosophies (such as object oriented design) can easily go against the idea. In my opinion, the web application logic should be able to work on data that may be obtained from any place - instead of enforcing a tight coupling of where the data is stored and how it is accessed. This allows many other ideas to be applied easily, such as for data partitioning, client-driven sharding, or testing local code changes with production data. Furthermore, separation of static vs dynamic data is as important for scalability and performance.

9) Performance - With event driven JavaScript, and extremely fast engines in the browser, it is hard to make performance an issue for web developers. However, few points are certainly important: (a) use the right algorithm and data structures, (b) use the right tool for animation or background task, and (c) optimize and collate API calls if needed. You may be tempted to use an array for linear search, hoping that the array will not be more than twenty items or so. But adding few more lines of code to use hash table, with proper cleanup, will not only give you confidence in your code, but also prepare it for those corner cases when there could be ten thousand items. Using CSS for animation is far more efficient than JavaScript. Finally, with RESTful principles, the tendency to GET the full data object, when you just need one or two attribute is often found. How can the client-server interaction be optimized, in a generic manner, such that the client and server can accomplish what they are intended to do, as quickly as possible, without doing seven lock-steps to show one web form.

10) Robustness - At the system level, robustness is about ability to recover from intermediate transient failures. At the module or code level, it can exhibit as proper error and exception handling, data verification before use, not to mention checking for null or undefined. Often times a catch all error notification is used in web applications to display or log any abnormal behavior. However, code robustness involves the ability to recover from the error if possible. For example, if a function that expects JSON object failed to understand the object, was it because a JSON string was supplied? If this is an external facing function for the module, does it make sense to provide that flexibility and robustness? What happens if the string has unsupported characters that are not allowed in JSON? Robustness for client-server communication is also useful. How can the client retry the failed API or WebSocket connection? How often should it retry? If keepalive is needed for persistent connections or for liveness checks, who should initiate the keepalive? Concepts such as soft-state and exponential back-off refresh timer are well known, and are often useful in distributed web applications.

11) Caching - Web protocols as well as browsers and servers are expert in caching. However, due to project requirements or customer demands, web caching may have been configured at a sub-optimal level. For example, if images and APIs are served from the same server, perhaps the no-cache policy of API also applies to images. Furthermore, problems in client side software may cause repeated requests to the same image or API in short duration. Will it be useful to cache such requests in the client application code, instead of invoking the request every time? What should be the cache duration? If the same image is being used at twenty different places in the web application, should that be cached instead of loading the HTTP header for each such instance?

12) One more level of indirection - Web runs on indirection, i.e., ability to resolve one name to another, or one DNS hostname to IP address, or one web path to specific blog article. The core idea is that there can be multiple names pointing to the same thing. This has wide range implications in web application development - in creating short links, in routing paths to pages, in converting URLs to API calls, and so on. How this gets applied in a specific scenario completely depends on the scenario. However, there are two ways the indirection is resolved - proxy vs. redirect. Or recursion vs iteration. As an example, if the code needs to do ten sequential tasks, should the controller invoke those ten tasks one after another, i.e., invoke the next one based on the result of the previous; or should the first task take care of invoking the next one, and return to the controller when all the tasks are completed.

13) Security - Whatever I can write about security here is going to be not enough. For web applications, the security of not only the client-server exchange is important, but also that of the software code. Code obfuscation and minification are often done for web application files, but deobfuscation tools are equally popular. Is the application logic found in the client code something to protect? Is the client server communication encrypted on the network? Should the client-server API be hidden from the developers that use devtools? Is client certificate useful for the web application? Are passwords stored unencrypted in cookies or local storage or database?

14) Configuration - Developers often use fixed values or constants to be used in the code, or assume a default from multiple choices to be applied if needed. Such values form the application configuration. Usually the configuration is stored at the server or database, and the client code just uses the values. However in some cases it makes sense to allow the client app to be configured differently for different launches, e.g., using URL parameters. Identifying crucial configuration items and exposing them as easy to turn buttons or controls not only pleases the user but also makes your software more flexible.

15) Reduce fat - This is probably the most neglected one. A lean software has many benefits  - easy maintenance, quick change, rapid testing, fast debugging, and above all, better performance and load time. Not just web developers but any software developers have tendency to not remove code, even if the code is no longer needed or is replaceable by similar code elsewhere. The fear of breaking the running code is far more than the pleasure of clean concise code. Also, use of external frameworks often exacerbates the fear in my opinion. Unfortunately, this causes technical debt, like no other, that is hard to fix later. With version control history and code minification tools, at the very least the unused code should be removed or commented out. And the near duplicate code should be merged or refactored. Like body fat, software fat reduces agility, and hence your productivity.

That's all folks! Happy developing....

Saturday, June 15, 2019

WebRTC notification system and signaling paradigms

This article describes notification system in WebRTC, and presents some common signaling paradigms found in existing WebRTC applications.

How does WebRTC work? 

WebRTC refers to the ongoing efforts by W3C, IETF and browser vendors to enable web pages to exchange real-time media streams. A page can capture from local microphone and/or camera using getUserMedia as a local media stream abstraction, create RTCPeerConnection, a peer-to-peer abstraction between browser instances, and send a media stream from one browser to another, as shown in the diagram below. 

(Borrowed from my paper on "Developing WebRTC-based team apps with a cross platform mobile framework")

The RTCPeerConnection object emits certain signaling data such as session description and transport addresses, which must be sent and applied to the RTCPeerConnection object at the other browser to establish a media path. How this signaling data is exchanged is out-of-scope of the ongoing standardization - which means every application is free to implement its own signaling channel.

What is a notification service?

This application-specific signaling mechanism is often called a notification service in WebRTC. It facilitates exchange of events such as call intention, offered session, transport addresses, etc., from one endpoint to another, often with the help of a server or server-farm.

Technically, a dedicated network service is not needed, e.g., if the events can be propagated via other means, say emails, or copy-pasting. However, for all practical purposes, you would see a notification service as part of any WebRTC-based communication application.

Often times, such notification service is part of another application-specific server. This server may also process other information and state, such as participants in a room, or authentication to call or connect.

What are the signaling paradigms?

Signaling in existing WebRTC applications are often modeled after telephone call or conference abstraction. For example, in a call semantic, one endpoint sends the offered session and call intention to the other endpoint, via the notification service, and the other endpoint responds with the answered session on call answer. In the conference semantics, a conference room or scope is identified by a URL or path, and all the participants connecting to the notification service under the same scope join the same conference session, and are able to communicate with each other. When a participant leaves the scope, all other participants are informed.

Call: For a call semantics, an endpoint registers or listens on some well known identifier, e.g., a phone number or address, and another endpoint can send an event notification for call intention (and optionally offered session) to that identifier. After the second endpoint responds, the two can hear and see each other. This type of application needs a registration and notification server, similar to the existing SIP proxy, that can maintain current mapping from identifier to endpoint.

If persistent WebSocket is used, then an endpoint is essentially represented as a socket connection at the server. The server enables exchange of signaling data between the two communicating endpoints.

Conference: For a conference semantics, each endpoint comes to know about a unique scope identifier for that conference, e.g., room number or access code. The endpoints can join or be invited to a conference room. Once joined, everyone in the room can hear or see each other, similar to a conference bridge model. This type of application needs an application specific server that maintains room and participant states.

If persistent WebSocket is used, then an endpoint is essentially represented as a socket connection at the server, but likely only for the duration of the conference. The server enables exchange of signaling data among the participants of the same conference room.

The call and conference semantics can be intermixed. For example, the conference could limit the participants to two, to force a two-party call scenario, or a call could allow inviting more people, while maintaining the full mesh topology behind the scenes.

There is also a third paradigm of named streams, popularized by Flash Player, but also applicable to WebRTC applications.

Named stream: Endpoints can publish and play named streams. A stream can have at most one publisher and zero or more players at any time. The primary difference with previous paradigms is that a stream is unidirectional, at least logically. Thus, a two party call will need two streams, one published by each participant and played by the other. Similarly, a N-party conference will need N-streams, each published by one participant, and played by all others. Behind the scenes, the application may be optimize by using bi-directional connections when needed.

This type of application needs an application-specific server that maintains stream states, including their publishers and players. Similar to the conference paradigm, a persistent WebSocket connection is needed only for the duration of streaming. And the server enables exchange of signaling data among the publishers and players of the same named stream.

Which one would you choose?

The choice largely depends on the use case. For example, a broadcast application naturally maps to the named stream abstraction - where the speaker publishes on the named stream, and all the viewers play the stream. A multi-party conference obviously needs a conference abstraction. A panel discussion can be a mix of conference among the panelists, and broadcast from panelists to viewers. A gateway or translator from one to another often needs the call semantics, e.g., to call a phone number from a WebRTC application in the browser or vice-versa. Then a multi-party conference where some participants may be on phone network needs a mix of call and conference semantics.

Many publicly available WebRTC applications typically implement call or conference paradigms. On the other hand, I have regularly used named streams in my past projects, both in open source [flash-videoio][vvowproject] and in the industry [vclick][artisy]. I have also created an open source light-weight notification server in about 200 source-lines-of-code in Python with associated sample web app [,webrtc.html]. This can enable some form of randomly generated conference room abstraction, with at most two participants allowed in the room.  Interested readers can also see my earlier article on NetConnection vs PeerConnection, where the former implements named stream and the latter has call semantics.

Unfortunately, existing systems often include closed walled garden servers. In that case, even if the APIs are public, they are locked to one of these abstractions. This limits certain use cases, e.g., a broadcast scenario that needs named stream abstraction must now use conference room; or a two-party call must create a room and exchange the room information out-of-band to the two parties.

Can the abstractions be converted from one another?

Luckily, it is not difficult to derive one abstraction from another in the above list. For example, to implement a call semantics on top of a conference semantics, one can assign a conference room for each user, in which that owner user is always joined, more like a listener. The room name represents the owner's identity. When another user wants to talk to this owner user, he connects to her room. When the owner detects another participant in her room, she creates another randomly unique room, and informs the other participant to join that. Thus, a call abstraction can work on top of the conference room abstraction.

To implement the named stream abstraction on top of the conference abstraction, one can represent each stream as a conference room. If the participants can selectively join for publishing or playing in each room, then the application can enforce one approved participant as publisher and all others as player in each room.

For supporting a conference abstraction on top of a call abstraction, the application must maintain states for various participants. In one example, it can create a full mesh call paths to emulate a conference room. In another example, it can treat a conference bridge as a call endpoint, creating a centralized conferencing topology. Similarly, the call abstraction where the participants can selectively join to send or receive media, can be extended to expose the named stream abstraction.

The named stream abstraction is pretty low level, and can easily create a call abstraction by creating two named streams, one for each direction of media, or a conference abstraction by creating N-named streams, one for publishing from each participant. The earlier trick of using a separate room for user identity can also be used here - a separate named stream for user identity, to which the owner pretends to publish. When it detects a player, it informs the player to instead play from another randomly generated named stream, for that new call.

These conversions among the various abstractions rely on certain assumption, such as the ability to join a call or conference with one way media, or the ability to join just for listening for event without actual media. Unfortunately with locked APIs, many of these assumptions do not hold true on existing application.

The interesting question is - if the abstractions are roughly interchangeable, would it make sense to define a generic API for WebRTC notification server that can provide all these three abstractions, albeit in a secure, scalable and robust manner? Moreover, can such an interchangeable abstraction be provided by a third-party layer, without modifying an existing WebRTC service? That will give the freedom to the application developers to pick the best abstraction for any particular scenario.

Reason for technology failures - chat bots, video conferencing, or you name it.

Every so often, I come across articles explaining why a piece of technology failed. For example, why chat bots failed in 2018, or why video call did not work for customer support. I think the answer to these and other similar questions can be attributed to three points: (1) not a holistic approach, (2) wrong audience, (3) unreasonable expectations. Let me elaborate further.

1) Not a holistic approach: Communication has not failed and will never fail. People (and machines) have always and will always communicate. Nevertheless many communication tools and technologies are considered failure. Because communication is not a problem to be solved. Budget shopping is a problem to be solved. Getting prompt health advice from a doctor is a problem to be solved. Finding affordable housing in a good place is a problem to be solved. Arriving at a decision in a corporate meeting is a problem to be solved. A shinny new responsive-designed web and mobile app for a hospital alone does not solve the problem of getting prompt medical advice. A brand new multimodal multichannel ubiquitous video conferencing system does not magically create efficient corporate meetings. As for the chat bots' misfortune, the problem was to efficiently and economically resolve customer problems, and chat bots were considered a solution. This is like when someone wants to live in a tree house, you her offer a boat that can be put on a tree. What more? You tell her that it will not work on short trees, or near a river! Mobile apps, web pages, chat bots are just tools. Tools that solve a problem only in a specific situation, or only a part of the problem, and can only be marginally successful in the best case if used alone. A tool or a piece of technology can only solve a subset of the problem, but a business that relies on that ought to consider the problem as a whole.

2) Wrong audience: Often times a solution is created by or created for one group of people, but used by another. This may or may not be a good thing depending on how you measure success. A chat bot that can reduce the call volume to humans is good in cost saving. But it may result in customer frustration, which eventually reduces the business (which is failure), while continuing to improve the bot to human chat ratio (which is success). A new chat bot to book flights on an airline may streamline the booking process or save some administrative cost (which is success), but may not address what most of its customers want - which is to compare the costs among airlines before booking - resulting in lower revenue (which is a failure). Eventually, the new but not trivial way to book flights on their new chat bot loses its appeal, and the customers move on. When writing software or creating applications, one should think from the point of view of the business. Otherwise, the business failure will eventually trickle down to the software. In the real world, one size does not fit all. Moreover, it is hard to write the business objectives in stone, that will not change, especially for small teams and startups. If the software system is flexible to handle range of customers or scenarios, then there is higher chance of success. In that case, even if a failure happens, there is higher chance of quick recovery from intermediate failures.

3) Unreasonable expectations: Change is usually incremental and slow. Once is a while we see a revolution. Those are exceptions. Hence, success should also be measured at the same scale and speed. In the era of instant gratification, quick turn around and fast return-on-investment, people have formed a habit of unreasonable expectations. If something did not behave exactly as we planned, it is termed a failure. If something did not reach a milestone in the time we set, it is termed a failure. If we step back, many times we will realize that the deadlines were arbitrary, or the desired behaviors were unreasonable, or even if they had behaved in the expected way, the eventual outcome of “success” vs “failure” would have been the same.

If we believe that failure is required for success, then things that have failed are also successful in contributing to the future success. Medicines are successful in preventing and recovering from some health problems. Germs are successful in getting rid of the weak specimen, and in evolution of the strong species. A chat bot that went bigot or racist is successful in understanding the evolution of human (or animal) behavior, when there is no empathy or conscience. A tool is rarely a failure, only that it gets in the hands of  the wrong audience or has wrong expectation.

So, what is the conclusion?

Nothing, really. But there are some advice, especially for small startups working on new technologies.

First, think about the holistic approach to the problem space, even if you are attempting to address only a part of it. Second, prepare the system to benefit the end users first and foremost, even if the end users are not the direct customers for the company. Third, be prepared for many failures, but design the solution to be flexible enough to easily cope up from such intermediate failures.

What does this mean? For designing a chat bot, you ought to think in terms of the time that the end user will save in getting answers to specific as well as vague questions, instead of thinking in terms of how many chats were answered by bots vs. humans. For video conference, you ought to think in terms of how the users will benefit from video in what they are already doing, instead of thinking in terms of screen share, white board, notes taking tool, and not to forget, chat bots. It means, to be agile, accommodative and flexible, to be ready to deliver numerous scenarios, instead of putting all the eggs in one or two use case baskets. At the software development level, it means picking flexibility over rigidity, doing rapid prototyping with quick customer feedback instead of six-month projects with rigid goals.