Kundan Singh: A generic video-io component for WebRTC

This article proposes a generic video-io web component for WebRTC. It can enable various real-time applications such as voice/video conferencing, video messaging, video presence and video broadcast. This client side component may be connected with any signaling channel or mechanism to integrate with existing or new websites or applications.

By encapsulating the core functions in a single easy-to-understand component, the component enables reuse of the core media features present in WebRTC. Without this, a user relies on the individual website or application developer to support and expose those media features.

Additionally, I describe the design of a global secure application service that can act as a signaling mechanism for all such component instances.

Some early version of the idea in the form of a video-io widget exists, and is described in my papers [artisy,vclick].

Background

In my earlier life, I had implemented the flash-videoio project [github,paper,slides] and demonstrated a wide range of applications using a simple, flexible and easy-to-use widget. These applications included video messaging, live broadcast, two-party video call, multi-party conference, panel discussion broadcast, interoperability with phone network, and audio-only communication.

The flash-videoio widget combined the various media abstractions available in Flash Player in a single box that exposed a well defined API. It could be invoked from the embedding web page in JavaScript or the parent Flash application in ActionScript.

Two things were very important in that project - independence of the client-side widget from the website hosting it, and a single abstraction that was easy to understand. Since signaling and control remained outside the widget, it could enable integration with existing or new application services over XMPP, WebSocket, Google Channel API, Facebook API, and so on.

With WebRTC, the APIs available in the browser are different than what the Flash Player provides. But the idea of a widget that can provide a simple abstraction, and can work with any website or service is still very promising.

In HTML5, web components allow creating new custom elements such as with associated styles and script. This promotes interoperability, reuse and modularization of web application fragments. Numerous web components are available for various appllications [link].

This article proposes a generic video-io component and its API, and discusses ideas to make it secure, robust and versatile.

Scenarios

Firstly, let us look at the common use cases enabled by WebRTC — not just for browsers, but also for mobile apps or other devices. These scenarios can motivate us further to create a generic video-io component, and guide us in its API design.

1) Customer support: visitor of a webpage can easily connect with owner or support person. Although the visitor is on the browser or mobile app, the support person may not be. Interactive dialog with machine such as IVR, Alexa or Siri-style communication is similar except that the other end is an automated response system.

2) Surveillance or one way media: passively record activities via audio and video, and generate alerts or connect to live agents when something interesting happens. Several always-on videos, drones or IoT use cases also fall under this category.

3) Fast interaction: video games and other AR/VR related applications require very low latency communication among participants. Other emerging areas such as online healthcare, remote surgery operation, etc., also fall under this category.

4) Online classroom: typically a one-to-many media path, but occasionally two-way media for asking questions or short interactions. Event broadcast, webinars or online TV to large number of visitors also falls under this category.

5) Team collaboration: online meetings including two-party as well as multi-party group meetings. Besides the ability to do real-time communication, it requires other features such as ability to record or share screen.

Instant communicator: a text chat or instant messaging application often needs the ability to initiate voice and video call, and to add more participants to the call. Social video chat, video office, and other forms of impromptu multi-modal communication fall under this category.

6) Yet another client: for many telecom service providers, enabling browser and mobile devices to connect to their services is big plus. Various types of gateways to connect between WebRTC and other systems fall under this category.

These are just some of the categories that come to mind, but there are definitely many more.

What are the common abstractions among these?

Earlier, I wrote about three paradigms - call, conference and named stream [link] - and, indeed, they seem to be applicable in these examples as well. For example, surveillance and online classroom roughly fall under named stream, team collaboration under conference, whereas instant communicator, yet another client and customer support under the call paradigm. Please note that boundary between call and conference abstraction is blurred in many scenarios.

Unfortunately, no single application provider supports all these abstractions. In fact, existing application providers are often too rigid, without the flexibility to enable scenarios beyond their carefully crafted use cases. On the other hand, separating the signaling services from the media features enables innovation on multiple fronts. A video-io component facilitates such separation.

Motivation

Here are some reasons to support a generic component that encapsulates media features for many such application scenarios.

Reuse media features across application:

A video-on-demand website typically encapsulates the media features in a video player (like a widget or component), but other provider specific features such as authentication or access control outside the video player. Examples of media features are pause or playback timeline.

A WebRTC application also includes several features - some are provider specific such as for identity, authentication or access control, and others are media related such as the ability to mute, zoom, record or change camera capture attributes. Typically, a WebRTC application developer would like to restrict and control the provider specific features, but would prefer to enable a wide range of common media features. But the developer must explicitly include the implementations of these features in the application. Otherwise, the customer is left wondering why she cannot mute her call, or how she can zoom in to a part of the video.

Encapsulating such media features in an easy to understand web component allows reusing the common functions across various applications and websites. By keeping the signaling outside the component, it can still be integrated with wide range of existing and new application scenarios. A comprehensive and well defined API for this component can serve as a list of required media features among supporting applications, instead of depending on the specific website or provider implementations, especially when the feature is entirely local or end-to-end.

Reduce dependency on third-party application providers:

WebRTC promotes communication silos [link]. It enables any website to easily become a communication service provider for its users or visitors.

Previously, VoIP signaling protocols such as SIP presented a consistent endpoint identity and allowed the communicating endpoints to exchange media and transport information. This allowed users of separate providers to talk to each other in a trapezoid topology [link].

With the triangle topology of WebRTC, and lack of global identity or signaling channel specification, the two endpoints connected to separate websites cannot easily talk to each other. Moreover, cross-site communication lacks the business incentives especially for websites that want to control (and hide) every aspect of their visitors’ interactions.

The end result is the availability of numerous incompatible “signaling SDKs” for various WebRTC applications. These are often too rigid to be used in new scenarios, or too tightly-coupled with a third-party application provider service outside the main website provider.

If the website’s communication is locked to such an external provider, many issues arise, e.g., which identity is used on the two systems? how does third-party audio/video integrates with website’s existing communication? who stores the recordings of the interactions? or who can perform real-time or offline media analytics?

Often times, it is in the long term interest of the website to reuse its own service to enable audio/video communication, instead of relying on (and getting locked to) a third-party application provider. Using a client side web component can off-load some aspects of this task, and can easily allow reusing existing web servers and user identities of the website provider.

Block diagram

So what does the video-io component look like? Here is a logical block diagram of the component.

The component combines the individual low level primitives available in WebRTC into a single widget represented as a box. The component can act as an audio/video publisher or subscriber (i.e., player). A publisher box can display video captured from the camera device, and also send the captured audio/video stream out, whereas a subscriber box can play a received audio/video stream.

The component hides the differences between various web or mobile audio/video software to provide a comprehensive but consistent and easy to understand abstraction. It also keeps the actual signaling outside the box, and provider a generic way to plug-in a new signaling mechanism.

The diagram above shows the two interfaces - application and signaling. The application interface is used to control the component by the application, and the signaling interface is used by the component to exchange signaling data with some signaling or notification service or with other component instances.

Extensible

The component can aim to define generic and extensible APIs to enable wide range of existing and future signaling and media technologies. For example, the media stack could use WebRTC as default, but could fall back to Flash Player or third-party plugins for actual media stream if needed. The external signaling can potentially use broadcast on local area network, WebSocket, Google Channel, XMPP, proprietary plugin, mobile device notifications, or something else.

Furthermore, a cloud hosted service to enable secure and robust signaling in the default case can help bootstrap the usage of the component. More about this global service is discussed later in this document.

The following diagram shows how the component is embedded or used on web vs. mobile. If the mobile platform does not allow or support WebRTC or WebView, then the library communicates with an external mobile app container that hosts one or more component instances. The component loaded from a web page should support both WebRTC and Flash Player, and depending on the session negotiation and capability of the browser, can pick either. The component loaded in the WebView of a mobile app should include only WebRTC capability for platforms that support WebRTC in the WebView. For other platforms, a separate mobile app may be launched that presents the user interface for one or more component instances. This is just one way to implement the cross platform component, using WebView and cross-platform tools such as Cordova.

The library runs in the scope of the parent web application, presents simple programming primitives, and interfaces with the component, or uses an iframe for browsers that do not support web components. In the case of mobile app, the library provides native programming primitives and bridges with either the WebView or the external mobile app containing the component instances. The external mobile app may be implemented by a third-party by following the interface described in this document.

Features

Let us enumerate some component interfaces for common media features that are applicable to many of the scenarios listed above. A number of attributes below are inspired by the HTML5 video element or the flash-videoio widget. Furthermore, the proposed component is expected to support both live and stored media, for publish and play.

Signaling and failover: ability to attach a signaling channel or mechanism to a component instance. This may use application or website specific mechanism, such as over WebSocket or SIP, or using proprietary RESTful APIs. Furthermore, it includes the ability to fallback to a secondary signaling channel or mechanism if the primary one fails during runtime. Seamless failover is needed to reduce initial (call setup) latency.
Appearance: controls how the component appears, and should have the ability to scale down or remove visual elements if the media type is audio-only.
Poster: specify an initial image to display before the video starts playing, either from local or remote stream. This is similar to the poster attribute of the video element. The component may allow a video file in addition to an image.
Preload, autoplay, loop: Similar to the video element, but applicable only for playback of stored media in general.
Controls: whether the component displays its own user interface controls such as for pause, mute, sound volume, etc? In addition to the playback mode, this also needs to support publish mode.
Publish, play, record, mode: named stream (to publish or play) that is attached to this component, using live voice and/or video, or stored content. The mode determines whether it is live or stored content. A live stream may also be recorded, in which case the record mode may be append or replace, to append or replace an existing recording of the same name.
Playing, publishing: controls and indicates whether the component is currently actively playing or publishing.
Bidirection: controls and indicates whether the component has negotiated bidirectional media stream or not? Even though one component instance can either publish or play a stream, multiple instances may work together behind the scenes to negotiate separate unidirectional streams or combined bidirectional streams.
Activity: indicates whether user is involved in any mouse pointer or touch event activity with the component. This can be used by the embedding application to show or hide the application defined controls for that component.
Devices (microphone, camera, sound, display, level, volume, width, height): controls whether those underlying features are enabled or not? For example, to mute the microphone or speaker, or to stop the camera capture or to hide the displayed video. Additionally, microphone level and sound volume can be used to indicate the microphone activity level or to control the speaker volume. The capture and display width and height attributes can control or indicate the camera capture dimension or video display dimension. These attributes should also allow controlling which of the multiple microphones or cameras may be used. Also, a list of available devices should be exposed to the application.
Codecs and parameters: controls the voice and video codecs to use for publishing, and indicates the codecs in use for playing. Additionally, it includes other parameters such as capture microphone sampling rate, camera or display frame rate, video encode quality, default video key frame interval, camera capture quality, capture bandwidth, echo cancellation, silence suppression. In case of simulcast or scalable video codecs, additional attributes should control or specify a simulcast stream or scalable property, e.g., to receive only low frame rate stream, or low quality video.
Zoom: controls how the video content is zoomed to fit in the component size, and is roughly similar to the object-fit CSS attribute. Additionally, the application should be able to zoom the component view to a specified rectangular area.
Mirrored: controls whether the local view is mirrored or not? Although, CSS can be used to control the horizontal or vertical flipping of the component, having such an attribute allows flipping the video horizontally without flipping the controls.
Fullscreen: controls and indicates whether the component is rendered in full screen, in which case it may switch to hardware rendering, or enable higher quality automatically, or perform other optimizations as needed.
Snapshot: ability to take a one-time snapshot of the video as an image content, that can be consumed in the embedding JavaScript application.
Current time, duration: controls and indicates the current capture or playback time since the beginning of publishing or playing. Additionally, other attributes may indicate the total duration of a stored playback content, or statistics of the bytes loaded or total.
Quality: indicates playback quality of the audio and video streams, and may be used by the application to indicate to the user, e.g., as quality bars. Separate values for audio and video may be used. Additionally, separate attributes may be used to indicate the current bandwidth, packet loss frequency, playback buffer size, or other quality related metrics.
Duplicate: ability to create a duplicate play mode component, which plays from the same source as the original, albeit via separate re-negotiation of the streams. If a publish component is duplicated, it creates a play mode component that plays that original stream.
Send/receive: ability to send or receive metadata or other non-media data, such as for text chat. A publish component may be able to send to all the current players. A play component may be able to send to the publisher if any.
Group: ability to group together multiple component instances, so that they can share the signaling channel, and can be treated as bidirectional when needed. A two-party call or multi-party conference will often create a group for each call or conference, covering one publish and one or more play component instances.

It is trivial to see that many of the features listed above are media features, whereas some are not. In particular, the signaling, failover and group features should include some form of external signaling services or mechanisms, via the signaling interface. That will allow the component to exchange various media attributes or features on the signaling channel.

The description presented above can be used to implement a generic video-io component. In practice, the component will be part of an application that will include some form of signaling channel or mechanism. Rest of the article attempts to describe global service for such signaling. Note, however, that the client side component is independent of such service, and may be used in proprietary or application specific manner.

Application and service

There are roughly three high level entities in a WebRTC application - the end user, the client application, and the application service - or user, application and service for short.

A WebRTC application can be classified along several dimensions. Let us look at a few of these.

1) Walled garden vs. open system:

Examples of walled garden (service) are websites that allow WebRTC-based communication only among the visitors of that website, or users of their APIs. Typically, such lookup APIs on the web server or access to the notification server are restricted, e.g., to account holders of that provider or visitors of that website.

On the other hand, an open system (service) enables third-party developers to create client applications that can connect using the services of that provider or website. In that case one client application from one developer can talk to another client application from another developer using the same service.

2) Proprietary protocol vs. standards based:

Examples of standards-based system (service) are those that map to some existing standard protocol such as SIP, so that the triangle topology is readily converted to trapezoid. The goal is to allow a third-party developer to create servers (service) that can federate using open standards. This is so that the users of one service can communicate with those of another. On the other hand, a proprietary protocol is enough for a single service or provider system.

Thus, walled garden does not allow clients from outsiders, and proprietary protocol does not allow servers from outsiders.

Note that a walled garden does not mean proprietary, and an open system does not mean standards based. A system can be walled garden but standards based, e.g., if authentication for inclusion in the federation is controlled by one provider, even if all the servers in the federation use SIP. Similarly, it can be open system with proprietary protocol, e.g., with open APIs that do not follow existing signaling standards.

3) Endpoint driven vs. server driven:

An endpoint driven signaling system keeps most of the application logic running in the endpoint or client device [link]. In that case, a server is typically needed only for event notification and data storage. On the other hand, a server driven signaling relies on the application logic running in the server, e.g., to determine room membership, stream subscribers, or picking the right user device.

It is not surprising that almost all existing systems are server driven. Endpoint driven systems exhibit little business value, because the service provider loses control of the application logic, especially if it is an open system that allows third-party client applications. Furthermore, many existing systems are walled garden, with proprietary protocol, and server driven.

Global service

One may ask - is there any benefit in a standards-based endpoint-driven open system?

What would it look like? Who will benefit from it? Can it allow all the scenarios listed previously? If it can, then it enables an open platform for developing almost all kinds of use cases, where any developer can contribute with a client application or a hosted service, creating a global WebRTC-based system.

The video-io component proposal presented in this article enables creating such a global system. In particular, the media features already reside in the component (in the endpoint). If an application service provider uses a generic and open API, and a pluggable server farm architecture, then such a system can be created. Note, however, that the component itself does not impose any restrictions on the system type, and can readily support open or closed, standards or proprietary, and endpoint vs. server driven system. In any case, the application service provider needs to implement some basic signaling primitives.

These primitives were discussed earlier, and fall under call, conference or named stream abstractions. The call abstraction creates a bi-directional media pipe and also allows registration and lookup of user identity. The conference paradigm presents a room and participant abstraction where each participant is a room can see and hear each other. The named stream paradigm allows unidirectional media flow from zero or one publisher of a stream to zero or more subscribers of that stream, where the publisher and subscribers can come or go independent of each other.

So the question is - what does it take to build an endpoint-driven standards-based open system for WebRTC signaling?

First, the server must be light weight and without any application logic beyond the above mentioned abstractions. Second, the user or device identity must be independent of a single provider. Third, it must be use existing standards for communication between servers, if needed. Finally, it must use open published interfaces between client and server, whenever needed.

If the goal is to prevent creating proprietary applications using this service, it must allow third-party developers to create client applications. On the other hand existing cloud WebRTC application service providers typically allow developers to create applications within the scope of their developer key. Thus one developer owns that application, and often, prevents other developers from creating an application that interacts with her application.

Public key/certificates or their fingerprints can be used as the user or device identity. The identity provider should be independent of the application service provider, and different application provider may trust different identity providers - similar to how a browser can trust many different root certificate authorities.

For simplicity, based on the previous discussion, the application can be abstracted out with three concepts: first, a component that publishes or plays media; second a named stream which can be attached to a publishing or subscribing (playing) component, such that there can be at most one publisher per stream at any time; and third, a collection of zero or more component instances and named streams that are dynamically adjusted as the application demands. Generally, these abstractions are enough to support the application scenarios listed earlier.

Named stream

One of the key feature described before is the ability of the component to publish or play a named stream. Depending on the signaling layer, this may or may not be readily available in the service. If not, then a server that processes state for various named streams, and connects the publishers and players of a named stream is needed.

My earlier projects include a light weight resource server that facilitate this feature. However, implementing this from scratch on a clean slate is not difficult. Such a named stream server need to maintain list of named streams that are being published or subscribed. Furthermore, it need to enforce at most one publisher per stream, and zero or more subscribers. The publishers and subscribers of a stream may come or go in any order, and at any time.

The stream name may be globally unique and globally accessible (e.g., URI), or may be unique within a single server. In the latter case, it is scoped within the server address, so that it becomes globally unique. A unique stream name allows simple interface for failover when needed, e.g., publish to stream A on server 1, and if that fails then to stream B on server 2. Otherwise, if server address cannot be specified in the stream name, then a failover within a single service is not robust in all scenarios.

Secure design

With global stream names, comes global problems, i.e., security and access control.

To elaborate the challenges, consider a naive system that uses clear text named streams. User Alice instructs her application to create a component and publish it to stream named “alice”. User Bob instructs his application to create a component and subscribes it to stream named “alice”. Now Bob will be able to view and hear Alice. Similarly, the reverse direction media flow can be done with stream named “bob”. If a malicious participant Marvin knows about the stream names, he can pretend to be either Alice or Bob, by publishing to that named stream, thus disrupting the system.

To improve upon this, the system can require the publishing application to use clear text stream name, but the subscribing application to use hash (e.g., SHA256) of the stream name. Thus, only Alice will know her clear text stream name, e.g., “alice543”, and Bob will know only the hash, H(alice543). The application will not deliver the clear text stream name to the other participants. Thus, Marvin will not be able to publish to streams of Alice or Bob.

To support other complex scenarios such as to restrict the subscribers to known ones, or to limit the number of subscribers per stream, additional data can be included in the stream name. For example, separate indirections can be distributed to separate subscribers, all pointing to the same published stream, albeit with different keys.

As mentioned earlier, there are three entities here - the end user, the client application, and the application service provider - or user, application and service in short. Should the application trust the service to not misuse the clear text stream name? No. If the service leaks the clear text stream name, then the system falls apart.

Another approach is to use a public-private key-pair for a stream, and have the stream name be derived from the public key. The system allows only the owner of the private key to publish to that stream. In that case the application does not have to give out the private key to the service, but the appllication can still prove that it owns the private key.

Should the application create its own key-pair for the stream, or should the user provide the credentials? Should the user trust the application to not misuse the credentials? No. If the application gets access to the user’s private key, then trust model falls apart again. Hence, the application should be a transparent bridge for the credentials, and the user should manage her private key independent of the application.

One approach to solve this is to use client certificates, directly on the underlying tool (browser or mobile device), bypassing the application for these credentials.

First, the user gets client certificates out-of-band from some identity provider. The certificate verifies that the user is who she claims to be. The user gives out her public key (identity) to other participants, again, out-of-band, e.g., via email or other means. When another participant’s identity is received, the user signs it using her private key, to create a contact certificate - which is also a client certificate but signed by this user.

Second, the user gets an application and instructs it to connect to the service, over SSL. If the connection is for publishing, the service prompts for client certificate directly at the tool level to the user. Here tool is the browser or mobile device, below the client application. The user selects the right client certificate. The service then uses the public key identity from the client certificate to publish that stream name on this connection. If the connection is for subscribing, and the service prompts for client certificate, the user selects the contact certificate. The service then uses it to determine the right stream to play on this connection. Note that the same connection may be used for both requests, however that will disrupt the separate abstractions for publisher and subscriber components.

Third, the service keeps track of active connections for each stream name. When a publisher arrives, the previous publisher if any is disconnected, and all the subscribers are informed. When a subscriber arrives, the publisher is informed. The publisher and subscriber exchange messages via the service on their respective connections, and create peer-to-peer WebRTC media or data channels. Once connected, they verify their credentials, to ensure that the two users talking are who they claim to be. This is needed because a malicious service may act as Man-in-the-Middle, unless the end users can verify each other using pre-determined keys.

Getting prompted to select a client certificate for every connection may become annoying, and deteriorate the user experience. To solve this, a light weight open source application may be used, which securely saves and reuses the certificate containing the private key. The open source nature will enable the end user to trust the application. Finally, on mobile devices, open API of this application should enable other applications to be built on top, without having to deal with low level session negotiations.

In summary, it is possible to create a global system where user, application and service interact with each other such that third-party can contribute to alternative applications and service nodes, without compromising the end-to-end security and trust among the users.

Summary

The article described the motivation and design of a generic video-io component. It encapsulates common media features, can be cross platform with failover to other technologies, and can be reused across wide range of application scenarios.

The article also described the motivation and design of a secure global service, that can support such a generic video-io component instances, without sacrificing end-to-end security and trust among the end users.

2 comments:

Anonymous said...: Excellent Kudan! do you think there will be a way to substitute the famous netgroup /sharedObject / multicast in Flash? I heard that Adobe is working hard to mutate Flash in webassembly....; 9:44 PM
Kundan Singh said...: Good point!

In some way these features depend on the server side as well. So changing or implementing only the client for these features is not enough, unless those clients connect to the same server side piece. However, in that case it largely becomes an application problem. Several web applications already have some form of sharedObject or RPC or equivalent if the use case demands. Since WebRTC media streams can be forwarded, new web applications can create app-level multicast or netgroup equivalent models too if the use case demands.

In any case, something like that will fit nicely with the global service discussed in the blog post above.; 1:04 AM