Saturday, June 15, 2019

WebRTC notification system and signaling paradigms

This article describes notification system in WebRTC, and presents some common signaling paradigms found in existing WebRTC applications.

How does WebRTC work? 

WebRTC refers to the ongoing efforts by W3C, IETF and browser vendors to enable web pages to exchange real-time media streams. A page can capture from local microphone and/or camera using getUserMedia as a local media stream abstraction, create RTCPeerConnection, a peer-to-peer abstraction between browser instances, and send a media stream from one browser to another, as shown in the diagram below. 

(Borrowed from my paper on "Developing WebRTC-based team apps with a cross platform mobile framework")

The RTCPeerConnection object emits certain signaling data such as session description and transport addresses, which must be sent and applied to the RTCPeerConnection object at the other browser to establish a media path. How this signaling data is exchanged is out-of-scope of the ongoing standardization - which means every application is free to implement its own signaling channel.

What is a notification service?

This application-specific signaling mechanism is often called a notification service in WebRTC. It facilitates exchange of events such as call intention, offered session, transport addresses, etc., from one endpoint to another, often with the help of a server or server-farm.

Technically, a dedicated network service is not needed, e.g., if the events can be propagated via other means, say emails, or copy-pasting. However, for all practical purposes, you would see a notification service as part of any WebRTC-based communication application.

Often times, such notification service is part of another application-specific server. This server may also process other information and state, such as participants in a room, or authentication to call or connect.

What are the signaling paradigms?

Signaling in existing WebRTC applications are often modeled after telephone call or conference abstraction. For example, in a call semantic, one endpoint sends the offered session and call intention to the other endpoint, via the notification service, and the other endpoint responds with the answered session on call answer. In the conference semantics, a conference room or scope is identified by a URL or path, and all the participants connecting to the notification service under the same scope join the same conference session, and are able to communicate with each other. When a participant leaves the scope, all other participants are informed.

Call: For a call semantics, an endpoint registers or listens on some well known identifier, e.g., a phone number or address, and another endpoint can send an event notification for call intention (and optionally offered session) to that identifier. After the second endpoint responds, the two can hear and see each other. This type of application needs a registration and notification server, similar to the existing SIP proxy, that can maintain current mapping from identifier to endpoint.

If persistent WebSocket is used, then an endpoint is essentially represented as a socket connection at the server. The server enables exchange of signaling data between the two communicating endpoints.

Conference: For a conference semantics, each endpoint comes to know about a unique scope identifier for that conference, e.g., room number or access code. The endpoints can join or be invited to a conference room. Once joined, everyone in the room can hear or see each other, similar to a conference bridge model. This type of application needs an application specific server that maintains room and participant states.

If persistent WebSocket is used, then an endpoint is essentially represented as a socket connection at the server, but likely only for the duration of the conference. The server enables exchange of signaling data among the participants of the same conference room.

The call and conference semantics can be intermixed. For example, the conference could limit the participants to two, to force a two-party call scenario, or a call could allow inviting more people, while maintaining the full mesh topology behind the scenes.

There is also a third paradigm of named streams, popularized by Flash Player, but also applicable to WebRTC applications.

Named stream: Endpoints can publish and play named streams. A stream can have at most one publisher and zero or more players at any time. The primary difference with previous paradigms is that a stream is unidirectional, at least logically. Thus, a two party call will need two streams, one published by each participant and played by the other. Similarly, a N-party conference will need N-streams, each published by one participant, and played by all others. Behind the scenes, the application may be optimize by using bi-directional connections when needed.

This type of application needs an application-specific server that maintains stream states, including their publishers and players. Similar to the conference paradigm, a persistent WebSocket connection is needed only for the duration of streaming. And the server enables exchange of signaling data among the publishers and players of the same named stream.

Which one would you choose?

The choice largely depends on the use case. For example, a broadcast application naturally maps to the named stream abstraction - where the speaker publishes on the named stream, and all the viewers play the stream. A multi-party conference obviously needs a conference abstraction. A panel discussion can be a mix of conference among the panelists, and broadcast from panelists to viewers. A gateway or translator from one to another often needs the call semantics, e.g., to call a phone number from a WebRTC application in the browser or vice-versa. Then a multi-party conference where some participants may be on phone network needs a mix of call and conference semantics.

Many publicly available WebRTC applications typically implement call or conference paradigms. On the other hand, I have regularly used named streams in my past projects, both in open source [flash-videoio][vvowproject] and in the industry [vclick][artisy]. I have also created an open source light-weight notification server in about 200 source-lines-of-code in Python with associated sample web app [notify.py,webrtc.html]. This can enable some form of randomly generated conference room abstraction, with at most two participants allowed in the room.  Interested readers can also see my earlier article on NetConnection vs PeerConnection, where the former implements named stream and the latter has call semantics.

Unfortunately, existing systems often include closed walled garden servers. In that case, even if the APIs are public, they are locked to one of these abstractions. This limits certain use cases, e.g., a broadcast scenario that needs named stream abstraction must now use conference room; or a two-party call must create a room and exchange the room information out-of-band to the two parties.

Can the abstractions be converted from one another?

Luckily, it is not difficult to derive one abstraction from another in the above list. For example, to implement a call semantics on top of a conference semantics, one can assign a conference room for each user, in which that owner user is always joined, more like a listener. The room name represents the owner's identity. When another user wants to talk to this owner user, he connects to her room. When the owner detects another participant in her room, she creates another randomly unique room, and informs the other participant to join that. Thus, a call abstraction can work on top of the conference room abstraction.

To implement the named stream abstraction on top of the conference abstraction, one can represent each stream as a conference room. If the participants can selectively join for publishing or playing in each room, then the application can enforce one approved participant as publisher and all others as player in each room.

For supporting a conference abstraction on top of a call abstraction, the application must maintain states for various participants. In one example, it can create a full mesh call paths to emulate a conference room. In another example, it can treat a conference bridge as a call endpoint, creating a centralized conferencing topology. Similarly, the call abstraction where the participants can selectively join to send or receive media, can be extended to expose the named stream abstraction.

The named stream abstraction is pretty low level, and can easily create a call abstraction by creating two named streams, one for each direction of media, or a conference abstraction by creating N-named streams, one for publishing from each participant. The earlier trick of using a separate room for user identity can also be used here - a separate named stream for user identity, to which the owner pretends to publish. When it detects a player, it informs the player to instead play from another randomly generated named stream, for that new call.

These conversions among the various abstractions rely on certain assumption, such as the ability to join a call or conference with one way media, or the ability to join just for listening for event without actual media. Unfortunately with locked APIs, many of these assumptions do not hold true on existing application.

The interesting question is - if the abstractions are roughly interchangeable, would it make sense to define a generic API for WebRTC notification server that can provide all these three abstractions, albeit in a secure, scalable and robust manner? Moreover, can such an interchangeable abstraction be provided by a third-party layer, without modifying an existing WebRTC service? That will give the freedom to the application developers to pick the best abstraction for any particular scenario.


No comments: