I analyze some problems in client-server communication for multi-party video conferencing.
Audio communication differs from video in two important ways: (1) usually in a conference only one person is speaking at any time whereas everyone's video is on, (2) audio codecs are usually fixed bit-rate whereas video codecs adjust bit-rate based on various parameters such as available network bandwidth and desired frame-rate.
In a client server mode, because video coming from one participant needs to be distributed to all the other participants, the bandwidth and processing requirement at the server can be higher; unlike audio where usually only one person is speaking. Secondly, the downstream video bandwidth requirement at the client increases with the number of participants in a conference. In an N-party conference, each client will have usually one outbound audio stream, one inbound audio stream, one outbound video stream and N-1 inbound video streams. Note that this problem is worse for peer-to-peer (P2P) video conference, where everyone is sending video stream to everyone else: in which case there are N-1 inbound and N-1 outbound video streams at each client. For asymmetric network access (ADSL or Cable), where upstream bandwidth is lower than downstream, this causes early saturation in outbound network bandwidth. Shutting down video stream or reducing the video quality while a person is not speaking saves some bandwidth especially for speaker mode conferences.
Second point of difference is that audio is usually encoded using fixed bit-rate codec whereas video bit-rate is adjusted based on several parameters such as available network bandwidth, desired quality and frame-rate. In a client-server environment most implementations use the client-to-server network quality information to decide what bit-rate to use for client's video encoding. Consider a two party client-server conference, where first client is closer to the server hence has lower latency. The first client decides to use high quality high bitrate video encoding. On the other hand the second client decides to use low quality low bitrate video encoding. This asymmetry causes the first client to receive poor quality video whereas the second client's downstream link gets congested with high bitrate video. The problem is further aggravated if in a multi-party conference there is only one participant on poor quality network. The problem is caused because we use client-server network latency metric instead of end-to-end network latency metric in deciding the video encoding bitrate.
Sometimes, the conference server imposes bitrate control to limit the traffic towards a low bandwidth client. However, for efficiency reason the server doesn't re-encode the video packets. Instead, it just drops non-Intra frames if there is not enough bandwidth. This causes marginal to no improvement primarily because Intra frames are several times bigger than other frames. Secondly, it causes choppy video which further degrades the experience. The layered encoding in MPEG solves this problem.
Larger video packets may not traverse end-to-end over UDP. An encoded audio packet is usually small, of the order of 10-80 bytes per 20 ms. On the other hand an intra-frame video packet size can be much larger, say 1000-10000 bytes. When media packets are sent over UDP, and the packet size is large, there is high probability of getting the packet dropped. This is because of the MTU restriction and middle-boxes (NAT and firewall) in the media path. An UDP packet of size larger than MTU (typically approx 1300-1400 bytes) gets fragmented at the IP layer such that subsequent fragments after the first one do not have the UDP header information (such as source and destination port numbers). A port inspecting NAT or firewall that doesn't handle fragmentation correctly may drop such subsequent fragments, causing loss of the whole UDP packet at the receiver end. Thus, video over UDP has to take care of additional fragmentation and reassembly, and/or discovery of path MTU in the application layer.
The server may allow video over UDP as well as TCP from the clients, typically to support NAT and firewall traversal. If some clients are over TCP and others over UDP, then the server also needs to proxy packets from one to other. If the client over TCP assumes ordered packet delivery, then the server will also need to do buffering, packet re-ordering and delay adjustment, which further adds to the implementation complexity of the server. The problem is not that visible for audio beyond a glitch in sound, whereas for video the view may get completely corrupted until the next Intra frame.
A slightly related problem is when the conference server does audio mixing but video forwarding. In this case, the server must perform delay adjustment, packet re-ordering, and buffering for the audio path. However, for efficiency reason it may blindly forward the video packets among the participants. Thus the synchronization information between the audio and video gets lost, and performing lip synchronization at the receiving client becomes a challenge. A correct implementation of the server should act as an RTP mixer, i.e., include the contributing source information in the mixed audio stream, and distribute RTCP information to all that participants for synchronization. (How to do this if each audio call leg is a separate RTP session?)
Some of these problems (2,3,5,6) can be solved to some extent by using peer-to-peer video conferencing.