How not to design a video conferencing product?

"Can you please stop sharing your screen so that I can share mine?" "I can't see that part of your shared screen because those buttons are overlaid on top." "How can I share my second webcam without stopping the first?" "Can you please say that again? I missed the last part." "It shows only up to nine videos even though I have a very big screen."

Do you ever feel frustrated due to some artificial restriction imposed by the video conferencing product you use? In this informal article, which some will find quite opinionated, I list some annoying product "features" when it comes to video conferencing. 


This is mostly a random list of problems, in no particular order, and in no way attributing to some specific products or brands. These problems can be roughly classified into five, sometimes overlapping, categories.
  1. In the name of security
  2. One size fits all
  3. Rigid video layout
  4. No feedback, self-test or customization
  5. Video conference is extra work

It takes four steps to connect, and three steps to disconnect. Joining a meeting requires sequential steps of login or signup, approval for cookies, and/or terms and conditions, device selection and permission, and waiting for the meeting to "start". When asked, you hear, "These steps are for extra security." What more? You can't just close the browser to leave the meeting - first click on the leave or logout button, click to confirm, provide feedback for call quality, and after that, close the browser tab.

Over time, people get used to this without asking question, like how folks remove shoes or liquids at an airport security checkpoint. We can do better - one click to call and one click to join.

Sometimes, there are no access to audio/video controls or settings until you "join" the conference. And you can't join the conference, until it is "started" by the meeting organizer or host. Why can't a participant join before the organizer? You end up waiting there - and then once the organizer joins, everyone else rushes to join too, and configure devices, mute microphone, and turn-off camera, or say, "hello hello, one two three, can you hear me?"  Can't these be automated, preferably independent of the "join" step.
 
Many video conferencing products are available as tools that are intended to be used actively, with full attention and focus during the use, and targeted for meeting-oriented teams. Multitasking or passive participation is either discouraged or not supported. 

You can't resize the video window (passive work) as narrow vertical box on the side, so that your code editor (active work) can occupy most of the screen. A passive participant is suddenly woken up by a directed question, and says, "Can you please repeat the question? I missed the last part." Why can't it have running subtitles text, or a quick-click ability to repeat the last five or ten seconds?

Screen or app sharing is encouraged, where a presenter shows a slide deck to others. But what about two software developers working on code review or pair programming where both want to share their code editors, and multiple apps and windows with each other at the same time? 

Many products allow only one participant to share only one screen or app at any time. If that participant wants to share a second screen or app, she must stop the first one. If someone else wants to share the screen, the first participant must stop too. Given that multiple videos are already shown in a video conference, I feel this is largely an artificial restriction imposed by the product.

The same can be said about multiple webcams and microphones. If a user has two webcams, she could point one to herself, and the other to the room's white board. Have you heard, "I will point my laptop camera towards the whiteboard, so that others can see what you are drawing there." Restricting one participant to one webcam video in another artificial restriction, even if the system and network has the ability to send multiple video streams.

When it comes to network capability, often times product decisions are made as a compromise. "It requires at least 1 Mbps upstream, and disables webcam if not enough bandwidth is detected." The underlying audio and video codecs are perfectly capable of working under very low bandwidth of few tens of kbps, to very high bandwidth, by doing appropriate quality trade-offs. Accommodating heterogeneous network conditions requires a quality engineering effort. 

The one-size-fits-all model is actually a mindset. It often assumes that if a restriction affects only a small percent of users, then it is fine. The problem with this is that the probability multiplies with each such independent restriction. Suppose, you have five features. In each feature, some restriction is made, so that 80% of the users are fine with it, and it ignores only 20% of the users' needs. Suppose these five features are independent of each other, in terms of how they make their users happy. Then only (1-p)^N = 33% of the users will be happy with all the restrictions applied together. Remaining two thirds of the users will have some issue with the product, and will be willing to look elsewhere for a better option. The majority becomes a minority - only a minority of the users happy with the product!

The video layout is pretty rigid. You see the participant videos laid out in 3x2 or 5x4 mode, with some empty space.  It does not adjust the layout based on window size, i.e., has the landscape orientation, and if you resize the window to portrait, you crop out some videos or you scale down everything while keeping the landscape layout of videos within the portrait window box. Sometimes, the videos are laid out in a fixed layout that requires more than 80% of the screen size, to avoid cropping or hiding some videos.

Video sizes are fixed, and cannot be changed, i.e., a video displayed in 360p or 180p or 720p, but not scaled to fit. Scaling the window size does not change the video size. Video of each participant is in a landscape box. If a mobile user joins with portrait orientation camera, the video appears as either excessively zoomed in, or with excessive padding on the sides.

Each participant appears as a video box, even if that participant's webcam is not enabled, taking up your previous screen real estate. There is no easy way to hide boxes that are not showing live video, so that the actual videos occupy more space.

Number of videos displayed is fixed. For example, even if I have low bandwidth, and someone else has high, both receive same nine video feeds to display. Whether I use a small laptop or a very large monitor, exactly twenty videos are displayed in grid view. 

Videos are displayed in an all-or-nothing mode. There is no easy way to open shared screen feed or an individual's video feed in a separate browser tab, or in large/focussed mode, or to select which videos you want to view and in what order. There is no easy way to zoom a displayed video, and even through the browser has the zoom feature, the product intentionally disables the default browser or device zoom feature.

The desire to have one-size-fits-all causes a lot of friction with mobile app development, especially when the device screen sizes and abilities vary by a lot. "Let's show four videos for both tablet and phone as a compromise, even though nine videos on tablet and one active video on phone may be ideal." Such considerations then prompt separate app development and product roadmap for mobile vs. desktop. If the app was designed from the beginning to scale up and down based on the available window size - the user could just use the web app on mobile.

There is limited or no feedback about the video quality or connectivity. "It was hard to tell if the person was not moving or that the video froze." "I kept speaking, but others did not hear. It looks like I got disconnected." "I couldn't hear him well, but wasn't sure if the problem was due to my bluetooth or on his device."

In a telephony-style flow, not able to call yourself for self test. Do not allow the same user to join multiple times in the meeting. No other easy way to do self test to check if my audio and video are working well end-to-end.

Automatically disconnect or reduce service, such as stop video, if low quality or network issue is detected. But do not automatically reconnect or re-enable when the condition improves. Users do not have any way to control or change the default behavior. Reloading the web app requires me to re-do the steps to arrive at this state. The page does not restore the previous state of video layout or other display elements.

Lack of customization is a big deal. There are tens, or even hundreds, of decisions made during product development, e.g., how quickly to adjust layout based on low bitrate, or how many times to retry a direct connection before fallback to media relay, or how much performance penalty is allowed for a higher quality. Many of these decisions become hard-coded during product development. Even if they are controlled by some internal configuration, they are not exposed to the end user or administrator. 

This lack of customization further fuels the one-size-fits-all mind-set. Some may argue for automatic selection of default configuration instead of manual. But the ability to alter the default has been the foundation of great and versatile software products - unless your market targets only a very narrow set of use cases.

Using the video conferencing tool is in itself some work - instead of adding video conferencing to my real work that I do such as code review, or discussing a document, or showing a demonstration. On top of that, the product is too intrusive - must be kept in foreground - interferes with my actual work.

The ability to provide an immersive video conferencing experience has been tried a few times in the industry, but did not catch on yet. Ideally, communication using audio or video should be immersed within the collaborative work I am doing. More like an always-on virtual presence, where just saying "Hey, Bob, check this piece of code?" while working-from-home, should be enough to start the engagement with app share behind the scenes, instead of using a "video-telephone" to dial out to engage actively with another person.

Using APIs, and embeddable technologies, it should be possible to add conferencing experience seamlessly in other work related and productivity tools in a non-intrusive manner. That is the holy grail of the video conferencing technology in my opinion.

On the other hand, the industry in general has a reverse focus - where the video conference product also covers messaging, and note taking, and voting, and what not - towards a unified product. It is easy to market the idea of integrated system with video conference, messaging, notes and others, instead of a hodgepodge system with video conference from one provider, note taking from another, and messaging from a third app. 

The tradeoff is that the integrated product now competes in all axes, but can't be the best in all. Users end up suffering by not being able to use the best products in each category. Moreover, the web platform with its separate tabs and frames, is ideally suitable for such a mixture or mashups of apps from separate providers working seamlessly. 

The problems I listed above are largely due to product decisions, and many times due to artificial restrictions, which could be resolved with some additional insights and engineering. If the goal is to make the users happy, a product should (1) avoid unnecessary steps in the name of security, (2) avoid the one-size-fits-all mindset, (3) not make video layouts rigid, (4) provide feedback, self-test and customization, and (5) not be a distraction in the actual work of the end user.

Did you ever feel frustrated due to some artificial restriction imposed by the video conferencing product you used? What are your thoughts or your list of such problems?

No comments: