We’ve all been there: On a video call, a few people start talking at the same time and suddenly you can’t hear what anyone is saying. The truth is, video call audio is — in general — pretty bad, leaving users feeling awkward, tired, and frustrated after a long day of being on calls. It doesn’t have to be this way. Other industries, such as the gaming industry, has found ways to make rich, dynamic sound experiences — with natural sounding ambient noise and complex soundstages that let users hear whether something is coming from the side, in front, or behind them — that users can spend hours in, and still want more. How do they do this? By prioritizing multi-channel, well-mixed audio. Video chat engineers should take a cue from these other industries, and make use of available technology, such as Dolby Voice, that can help create a natural feeling interaction that’s richer, easier, and let’s people feel more like they’re meeting face to face.
The new normal of seemingly endless virtual meetings is rapidly changing social norms and driving new interest in the technology that connects us. While there have been video conference platforms available for years, they were not designed to replace in-person interaction. After a year of pushing through “Zoom fatigue,” it’s time to take a fresh look at which features and attributes make users feel connected — and how they can be improved. One important aspect is sound. In fact, if every person involved in a conversation can hear, one could argue it is the most important feature of in-person interaction.
Good sound allows us to not only understand the words in a conversation, but also pick up on mood elicited by vocal intonation and environmental sounds. Bad sound, on the other hand, leaves us frustrated. In video chats with more than two people — quite common for virtual happy hours, team meetings, and collaborative sessions — concurrent speakers inevitably drown each other out. Today’s most popular video chat platforms aren’t compatible with rapid dialogue.
But the days of video conversations aren’t going anywhere given how many companies are continuing WFH policies. So to make meetings and other gatherings more productive (and more fun), it’s important to understand why the experience is so poor — and to know that solutions for video platform developers do exist in the worlds of video games and music.
Why Video Conference Sound Is Awful
At the most basic level, microphone levels vary between individuals, which makes simply compiling each person’s sound wave into a single audio stream difficult. On some platforms, this can lead to speaker bias where the loudest person wins; on others, only the active presenter’s audio stream is prioritized. The resulting dialogue interruption, repetition, and confusion we experience leads people to interact differently on video chat than they normally would. This is a technical problem, and it all comes back to what’s called the “phase” between the combined sound waves.
Here’s how this works: Two sound waves of the same frequency signature that are perfectly aligned have a phase difference of 0, which we call “in phase.” When waves that are in phase combine, like two people saying the exact same thing, at the same time, they produce a sound that’s twice as loud. The problem comes when two sound waves in similar frequency ranges are “out of phase.” When the waves don’t match up, they start to cancel each other out, to the point where two waves that are exact opposites will cancel each other out entirely. Noise-cancelling headphones work by doing this on purpose.
In most video calls, issues with phase cause problems with being able to hear what other people are saying. When the dialogue sound waves are combined, it results in different portions of the waves being cancelled out, points of random sound amplification, and a more noisy frequency set in between the in-phase and out-of-phase points in the wave cycle.
There’s an additional issue: Video conferencing was built for in-office meetings, where you can expect silent surroundings (though even in an ideal setting most services have weak points). What happens when we want to virtually grab coffee, socialize, attend a dance class, and all the other things we normally do outside of work? Platforms typically filter sound to reduce the volume of frequencies when people aren’t talking; compounded with the aforementioned phasing issues, this means that ambient sound cannot be introduced into today’s video chats without further diminishing (or completely cancelling out) dialogue intelligibility. So, the city sounds of traffic, your favorite 90s hip-hop in the background at the coffee shop, or a high-paced pop song driving the mood is reserved for IRL. This is disappointing. Without ambient noise, we suffer from mood-killing silence when we take a break from talking.
We don’t need to accept this limitation, however. While video chat technology introduces challenges around natural dialogue cadence and environmental mood, in other spaces, like music and gaming, audio engineering has been a focal point of advancement for quite some time.
What Video Chat Companies Can Learn from Music and Gaming
When you hop into a virtual environment, say Call of Duty multiplayer mode, you’re immersed into a first-person point of view in which you can spend hours effortlessly. It’s not by chance that gamers get less fatigued than users of a platform like Zoom. The visual experience is paired with the ability to have complete audio clarity with respect to ambient sound, action sound effects, virtual teammate dialogue, and more. While an explosion might be happening in front of you, you can hear crickets in your peripheral and your teammates voices center stage right, all within your headphones. Simply put, this is possible because of audio mapping (or mixing). By working to place things a certain distance and direction from the user, sound waves are being compiled with signal processing that works to alleviate the challenges of phasing and noise.
This analog carries to music as well. Though you may not realize it, music has required mixing ever since we began to add a multitude of sounds to create a singular experience. In music, producers not only blend together vocals for the main and background singers but also seamlessly add in strings, horns, bass and other instrumentation to deliver hit songs. Audio engineers make sure instruments don’t clash, the volume of the song is proper, and the core emotion of a song comes through by properly staging (filtering, compressing and more) the composition sounds.
If you compare the audio from music (or video games), you’ll see how far video conferencing has to go. Try listening to one of your favorite songs with your eyes closed — hear where the vocals sit on the soundstage in comparison to other instruments that are driving the melody. Then, next time you’re on a Google chat with two other people on your desktop, listen to where all the voices come from. You’ll notice in video chat the sound stage is not used to the extent it should be.
Video chat developers can take a cue from audio engineers in the music industry, who have been responsible for delivering near-perfect sound experience for decades. Consider how this could align with the audio of a collaborative virtual meeting. Imagine hearing one person more from the left, another more from the right, while on the outer premises of the sound stage, there is ambient music playing. Now you have a more natural interaction, and by way of giving the sound stage more room to drive the video chat experience, the user’s brain will actually better map dialogue sound with the source on screen. This approach lends itself to better user orientation, ultimately resulting in reduced Zoom fatigue.
Building a Better Video Chat
With audio coming to the forefront as a key part of the video chat experience, companies like Pilotly (where I’m the CEO) and BlueJeans have taken steps to move towards the future of virtual conversion. Working with leaders in the audio processing space, both companies have applied algorithms to video chat that have created more clarity around dialogue through robust audio mixing.
BlueJeans, recently acquired by Verizon, was one of the first to work with a partner — Dolby — to enhance their user experience. To put clear dialogue at the forefront of their value proposition, they pulled in Dolby Voice, a system that could normalize audio levels, optimize for particular voice bandwidth, reduce noise, and do some mixing in the cloud to prevent cancellations when multiple parties talk in a meeting.
Other companies can follow suit. Dolby is currently working to make it easier for applications to improve their audio capabilities. Paul Boustead, Chief Architect of Dolby’s Communications Business Group, says that expanding the use of these technologies is a priority for the company. “I’ve been specializing in voice and video communications for over 20 years, as a researcher, an engineer and an architect,” he says. “I’ve really been pushing to make online communication as natural as possible.”
Pilotly’s video chat platform, Reelchat, is focused on creating a virtual environment that will be akin to a gaming experience. The first application of Reelchat has been virtual focus groups, where it’s important to have quick, free flowing conversations where you can hear more than one person at a time — just like in a meeting or a happy hour. This is one of the reasons why we’ve prioritized audio mapping to create maximum comfort and intuition for conversation participants. We believe that the key to making virtual human interactions work is moving the user into a space where sound exists more naturally.
Adjusting and accelerating the rate of advancements in video chat technology will be central to the success of business, higher education, and social connection as we continue to endure extreme limitations around IRL interaction during a pandemic. Audio, long ignored as a central factor in audience experience for visual, is the future of interaction. Gaming and music understand this, and the next evolution of collaboration and meeting platforms would do well to emulate the same type of music mixing into their user experiences.