Google Brings ‘Emotion To Captions’ With New Expressive Captions Feature

Google on Thursday announced an accessibility feature for Android it calls Expressive Captions. The software, which is built atop Google’s existing Live Captions feature, uses artificial intelligence to help Deaf and hard-of-hearing people understand emotion in spoken dialogue.

Google boasts Expressive Captions not only allows users to read what people are saying—“you get a sense of the emotion too,” they said.

The company made the announcement in a blog post written by Angana Ghosh, director of Android product management. She called today’s news “a meaningful update” because people who can’t hear well still deserve the opportunity to feel what people say on screen in addition to reading it. At a technical level, Ghosh explains Expressive Captions works by using AI on one’s Android device to communicate such vocal attributes as tone and volume; environmental sounds like crowd noise during sporting events also are represented. These make a significant impact “in conveying what goes beyond words,” according to Ghosh.

In a brief interview with me, Ghosh said developing Expressive Captions was a cross-collaborative effort within Google that included the DeepMind team and many more people. The efforts weren’t trivial either, as she added bringing Expressive Captions to life encompassed “the last few years.” In terms of the nerdy nitty-gritty, Ghosh told me Expressive Captions functions by “[using] multiple AI models to interpret different signals that allow it to give you a full picture of what is in the audio.” AI locally processes incoming audio in order to recognize non-speech and ambient sounds, along with what Ghosh called “transcribing speech and recognizing appropriate expressive stylization.”

“All these models are working nicely together to give us the experience we want for our users,” she said.

Ghosh said Google very much cares about accessibility. Its goal to build products for everyone, including disabled people. She noted Live Captions debuted in 2019 as a way to make media more accessible to those coping with limited hearing or none at all, as aural content “often remains inaccessible to the Deaf and hard-of-hearing communities.”

“Expressive Captions pushes that a step further to provide people with the context and emotion behind what is being said, making audio and video content even more accessible,” Ghosh said.

She added: “When we build more accessible technology, we create better products overall. Oftentimes, they can be beneficial for a wide range of people, including those who don’t have disabilities. With captions, this is especially the case, as 70% of Gen Z uses captions regularly.”

In the several years since Live Captions was initially introduced, Ghosh said Google has heard from many in the Deaf and hard-of-hearing community that they missed “the emotions and nuances behind the content”—which is problematic because, as she said, “in many cases those nuances of audio, like a well-placed sigh or laugh, can completely alter the meaning of what is being said.” Ghosh told me Google collaborated with a number of experts, including theatre artists and speech and language pathologists in making Expressive Captions; this helped them understand the areas where current technology falls short, but more saliently “what’s important to emphasize within audio.”

In other words, Google sought to “ensure context was being reflected.”

“Expressive Captions provides that information in a consistent way across all apps and platforms on your phone,” she said. “Expressive Captions aims to provide the full picture of audio and video content, capturing the nuances of tone and non-verbal sounds. We hope this is a step towards making captions more helpful and equitable for people.”

However technically impressive—and yet another example of wielding AI’s sword for genuine good—it should be mentioned what Google has done with Expressive Captions isn’t necessarily novel. Professional captioners, like those employed at companies like VITAC, have long augmented closed-captions with emotive metadata. In many places, there are descriptors in parentheses which denote, to Ghosh’s aforementioned points, ambient details like a well-placed sigh or swelling crowd noise. What’s more, there even are indicators of what song or type of music is playing during a television show or movie.

When asked about feedback, Ghosh said the response to Expressive Captions has been met positively. As a brand-new technology, she said it was important to the team to “embed” testing throughout the development cycle, testing various stylizations and deploying prototypes to various groups. The overarching goal was to build a product that felt “helpful and intuitive” to people, as readability and comprehensibility are paramount to captions. Many participants reported during the testing phases that Expressive Captions increased accuracy and context.

Looking towards the future, Ghosh expressed excitement.

“We’re incredibly excited to be releasing [Expressive Captions],” she said of the feature’s advent. “It’s a new challenge to consider how to bring more expression and context into captions. It isn’t something that has been done with automatically generated captions before and we look forward to receiving feedback from people, including the Deaf and hard-of-hearing communities, as they use the feature. We want to be thoughtful about making Expressive Captions truly helpful for people.”

Read the full article here