Dan’s Weekly AI Speech and Language Scoop #26

These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.

Dispatches from Interspeech

While one day remains, I saw enough great talks, met enough great people, and learned enough great stuff that I feel confident enough to write a wrap (and it’s Wednesday, the day I block for writing these).

Where are the audio LLMs?

My main takeaway was that despite the ASR + LLM session being by-far the best attended, LLMs and especially audio LLMs have not really reached the broader speech community. I never saw systems benchmarked against GPT-4o or Gemini or some kind of Llama-based speech system even when the comparison seems like it would have been apt and only caught a few posters using LLMs to improve ASR in various ways [1, 2]. I was expecting the speech community to be all over this trend, especially considering that I believe audio LLMs have the potential to subsume many of these research fields, but they seemed largely pushed to the corner. Please write me if you think I am missing something here because I do genuinely want to understand how the broader speech community is reacting to these.

Saving time and money with HubertEE (early exit)

I thought this was a clever idea. The authors claim that modern speech recognition models are big and efficient (true!), so they attempt to do something about this. They add an early exit branch to each layer of the transformer encoder to let the model stop computation when it crosses a confidence threshold, saving compute and dropping latency. Their results aren’t stellar but I think represent a great step in making large models more efficient and I’m sure there is more juice to squeeze here. This approach is very similar to Google’s Mixture of Depths.

MASSIVE dataset for audio LLMs

NLU (natural language understanding) is a traditional NLP (natural language processing) task that requires various tags to be applied to various words in an utterance, e.g. Open Uber and book me a ride to the airport → transport type: car, place name: airport. This is referred to as slot filling and each task is a slot (there is some predefined number of slots of each domain). When the input for this task is audio rather than text, it becomes SLU (spoken language understanding) instead of NLU.

A team at Interspeech presented Speech-MASSIVE, a multilingual SLU corpus. I attempted to run a few of these through Gemini and GPT-4o and they both did pretty well. I think that SLU could become an important part of audio LLM benchmarking before better metrics arrive.

e2e models are all you need?

I saw a bunch of papers (notably not the Speech-MASSIVE paper above) that showed they could beat cascaded benchmarks on some speech + NLP task with an e2e system. This seems like a somewhat obvious result given how much information is carried in the paralinguistics, but what was not obvious was why stop there? Related to my first point, I would expect these groups to all reach the logical conclusion that audio LLMs can accomplish all of these tasks and the action is probably in totally general models. 

Hung-yi Lee is an amazing speaker (and gave an excellent survey of audio LLMs)

If you can find a video of his presentation, watch it. I really like how he framed the problem and presented the state of the field. Looking through his slides and reading the references is a good backup plan if a video isn’t available.

The Chatbot Arena attempts to fight off Goodhart

I’ve written extensively about how this seemingly ungameable (and previously unparalleled) human preference benchmark is obviously being hacked by a number of labs (and loved seeing our fearless leader Thread (??) the same this week). Lmsys has responded to the criticism with a scaled ELO score that attempts to separate ratings based on style (gameable) from those based on substance (not gameable).

Their methodology is simple but makes a lot of sense based on the biases I’ve seen in annotators working on human preference data. And the results are exactly what you’d expect. Last week, I noted how crazy it was that Claude Sonnet 3.5 was tied with Gemini 1.5 Flash in the Arena. The former is obviously a more capable model. 

When controlling for style, Claude Sonnet 3 rockets from 4th to 2nd and Gemini 1.5 Flash drops from 6th to 9th. It may have just been a coincidence that I selected these two to experiment with, but it is quite refreshing to see my experience quantified through a study like this. The big winners are Anthropic and Meta, whose models seem to improve when controlling for style, the losers Grok and Google, and OpenAI being relatively neutral (aside from mini). 

The benchmark we’ve all been waiting for

I recently wrote about the challenges of evaluating audio LLMs (note: at Interspeech I heard these things referred to as audio LLMs, spoken LLMs, and speech LLMs; I like the first because it encompasses audio beyond speech). Unbeknownst to me, a group in Singapore had already put together a pretty comprehensive suite called AudioBench.

They include the old standbys like ASR (Librispeech, CommonVoice) and augment them with a both speech and voice understanding (another thing I like about this paper: they define speech understanding as tasks that need the words alone to accomplish and voice understanding tasks as those needing the non-verbal audio), and audio scene understanding. 

Speech understanding tasks include:

  • ASR
  • Speech QA
  •  Speech instruction following

Audio scene understanding:

  • Audio captioning
  • Audio-scene QA

Voice understanding:

  • Accent recognition
  • Gender recognition
  • Emotion recognition

While I don’t think that these tasks alone would define a strong audio LLM that humans liked to use, they are a darn good start (and better than ASR and AST alone). I thought it was a bit interesting that they left out AST (automatic speech translation) and SLU (spoken language understanding), but this is admittedly a work in progress (they want to add speech generation metrics among others). I think that this could evolve into the standard benchmark for audio LLMs and give the industry a bit of time as human evals and private leaderboards mature. 

On the note of human evals for audio LLMs, Hung-yi Lee of NTU is creating the Chatbot Arena for audio (SpokenAgent-SUPERB). Email him if you’d like to submit a model.

Cough2Vec

One of Google’s ML research teams trained a “bioacoustic foundational model” on a giant corpus of human sound. They achieve SOTA on a number of tasks including cough and sneeze detection and are partnering with a startup to detect tuberculosis.

Now that nearly every human on the planet has access to a camera and microphone connected to the internet, I hope that we really accelerate this kind of work. I’ve heard about these kinds of projects for quite some time, but I never really know how real they are or what the impact is. Who wouldn’t want an app that could give a directional diagnosis based on a cough or photo of a mole? I don’t doubt that we have the technology to perform this better than the existing medical system today, so I hope we can figure out how to cut through the red tape and give people better lives. 

Amazon surrenders

Following Apple, Amazon elected to license a 3P LLM provider in their core voice assistant product. I don’t know what to make of either of these. Are they craftily using OpenAI/Anthropic as accelerants? Are these deals going to catapult OpenAI and Anthropic into the mainstream? I have no idea, but both of these do seem like some kind of capitulation.

And I do wonder why they didn’t pull a Llama off the shelf and fine-tune for their needs. I know that the Meta license requires some kind of arrangement when the licensee has >700M DAP, but I couldn’t imagine that fine-tuning and deploying a Llama would accelerate their own work more than integrating an API call.