The embryonic state of audio LLM leaderboards

Back in 2021, Google introduced the concept of instruction fine-tuning in their paper “Finetuned Language Models are Zero-Shot Learners”, aka FLAN, a play on OpenAI’s GPT-3 paper title, “Language Models are Few-Shot Learners”. Through instruction fine-tuning, language models could be taught to answer any conceivable question. “What is the capital of France”, “How do you feel today?”, “Who is the main character in this passage?”, and so on. This is the first time, AFAIK, where a lab tuned a language model to really talk to a human (OpenAI significantly refined this approach a few months later with “Training language models to follow instructions with human feedback” aka InstructGPT).

However, the FLAN team created quite a problem for themselves. How could they evaluate a model that could do anything? Their response was to test on everything! They assembled a battery of 62 (!!!) existing NLP benchmarks, including some familiar names still used today like HellaSwag, StoryCloze, ARC, Winogrande, and SQuAD, and measured performance on all of them. While this battery of evals has largely been discarded and replaced one that more precisely measure whether humans find these models valuable, they served to get the field off the ground and hillclimb toward the incredible products we have today.

Audio LLM evals are in the era of FLAN so I wanted to take a deeper look at what was out there. So far, these are all quite focused on understanding and reasoning, but I’m hoping that we’ll see more generation benchmarks appear soon, especially now that we can plausibly use the strong understanding models as judges.

Dynamic Superb

Dynamic Superb is an eval from Hung-yi Lee’s group (the same professor trying to create a Chatbot Arena for audio) focused on speech and audio understanding (no generation tasks included). It is composed of 55 (a subset shared below) tasks organized into 6 buckets (tasked marked with a * are possible with cascaded systems):

Content
- *Dialogue classification
- *Dialogue emotion classification
- *Intent classification
- *Language identification
- *Speech command recognition
- *Speech detection
- *Speech text matching
Speaker
- Accent classification
- Speaker distance
- *Multispeaker detection
- Speaker counting
- Speaker verification
Degradation: noise prediction
- Noise detection
- SNR prediction
Paralinguistic
- *Emotion recognition
- Sarcasm detection
- Stress detection
Audio: environmental sound classification
- Bird sound detection
- Chord classification
- Environmental sound classification

All of these tasks come with an OSS evaluation suite and dataset and should be relatively straightforward to evaluate. Like FLAN, while these don’t truly represent what users want from an audio LLM, they do capture a big chunk of the types of things a user might ask. Their leaderboard doesn’t appear very active, but Qwen-Audio is on top.

AIR-Bench

AIR-Bench (Audio InstRuction Benchmark) is a newer speech and audio understanding benchmark that takes a similar tack to Dynamic Superb. They identify 23 tasks in 4 categories (speech, music, sound, and chat) and also show Qwen2-Audio at the top of their list, outperforming Gemini 1.5 Pro (but underperforming a strong Whisper+GPT4 cascaded system in appropriate categories). In all cases, GPT-4 is used as a judge (seems like overkill for multiple choice questions, but GPT-4 is cheap these days).

Speech
- *Intent classification
- *Speaker number verification
- *Speech entity recognition
- Speaker age prediction
- Speaker gender recognition
- *Spoken language identification
- *Emotion recognition
- Speech grounding (identify beginning/end timestamps for a given word)
Sound
- Audio grounding
- Vocal sound classification
- Acoustic scene classification
Music
- Music instrument classification
- Music genre classification
- Music note analysis
- Music emotion detection
Chat
- *Speech QA
- Sound QA
- Music QA
- Mixed audio QA

AudioBench

AudioBench feels like the most modern having pulled in the latest speech IF evals and created a dash on HuggingFace. The Whisper + Llama cascaded system leads in all categories answerable in semantics alone and Qwen2-Audio the rest. They propose a similar taxonomy to Dynamic Superb and AIR-Bench.

Speech understanding
- *ASR
- *Speech QA
- *Speech instruction following (this is very similar to QA)
- *Speech translation
Audio scene understanding
- Audio captioning
- Audio scene QA
Voice understanding
- Accent recognition
- Gender recognition
- Emotion recognition

ASR and AST

Most of these models are also evaluated on various ASR (automatic speech recognition) and AST (automatic speech translation) tasks, despite these capabilities representing the bare minimum of what is expected of an audio LLM. I believe that this is largely because the evals exist and are easy to run. Gemini 1.5 Pro and GPT-4o only shared benchmarks on these axes.

My conclusions from this are fourfold:

There is an opportunity to align on a set of canonical quasi-academic audio LLM evals to track progress.
A cascaded benchmark is really hard to beat for tasks that only require a semantic understanding
Qwen2-Audio seems to be leading in OSS speech understanding
There is no consistent way to measure generation quality (and most models have not focused on generation at all). Kyutai used MOSNet, but I think that this only tells a small part of the story.

In defense of GSM8K or a rebuttal to Apple’s paper claiming that LLMs cannot reason

Some Apple researchers published a simple paper recently that showed performance on GSM8J regresses when they make minor changes to the benchmark (GSM-Symbolic). Their work was widely reported on with headlines like “Apple study exposes deep cracks in LLMs’ reasoning capabilities”, which is somewhat hyperbolic but generally aligned with their paper’s conclusion section.

I don’t think that either are justified. To briefly summarize their experiment, they ran three experiments:

Replaced a templated number of entities (name, number, etc.) in GSM8K with randomly selected alternatives
Increased the difficulty of the benchmark by adding constraints that require additional reasoning steps
Added extraneous information to see if the models ignored it

The results of the first experiment were similar to those previously released by Scale.ai on their private GSM1K eval: they show that all models regress to some extent (although they do not benchmark Anthropic models, which Scale showed actually improved on their held-out benchmark), but the delta for leading models is within their confidence intervals. However, even the most overfit model, Mistral 7B, achieves 48.0% on GSM-8K and 41.1% +/- 3.7% on GSM-Symbolic. While this is a large relative change (and consistent with Scale’s findings that the Mistral models were most overfit), scoring 41.1% is still showing a strong ability to reason. While perhaps there was some contamination in the training sets or methodology (Scale attempts to quantify this), that should be the conclusion rather than LLMs cannot reason. And further, if human reasoners had previously seen a problem with a set number of names and numbers, I suspect that their performance would suffer if the names were switched as well.

Their second experiment adds reasoning steps. Not surprisingly, most models perform worse when difficulty increases. The authors argue that because the rates of accuracy drop increase as difficulty increases that the models are not reasoning to an answer. I’m not sure that I totally understand this answer (is human performance on a task linear with number of reasoning steps? I doubt it. I suspect that human reasoners suffer from the same) and the distributions are quite wide, making conclusions difficult.

They (and the media) spend the most time on their third experiment. When adding reasoning but extraneous information, all models regress significantly (“catastrophically” in their language) from 17.5% for o1-preview to 65.7% for Phi 3 mini. Because this additional information should not impact reasoning, the authors argue that this experiment “exposes a critical flaw in LLMs’ ability to genuinely understand mathematical concepts”. I think that this is misguided. It is well known that human reasoners get distracted by extraneous information (one random reference). Parsing information and determining what is relevant to the problem at hand is a hard reasoning problem!

The conclusions of this paper come down to the definition of reasoning and pattern matching. I would argue that the authors’ data show that LLMs reason just like humans, making similar mistakes and succumbing to similar pitfalls. Maybe we are all just pattern matchers.

Mo’ tokens, mo’ reasoning: Thought Preference Optimization

FAIR has been on a roll publishing simple papers with big impact. This week, they showed that they could lift performance on AlpacaEval and Arena-Hard by creating machine-selected examples for DPO that are generated by prompting the model to split its response between thinking and responding. After a few iterations, they significantly outperform the DPO baseline. Interestingly, this outperformance is not limited to topics that seemingly require more reasoning like “math and calculations” and “reasoning and problem-solving” but also “marketing and sales” and “personal development”. Does this mean that quantitatively marketing and sales is on par with math for reasoning? No more enterprise sales drone jokes!

When I was working on Gemini in Google Cloud, I would occasionally hop on sales calls to give our clients advice on how to make the models work better. My response was almost always: “figure out how to get the model to use more tokens to do the same task”. My mental model was that each token contained a quanta of reasoning and at a high level all the different techniques out there, e.g. chain of thought, LLM as a judge at inference time, etc., were just different ways to increase tokens/response. This paper (and I suppose o1-preview and all the work on inference these days) is a great validation of that general thesis.

Early fusion speech model from HomeBrew

The HomeBrew team has been working on teaching Llama to listen for a while now. They’ve played around with adapters but recently shared Ichigo, their model trained with native speech tokens (note: in LLM-speak early fusion refers to bringing additional modalities early in training as native tokens not at a lower layer in the neural network).

They demonstrated many of the benefits of early fusion speech models including super low latency (111ms TTFT) and the ability to understand interleaved speech and text. They tokenize using discrete WhisperVQ units, train on 16k hours of multilingual speech derived mostly from ASR datasets, and use WhisperSpeech to generate 1.3M pairs of speech instruction and text answer post-training data. Interestingly, it appears that the authors didn’t make any effort to align the speech and text tokens, but the models responses clearly demonstrate that some alignment did happen, e.g. generate code from a spoken prompt, even though they did see fairly significant regressions on text benchmarks.

They outperform Qwen2-Audio on two LLM-as-a-judge spoken QA benchmarks (openhermes_instruction_test and alpaca_audio_test) and split the performance of an equivalent cascaded system, while dropping the latency by roughly 4x.

It’s great to see more strong audio LLMs and experimentation with different approaches. I’m looking forward to seeing HomeBrew continue to refine their methodology and (hopefully) add generation soon.

Daniel D. McKinnon

Musings on adventuring in the modern era and tinkering with technological curiosities

Dan’s Weekly AI Speech and Language Scoop #31

The embryonic state of audio LLM leaderboards

In defense of GSM8K or a rebuttal to Apple’s paper claiming that LLMs cannot reason

Mo’ tokens, mo’ reasoning: Thought Preference Optimization

Early fusion speech model from HomeBrew