Dan’s Weekly AI Speech and Language Scoop #38

New SOTA in ASR; does anyone care?

A small team of researchers at sandlogic just released a SOTA ASR model based on the Samba state-space model architecture we covered last summer. The architecture’s creator said that it really excelled for continuous signals like speech, so perhaps this is another great proof point. 

They dropped WER on Librispeech to 1.17% on clean and 2.48% on other, just 0.17 pp below the previous SOTA on the former and 0.01 pp on the latter. I don’t know off the top of my head how many words are included in the LS test sets, but this likely involves just flipping a handful for incorrect to correct. These error rates are and have been significantly below the single-pass interrater agreement on these datasets, which means that these models are superhuman on these types of tasks.

My probably not too provocative question is, “is this style of speech recognition solved?” Does anyone care that we can make marginal progress on these clean, English benchmarks? Whisper gained incredible adoption by working pretty well for all languages out of the box despite being far from SOTA. My prediction is that the future of ASR is pushing performance on multilingual, hard entities, and challenging audio conditions/accents rather than academic benchmarks.

The return of the AI hobbyists; who needs Stargate?

There was a brief golden age in AI research a few years ago when the most cracked model was as likely to come from an anon on Twitter with an anime girl pfp as a top lab. The big dawgs had laid the groundwork for instruction tuning but were weighed down by intellectual property, internal process, and AI doomerism. All one needed was a table of prompts and responses, a 4090, and a basic understanding of low-rank matrix decomposition to win. But times have changed. The strongest “amateur” model is sitting down at #42 in the Chatbot Arena thanks to three students at UVA and Princeton still fighting the good fight.

This all changed this week (credit to AI News for breaking the story). Berkeley’s Sky Computing lab did the simplest thing to achieve parity with o1-preview across many hard evals. All they did was generate 17k SFT examples from QwQ (Alibaba’s reasoning model; weaker than DeepSeek R1), rewrite with GPT-4o mini, rejection sample, and train Qwen 2.5 on the data at the total cost of $450. This is an insane result. If this holds (and I think it will given that this is the general approach that DeepSeek took with their distilled dense models and another lab repro’ed and surpassed already), it means that the cost of replicating a new model after release is millions of times cheaper and easier than training the original.

Why bother spending $500B on Stargate when distilling from whatever emerges from it can be done for $500?

The one question not answered here, which is really relevant, is whether this works across many stages. If OpenAI trains a $500B model with Stargate and releases a distilled version, can the community create something equivalent from that? Or do they genuinely need access to the big model. Regardless, this is a big development in OSS LLMs and I’m really looking forward to anime anons starting to build again.

The true coding benchmark: write a Gemini 2.0 client

Gemini 2.0 debuted in preview mode a few weeks ago. Google’s API docs are an absolute mess, but it appears that while native, i.e. omni, audio generation is only available under an allowlist (can someone at Google get me on that list?), cascaded audio generation is available now. Google provides a number of poorly documented examples, but the two relevant to me were turn-based text-in/speech-out and real-time speech-in/speech-out. I wanted turn-based speech-in/speech-out.

You would think that this would be easy, but the function used to send text to the model was text-specific and the real-time example used a bunch of real-time stuff I didn’t really understand or feel like learning about. This is a perfect task for an LLM. 

After digging a bit, I decided the best approach would be to swap out the streaming microphone function with one that streams a file from my local machine. For my prompt, I described the exact problem I was trying to solve and pasted in the complete real-time speech-in/speech-out client (only 300 LOC). Easy, right?

I started with Gemini (both 2.0 and preview models through AI Studio). Their latest releases crushed the Chatbot Arena, including the code section. Unfortunately, it initially gave me this answer: `The Gemini API, in its current form, does not directly support sending audio files as input. It primarily works with text and image inputs.`. With a bit more massaging, e.g. removing Gemini language, I got the model to start generating code, but unfortunately it was all wrong. After 5 min of back and forth, I decided Gemini wasn’t up to the job.

I then moved to Claude, ChatGPT, and Meta AI with similar results. Using the exact same prompt, I never got a working solution within 5 minutes with any of them, despite quite a bit of back-and-forth, pasting errors, and so on. Before giving up, I tried o1. Within a turn or two, I got a perfect working demo and a lesson as to how to properly encode my audio files. If you are interested in turn-based speech-in/speech-out with Gemini, feel free to use its results here.

Why did I share this? I haven’t been coding a lot recently, but it’s obvious that the coding evals are not reflective of real-world use. Gemini-Exp-1206 is neck and neck with o1-2024-12-17 on the leaderboards, but for this representative problem, they were nowhere close. I think that there are two things going on here:

  1. Labs are hacking leaderboards. We’ve discussed this at length before.
  2. We need new coding evals for weird use-cases. It would be great if there were some kind of public repository that I could place every public coding question I’ve ever had into alongside responses (good and bad) from models. Does this exist? Is someone working on this? I assume that it would be quite valuable to top labs.

ByteDance makes a statement in voice

ByteDance has been dabbling in voice for a while now with some interesting OSS releases (SALMONN, LSLM, etc.). Alongside the larger Doubao launch, they just announced their omni model, which they show exceeds GPT-4o handily on their human SxS eval. I don’t speak Chinese, but expressiveness and naturalness is quite impressive in addition to the quality of the answers. It’s not totally clear how an English speaker would get access to these models, but competition is heating up on both sides of the Pacific.

Full-duplex goes mainstream

Deedy Das was an early employee at Glean who has become quite influential on Twitter. I was pleasantly surprised to see him call out all three known full-duplex speech models. Appreciation for natural conversations is going mainstream!

Humanity’s last exam?

The Center for AI Safety and Scale announced that they were partnering to create “Humanity’s Last Exam” a few months ago. They have released the results. Browsing through the questions, this doesn’t look that hard. There may be some nuance I am missing, but I feel like I could sit down and solve many of the physics and chemistry questions (unlike FrontierMath where I would have no idea where to start). I predict that this will get saturated by autumn.

Two nice surveys of speech LLMs

To be honest, I didn’t read either of these in detail but am dropping them here for future reference. This is a nice image tracking the evolution of various OSS speech projects and this is a nice survey of audio LLMs, including the latest releases like SpiritLM and Moshi.