Dan’s Weekly AI Speech and Language Scoop #28

This episode was written inside the Eagle County Courthouse while I was awaiting jury selection. Something about witnessing the grinding inefficiencies of our justice system inspires me to get even more excited about AI.

You just won a chance to chat with John Cena or my slightly biased review of Meta AI voice

Perhaps this is my Marques Brownlee moment. After spending the last 6 months reviewing various audio LLMs and voice products, our own finally leaps into the wild. But I think that it is pretty good! After spending a few months dogfooding and a day playing with our production product, I want to share my slightly biased but candid review of Meta AI voice.

First, I want to explicitly share how I think about the value of audio experiences. Whether they be music, podcasts, voice assistants, conversations, or anything else, I grade these experiences on a continuum from educational to entertaining. Pop music, All-In, Kyutai Moshi, banter with college bros: pure entertainment. Latent Space, Yann LeCun’s YouTube lectures, OMSCS courses: pure education. Ezra Klein, Odd Lots, debates with activist friends: somewhere in between. For an audio experience to be compelling, it needs to find the efficient entertainment/educational frontier.

As I mentioned last week, I don’t think that any voice products have really nailed entertainment yet. I can’t quite put my finger on why, but none of these models are that fun to talk to. Unfortunately, none are that educational either. Until now, every other conversational voice experience has not had access to real-time information. Meta AI set that straight and now I can have a moderately entertaining but highly educational conversation about any topic because answers are grounded in search and other internal knowledge bases. I don’t think that we are there yet, but this alone makes Meta AI the most compelling voice assistant for me. In addition, I am most impressed by Meta AI’s ability to handle interruptions and explain concepts concisely. I also prefer the default voice, but this is just personal preference.

As far as the negatives, OpenAI’s Advanced Voice Mode is more instructable in terms of speech generation, it speaks more languages (Meta AI is largely monolingual now), and it seems to have better paralinguistic understanding, but none of these make up for the fact that it can’t have basic conversations about the world with me. I’m sure that OpenAI will add this functionality soon, but right now this is a deal breaker for me.

Play.ai previews the state of business agents (and really easy, fun voice cloning)

I recently stumbled upon Play.ai. They are building a voice interface toolbox for developers and businesses. Nothing about their technology is particularly noteworthy (I believe that they use a simple cascaded system based on Llama), but they really nail the guardrails. Their demo is a sales agent for Play.ai and I was super impressed by how constrained its responses were. Prior to that experience, I’m not sure that I would have high confidence that an AI could handly open-ended call center queries by voice, but their agent stayed totally on script and I feel could be a drag-and-drop replacement for an L1 agent if they fixed a bunch of the polish issues (latency, quality, etc.).

Their voice cloning is also really fun. I know that there are a ton of these companies now, but I haven’t actually tried that many of them. With Play.ai, I recorded a 30-second clip of my voice and build an embodied agent. It was eerie. That was the closest I’ve ever gotten to having a conversation with myself. Give it a try if you want to be weirded out. It only takes 5 minutes.

DIY Audio LLM post-training with Llama-Omni

The Llama-Omni paper reminds me of the early days of instruction tuning text LLMs. It seems like there was a huge collection of small teams just trying stuff out in the wild (and teaching me the names of every member of the Lama genus; yes, one “l”; llama is the species and lama is the genus).

The Llama-Omni team showed that they could achieve nice spoken QA IF performance by bolting a speech encoder (Whisper) onto a Llama 3.1 backbone modified to decode speech and text simultaneously. Most novel to me is how they approached SFT-only post-training. They started with publicly available text instruction tuning datasets, prompted LLMs to rewrite the responses to be more appropriate for voice (this is very important; we consume information quite differently visually and aurally), and generated audio with TTS. It’s that simple. I hope that this paper inspired more audio hackers and excitement in the space. 

Pressure is on the OSS LLM race. Is quality enough to win?

The OSS LLM ecosystem has really coalesced around Llama despite strong releases from DeepSeek, Google, and Qwen. However, Alibaba’s Qwen team released a suite of models this week that could challenge that dominance. Their 2.5 Instruct model outperforms Llama 70B across all measured academic benchmarks and matches or beats 405B on many. I don’t think that they are overfitting because the vibes from the community seem quite positive as well, but it would be nice to see how they perform on the private Scale leaderboards or in the Chatbot Arena.

That said, I think that Llama offers more than high-quality models. The ecosystem is starting to develop some stickiness. There is Llama Shield, Llama Stack, integration guides, community forums, how-tos, hyperscaler support, and more. If Qwen had dropped this model two years ago, I think that they would have immediately driven significant adoption, but now I believe that we are learning that quality is not all you need. We’ll see if I am right and Alibaba starts matching Meta with ecosystem investments.

A taxonomy of multimodal LLM architectures

Hung-yi Lee introduced me to “The Evolution of Multimodal Model Architectures” during his phenomenal Interspeech talk on audio LLMs (or spoken LLMs as he calls them). We suffer from a vocabulary crisis when describing multimodal LLMs and this paper attempts to describe a taxonomy to separate different approaches. If you are wondering how these models work at a high level, this is a great place to start. Briefly:

  • Type A and B models are “easy”. They encode additional modalities (image/video/audio) and incorporate them into the cross attention layer. This lets the models understand these modalities and respond to them in text on a per-turn basis. Interleaved understanding is not possible.
  • Type C models encode the additional modalities as embeddings that are fed into the model alongside the text tokens. This lets the model understand interleaved text and media (and multiple modalities at the same time).
  • Type D models encode all modalities as tokens, enabling both understanding and generation of interleaved text, audio, and images. This is the hard mode but seems most promising to me.

Have a hard question? Submit it to Scale.ai and earn your chance to win $5k.

Scale.ai and the Center for AI Safety partnered to build “Humanity’s Last Exam”. This is a neat idea. They want to collect the world’s hardest questions from all domains and assemble the canonical held-out LLM eval. If I get some time this weekend, I’ll try to write up a few obscure ones related to my PhD thesis. My bet is that existing reasoning models will do pretty well out-of-the-gate on this eval and it will saturate as quickly as GPQA. We will see.