The Sesame heard round the world
If you work in audio and you didn’t hear about Sesame last week, you might want a quick career check. This team, founded by former Meta Reality Labs leaders, debuted their “Conversational Speech Model”, which is the most expressive and natural audio LLMs I have experienced. While I don’t know that I want Sesame in a product (if you ask it a simple question like, “What is the capital of Romania?” it answers something like, “well, wow. That is so interesting. A geography question, huh? I think…”), the range is impressive and I am confident that they can adjust to land the experience they want.
As far as I can tell from their blog post, they built a half-duplex Moshi optimized for highly expressive speech. I’m sure that there are some interesting technical details and approaches to evals withheld, but it seems like the biggest innovation here is the opinionated product. Instead of threading the needle between helpful assistant and entertaining companion, they turned the knob to 11 on the latter.
In addition to the demo hosted on their website, they OSS’ed a conversational TTS model, which competes with the best of them for voice cloning and enables a totally new type of speech generation. Beware though: their HuggingFace demo is very impressive, but there have been many reports of challenges of reproducing the results. I haven’t tried myself, but I think that there may be a bit of a Fish-like sleight of hand involved in their OSS strategy.
Moshi straight from the horse’s mouth with Alex Défossez and shout out to Pooneh Mousavi and her ConvAI reading group
I’ve been impressed with Kyutai’s Moshi full-duplex audio LLM since the beginning but I never took the time to really understand some of the novel bits in their paper, particularly around the Mimi tokenizer and how the depth and temporal transforms interact. I was delighted to see that Alex was hosting a deep dive with Pooneh Mousavi’s ConvAI reading group. I found his presentation totally clarified their approach and am grateful to Ponneh for hosting these. I’ll try to catch all of these going forward.
Markets are inefficient: Cartesia just raises $64M at >$600M for yet another good TTS system
The same month Sesame, Canopy Labs, and Nanyang Technological University OSS’ed CSM, Orpheus, and Seed-VC, three very strong TTS models, Cartesia raised a giant round to expand their business. While Cartesia’s voices are high quality, I don’t think that they are differentiated beyond a now dozen or so strong small OSS projects and their voice cloning capabilities are below all the new entrants, e.g. Fish, Sesame, and Orpheus.
They claim they are the fastest system, perhaps due to leaning on SSMs, but I’m not totally sure why that matters in TTS. The Groq CTO recently asked the same question on Threads and got crickets in his responses. I think that less than real-time is adequate for most applications and haven’t seen many applications yet that require super fast batch processing (and if I did, would it be cheaper to scale machines or build new model architectures?).
I’ll be interested in seeing how this plays out. Maybe this is just an implicit bet on SSMs as a new architecture and TTS is just the first application? That is the best way I could rationalize this investment.
Grok 3’s voice afterthought
I spent some time with Grok 3’s voice modes. While some of them made for great shareable clips (who doesn’t chuckle a little bit hearing an AI curse?), I wasn’t super impressed with the execution of their voice mode. Voice felt bolted onto text and many of the elements of Sesame that really impressed me were absent from Grok.
While the factuality and semantic steerability were excellent on this model, expressivity, conversationality, and acoustic steerability were poor and aside from hearing audio LLM-specific bugs, I would have guessed that they just bolted a mid TTS system onto the text outputs.
That said, I do think that there is something interesting from a product perspective in the voice modes. In theory, a perfect conversational AI would be able to inuit what the user wants, but in practice this is very hard. With Grok 3, the model doesn’t have to guess if you want NSFW romance or homework help. You can just tell it directly.
Gemma 3 makes hacking great again
DeepSeek R1 is amazing at 671B params! But you might have to sell a kidney to figure out how to run and fine tune this. Gemma 3 brings OSS models back to reality with their very strong 1B, 4B, 12B, and 27B dense models. I haven’t played with these much yet, but my initial reaction is positive and their benchmark numbers are very impressive.
They released both pretrained and instruction-tuned variants of all sizes along with pretty robust documentation and developer tools. I’m looking forward to seeing what the anime-girl pp crowd on Twitter builds with these.
Teardown of Phi-4 Multimodal technical report
Hours after I published the last speech scoop, Microsoft released the Phi-4 Multimodal technical report. While this wasn’t a dialogue model, they achieve very strong numbers on a bunch of audio understanding tasks (although I do think that some of their evals are cooked; after playing with the demo, I have a very hard time that they best Geminin and GPT-4o on a bunch of audio QA tasks) and took a novel approach that I found worth covering.
Instead of training a model from some checkpoint with audio tokens or bolting on an adapter, they dynamically load LoRAs for different token types. This let them maintain the performance of the pre-trained text model with new modalities and deliver an e2e-type approach with minimal additional training. Loading and unloading all of these LoRAs seems hard to serve and I’m not sure that I buy the merits of the approach given the complexity, but time will tell if something like this wins out.
Don’t sleep on Chinese models part VII: Baidu does audio with Ernie 4.5
I haven’t been able to test this because their consumer product is entirely in Chinese and doesn’t have an obvious voice entry point, but their launch announcement claims frontier performance across modalities including support for audio and OSS soon. It’s not clear whether these are true native dialogue capabilities or just understanding like Gemini 1.5 and Qwen.
Measuring dialogue quality just got a little easier
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics and Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities just dropped proposing new methods to evaluate the conversationality of dialogue models. I haven’t read in depth, but I think that this is important work and anticipate some of these metrics to start to become industry standards.