Everyone does voice agents; where does the value accrue?

Voice agents are hot these days. A coworker driving through rural Georgia pinged me to let me know that a Bojangles drive-thru took orders with a voice agent. Who knew that a Southern fried chicken chain would be making investments in gen AI?

On the heels of their big raise last month, Cartesia announced a voice agent platform called Forethought without many details. Today, Amazon announced Nova Sonic with a few demos (no access for me yet). It’s not clear how these new products are different from those already on the market from Fixie, Play, ElevenLabs, OpenAI, or what Google will inevitably ship with Gemini. This is in addition to the 22% of F24 YC companies that are building voice agents.

I don’t doubt that there is a big opportunity here (after all, I work in the space), but this does seem like a flood of undifferentiated products. In my mind there are four fairly hard problems worth solving around voice agents:

Build excellent model (cascaded or e2e)
Focus on integration and orchestration
Market and sell your solution
Solve an actual business problem

Only a few well-resourced companies are tackling 1. Based on my experience with text models, I find it hard to believe that that is really a defensible edge. All frontier-ish text models are essentially good enough for the vast majority of use-cases these days and prices have absolutely crashed for a given level of intelligence.

Most groups seem to be focused on 2. While I recognize that integrating a voice agent with a customer system has its challenges, this seems like naively the easiest place to start, which is why we see the explosion of voice startups and, I think, Cartesia and Amazon’s recent launches. It’s hard for me to understand how to differentiate here.

I think that most of the value will accrue to 3 or 4. Most of the existing solutions are probably good enough for many simple applications like taking a drive-through order. Someone needs to do the hard work to get through the sales cycle to land these in production. Once the customer relationship is established, my sense is that this is pretty durable. It’s not clear to me if the group to do this is an ISV-type integrator or whether 1-3 will come from the same company. I think that every business model is in play.

That said, there is a world where the ultimate winner is the customer who actually solves a business problem. If Bojangles can deliver better burgers for cheaper by having AI take orders and they can be agnostic to the provider, maybe they or their customers who presumably enjoy lower prices are the ultimate winners and they just leave a bunch of failed AI voice agent companies in their wake.

Google’s models crush; why don’t I use Gemini?

DeepMind has been on fire. Gemini 2.0 Flash is dirt cheap and capable. Gemini 2.5 Pro is widely considered the best model out there (I personally wouldn’t place it at #1, but it’s up there). If models are indeed the product in GenAI, which many assert, why don’t I use Gemini/AI Studio?

This is a bit polemical because I do use Gemini/AI Studio occasionally, but Google’s products are far from my daily driver. When I sit down to use an AI model, I almost always go to OpenAI first. Part of this is familiarity, but part of it is that the product is better.

OpenAI’s web interface is a clean white design that makes it easy to concentrate on the problem at hand. AI Studio defaults to a white-on-black typography with dozens of options (all of which I understand, but I think it’s unlikely that an average user does). You might argue that Gemini the web interface is the right comparison to ChatGPT.com, but this simpler product suffers from many of the same design issues (I am not a design guy, but Gemini and AI Studio make my eyes bleed compared to ChatGPT.com) and layers on a stack of refusals so thick that it makes the product impossible to use.

For example, if I prompt Gemini and AI Studio with “What is chrM?” (2.5 Pro Experimental), Gemini responds, “I’m a text-based AI and can’t assist with that.” and AI Studio, “chrM is a common abbreviation used in genomics and bioinformatics to refer to the mitochondrial chromosome…” These kinds of soft or hard refusals and layered into the Gemini product make it almost impossible to use as a daily driver.

I think Google has finally recognized that excellent models are necessary but not sufficient for excellent consumer AI products and elevated Josh Woodward, the inspiration behind starting this newsletter, to lead the Gemini app. Best of luck to Josh. My $20/month I pay OpenAI is up for grabs, so we’ll see if Josh earns it with a rebooted Gemini Advanced.

BREAKING: Josh and Google actually earned my $20 between when I wrote this and when I hit send. Offering Deep Research with Gemini 2.5 to Advanced customers is interesting enough for me to shell out. I was quite impressed with my first queries and believe that they clearly surpass OAI Deep Research. Will follow up with details in the next edition.

A brief note on Gemini 2.5 Pro

I recently completed a two-week long vibe coding project where I needed to come up to speed on a largely new domain (genetic sequencing) and write a bunch of code to process raw reads, understand variants, and investigate their effect. This project required essentially every type of AI model: deep research, reasoning, simple QA, coding/debugging, search, etc. I started right when Gemini 2.5 Pro was released and fully intended to use it, but for whatever reason (I spent my time on the project not thinking carefully about why one model was better than the other), the OpenAI answers were just better.

This could be my familiarity with OpenAI’s models and comfort in switching between modes (my understanding is that Gemini 2.5 Pro is supposed to do it all), but I just couldn’t get the same quality out of Google. I saw this both on deep research-style tasks and code, where o1 could one-shot generate the analysis I needed and Gemini often needed multiple steps to debug.

I was quite surprised by this experience given the Gemini 2.5 leaderboard numbers and response on social media, so I’m reflecting on whether the issue is actually what I raised about and I gave OpenAI more of a chance because the product itself is designed better. We can sometimes be blinded by raw truth when idiosyncratic aspects of the experience don’t land.

Omni(ish) Audio LLMs come to OSS

Both Qwen and Moshi released OSS speech + image models since I last published.

MoshiVis uses a simpler Type-B approach, where they inject the image embedding directly into the attention layer. Their demo is the same old Moshi plus an image as context. I had fun chatting with the model but found its ability to both hold a deeper conversation and understand the nuances of various images was missing, but I think that Kyutai wanted a proof of concept rather than a product.

Qwen2.5-Omni-7B is a bit more interesting. This is a true Type-D model (tokenized speech, text, and images) that you can chat with on their hosted demo. I found the overall experience fairly unimpressive (I am 100% sure that they cooked their metrics showing human parity for speech generation), but the model was able to carry on a conversation without any issues and recognize all the common objects I flashed in front of my webcam. The fact that they achieved this level of reliability in a 7B model they OSS’ed is quite cool. I’m looking forward to seeing how good this gets with scale.

OpenAI.fm and yet another SOTA ASR system

OpenAI released a next generation of audio models and an incredibly meme-worthy website, OpenAI.fm, to accompany.

The most interesting part of this release is that it appears they distilled standalone ASR and TTS models from GPT-4o. We’ve seen a lot of Llama-based TTS models recently, so this isn’t totally surprising, but I find it quite elegant that OpenAI was able to really pull it all together.

And a side benefit to this approach is that they can guide their TTS model with a text prompt, enabling the fun demos they show on their website.

That said, from a performance standpoint, I wasn’t blown away. There are so many great TTS models out there like this and theirs did not feel differentiated. Their ASR model claims SOTA, but I don’t even know what that means anymore. Their semantic VAD model is quite clever, but I didn’t practically notice a difference in my everyday use of AVM.

Tragically, Meta actually shipped a similar demo years ago but never productionized it. OpenAI knows how to ship (and market; OpenAI.fm is genius).

Daniel D. McKinnon

Musings on adventuring in the modern era and tinkering with technological curiosities

Category Archives: Speech and Language Scoop

Dan’s Weekly AI Speech and Language Scoop #42

Everyone does voice agents; where does the value accrue?

Google’s models crush; why don’t I use Gemini?

A brief note on Gemini 2.5 Pro

Omni(ish) Audio LLMs come to OSS

OpenAI.fm and yet another SOTA ASR system