The price wars come for voice
GPT-4 prices have fallen 99% since launch, which probably hasn’t been great for the token sellers’ bottom lines even accounting for advances in distillation and efficient inference. The root cause is quite likely the rise of open models like Llama, Qwen, and DeepSeek, which AI-inference providers can offer for essentially cost.
When OpenAI first introduced GPT-4o-realtime-preview, their accountants probably breathed a sigh of relief. They were the only game in town for high-quality speech-to-speech models (this isn’t totally true; a number of companies including Fixie and Play offer strong cascaded voice agents via API) and used that position to command a premium. Initial pricing was ~$15/hr, depending on the ratio of input to output tokens.
Fast forward mere weeks and Google announced Gemini 2.0, which is advertised to match OpenAI’s real-time API for voice performance (my initial review was not able to reproduce the fancy demos, but it’s possible the real thing isn’t up yet). While Google hasn’t announced pricing yet, OpenAI knows quite well that they can finance whatever economics they want with their colossal existing business and history has shown that Google comes in low (I think that they were as much a part of the race to the bottom with pricing as Llama).
Google’s announcement triggered a response from OpenAI that they are dropping prices on GPT-4o-realtime 60% and introducing GPT-4o-mini-realtime at a 90% discount from the original pricing.
Wow. What took over a year for GPT-4 took a month for voice. The big winners are consumers here. I look forward to seeing what people build.
Update from Nature on decoding animal speech
Nature, the fancy scientific journal, has a new (to me) format of pop sci articles published for the web. Their latest edition reviews how scientists are using tools similar to audio LLMs to discover that:
- elephants, elephant seals, and monkeys call each other by specific names
- sperm whales have a phonetic alphabet
- crows use a consistent language for some tasks
Unfortunately, as far as I can tell, no one has taken the logical next step with the projects to generate the audio and try to communicate with animals.
Voice agent cash bonanza
When I started writing this newsletter last year, it was pretty easy to stay on top of every startup offering some kind of differentiated voice service. Every few months someone would launch, I would play with the model, and life would go on.
This has changed. In the last few months, Hamming.ai raised $3.8M seed, Vapi $20M Series A, Retell AI $4.6M seed, LiveKit AI $22.5M Series A, Hume.ai $50M Series B, and probably a bunch I am missing.
These companies are focusing on diverse aspects of the audio LLM space from evals to modeling to infra and are starting to create the foundation of the voice ecosystem that has existed in text for a few years. I don’t have time to test all of their products for this week’s edition, but I will slowly go through this list and figure out what they are all about.
A day in the life of a voice phisherman
I was clutching my pearls last month when I realized that I could create a very realistic telephone scam agent using Play’s tools in <5 min, but I questioned whether I was overreacting. Is scam volume gated on supply or demand? Does increasing supply with automated tools actually matter?
Krebs on Security assuages my concerns to some extent through this excellent profile of what it takes to run an audio scam. TL;DR is that scamming somewhat sophisticated targets out of cryptocurrencies requires an elaborate ballet of different roles operating in different areas. I think that it is very unlikely that this type of orchestration could be replaced with AI in the near future, especially given the distribution of the rewards.
I do acknowledge that this is quite sophisticated and vanilla audio versions of Nigerian princes will always be out there, but I think that the jobs of the real phisherman are safe from AI for now.
Strong grounding is a must for audio models; Google shares a new benchmark
Factuality is essential for audio LLMs. While sitting in front of a computer, hallucinations can easily be cross-referenced and corrected. I don’t need a chatbot to tell me exactly which winter tires are best for a Tesla Model Y with Induction wheels, but I just need some pointer that lets me finish the job. Users of voice interfaces have no such luxury.
If I am frictionlessly talking to my AI agent through my wearable about what restaurant to stop in, that restaurant better exist and be good or else I will throw away that Wearable and revert to typing on my phone, a pretty good alternative. This kind of factuality is achieved through two different mechanisms: parametric memory and grounding.
Parameter memory is the knowledge that the LLM stores in the parameters. This is used for evals like SimpleQA, which measures whether LLM can retrieve obscure facts [aside: I think that we should hillclimb fraction wrong not right on SimpleQA; it is better to not answer than be wrong]. While models can store a lot of general information this way, this information is immutable post-training and generally lacking in the long tail.
To account for this, models can ground their responses on externally provided text like search results and knowledge graphs. This technique is essential for use-cases like the restaurant one above. To my knowledge, there was no great benchmark out there for factuality on grounded content until Google released FACTS Grounding. It is a set of 860 prompts and associated documents and a LLM-as-a-judge framework. I look forward to seeing how audio models do.