Dan’s Weekly AI Speech and Language Scoop #33

Standard Intelligence’s hertz-dev is the Zoolander of audio LLMs

Standard Intelligence, a company that as far as I can tell was started by a bunch of high school students (aside: wow, you guys are impressive if you are reading this), just released the second Type D audio LLM I’m aware of (Kyutai’s Moshi was the first): hertz-dev. Like Kyutai, they claim full-duplex generation and very impressive 120 ms algorithmic latency, which should make this a very powerful tool for natural dialogue. Unlike Kyutai, they released the base model, so you can have all kinds of fun with few-shot prompts and continuations.

The fun part of continuations is that the model just tries to complete the sentence in speech tokens. So if you prompt the model with your own voice, it will clone it and continue your thought. hertz-dev speech generation performance is very impressive. With repeated sampling, I could generate audio that sounded nearly indistinguishable from my own voice with a really wide range of expressivity.

Unfortunately, while the model is really really ridiculously good sounding, its semantic performance (intelligence and reasoning) is hilariously bad. The model can barely complete a low entropy sentence like “Once upon a time there was…” much less handle more sophisticated GPT-3-style few-shot prompts. This is a bit surprising since they started from a strong pre-trained language checkpoint. I suspect that introducing the audio could have triggered from kind of catastrophic forgetting or their model hasn’t learned to align speech and text.

Regardless, this is a really strong result from a young, talented team. I’m looking forward to playing with their to-be-released 70B variant. Give it a shot yourself at their HuggingFace space or on your local machine.

Search is hard: My review of SearchGPT

I think that most technologists agree that getting an answer is better than 10 links and there are not a lot of products out there that claim to do just that. Google itself, Perplexity, You.com, Meta AI, and the list goes on. But all of these miss the mark on some axis. Search is really hard.

I had high hopes for SearchGPT. I thought that OpenAI would nail this one and drive a step change in the experience. Unfortunately, they left me disappointed. There appear to be two SearchGPT modes:

  1. References inline and clearly cited for simple facts
  2. References lumped together for more complex ones

Mode 1 works pretty well, but these types of facts are generally very easy to Google oneself and digging through the reference to find the actual source of the citation takes longer than the equivalent Google search. They could significantly improve the experience by linking direction to the relevant passage.

Mode 2 reminds me of the early days of ChatGPT before users learned to check for hallucinations. The model generates a very authoritative answer supported by sometimes dozens of references, but the chance of these references actually supporting the conclusion varies depending on the query.

I recently bought oversize snow tires for my Tesla Model Y (stock = 245/45/20; mine = 275/45/20). Figuring out what tires would fit involved a painful tour through a bunch of forums and is exactly the kind of query an agentic search engine could really help with. However, when I ask SearchGPT this question, it recommends two tires that don’t actually come in that size, one tire that someone in the Tesla forums mounted on aftermarket rims of a different width, and one tire that Tesla owners commonly run but isn’t a snow tire. The tires I bought (nor the other commonly used oversize snow tires) were nowhere to be found.

Returning the wrong answer really isn’t a problem. At this point, people are used to LLM hallucinations. Returning the wrong answer with thumbnails and confident links to authoritative websites is, especially when the product almost filibusters the user with the volume of references.

To be fair to OpenAI, no other product on the market answers this query well today, but I was hoping that they would come in and level up the game. Maybe it turns out that this is just a really hard problem and it will be a while until consumers get something great.

Tencent enters the chat

Tencent released an open 52B active parameter MoE model. Nothing in their paper leapt out at me, but it was interesting to see that they only trained on 7T (vs. 15T for Llama 3) tokens and mostly synthetically generated their post-training data yet still managed to mostly beat Llama 3 405B and DeepSeek V2 on academic benchmarks.

What does leap out at me is wondering which chat Tencent entered. Their marketing page is only available in Chinese. Their consumer product, Yuanbao, seems on par with ChatGPT but is basically unusable as an English speaker. I don’t see their models in the Chatbot Arena or any of the major leaderboards, e.g. Scale’s SEAL.

What is going on in this parallel universe of open Chinese LLMs? Are they just building for the domestic market? Is there any reason that they don’t seem to be putting any effort into international adoption? Alibaba’s Qwen, Tencent’s Hunyuan, and DeepSeek’s DeepSeek could all challenge Llama for the OSS adoption throne but seem to have mostly opted out. If anyone knows what’s going on with these Chinese models, let me know. I am curious.

Anthropic cheaps out

Someone has to make money in AI one of these days. LLM inference has gotten roughly 100x cheaper since GPT-4’s release and Anthropic finally put their foot down. They both raised prices on Haiku (almost 5x!), arguing that the latest  version outperforms the original Opus, and periodically swap out Sonnet for Haiku in their Claude web interface. 

As LLMs find PMF and investors start wanting to see a return, I suspect that this trend will continue. We know from leaked docs that Open AI lost almost $5B on $3.7B in revenue this year. While they do pay well, most of this isn’t in cash and the company is still pretty small, so I suspect that they are losing money on every $20/month subscription and most API calls (although it’s totally possible that training costs are what’s eating their margins). I don’t doubt that Anthropic is in the same boat and this is their first of many moves to try to reset expectations.

The Uber of knowledge: Scale’s expert match platform

Scale is making boatloads of money. They apparently earned $400M in revenue last half. But I suspect their business isn’t spectacular by the lofty standards of the tech industry. They work with AI companies to generate human annotations on which to train their models, but this is really, really hard to do these days. One can no longer just set up a simple task on Mechanical Turk to generate QA pairs and expect to lift model performance. Annotators must be trained, systems must be built to guide their responses, QA processes established to prune low-quality data, and so on.

Rather than deal with the messy realities of human workforces, a better business would be to build a platform to connect annotators with AI companies and take a slice. And Scale just announced that they are doing just that.

Beyond being great for Scale’s margins, this is cool on two levels.

  1. Frontier models are really, really good. They probably have a master’s level understanding of essentially every topic. To improve them with human annotation, you really need PhD level annotators, but these people are hard to find. Just like GLG and AlphaSights and the expert network companies really unlocked the ability of a PM to connect with an expert without a ton of work, Scale now can do that for an AI company. If this is successful, I expect this effort to accelerate the number of specialized domains in which models perform at top human level.
  2. This is a fun way for experts to contribute to democratizing their knowledge and earning a bit of money on the side. I did my PhD in Chemical Engineering and would have a lot of fun writing QA pairs for AIs on my thesis. And I’m probably one of a few people in the world who could inject that bit of knowledge into the model. If you multiply that by a million, the model can get a lot smarter, those million people can feel prouder that their esoteric PhD work might actually be helping someone, and they would have a few more shekels bouncing around in their pockets.