These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.

Voice assistants are nothing without their tools

TechCrunch reviewed ChatGPT Advanced Voice mode, spilling a few pages of ink lauding its expressivity and creativity while burying the lede halfway in:

However, AVM falls short in other ways. ChatGPT’s voice feature can’t set timers or reminders, surf the web in real time, check the weather, or interact with any APIs on your phone. Right now, at least, it’s not an effective replacement for virtual assistants.

Ultimately, our users come to a voice assistant to solve a problem. That problem may be entertainment, role-playing, or something else that can be handled well with the model’s parametric memory alone, but the vast majority of assistant use-cases that people care about today require some kind of real-time information. We’ve trained users that voice assistants can check the weather, set timers, create reminders, search the internet, and more. Removing these capabilities is a big step back, which I believe will hurt the adoption of next-gen voice assistants.

Evaluating voice assistants

Discussing the challenges of evaluating LLMs has been a consistent theme of this newsletter [1, 2, 3, 4]. We started with academic evals, saturated/overfit those, moved to human SxS, created models that were capable of answering essentially any non-expert human question correctly, and now are coalescing around domain-specific private evals.

We are back to square one with voice assistants. The Gemini and GPT-4o technical reports shared ASR (automatic speech recognition) and AST (automatic speech translation) numbers on a few standard academic benchmarks. Spirit LM expanded upon these by adding a few spoken version of written benchmarks and sentiment classification tasks. However, these benchmarks are only proxies for what we all really care about: is this model useful to humans?

I’m looking forward to seeing a speed run of the text eval pipeline evolution for speech (and vision) models. Will Lmsys create a Chatbot Arena for speech models? Can Gemini or GPT-4o serve as a reliable LLM-as-a-judge out of the box and become the de facto standard the way that GPT-4 did last year? Do the standard human annotation processes for text apply well to speech? Time will tell, but I’m looking forward to seeing what the industry builds.

LLM price wars have reached their logical conclusion: Free!

We’ve been joking about intelligence too cheap to meter for a while now. Prices for LLM-based inference have come down ~100x in the past two years, but Google finally did the inevitable. They are now pricing their Gemini Flash model at “free for all intents and purposes” (and offer a fairly generous free tier for Pro).

Now that all top models are essentially indistinguishable for most use-cases, I have no idea how the inference providers, hyperscalers, and frontier model companies will compete. I am quite glad that I am focused on building delightful products with our models and not trying to compete on this scorched earth (that said, I do think that releasing the Llama models for free was one of the key catalysts driving these prices to zero).

LLMs as a societal mirror: Reproducing social science experiments with one-shot with GPT-4o

Social science experiments are notoriously hard to reproduce (62% of major studies are able to be directionally reproduced, but the effect is only 50% of reported on average). People’s preferences and behaviors are squishy and hypotheses often fit preconceived notions. However, social science can be really important to help us understand how we think and improve how we structure society.

A group of researchers showed that they could replicate most (67%) major text-based social science experiments with a single GPT-4o prompt that included the exact prompt given to respondents. This number goes up to 85% when the responses from 20 prompts are averaged. Note that both of these numbers are higher than the paper linked above where researchers attempted to reproduce results by traditional means. Note that their method is much simpler than I originally thought. They don’t simulate N participants and researchers through some agentic system. They just ask the model what the result of the experiment would be.

Why do I think this is an important result?

If you want a directionally correct view of how humans behave, just ask an LLM carefully. The results will be approximately as high quality as more expensive, time consuming studies.
It’s not clear which way causality flows. Are researchers subconsciously selecting hypotheses to test because they are consistent with humanity’s written corpus and align with our expectations (and thus the expectations of the model)? Or are the models genuinely able to predict the results of novel studies? Or are these sentences saying the same thing?
If I were in social science, I would introduce the notion of a high-entropy experiment. We should seek to build on the LLM’s understanding of human preference and behavior. Before conducting an experiment, I would ask the model for an answer and only proceed if I believed that I could show that the model was wrong and thus add need information to the system.

The puzzle of what to do with “unsafe” LLMs

xAI released Grok 2 last week. Not being a Twitter Blue Plus Extreme member, I haven’t gotten the opportunity to try it, but by all accounts it is good. Joining the “I have a top model club” would have been big news a few months ago, but now is meh.

That said, this release is interesting for a different reason. Grok’s safety alignment is much, much lighter than other models. The internet went wild with Grok- (or technically Flux) generated pictures of Donald Trump on dates with Kamala Harris, Mickey Mouse posing with machine guns, and Bill Gate snorting cocaine. I didn’t see any text examples, but I’m sure that the results are similar.

There is clearly a demand for this kind of tool. And there always has been. Satirists have created this kind of art since far before there were LLMs. In my SF bachelor days, I had a print of George Washington posterizing Kim-Jun Un with Lincoln boxing out Stalin in the background hanging in my kitchen. It’s funny! People loved it! But if I ask any of the mainstream image generation models to reproduce it, they will refuse.

Now that we’ve had LLMs in the market for a few years and had a chance to understand the tradeoffs around safety, I’ll be curious to see how consensus evolves.

Daniel D. McKinnon

Musings on adventuring in the modern era and tinkering with technological curiosities

Dan’s Weekly AI Speech and Language Scoop #24

Voice assistants are nothing without their tools

Evaluating voice assistants

LLM price wars have reached their logical conclusion: Free!

LLMs as a societal mirror: Reproducing social science experiments with one-shot with GPT-4o

The puzzle of what to do with “unsafe” LLMs