Dan’s Weekly AI Speech and Language Scoop #32

On SimpleQA and factuality in LLMs

OpenAI released an LLM benchmark called SimpleQA [csv of questions]. It is far from simple. This is pub trivia on steroids. I suspect I could reason my way to directionally correct answers for about 10% of them and don’t even fully understand what the question is referring to at least 25% of the time.

The best performing model, o1-preview, only answers 42.7% of the questions correctly while the worst, Claude-3-haiku, only scores 5.1%. Bad, right? Not so fast. While Claude-3-haiku only answers 5.1% correctly, it also only answers 19.6% incorrectly, passing on 75.3%. o1-preview misses 48.1%, making its correct answer rate worse than a coin flip.

Which would you prefer as a user? I personally would rather Haiku. When I am interacting with an LLM, especially over audio, I want the answer to be correct. I don’t want the additional mental overhead of having to check the 50% of answers. I want the model to know that the answers to these questions are probably not lurking in its parametric memory and to seek out the answer for me.

And existing LLM-based products for the most part do just this. From sampling a half dozen questions Meta AI, ChatGPT, and Gemini all score nearly perfectly on this benchmark with their search grounding (tragically, Claude does not offer this yet; surprised that they are releasing agents but still haven’t built search into their consumer product).

So why is this an LLM rather than user-facing system eval? Why do I care about how the base model scores on these very challenging trivia questions? The answer, I think, is that I don’t. This is very reminiscent of the early work on teaching LLMs to do complex arithmetic. People used to care until they realized the correct approach was to just give the model a calculator.

However, OpenAI motivates this benchmark in an interesting way. They suggest that it can be used to measure the calibration (confidence) of language models, which is quite interesting and requires a set of prompts that the model is likely to struggle with.

I’m always glad to see an interesting new benchmark, but I hope that this doesn’t start popping up in marketing materials as something to optimize for and instead remains an interesting way to explore model calibration. LLMs can do lots of wonderful things and I don’t believe that recalling obscure facts is one of their capabilities to optimize for. I would much rather see a smaller, faster, more agentic model than one that has memorized Wikipedia.

Stated vs. revealed LLM preferences: Dunning-Kruger Effect in LLMs

Figure 2 in the SimpleQA paper shows how the OpenAI researchers measured model calibration. They both asked the model how confident it was in the answer and sampled the sample prompt multiple times to measure how frequently it yielded the same answer. This is very reminiscent of a human stated vs. revealed preference experiment.

When asked (stated preference), the models were alway far more confident in their answers than justified (only answered 50% of the questions correctly for which they were 90% confident). However, when sampled, the models were much better calibrated (and o1-preview almost perfectly so). They are just like us humans: deludedly overconfident in our abilities

The information theory behind parametric memory

How many parameters do you need to memorize Wikipedia? According to some very smart people who write papers I don’t totally understand on the physics of language models, the empirical limit for knowledge storage in LLMs is 2 bits per parameter. English Wikipedia contains 28.2B characters. While optimal Huffman code for the English alphabet requires 4 bits per character, Wikipedia tells me that this actually drops to ~1 bit/character due to the redundancies of the language.

This means that we need 14.1B parameters to just memorize Wikipedia. And Wikipedia is a small fraction of the information found on the internet and only had answers to a fraction (~50%?) of the questions in SimpleQA. This means that we would need ~28.2B parameters to just store the answers to these questions at the theoretical limit. I suspect the real number is far higher.

Reasoning one’s way to winning pub trivia

Have you ever found yourself sitting at a bar with friends playing trivia and totally stumped by the answer? What was Lift 5 at Beaver Creek called in 2005? What is lift 5 today? Is that Rose Bowl? Cinch? Who knows? Let’s have a drink.

If you’re anything like me, reasoning through true trivia questions rarely works. You either know the number of rushing yards some obscure NFL player had in a given year or you don’t (no points for Fermi estimates in trivia!).

OpenAI’s previously mentioned SimpleQA eval shows that machines are no different. Their GPT-40 model scores 38% correct and o1-preview only adds 5 pp through reasoning. This is in sharp contrast to true reasoning problems like competition math where o1-preview adds a whopping 44 pp.

This is another nice demonstration that models are just like us afterall.