Intelligence too cheap to meter
The theme of this week was cheap models. OpenAI released GPT-4o mini, performing nearly as well GPT-4o, itself released only a couple months ago, pricing inference at $0.60/M output tokens. DeepSeek released an updated checkpoint of their model, charging only $0.28/M output tokens. And we, of course, updated Llama3 70B to be competitive with both of these, which Together.ai is serving for $0.54/M output tokens.
To put this in context, 1M output tokens represents about 2,500 pages of text or the entirety of the King James bible. An approximation of this work that took 47 people and 7 years to write, can be yours for less than a dollar in a few minutes (assuming that you quickly scraped the web for the source documents).
Inference prices have fallen >100x since GPT-4 was first released 14 months ago with marginal impact on quality. Even if this trend halts, which I think is unlikely, the economics behind these models now makes entirely new business models or interactions with computers possible. I think that for the first time, we have a line of sight to a world where it is feasible for every experience to be perfectly personalized and generated on the fly for each user.
So physics really is the hardest science
I worked in an interdisciplinary lab in grad school studying gels in the chemical engineering department. I studied physics as an undergrad and one of the postdocs was a biologist. No matter how hard I tried, I could never convince him that physics was a harder science (both in terms of difficulty and rigor). While looking back, this was a bratty thing to inflict upon a forever biology postdoc, but thanks to the Llama 3 paper, I finally have the evidence I need to support my claim.
We fed subject specific AP tests to various flavors of Llamas, GPTs, Claudes, and Nemotron and saw some clear trends emerge. All the models can saturate or nearly so the AP Biology exam, but even the best performer, our own Llama 3 405B scores only 93% on Physics. However, the liberal arts folks may have the last laugh. AP art history earns the lowest high score and seems to uniformly challenge all the models.
Llama3 405B dazzles the Scale private evals and OpenAI hacks the Chatbot Arena
We’ve spent a lot of time talking about the challenges of evals in this newsletter [1, 2, 3, 4, 5, 6, and more]. I wanted to point out two interesting ones this week.
First, Llama 3 405B earns a the top spot (within the confidence intervals) on the Scale.ai private leaderboards for coding, instruction following, and math. This isn’t surprising to me based on vibes and strong benchmark performance, but it is reassuring to know that we delivered on reputable, private evals that I believe have become the gold standard for evaluating model performance.
Second, GPT-4o-mini takes second (behind GPT-4o and ahead of widely lauded Claude 3.5 Sonnet, online Gemini Advanced, GPT-4-Turbo, and many other strong models) in the Chatbot Arena. Based on vibes and other benchmarks, there is no way that this low-cost new release is really on par (or better) with the others. We suggested that Goodhart’s Law had come for the Arena a few months ago [1, 2, 3], but I now think that we can officially usher in its arrival. It is clear that companies are fitting to the distribution of prompts and users and I believe that this metric has now outlived its usefulness.