These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.
Impressive Llama3 launch and Goodhart comes for the Chatbot Arena
The impressive success of Llama3 launch has been well covered, but I wanted to spend some time on one of my favorite subjects: evals. Llama3 famously (and maybe unexpectedly?) rocketed to the top of the Llmsys Chatbot Arena, leading many to claim that Meta achieved GPT-4 level performance with a relatively small open model.
But does this make sense? GPT-4 is a 1.8T parameter model that has been optimized by a very capable team for over a year now. Can we really pack that much more reasoning and knowledge into so many fewer parameters? Research we covered last week suggests not.
Instead, I think that Meta may have unintentionally hacked what has become the canonical leaderboard for LLM performance. The average Meta AI user wants to chat about the weather, not the intricacies of the Solidity compiler, so Meta would naturally gravitate toward designing a model to reward chatty, friendly answers that would feel at home in social apps. This behavior coincidentally aligns well with the Chatbot Arena, leading us to take one of the top slots.
This was a great moment for Meta, but what does it mean for the Chatbot Arena? I predict that as with academic benchmarks like MMLU, Goodhart will drive down its mindshare after this performance.
In fact, I think that we may already be seeing some Chatbot Arena hacking. Shortly after the Llama 3 launch, Google unveiled their Gemini-1.5-Pro Chatbot Arena ranking and debuted at #2. I have been using this model for weeks. While it is capable, I am very confident that it doesn’t reach the overall quality of GPT4 or Claude Opus for complex tasks and unlike Llama3, I’m not so sure that Google has a strong business reason to optimize their foundation instruct model for these types of use-cases.
Lmsys has anticipated this problem and released Arena-Hard, an offline eval based on the prompts they’ve received that they believe more generally encompass LLM performance. Here, not surprisingly, we see that the latest GPT-4 comfortably outperforms Claude 3 Opus, which edges Llama 3 70B.
This discussion all comes back to one of this newsletter’s primary themes: evals are hard.
Phi mini, moar benchmarks, and the wrath of Susan Zhang
Shortly after the Llama 3 launch, Microsoft announced the Phi-3 family of models. Their unique contribution is that they pretrain on a small amount of “heavily filtered”, high-quality web data deemed to have “high educational value” augmented by a large corpus synthetically generated text (adjacent to last week’s discussion).
Their 3.8B variant shows super-Llama 3 8B performance on a collection of academic benchmarks, igniting the internet. If Meta’s 70B model beats OpenAI’s 1.8T model and MSFT’s 3.8B model beats Meta’s 8B model, when are we going to get GPT-4 running on our toothbrush?
However, as always, evals are hard. Meta alumnae Susan Zhang, crowned the queen of LLM reviews, showed that Phi3 appears to have prior knowledge of several important benchmarks, indicating that somehow the test data leaked into training.
I don’t think that this was intentional and suspect that the root cause is related to synthetically generating training data, but, again, evals are hard.
Google quantifies the value of a 1M-token context window
Google released Gemini Pro several months ago, whose differentiating feature was a 1M-token context window. They had a moment prior to launch when the internet was aflame with clever demos using the full window, but many later were somewhat debunked when other users pointed out that the same responses could be generated using the model’s parametric knowledge without the provided context, e.g. creating a graph of all the characters in the Harry Potter series.
They recently published “Many-Shot In-Context Learning” that shows performance on a number of tasks increases up to 8192 shots. This seems like an inefficient way to marginally increase performance and this result seems more like a marketing piece for long context windows, but the could be a great way to onboard long-tail tasks quickly for experimentation.