These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.
Smol models from Nvidia
All things being equal, small models are better than big ones. And the trend has been squeezing more and more from smaller and smaller models. Google’s 540B PaLM model (quaintly undertrained on only 780B tokens) achieved only 69.3 on MMLU, equivalent to Llama 3.1 8B. In two short years, we have squeezed ~100x the reasoning per parameter!
Those gains came from a whole host of improvements across data, training, and prompting too numerous to cover. We have looked at a few specific promising techniques in this newsletter (aggressive quantization, dropping entire layers, etc.), but Nvidia just added another one to the repertoire.
They show that they can use a combination of distillation, layer dropping or depth pruning (I found their figure showing that they can drop 2 middle layers (out of 32 total) with essentially no impact to performance), and a new (at least to me) technique called width pruning. For the latter, they estimate the importance of each component of each head at each layer using a surprisingly simple function, rank each component, and delete the less important ones.
Combining all these techniques, they show that they can shrink Llama 3 8B into a 4B Minitron package while essentially maintaining all benchmark scores. I wonder if we will be looking back two years in 2026 shocked that Llama 3.1 needed 8B parameters to score 69 on MMLU. Maybe a GPT-4o-quality will be running on your toothbrush soon?
Google’s launches and a modest proposal for more private evals
Google refreshed Gemini 1.5 Pro, Flash, and a smaller, 8B version of Flash. All of these models perform extremely well on the Lmsys leaderboard. This is an impressive launch, but I think that we collectively are overfitting to the Arena. When I send a handful of randomly selected arena-hard
prompts through Gemini 1.5 Flash and Claude 3.5 Sonnet, which I think was released before Lmsys hacking has been super pervasive, the latter’s answer seemed almost universally better. Try “What is the most efficient way to uniformly sample a point inside a right angle triangle?” or “Write code to simulate a ballistic projectile in non-uniform gravity.” to clearly see for yourself.
Because some subset (or maybe all) of the prompts are released, I suspect that Google (and probably all other labs) is identifying losses and patching them, assuming that the distribution of prompts will remain roughly stable over time, to optimize for the median Chatbot Arena user rather than focusing on building general-purpose intelligence, which is much hard for an average person to evaluate. These things may be correlated, mitigating some of this criticism, but I would argue that the average Lmsys user isn’t super representative of the population we collectively want to serve with these models and we shouldn’t over-rotate.
I think some of this shows up in how metrics are reported. If you click the “Full Leaderboard” tab, you can see that many recently submitted models don’t report Arena-Hard-Auto, MT-bench, or MMLU and don’t appear in Scale’s SEAL private leaderboard. These are all easy-to-run auto evals, so one would think that they would be included alongside the human voting in the Chatbot Arena, but I suspect labs are not seeing progress or even regressions, so only want to show the results of voting, which is highly skewed by formatting, refusals, and other issues only tangentially related to intelligence.
I don’t mean to pick on Google here. I think that this behavior has become pervasive as the Chatbot Arena has evolved into the one metric that everyone tracks, but as Goodhart said, “when a measure becomes a target, it ceases to be a good measure.” I hope that this is a small call-to-action for more, better private evals and for the newest models to make standardize around submitting checkpoints to all of them.
Cursor smokes GPT-4o with a Llama 70B fine-tune showing off the power of the OSS LLM community
Cursor sells an AI-powered code editor that seems to have caught fire on social recently (I have never personally tried it). They showed that they could outperform GPT-4o on both quality and speech on their code editing task through a clever fine-tuning pipeline that augmenting a bunch of real user actions with synthetic data.
This seems like a really neat test case for two reasons:
1/ Beating one of the top closed model with a relatively small open one is a nice accomplishment
2/ They published their pipeline, so anyone working in a similar space (including researchers at Meta) can leverage their learnings
Interestingly, they also experimented with DeepSeek Coder, which began with GPT-4-level coding benchmarks, but shipped Llama 3 because it “felt” better. Vibes-based evals FTW!