These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.
NVIDIA beats Meta to a large (340B) open-weight model with middling results
NVIDIA released a 340B parameter model that appears to significantly outperform Llama 3 70B on a handful of important benchmarks and match its performance in the Chatbot Arena (remember that the Chatbot Arena tends to favor friendly, chatty models like Llama over competitors that have places more emphasis on challenging reasoning tasks).
The pace of open model releases has been dizzying this year, but this one is worth discussing for a number of reasons:
- It’s big! This is by far the largest open-weight model released. However, given the cost involved in fine-tuning and serving a model of this size and the relatively poor performance per parameter, I don’t think that NVIDIA will see significant adoption.
- They released the reward model. I don’t believe that any major open labs have released their reward models, which are essential to pushing model capabilities beyond that of the average annotator (the Llama2 paper describes this phenomenon well) and can be considered some kind of secret sauce alongside the data selection pipelines.
- 98% of their data was synthetic. They only paid for 10k SFT and 10k preference pairs to be annotated by humans. Training on synthetic data is quite standard, but I think that this ratio is unique as is NVIDIA’s full disclosure on exactly how their data were generated.
GPT-4 finally dethroned by an open-weight model
DeepSeek, a quant fund turned AI company, released DeepSeek-Coder-V2 (paper, GitHub, chat), the first open-weight model that truly achieves SOTA for a broad domain. Always skeptical of published numbers, I fired a few really hard coding prompts that GPT-4 could not answer through their chat interface (I have a bunch of these from an old project. They mostly involved complicated translations from Solidity to Rust, two quasi-obscure languages) and they did indeed outperform both my previously saved GPT-4 results and those from the latest GPT-4o model.
After seeing this and going back and reviewing their previous launches, I am very bullish on this project. They are achieving near SOTA performance across many domains with a very sparse MoE model that they are selling for $0.28 per million output tokens or >50x cheaper than OpenAI and Anthropic. Wow.
Retro “AI” reveals that elephants call each other by name
Hot on the heels of researchers using modern semi-supervised techniques to decode Meerkat speech, another group just showed that could identify elephant’s names from their “rumblings” (apparently elephants speak via rumblings) significantly better than random chance with a fully supervised dataset and random forest model. Their results aren’t stellar (15-30% accuracy) and dataset small (~600 rumblings), but if they hold, this is the first evidence we have for animals addressing one another by name.
No AI for Mickey D’s
McDonald’s,Wendy’s, Arby’s, and a few other fast food joints have been experimenting with replacing human drive-through order takers with AIs. After a three-year pilot, McDonald’s announced that they are pausing the effort. On its face, understanding drive-through orders seems like a relatively simple problem given how constrained the menus are. Is the problem their technology partner, IBM? Or that the cost/embarrassment of an edge case is higher than that of a human taking the order? I’ll be keeping an eye out for an announcement from Wendy’s, who partnered with Google Cloud, to answer those questions.
Breaking: Claude 3.5 Sonnet drops, achieving SOTA on most academic test sets
Right as I was about to post, I saw that our former colleague Mike Krieger (former co-founder/CTO IG, currently CPO at Anthropic) announced the release of Claude Sonnet 3.5, a relatively small model (sizes not shared) that (in some cases significantly) outperforms the GPT-4o, the current leader, and their large model, Claude Opus 3, on most academic benchmarks.
We have talked extensively in this newsletter about how these academic benchmarks have lost most of their value [1, 2, 3], but this release is interesting for two reasons:
- The trend seems to be squeezing more performance out of smaller models. Google first showed this with Gemini 1.5 Pro outperforming Gemini 1.0 Ultra (and later very strong performance with Gemini 1.5 Flash) and rumors are that GPT-4o is significantly smaller than GPT-4 (must be based on much faster inference time and a free release to all consumer users).
- Open AI’s lead seems to have evaporated. Users can now get excellent performance from Anthropic, Google, Yi, and even open-source models like DeepSeek, Llama3 and Nemotron for cut-rate prices. It’s hard for me to understand how the spoils of this AI race will go to the foundation model providers unless they can monetize in their own existing products, e.g. Google.