These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.
GAIA-style benchmarks for voice assistants?
Last fall, Meta’s own Gregoire Mialon and team released an interesting AI benchmark that tested models’ abilities to perform relatively simple tasks that required reasoning, multimodality, tool use, search, and more. I only recently stumbled upon their work because it was presented at ICLR (how about that for an academic review cycle?). I don’t want to leak any of their prompts here (they are available behind a login on HuggingFace), but they are similar to, “I was trying to remember which model provider first published that silly diagram with a made-up performance benchmark on the y-axis, some kind of compute metric on the x-axis, and a magical triangle in the upper-left that showed they were best.”
I struggle with these types of queries constantly in my everyday life: “What was that one thing that I saw on the internet and is now relevant to the conversation I’m having?” Looking forward to the day when our assistants can handle these.
Have frontier models plateaued?
OpenAI released GPT-4o and Google Gemini 1.5 Pro at their recent developer conferences. While improved model performance took a back seat to really impressive voice and multimodal interfaces, both did significantly lift their scores in the Chatbot Arena.
However, we saw a few interesting trends emerge:
- GPT-4o debuted 60 ELO points above GPT-4 (and even more on challenging subsets) but has seen that gap drop to 30 since launch.
- Gemini 1.5 Pro beat GPT4 by 10 and took the second place slot. These models are clearly clustering around the same level of performance.
- Gemini 1.5 Pro is tied with Gemini Advanced. The latter has access to the internet while the former does not. When Google first submitted an online model to the arena, Bard (Gemini Pro), it beat the offline equivalent, Gemini Pro, by almost 100 ELO points. Are models getting advanced enough that search grounding no longer provides this lift?
My takeaway is that between GPT-4o and Gemini 1.5 Pro scoring essentially the same and Gemini Advanced not outperforming Gemini 1.5 Pro, we may be reaching some kind of saturation point with these models, which I might say that Meta has essentially reached with Llama3. I will be very curious to see if we can break this asymptote and if we can, what new use-cases the significantly more advanced models will unlock.
Anthropic’s peek inside of Claude’s brain
Anthropic published a really interesting paper that showed they could identify networks of neurons in their Claude models that fire on fine-grained concepts, e.g. “transit infrastructure” or “sycophantic praise”. I didn’t take the team to really understand how this did this but found it really neat that they could turn the neuron weights associated with various concepts up or down and really tune the model response. This kind of fine-grained control could be really useful in guiding the behavior of AI assistants.