Dan’s Weekly AI Speech and Language Scoop #22

These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.

More critiques of the Chatbot Arena

We’ve been pretty hard on the Chatbot Arena recently. What was once the marquee LLM eval has been relegated to a guardrail [123]. And things don’t look like they will turn around any time soon.

After a bunch of digging, the community has generally come to the conclusion the GPT-4o mini scored so unexpectedly high due to:

  1. Lack of refusals
  2. Response length
  3. Formatting

These “hacks” actually take precedence over traditional metrics of model quality. Top labs of course know this, so every model can be optimized for Chatbot Arena friendliness over or in addition to real business metrics (likely including the latest Gemma-2-2B release that beats GPT-3.5 and Mixtral).

In fact, we now have OSS instructions for doing this. This team showed that they could boost LLM-as-a-judge scores on all popular chat benchmarks through aligning format with human preference, which is quite easy to do with the Chatbot Arena dataset. While they didn’t explicitly enter their models into the arena, I have no doubt that these autoevals would translate into human preference.

But, Meta’s 405b model is sitting in 3rd or 4th place, so forget about what I just wrote and let’s celebrate anyway.

(In all seriousness, this critique of the Chatbot Arena is largely pointed toward companies that are building models for less social use-cases. I think that “pleasant to chat with” is actually a totally reasonable metric that Meta should aim for.)

Are Nature publishers ok? Do AI models really collapse when trained on recursively generated data?

I previously shared a seemingly routine paper that made it into Nature on using a semantic similarity metric to detect hallucinations. This week, another Nature paper set the AI world ablaze. I need to read through their theoretical arguments in more details, but they basically show that naively training on the output of a model over successive generations will degrade the quality of the model.

Similar experiments have been published previously (perhaps their theoretical approach is novel?), but what I (and I think the rest of the world working in this area) found funny was that there was no mention in the paper that every top lab is now successfully (though not “indiscriminately”) training on synthetic data and has been for years. I’m really surprised and disappointed that Nature editors didn’t ask the authors to contextualize this work.

In a totally unconflicted statement (/s), Alexandr Wang, the CEO of Scale, a company that makes billions of dollars renting out humans to annotate data, shared his support for the paper in a {Tweet,X}Storm and let practitioners know that they should keep paying him.