Dan’s Weekly AI Speech and Language Scoop #11

These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.

More on automated evals or how to earn yourself $100k

Last week ScaleAI, a data annotation company, released a private version of the GSM8k (grade school math) benchmark to understand the impact of overfitting on the scores of various models. They draw a couple of interesting conclusions:

  • Certain model families, notably Mistral and Phi series, systematically outperform on GSM8k vs. their GSM1k, indicating some kind of systemic overfitting. Conversely, frontier models like GPT4 and Claude 3 Opus perform similarly on both.
  • Data contamination does not cause most of the overfitting. It would be easy to jump to the conclusion that the overfitting models simply accidentally introduced the test data into training somewhere. Rather, the authors suggest that overfitting is due to more subtle factors like model builders collecting data similar to these benchmarks or using scores on benchmarks to select model checkpoints. In my mind, these observations continue to reinforce how the value of these evals decays steeply over time.

What can we do about this? Humans themselves are also quite fallible (and slow and expensive). LLM-based auto-raters do really well for more structured tasks but don’t always capture the je ne sais quoi that humans expect in a chatbot. To solve this problem, Lmsys and Kaggle are partnering to sponsor a $100k competition to predict human preference in the Chatbot Arena.

Hybrid LLM routing for giant drops in latency and inference cost

Some queries are complex: “What are the lifetime emissions of a F-250 relative to a transatlantic flight?” Some are simple: “What is the capital of Colorado”. Is it possible to server the former to a large model capable of complex reasoning and the latter to an small model capable of delivering low-latency responses on cheap hardware?

The answer is yes. A team recently showed that they could drop the cost of inference by up to 99% for a 2% performance impact by training a router to shift queries between Llama 7B and 13B.

To sweeten the pot, Meta’s own FAIR just showed that we can both improve quality and reduce latency by predicting multiple tokens at the same time. Perhaps these techniques could be combined to deliver a big latency win?

The ultimate self-training

Some researchers at Google published a funny way to improve LLM pretraining data: just ask the model. In practice, I don’t think that this makes that much sense because the cost of compute to select the training data could probably be better spent training the model, but they do show that the model has some awareness of what it needs to improve and they show that they can reach the same perplexity with far fewer tokens for small models.

KAN: Kolmogorov–Arnold Networks

I’m sure this paper has been flying around your feed as well. I have no idea understanding of the Kolmogorov-Arnold representation theorem that underlies these models, but they seem like a really promising MLP replacement. Maybe some generous smart person can explain these do me?