These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.

Machines notch two more wins in the battle vs. man

Last week, I mentioned that DeepMind launched a call for highly specialized, e.g. PhD in a specific field, AI tutors to help write SFT examples and evals because even the best human raters are significantly worse at these asks than the best AIs.

In a similar vein, Klarna shared that their customer service AI is now doing the work on 700 full-time agents while dropping repeat inquiries by 25%, time to resolution from 11 minutes to 2 minutes, and achieving equivalent CSAT [press release].

Both of these tidbits indicate that the best AIs can outcompete humans across unrelated, less specialized domains. These data will impact how I think about annotation, customer feedback, and other tasks related to building an assistant that people can’t live without.

Community reviews of Gemma seem to confirm that “Standard” LLM benchmarks have lost relevance

I suggested last week that Google’s Gemma OS models did not perform as well as their Mistral or Phi counterparts that they bested across a broad range of “standard” evals like MMLU. Since then, the community has spent a lot of time with these models and seems to have reached the same conclusions.

Claude 3 makes dramatic progress on GPQA

On a related note, Anthropic dropped their Claude 3 model. What leapt out at me was not their impressive numbers on the benchmarks we decided above had lost relevance but rather their performance on GPQA. The team that built this benchmark designed it to be Google-proof and it really is.

They asked PhD students in either biology, chemistry, or physics to write multiple choice questions in their domains that they believed would be extremely challenging for experts in even closely related adjacent domains to answer even with 30 minutes and access to the internet. They were successful. For the best tier of the eval, 81% of questions were able to be successfully answered by a peer specializing in the same field but only 22% a peer in an adjacent field (25% represents random chance).

GPT-4 achieved a slightly better than random 36% accuracy, but Claude 3 made a large jump to 50.4%. This is really impressive. With Claude 3, we have gotten to the point where our AIs are no longer significantly better than just human raters or customer service employees, but experts who have spent years studying their field.

Note that I haven’t played with Claude 3 much yet, so I don’t have a POV yet on general model performance, but I am very impressed with their number on this scientific benchmark.

Super compression with 1-bit LLMs

Quantization has really accelerated over the last couple years. Not that long ago, people felt like they were getting a good deal running half-precision models (fp/bf16) to cut memory in half, speed up inference, and reduce power consumption. This week, researchers showed that they could quantize models to a single bit (really 1.58 bits because the valid values are -1, 0, and 1) with little degradation of quality. In addition to the obvious benefits from reducing the size of the weights matrices, 1.58 bit quantization enables a bunch of matrix multiplications to be replaced with additions, which leads to further improvements to inference speed and cost.

I’m looking forward to seeing how we can leverage these quantization strategies to bring as much compute as close to the user as possible. I think that a truly natural conversation can’t involve multiple round trips to the Cloud and this type of research can pave the path for better on-device inference.

[paper, HackerNews]

TTS Arena launches

User preference is the gold standard for evaluating many types of models these days. The Lmsys Chatbot Arena is the canonical leaderboard for LLMs and the team recently launched a TTS equivalent.

[blog post, arena]

LLM adapter achieves SOTA for lipreading

A team in South Korea published a very simple paper showing that they can achieve strong speech recognition performance through visual cues alone (lip reading) by training a visual encoder on top of a frozen Llama model.

[paper]

Daniel D. McKinnon

Musings on adventuring in the modern era and tinkering with technological curiosities

Dan’s Weekly AI Speech and Language Scoop #2