These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.
Giant models are all you need?
While none of these papers are new, Ethan Mollick renewed the conversation on social by pointing out a somewhat obscure study that showed that GPT4 handily outperformed BloombergGPT, a custom model trained on Bloomberg’s proprietary financial datasets, across a wide range of financial tasks. I noticed something similar when I was at Google and observed that GPT4 outperformed MedPaLM on many clinical-style prompts.
While the comparison isn’t totally apples-to-apples as BloombergGPT and MedPaLM are older, smaller models trained on fewer tokens, I do think that both of these examples show how hard it is to outperform a top frontier model even with a strong proprietary dataset.
Compute is all your need (for speech)?
I stumbled upon a fun essay written by Rich Sutton in 2019 that argues we should stop trying to encode human knowledge into our systems and just throw general methods and compute at the problem. Since 2019, I would argue that we have largely followed his advice, but this was a bit of a controversial take at the time. Noam Shazeer and his colleagues at Google famously couldn’t convince leadership to invest in scaling language models (arguably letting OpenAI take the lead) and we at Meta were split on whether to invest in end-to-end speech recognition models that discarded the linguistic information we previously encoded into our systems.
Correlating benchmarks to human preference
We’ve talked a lot in this newsletter about the value of academic benchmarks in predicting model performance [1, 2, 3]. My general take is that they are very coarse-grained tools for understanding the class of the model (is the competitive with GPT3.5? GPT4?), but that most of the top labs are implicitly or explicitly gaming the benchmarks.