These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.
LLM as a meta-judge
Sometime last year, a bunch of people independently figured out that you could improve LLM responses by asking the model to reflect on its response. The hand-wavey rationale was something like “models have a fixed amount of reasoning per token, so if you ask the model to judge its own response, it will get more tokens and thus more reasoning.” This observation has since evolved into a bunch of different techniques from self-rewarding to chain-of-thought that rely on the same general principle.
FAIR just shared a fun paper, “Meta-Rewarding Language Models: Improving Alignment with LLM-as-a-Meta-Judge”, that adds one more layer to the recursive stack: they show that the model significantly improves (Llama 8B achieves parity with GPT-4 on several benchmarks) when it is not only asked to judge its response but also to judge the judge judging the response. One could ask, why stop there? Can we access another round of improvements by adding another layer? This all does feel quite “turtles all the way down”.
ByteDance cracks dialogue models
ByteDance published an approach to giving an audio LLM the ability to listen while speaking. The model processes two tokens per time step: a speaking token and a listening token. It generates either a speaking token or an interruption token, which instructs the model to stop speaking and let the user finish. This is a relatively simple system, but it’s easy to imagine how it could be extended to something very capable of natural dialogue.
Interesting side note: the authors used a clip of InfoWars, according to Wikipedia “an American far-right conspiracy theory and fake news website owned by Alex Jones”, a personality famous for denying a tragic mass school shooting, in their demo video. Seems like a really weird choice.
Apple opens up (in more ways that one)
Applers were busy last week. They released:
- a technical report explaining how they trained their foundation models (seems mostly boilerplate to me at this point; I skimmed it and nothing caught my eye)
- an open weight and open data 7B model (this is really unique; the vast majority of open model releases do not include the dataset; this release will let researchers play with dataset construction in a way that really wasn’t possible previously with a model of this caliber)
- their system prompts (Apple Intelligence system prompts were found buried in plaintext in beta OS release; nothing super noteworthy beyond aspirational instructions like “do not hallucinate”; with very early version of Gemini, we found that instructing the model to tell the truth actually helped factuality; I wonder if this is still quantitatively true)
- the first iteration of Apple Intelligence (seems very limited and left reviewers wanting more; looking forward to seeing if Apple can truly understand and complete a complex queries like they they showed in their launch video)
Google dethrones OpenAI after a slow start
OpenAI has dominated the leaderboards, including often criticized Chatbot Arena [1, 2, 3, 4], for the last two years. A new sheriff is in town. Google’s latest Gemini 1.5 Pro is currently sitting in #1 in the Arena by a wide margin.
I spent 30 minutes running a few hard prompts through their playground and didn’t notice any difference between their model, GPT-4o, Sonnet 3.5, and Llama 405B, so I can’t say that this release is qualitatively better in my eyes, but I do like seeing a shakeup at the top (and see this as further evidence that we are beginning to saturate performance for even challenging human tasks).