Dan’s Weekly AI Speech and Language Scoop #8

These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.

Synthetic data and more thoughts on the future of human annotation 

A persistent theme in this newsletter is tracking the battle of man vs. machine across a variety of tasks [1, 2, 3]. Unfortunately for this author and his co-workers, the machines appear to be doing quite well. This week, a couple pieces of evidence dropped indicating that this trend shows no sign of reversing. 

First, a team at Google released a survey on generating synthetic data for LLMs. As a review, nothing included is novel, but the whole package made a compelling case that synthetic data outperforms that generated by humans across a wide range of domains. They conclude with “as we approach human-level or even superhuman-level intelligence, obtaining synthetic data becomes even more crucial, given that models need better-than-average-human quality data to progress.”

Second, the WizardLM team at Microsoft released the second generation of their model fine tuned with synthetic data rebased on Mixtral 8x22B and nearly achieved GPT-4-level performance with a truly open system. To my knowledge, this is the first group that both recognized the power of generating synthetic instruction-tuning dataset and achieved (at the time) SOTA results, so it’s not surprising that they’ve done it again. They close their paper with,  “as the natural world’s human data becomes increasingly exhausted through LLM training, we believe that: the data carefully created by AI and the model step-by-step supervised by AI will be the sole path towards more powerful AI.”

Based on these papers, both teams’ closing statements, and my own experience, I think that it is a consensus that synthetic data will continue to play a huge role in training more powerful AIs.

LLMs only store 2-bits of knowledge per parameter?

Meta’s own Zeyuan Allen-Zhu published an interesting paper that I can’t say I totally understand that argues that LLMs store asymptotically 2 bits of information per parameter independently of quantization (down to int8). I think that this (along with our recent discussions of 1.58-bit quantization and dropping layers) indicates that these models are overparameterized and there should be a way to get closer to the theoretical limit described by information theory.

Encoder-Decoder architectures are back?

Last week, there were a flurry of LLM releases including a new GPT-4 model that has retaken the top spot in the Chatbot Arena leaderboard, a Mixtral 8x22B that seems to have dethroned DBRX as the top open-weight model days after its release, and a few others. These launches were all covered better elsewhere, but a detail in the Reka technical report caught my eye. They released a GPT-4 competitive model called Core that appears to use a encoder-decoder architecture like T5 for all modalities, including text. Since the dawn of the GPTs, the vast majority of language models have relied on a decoder-only architecture, so it’s interesting to see something new, especially from a new lab that released what appears to be quite a capable model on their first shot. Look forward to seeing what they can do!