These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.
You can’t hide from the Scale private leaderboards
The difficulty in evaluating LLMs has been one of the key themes of this newsletter [1, 2, 3, 4, 5, 6]. We started with academic evals, which are now both too easy and too easily gamed. We moved to the Chatbot Arena, which I think still has some predictive power but really rewards fun, chatty models like Llama 3.
This week, Scale.ai, a large data annotation company, announced a series of private evals across a number of tasks including coding, instruction following, and math. Because of Scale’s extensive experience in generating SFT/reward data and evaluating models for all the large players, I believe that this is now the gold standard (and will remain so until companies start fitting to their positions on this leaderboard).
They show the usual players at the top of most categories, but interestingly Llama3 70B takes 3rd in the instruction following category.
Anthropic (finally) ships tool use
OpenAI (and I think Google?) have offered a tool-use API for a while now. The idea is that you pass a name, description, and schema for the tools the LLM can use, e.g. calculator, code interpreter, weather API, etc., and the model can use its innate reasoning knowledge to figure out when to call the tool. When this happens, the model generates all the function arguments and pauses generation. The client then runs that function with those arguments and returns the response to the LLM, which it uses to complete the response. This approach works really well for a lot of domains on which LLMs struggle, e.g. math, so Anthropic decided to join the party.
SSMs for speech
This week was a big one for SSMs. Cartesia.ai shared some really promising speech generation numbers using SSM. Their founder argues that “SSMs generally crush on data derived from continuous signals”. Given that speech is a continuous signal, maybe there’s something here for us?
Shortly afterward, their chief scientist shared the second version of their OS SS LLM, Mamba2. I need to read the paper carefully and understand how these things actually work, but they showed low perplexity for a given compute budget and a few other positive characteristics. All of this should probably be taken with a grain of salt given their financial incentives to promote SSMs, but it seems like an interesting trend to follow.
GPT4 will trade stocks for you or how to capture headlines with an academic paper
A group at the University of Chicago business school just published “Financial Statement Analysis with Large Language Models” showing that GPT4 plus their special prompt can profitably pick stocks by analyzing financial statements. This work got picked up in Forbes, Venture Beat, Yahoo News, and a bunch of other news sites and all over social.
Tragically for our portfolios, the baseline they used is a logistic regression model from 1989 and a “SOTA” 900k parameter ANN and even then barely outperformed. Hilariously, all four techniques beat a human analyst. Another win for the machines?