These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.
MobileLLM unlocks on-device chatbot use-cases
I believe that latency is one of the biggest problems slowing the adoption of chatbots. This week researchers showed that they they could achieve similar performance on an API-calling task to Llama-7B with only 125M parameters quantized to 1.57 bits for a total model size of 16MB [paper]! For the more general MT-bench eval, they only see a ~5% regression vs. the original Llama-13B.
I think that this work really opens up the potential for a high-quality, super low-latency, on-device models that lets users have very snappy basic conversations.
GPT-4 finally meets its match
The first Claude 3 Opus numbers appeared in the Lmsys Chatbot Arena. They are just outside the confidence interval of being equivalent to GPT4. This is the first model that has really demonstrated GPT-4 level performance almost a year after its release. Looking forward to seeing how OpenAI/Google/Mistral/we respond.
tinyBenchmarks show that LLMs reliable numbers on academic benchmarks can be established with ~100 examples
We talked a bit about the difficulties in evaluating LLMs in the last two editions [1, 2]. While this doesn’t solve the root problem that academic benchmarks have lost some of their predictive power, a team of researchers showed that they can dramatically reduce the size of the test set for academic benchmarks and still achieve results.
This work is interesting for me because it significantly lowers the barrier for the broader community to evaluate models and scrutinize the numbers published by big model providers in a similar to fashion to what Lmsys did with the Chatbot Arena.