One of my colleagues at Google wrote a quick note every week about interesting things he observed across the tech ecosystem that made him think differently about his work. I really enjoyed his musings and thought that I might try to emulate him upon my return to Meta.
I spend a lot of time reading about speech and language in my spare time, so perhaps my notes will be valuable to others (and if not, this will serve as a nice diary for myself).
These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.
Google enters the Open LLM battle
Google released Gemma 2B and 7B, open models with permissive Apache licenses based on the Gemini architecture. What I found interesting in this launch was not that Google entered the fray with small open models but that I (and from what I have seen, many in the community) didn’t feel like the model quality matched the expectation they set with strong academic benchmarks. I feel like this and the Gemini Ultra launch may be the beginning of the end of companies optimizing for and publicizing results on these benchmarks (Goodhart’s Law strikes again!).
Deepmind launches a call for AI tutors
It is widely accepted that the best models outperform the best pools of humans available for prompt writing. To make this concrete, if the best human mostly likely to work as a writer to generate SFT (supervised fine tuning) data and GPT4 both write a response to a prompt, other humans are likely to rate the GPT4 response higher than the human. This is partially why RLHF is so helpful in improving models these days. Human have an easier time selecting a higher quality prompt than writing one on their own.
However, notice that I mentioned AI would outperform the “most talented pool of humans likely to work” as prompt writers. DeepMind is addressing this problem by expanding that pool and calling for experts in all domains to work part-time as writers. I believe that this program could significantly increase the performance of their models on both these specialized domains and simpler queries that involve reasoning and related skills (analogous to how adding code to the training mixture improves natural language performance).
Mistral releases GPT-3.5/Gemini-Pro/Claude competitor
Mistral released their 4th (?) model in less than a year. This one seems to match GPT3.5 and Gemini Pro on academic benchmarks, but I am expecting outperformance in the Chatbot Arena, based on the quality of their other models. I am simultaneously impressed with and puzzled by Mistral’s ability to ship such high-quality models so quickly with such a small team. The explanations I’ve heard are that they are less concerned with high-quality copyrighted data than Google/Meta/etc. (one textbook token is worth ~10+ web tokens) and that they don’t spend time on safety, but I think that there is more to it than that.
Google Gemini rekt with safety issues
This has been covered extensively across all channels, but TL;DR Gemini’s responses and safety filters were not aligned with a large (likely vast majority) fraction of its user base and the internet erupted. This is interesting to me because I believe that Google has a very strong language modeling team that was mostly unaware of these issues. I think that it will be relatively easy for them change course, but Ben Thompson of Stratechery fame feels otherwise. I look forward to looking back on this a year to see how Google is faring (note that Google has plenty of other problems; I am referring only to model quality here [1, 2]).
Air Canada chatbot response ruled legally binding
Air Canada shipped a chatbot to its marketing site to help customers book travel. The chatbot promised a passenger a discount that wasn’t available. A judge ruled that the chatbot was an agent of the company and its offer was binding [BBC story]. I suspect that this will send rippled through the enterprise chat industry, probably open opportunities for AI factuality/grounding startups, and could even impact how we think about our consumer products.
Amazon Trains a SOTA foundational TTS model
I haven’t historically followed TTS, but I think that this domain will become increasingly important as users’s expectations for AI assistants evolve from turn-based answer engines to human-like companions. Amazon published a paper that used a new LLM-like technique to achieve SOTA naturalness in TTS. I haven’t read this paper closely and can’t claim to understand it, but this is an area I want to spend more time exploring [paper, demo].