These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.
Decoding Meerkat speech with animal2vec
I believe that animals are much smarter than we give them credit for. Years ago, I was amazed to learn that elk adapt their migration patterns to changes in hunting seasons, seeking refuge in national parks and private land [1, 2]. Today, I don’t find this kind of intelligence surprising.
That said, an underappreciated field in machine learning that most excites me is decoding animal speech. We have made some early progress with whale songs, fruit bat vocalizations, and likely more species, but these efforts relied largely on the intuition and domain expertise of researchers and whatever supervised data they could assemble.
This week, a team of researchers released animal2vec and the accompanying proof-of-concept dataset MeerKAT, 1068h of Meerkat records with 184h of ground-truth labels. They show that with animal2vec, they can train unsupervised or lightly supervised representations of animal speech and use those representations to classify unheard utterances. This work harks back to a previous edition where we discussed an essay by famous speech recognition researcher Rich Sutton, who argued that we should stop trying to encode human knowledge into our systems and just make GPUs go brrrrrr. I look forward to a day where biologists need to do nothing more than place a microphone array out in the wild and fire up some GPUs to let us eavesdrop on the talk of the animal town.
Apple achieves SOTA on the first try with adapters
Much has been written about the launch of Apple Intelligence, but I wanted to touch on something that I found relevant to our assistant efforts. In their launch blog, they show that their Apple Server model outperforms or matches GPT-4-Turbo (presumably the latest version unless they are really cheaters) and their on-device model outperforms Gemma, Mistral, and Phi on their proprietary human preference datasets and IF and writing benchmarks. Very impressive!
However, anyone who has worked on these things can tell you that achieving general GPT-4 level performance is really hard to nail the first time (Google struggled for >1 year). Doing so on a subset of tasks is easy. Apple appears to have taken the easy approach. They apply a different adapter for each tasks, allowing them to achieve SOTA performance on specific tasks without a SOTA model.
OpenAI follows Anthropic’s lead on interpretability
A few weeks ago, Anthropic published a really interesting blog post highlighting how they peek inside of Claude’s brain. This week, OpenAI published something similar. They use an autoencoder and clustering method to identify networks of activated neurons that seem to encode semantic concepts.
This was Ilya and Jan’s last work at OpenAI before they left, so I suspect it didn’t get the full media treatment, but it is cool to see that 2/2 labs showed that patterns of activated neurons do correlate with concepts, which seems to resemble the human brain.
Qwen2 disappoints in the Chatbot Arena
Alibaba released a new version of their Llama competitor, Qwen. They claim SOTA open-weight performance across all of the academic benchmarks and seemed to be giving Llama3 70B a run for its money. However, when the model faced the scrutiny of 10k+ blinded human reviewers, it only managed 15th on the leaderboard, a solid 20 Elo points behind Llama3 70B.
As we’ve discussed previously in this newsletter [1, 2, 3, 4, 5, 6], LLM evals are hard and academic benchmarks have lost much of their predictive power due to both data leakage and unintentional overfitting. I see them as nothing more than a sanity check these days.