These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.
Kyutai Labs’ Moshi chatbot dazzles the internet but not me and the need for voice assistant evals
Kyutai Labs announced an audio language model that they call the “lowest latency conversational AI ever released”. Rather than some of the previous systems like Fixie/Ultravox we’ve looked at here, they leverage a natively multimodal rather than cascaded system.
After spending some time with their demo, I was surprised by the extremely positive reception they received. The latency is impressive, it wasn’t noticeably different from Fixie and the quality of responses (both from the perspective of the voice itself and the content) and ability to handle barge-ins and interruptions was noticeably worse. I found my interactions with Moshi to be quite robotic and curt while I actually enjoyed chatting with Fixie, which is slower but provides better quality responses.
That said, this review is based on my vibes-based eval rather than any hard numbers. I do think that there is a need to develop a more quantitative eval beyond somewhat contrived metrics like spoken versions of text benchmarks that really only address correctness rather than experience.
So the best LLMs aren’t actually capable of graduate-level reasoning?
Last year, a team released a really clever LLM benchmark called GPQA when it was obvious that all the existing leaderboards were saturating. They divided the benchmark into 3 categories, physics, biology, and chemistry, and accepted questions if a PhD candidate studying a given category could answer reasonably accurately (74%) and a PhD candidate studying an adjacent field could not (34%) even with 30 minutes and access to Google. At the time, GPT-4 achieved 39%, marginally better than random chance (25%) as these questions are multiple choice. Since then, Claude 3.5 Sonnet notched 59%, providing evidence that their models have already achieved the PhD-level intelligence that Mira Murati claimed would come with GPT-5.
However, a new benchmark just dropped that challenges that assumption. SciCode challenges LLMs to code solutions to hard scientific problems. And these are really hard. I studied Physics as an undergrad and did my PhD in Material Science (aka physics for people who aren’t smart enough to be physicists) and I have no idea how to even start answering some of the questions. The best model, Claude 3.5 Sonnet, only achieves a 5% pass rate.
I do believe that once we saturate this benchmark, we will be able to confidently claim that our models have achieved PhD-level intelligence. Looking forward to seeing who does it first.
Personalized, AI-generated Olympic summaries from NBC
Al (this is Al as in Alan not Artificial Intelligence; perfect name for the role) Michaels is a beloved sportscaster that apparently drives watch time for NBC. Because Al only has a limited number of hours in the day, he can’t possibly comment on every event, but his AI avatar can.
Ahead of the Olympics, NBC built a feature that generates personalized daily summaries announced by Al for each user. While they will all be reviewed by humans, this is the first time I’ve seen AI creators deployed in such a public way before.