Fixie/Ultravox open-sources low-latency Llama-based voice assistant

A company I’d never heard of, Fixie AI, launched an impressive Llama-based voice assistant. They bolt together a speech encoder, multimodal projector/adaptor, and Llama 3 8 B model, which streams text that they feed to a TTS system. They plan on replacing text generation with token generation and a vocoder for the next edition, yielding a truly e2e chatbot.

I didn’t spend any time trying to run their system locally, but the demo is very impressive. Their AI assistant punts or delays quite a bit and doesn’t handle multilingual well at all, but overall seems like a great product. Their speech recognizer outperforms every standard speech recognition system on rare words like “Thermopylae” (one of my conversations was about ancient Greece).

Gemma 2 fits to the Chatbot Arena prompts

I previously joked that Goodhart’s law was coming for the Chatbot Arena [1, 2], which historically has been the canonical place to find directionally correct rankings of LLMs. That moment arrived faster than I expected.

Google’s new Gemma 2 open models appear very impressive for their size independent of any metric hacking, but I find it funny that they explicitly mention that they “use the prompts, but not the answers from LMSYS-chat-1M” to fine tune their model in the technical report. Not surprisingly, after aligning their 27B model to the exact expected distribution of raters active in the Chatbot Arena, they rocketed to the top of the open model leaderboard.

Detect hallucinations (and publish in Nature) with semantic entropy

A group of researchers showed that they could detect hallucinations using a new metric called semantic entropy. They sample multiple answers to the same question, cluster them based on semantic similarity and entailment, and use the entropy over the cluster weights to determine the likelihood of a hallucination. The figure in their paper describes this far better than I can in words.

This approach seems fairly obvious and similar to other approaches to quantifying LLM factuality I’ve seen, but I mostly wanted to share because it made it into Nature. I don’t know if Nature is as important or well known in CS that is is in other fields, but I feel like there must be something here that I’m missing or underappreciating. Can anyone tell me?

Dramatically improve LLM inference efficiency with routing

The result is fairly obvious, but this is the first time I’ve seen the impact of model routing quantified. The LMSYS team showed that they could train a router to direct a query to a model of the appropriate power and significantly reduce cost/latency with little impact to quality [blog, paper].

FAIR solves length constraints

Due to complexities with tokenization and human preference for longer responses, LLMs generally have trouble following length constraints. However, for one of the most prominent LLM use-cases, generation of marketing content, length constraints are critical. Marketers need to ask an LLM to generate copy 10 words long for a Google AdWords campaign, 20 words for Instagram, 200 words for a blog post, and so on.

FAIR figured it out by augmenting existing preference data with length constraints. This approach seems so obvious that I’m surprised that no one else did it first, so I suspect that there is some secret sauce or tradeoff that I’m missing. This is a nice step toward more controllable LLMs.

Daniel D. McKinnon

Musings on adventuring in the modern era and tinkering with technological curiosities

Dan’s Weekly AI Speech and Language Scoop #19

Fixie/Ultravox open-sources low-latency Llama-based voice assistant

Gemma 2 fits to the Chatbot Arena prompts

Detect hallucinations (and publish in Nature) with semantic entropy

Dramatically improve LLM inference efficiency with routing

FAIR solves length constraints