Dan’s Weekly AI Speech and Language Scoop #27

Kyutai’s Moshi makes voice assistants fun again

Audio LLMs are all the rage. OpenAI showed off Her. Google Gemini Live. But why? What is the point of these things?

We have lots of product hypotheses, but I think a big one that gets overlooked is that they should be fun. The entire world knows about all the cool things ChatGPT can do but generally ignores the runaway success of Character.ai.

I’ve played with every audio LLM I’ve been able to get my hands on and find them all boring. If I didn’t work in this area, I would not use these products outside of simple utility tasks like asking for the weather, which can be easily accomplished through a cascaded system (side note: Ray-Ban Meta’s voice assistant and form factor is AWESOME for these use-cases).

However, I think that Kyutai’s new release, Moshi, may have started to crack this nut. Yes, its voice makes RRK Jr. sound like Adele. Yes, it gets stressed and a bit unhinged. Yes, it interrupts users way too aggressively. Yes, it has a weak relationship with truth.

But, yes, it is fun in a way that no other experience is. Why? Moshi has a personality. It asks the user questions. It has preferences. It carries the other side of the conversation in a way that no other model does yet. 

With other audio LLMs, the user has to 100% drive the conversation. This is exhausting. With Moshi, it is more balanced.  It opened my first conversation by letting me know that it was stressed (lol) about its upcoming Japanese exam. If I ask other audio LLMs about their Japanese exams, they will tell me that they are AI language models and don’t have exams. This is no fun. The tech and product obviously still have a lot of problems, but this is an insight that I think is important.

Give it a try for yourself at moshi.chat, read their technical report, or even run the model locally on your Macbook with two commands (although it runs too slowly locally for me to be fun) and let me know if you agree with my assessment.

Synthetic podcasts with Google Labs

Google released a really neat product as part of NotebookLM and Illuminate that generates synthetic podcasts from uploaded source materials. This system runs offline (podcasts take a few minutes to generate) and is likely stitching together some powerful version of Gemini prompted or tuned for dialogue with a very impressive TTS system rather than an native audio LLM, but results are similar.

From a tech perspective, I am very impressed by the quality of dialogue and TTS. Their AI podcast hosts outperform the majority of humans from a simple polish perspective. For more popular papers/documents, the grounding and factuality is solid. I didn’t spot a single error, including references to figures and tables, for the “Attention is all you need” podcast. The explanation of the important concepts was totally aligned with my understanding of that paper.

Factuality and intent fell apart for lesser known works. I generated a podcast based on one of my grad-school papers on covalently adaptable hydrogels. The podcast made a number of factual errors and generally missed the intent of the work. When I generated a podcast using both papers together, the generated podcast tried to draw some very strange analogies between hydrogels and attention mechanisms.

If you wanted to learn about a very well-known topic, I think that the combination of the model’s grounding and strong parametric understanding of the topic yields a great result, but I would be skeptical of using this for general documents. This comes back to some of my earlier feedback on audio LLMs: they really need to be factual because checking the results is so hard with the audio modality. Once the model loses the users trust, the conversation becomes quasi-sociopathic.

From a marketing perspective, Google did an amazing job landing this. I highly doubt that most people amplifying their message actually generated and listened through an entire podcast, but the message obviously resonated. Bravo to the team for uncovering this.

From a product perspective, I’m not sure that I would actually use this as designed. I listen to podcasts for two reasons: to learn and to be entertained. Some podcasts are totally vapid but fun. Some are super dense but educational. Most are somewhere in the middle. To me, these synthetic podcasts are not dense or generally informative enough to be educational (I could get the information more efficiently from a text-based summary) or human enough to be entertaining (while the TTS and dialogue modeling is amazing, I like vapid podcasts because I share the same cultural context as the hosts and get the jokes).

Make your own at https://notebooklm.google.com/ and let me know what you think.

Google’s new whale LID model

Language identification is a common speech task: given an audio, classify what languages are being spoken at various time stamps. Google applied this basic concept to underwater audio and trained a model to recognize whether Humpback, Orca, Blue, Fin, Minke, Bryde, or Right are being spoken. I don’t think that the modeling technique is particularly novel, but I thought it was fun how close the pipeline (speech enhancement, side speech rejection, log-mel front-end, etc.) is to a human LID system and love this application. They include recordings on their website. I’ve heard whale calls IRL, but I had no idea these creatures made such a diverse array of sounds.

For the Orca language, they were able to decode several behaviors (echolocation, whistle, call). I’m looking forward to when we can fully decode these whale calls and talk to them ourselves.