Dan’s Weekly AI Speech and Language Scoop #30

Will the real OSS voice cloning toolkit stand up and the joys of Hacker News

Last week, I tried to understand the dearth of good OSS packages for zero-shot voice cloning. I’ve heard from many people that this is easy, but nothing out there seems to work very well compared to the closed systems. Alongside my own investigation, I asked the HackerNews hivemind the same question. Between writing my previous newsletter and today, popalchemist responded and let me know that I should check out VoiceCraft. Despite sitting near the bottom of the TTS Arena standings, apparently its ZS voice cloning capabilities are unrivaled.

Their GitHub links a working Colab that let me clone my voice with only a few minutes of setup. Performance is still quite a bit below Character and ElevenLabs, but I think that they are on par with Play and a step above all the other OSS systems I tested. The chances of this voice fooling anyone who knows me well is near zero, but it does really pick up my unique timbre and gets close to the commercial systems.

And what makes this such a fun coincidence is that when glancing through the VoiceCraft paper to understand how the system works, I was so pleasantly surprised to see Shang-Wen Li, my buddy from my first project at Meta, and Abdelrahman Mohamed, one of my favorite FAIR collaborators who is now in start-up land. Nice job on this, Shangwen and Abdo!

Why don’t I write this newsletter with NotebookLM?

NotebookLM has gotten a lot of press recently. Their “generate a podcast” feature went majorly viral (my review here) and has graced the pages of more publications than I can count. But the core product is not actually generating podcasts. It is a multi-document RAG tool wrapped in a nice UI that lets users upload an arbitrary number of docs and chat with them.

In theory, this tool would be perfect for writing this newsletter. My process is that I collect interesting papers, blog posts, Tweets/Threads, and projects over the course of the week. I should be able to add them as they arrive to a NotebookLM and use the tool to help me write this newsletter. Why don’t I do this? Well, I tried. And the results were bad.

This is what I think NotebookLM needs to fix before being able to assist with writing like this:

  • Context window is way too small. The context window of NotebookLM is enough for short questions but not more complex prompts. I can’t paste previous editions of my newsletter in the prompt and write something in my style.
  • Model doesn’t understand general context. NotebookLM is able to summarize documents at the level of a high school student while competitors, including Google AI Studio, do the same at a much higher level. I think that something about the model’s grounding and attribution behavior has hurt its ability to synthesize information. For example, when I ask NotebookLM to synthesize the unique ideas in the multimodal architecture paper I discussed last week, it gives me a regurgitation of the subsection headings. If I ask other models to do they same, they clearly break down the taxonomy and explain why it is novel in clear language.
  • Model doesn’t understand my context. This one is a little harder and potentially related to the short context window. I write this newsletter to synthesize my view of the world. I select topics because they interest me and I think I have the ability to add some unique value on top of the primary source. According to Gemini 1.5 Flash, these themes are: The Evolving Landscape of Evaluation, The Race for Smaller, More Efficient Models, The Future of Voice Assistants, The Rise of Multimodality, and The Importance of Open-Source Models. This is pretty good! With a larger context window or some way to connect new projects to previous ones, NotebookLM could understand that these were the themes I cared about and extract the elements of a given source relevant to those themes.

That said, NotebookLM did uncover a key product insight that I think all the other consumer AI applications can learn from: multi-document RAG and simple UI is really awesome for a lot of tasks. As far as I know, every other interface only allows the upload of a single doc, which is a total bummer because people rarely work in a vacuum. I want to be able to upload multiple documents and ask the model to help me understand connections between them. I think that if NotebookLM kept their existing interface, extended the context window, and swapped in a more powerful model, even if it meant losing some attribution or factuality, it would be a really killer product. Now let’s see if someone else does this first…

Next-Token Prediction is All You Need: An open Type D multimodal model

Hold on… Before we get to NTP is all you need, someone already did multi-document RAG with a super capable model and long context window. ChatGPT actually lets you upload multiple documents and ask questions. Watch out, NotebookLM! Their UI is worse and the feature is a bit hidden, but this seems quite compelling. Google AI Studio does the same. I will try to rewrite next week’s issue with both of these tools and see how they stack up with NotebookLM and my own writing process given that I think that they both address the shortcomings I flagged with NotebookLM above.

Back to our original programming: Emu3: Next-Token Prediction is All You Need is a fresh paper from the Beijing Academy of Artificial Intelligence. This is the first time I’ve seen a Type D (native multimodal understanding and generation) model out in the wild. They show that they outperform essentially every speciality model in every domain with a very simple approach: tokenize all modalities and train on NTP. I spent some time with their HuggingFace demo and I have to say that I agree. Image generation seems on par or better than vanilla SD1.5 (but far worse than fine-tunes) and vision-language understanding seems solid (I have less familiarity with this task). This model is only 8B, so I’m going to get this running locally this weekend to really put it through its paces.

While they limit themselves to text, images, and video, I wouldn’t be surprised if they extended this to audio, enabling the OSS community to have access multimodal assistant experiences ahead of any of the closed models.

Quickly back to my two-document RAG aside: I uploaded this paper and the multimodal taxonomy one and asked ChatGPT to slot this into the taxonomy and explain the novelty. It did a great job.

RLEF is so clever; Meta shares a fun approach

When I first started subscribing to GitHub Copilot, I was so frustrated that the model would generate code that wouldn’t compile or was syntactically incorrect. Don’t we have agents (compilers and interpreters) that we can include in either training or inference that should check the model’s response and make it essentially impossible to generate non-runnable code (whether the code is correct is another story)?

A few different labs have shared their approach to RLEF, but I thought that ours was fun. Gabriel Synnaeve and team show that they can significantly improve performance on coding benchmarks by simply passing written execution feedback to the model after it fails a test until it passes, after which the example is added to the DPO loop. This is a really simple, clever way to use execution feedback that does not involve a lot of complexity.