Dan’s Weekly AI Speech and Language Scoop #36

A new OSS zero-shot voice cloning sheriff is in town

A couple months ago, I lamented over the low performance of OSS zero-shot voice cloning tools. Even the best option, VoiceCraft, was far behind what I would consider SOTA closed models: ElevenLabs, Character.ai, and Play.ai.

Well that was then and Fish Audio is now. They still feel a bit behind ElevenLabs and Character.ai, but I think they have matched or surpassed Play.ai and that a clone of my voice could very convincingly fool someone I didn’t know super well.

I didn’t go deep into how the dozen or so TTS systems I previously tried worked and it’s not super obvious from their paper what the breakthrough was here, but this works very very well and I think represents a breakthrough in OSS ZS VC.

Also interesting to note is that the second and third place models benchmarked in their paper are Reecho and CosyVoice. The former has made zero effort to distribute to English speakers and the latter not much more. As I mentioned last week, these Chinese labs are absolutely cooking and many are not even on my radar. I spent hours trying to find every OSS ZS VC system I could and the two best ones (prior to Fish Audio) totally evaded detection.

Gemini 2.0 Flash speech to speech is live (maybe)

Gemini 2.0 launched this week with a bunch of super impressive speech-to-speech capabilities. As soon as I heard the news, I jumped over to AI Studio Live to check it out. Unfortunately, the system just didn’t work as I had hoped.

  • A native audio model should be able to both understand and generate paralinguistics. The model did not respond appropriately based on my tone, e.g. when I complained about getting 80% on my math test, it congratulated me, nor did it speak in a whisper when instructed.
  • The model didn’t know that it wasn’t able to respond to my paralinguistic instructions. When I asked the model to speak like a pirate or whisper, the semantics changed appropriately but the paralinguistics did not.
  • The system felt very assistant-y and unnatural. Latency is good, but OpenAI and OSS models like Moshi do a better job simulating a natural conversation. Barge-ins were functional but slow.
  • Multilinguality is handled less naturally than GPT-4o. When I started a conversation in Spanish, it continued in Spanish and wouldn’t respond to my English barge-in. When I ask to learn basic French using English as a base language, the model pronounces the French words in the English accent. Pronunciation for simple words is surprisingly bad, e.g. fran-kay for français. It feels like there is some kind of switching going on behind the scenes rather than a native multilingual audio model.
  • “As a language model, I don’t have preferences…” this is more style than tech, but I think that a dialogue model shouldn’t respond like this. It is no fun to talk to a model where the user has to drive the entire conversation.

Given that I couldn’t even reproduce many of the examples highlighted in their demo video, I suspect that the Gemini 2.0 Flash Experimental model in AI Studio today is just the existing Gemini Live cascaded system powered by Gemini 2.0 Flash rather than the real native audio model. Can anyone from Google confirm or deny?