The trials and tribulations of zero-shot voice cloning and the puzzle of why commercial >> OSS
I’ve been thinking a lot about zero-shot voice cloning and voice deep fakes recently. This election was supposed to be defined by deep-fake misinfo. Instead, it has been defined by absurd AI-generated memes. That said, I’ve been getting more interested in speech generation and wanted to learn more about the tools out there and what it would take to deep fake my own voice.
I rank the tools approximately by quality. Note the ElevenLabs is widely considered SOTA, but I wanted to limit myself to free options that anyone can use without any obvious guardrails.
Note: I removed my the clips I used to clone my voice and the examples from the public edition. If you have a strong interest in ZS voice cloning and these would help you, I’m happy to share privately.
Character.ai (closed source)
Character.ai lets anyone create a deep fake version of themself with only a few clicks. Your character will not only mimic your voice but whatever topics you’d like to include in your bio. They don’t disclose exactly what technology they use to enable this, but the results are really good. My Character sounds almost indistinguishable from me. While they don’t offer API access to cloned voices or an explicit TTS interface, you can ask the character to say something and record system audio.
Play.ai (closed source)
Play.ai provides a similar interface to Character with the addition of more business-focused options like adding a phone number and knowledge base. They disclose that they use an off-the-shelf LLM but do not disclose how their ZS voice cloning works. Performance is noticeably worse than Character. Like Character, there is no official API or TTS interface.
Whisper Speech (open source)
Whisper Speech inverted Whisper to create the best OSS zero-shot voice cloning system. The provide a super simple example script that runs without modification on a Google Colab and yields a cloned voice almost on par with Play.
Tortoise (open source)
Neets.ai offers some really impressive cloned voices of celebrities created with Tortoise. Unfortunately for the sake of this exercise, they fine-tuned the model on these voices rather than using zero-shot techniques. Setting up Tortoise is super easy if you follow their instructions carefully, but zero-shot voice cloning is disappointing. My clone affects some of my spoken idiosyncrasies, but it almost sounds like a generic male voice is half-heartedly trying to imitate me.
Bark, XTTS, MetaVoice, OpenVoice, Vall-E-X (open source)
I had high hopes for all of these models and went to great lengths to get them all running. Unfortunately, they all disappointed. Every one of these repos claims strong zero-shot voice cloning performance, but I could not replicate these results. I believe that the examples they showed were likely similar to those present in the training data and thus cloned nicely. Perhaps mine was a bit out-of-distribution and thus too challenging.
This experience does beg the question of what ElevenLabs and Character.ai are doing. I heard from many TTS experts that zero-shot voice cloning is not hard, but we don’t seem to have a good option out there in the open. Please reach out if you can explain to me why these OSS projects are so far behind the SOTA.
OpenAI ships Google Duplex’s six year-old dream and finds margins with speech-to-speech API
A seeming eternity (in AI time) from the initial GPT-4o demo and alpha release of Advanced Voice Mode, OpenAI launched its speech-to-speech API. The API appears to be essentially identical to AVM with the exception of offering function calling, which was my main criticism of AVM. Some of their launch videos show the voice mode seamlessly interacting with other systems, teasing how powerful this voice interface will be. The six year-old dream of Google Duplex has finally arrived.
Speech-to-speech is priced at ~$10/hour, depending on exact blend of input and output tokens or 20x more than their text API. My theory for this aggressive pricing is two fold:
- OpenAI really needs to find some margin. I suspect that their margins on the text APIs have been pushed close to zero (or even negative) by Google trying to catch up and Meta offering free Llama models.
- There is really no other competitive speech model on the market today. Many startups, like Tavus mentioned below, are cooking up cascaded systems, but OpenAI was first to market speech-to-speech API (Google does offer speech in/text out for Gemini).
At $10/hour, I’m not sure that either the companion or call-center use-case really makes sense at scale (I bet that off-shored call center operators pay employees on this scale and don’t have to deal with setting up a bunch of expensive IT infra), but I would bet with strong odds that these prices will come down 10x-100x fast. I’m sure that Google is hung up on safety and wouldn’t be surprised if we see a Gemini response very soon now that OpenAI has crossed the Rubicon on generative voice.
Y conversational AI video calls?
I’ve been following a bunch of startups building conversational agents for a variety of use-cases (mostly centered around companionship/entertainment and call center replacement). Whenever I see a demo posted, I try to spend some time chatting with the agent to understand the limitations of various approaches.
Last week Tavus got some traction by relaunching their video agents on HackerNews. While they claim low latency and strong conversational awareness, I found the experience inferior to most audio-only competitors. However, they differentiate on offering video cloning as well as audio cloning.
The video cloning demo is impressive (similar to what Zuck demoed at Connect) but falls into a weird uncanny valley. I felt a weird sense of unease chatting with both Hassan and Carter that was qualitatively different than chatting with an audio clone. Many of the responses on HackerNews echoed the same sentiment in addition to being generally creeped out by the technology (many of them covering their webcams to avoid being cloned themselves).
Is there a strong use-case for AI video calls? Does interacting with an AI over video add something new? I don’t have an answer to either of these. If you do, let me know.