Dan’s Weekly AI Speech and Language Scoop #13

These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.

One-shot language learning with Gemini 1.5 Pro

Google released the 153-page Gemini 1.5 technical report. I haven’t gotten through the whole thing, but I think that they finally discovered a really novel use-case for their differentiated long-context windows: one-shot language learning.

Kalamang is a New Guinean language spoken by fewer than 200 people recorded in only a handful of field manuals and other academic documents (~250k tokens). The Gemini teams shows that that can achieve near-human level translation of English to Kalamang by providing all available materials in the context window (in fairness, GPT-4 and Claude 3 also do reasonably well with less context). They show similar results for speech recognition, prompting the model with just 45 minutes of audio and achieving a 23% CER for transcribing Kalamang. I am impressed and would love to see this extended to other ultra-low resource languages.

Note to self: Do not steal voices

Scarlett Johansson starred in the movie Her about a man who falls in love with an AI. I never watched the film, but apparently it was a hit and a huge chunk of the general public associates ScarJo with AIs.

OpenAI recognized this and thought that her voice would help drive excitement for their new voice assistants (rightly so, I think), so they reached out to her to ask to negotiate a license. She declined, so OpenAI moved forward with a sound-alike that was obviously inspired by her character (and possibly the associated Warner Bros IP).

Post-launch, ScarJo was was “shocked and angered” and seems to be pursuing legal action. It’s not clear to me whether creating sound-alike voices is strictly illegal, but it’s a really bad look for OpenAI.