These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.

Meta AI Speech team wins best paper at ICASSP

ICASSP is one of the top speech conferences in the world. Meta’s speech team took home the best paper award for their work on a novel method to factorize the RNN-T model to better jointly optimize an acoustic model with an internal language model, which will make it easier to bring world knowledge to a speech recognizer, significantly improving both WER and rare-WER on benchmarks.

A few interesting non-Meta papers that caught my eye:

Lab testing can be simulated. A team at Google showed that room simulations could replace or augment lab testing for training and evaluation.
Speech models can be aggressively quantized. A team at Google showed that speech models can be compressed up to 90% with minimal reduction in WER through some quantization techniques.
Bolting an LLM decoder onto a speech model helps performance [1, 2].
Context really helps with rare words.
PEFT for personalized ASR. A team of academics show that they can significantly improve ASR performance through a personalized fine-tuning technique.

How to make sure that your model is in the top corner? Invent an eval!

Mistral and Snowflake both released new open-weight models last week. Neither is particularly notable beyond Snowflake’s decision to ship a super sparse architecture (128 experts/480B with only 2 experts/17B active). The bit relevant to this newsletter is that Snowflake positioned their model as SOTA on a contrived plot that combined training compute with “Enterprise Intelligence”, which is the tortured combination of four standard benchmarks.

I bring this up not only to make fun of Snowflake for creating a silly eval to crown themselves the winner, but also to salute them for creating a silly eval. They know their customers best. If they believe that these metrics represent their customer use-cases, then good on them for reporting.

Jury > Judge for AI evals

A team at Cohere showed (not surprisingly) that a panel of LLMs were more effective in evaluating models than any individual model alone.

Deep audio fakes are coming. Are we ready?

A Baltimore principal was fired last week for allegedly making racist comments on social media. Plot twist: the recording was faked by a disgruntled PE teacher. What’s not surprising is that this happened. Voice cloning tools are everything. What is surprising is that it took a week to reverse the firing and (I believe) only old-fashioned gum-shoe police work got to the bottom of the story.

Daniel D. McKinnon

Musings on adventuring in the modern era and tinkering with technological curiosities

Dan’s Weekly AI Speech and Language Scoop #10

Meta AI Speech team wins best paper at ICASSP

How to make sure that your model is in the top corner? Invent an eval!

Jury > Judge for AI evals

Deep audio fakes are coming. Are we ready?