Dan’s Weekly AI Speech and Language Scoop #7

These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.

Gemini’s “native” audio understanding falls flat

Google announced a big Gemini update today whose marquee feature appears to be that “Gemini 1.5 Pro can now hear”. I assume that this means the model consumes “native” speech tokens instead of transcribed speech or cascaded speech embeddings, which should help make the chatbot more robust to ASR errors and vice versa.

I hopped over to Google AI Studio to try to give a few of hard test cases a whirl. Unfortunately, this “native” audio capability seems to perform worse than several open-source cascaded systems. I picked two failures where the pronunciation drives an incorrect transcription, but the entity should be obvious with the context, something an LLM should excel at (if you paste either of these into an LLM and ask it to correct the error, it will).

In both cases, after reading these utterances to Gemini, it happily hallucinated an entire stories about places that don’t actually exist. I was hoping that this launch could teach me something about the value of native audio understanding, but alas perhaps it may serve as a warning.

Mixture of Depths enables models to reason about computation

Last week, we learned that we can drop almost half of the layers in a Llama model without meaningfully impacting performance. However, this approach is a little brutish. What if certain tokens require more reasoning or computation to generate than others?

This week, a group of Google researchers examined just that. They introduced the “Mixture of Depths” architecture that enables the model to dynamically allocate compute per-token by enabling a router to skip subsequent layers. In practice, this means that “easy” or low-entropy tokens could be generated with less compute while unleashing the full power of the model on large ones.

This is a natural progression of the MoE (mixture of experts) architecture and enables us to run more powerful models with less compute (perhaps trading off memory). I hope that this work combined with aggressive quantization will get us to the point of being able to run more powerful models on weak hardware.

Cohere released the first open-weight model competitive with GPT4: Your move, Llama3?

Cohere is an interesting company. They took a stab at competing in frontier models and appear to have pivoted to supporting their new collection of open-weight models to the benefit of the broader community. 

This week, they released the weights of Command R+, which is the first open-weight model to touch GPT4-level performance in the Chatbot Arena, which is confirmed by my own vibes-based eval. Looking forward to seeing if Meta can take back the throne with Llama3.