Dan’s Weekly AI Speech and Language Scoop #12

These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.

Speech is the belle of the ball

For those of you who have been hibernating, OpenAI hosted their “Spring Update” yesterday and Google’s I/O begins today. Both of these events centered around (or in Google’s case, will likely center around based on their preview video) the launch of more fluent voice interfaces to marginally improved multimodal models.

The 20-minute OpenAI video (and accompanying snippets of various fun use-cases) is well worth a watch. They demonstrate how their voice interface can:

  • Intuit emotion
  • Respond to non-verbal audio like breathing pace
  • Adjust tone according to instructions and gravity of conversation
  • Land non-dad jokes
  • Fluently switch languages
  • Handle interruptions and redirects

These demos were impressive, but as Soumith put it, they feel within reach.

The rise and fall KANs or the wild word of new architectures

Last week, I mentioned that a new network architecture called Kolmogorov–Arnold Networks was blowing up on social media with claims of significantly better performance on a variety of simple fitting tasks than MLPs. Days later, the community consensus appears to be that KANs are essentially re-expressions of MLPs parameterized in different ways. This neural architecture theory is way above my pay grade, but if you want to watch a debate unfold in real-time, start with this Twitter thread and notebook.

And on that note, another exciting new architecture has made the rounds this week: xLSTM. The idea is to capture the scaling power of transformers without the quadratic compute required by a traditional attention matrix. Unlike KANs, this new architecture is far more accessible to the dilettante, so feel free to give this one a read without risk of headache.

Reverse training to nurse the reversal curse

A very common criticism of LLMs is that they don’t understand uncommon reversals of common relationships. For example, the models can easily predict what line comes after “Gave us proof through the night that our flag was still there” in the National Anthem but cannot predict what line came before.

Our team at FAIR just published a simple, clever technique to address the problem: they reverse the data!