Dan’s Weekly AI Speech and Language Scoop #4

These notes are written for a Meta-internal audience. I quickly redact Meta-internal information before publishing but don’t take the time to rewrite. If a particular edition seems thin or a transition doesn’t make sense, this is why.

Bias in speech recognition models

I stumbled upon a paper that evaluated both Seamless (Meta) and Whisper (OpenAI) on male and female voices. They identified clear gaps, generally performing better on male speakers. I later learned that Meta showed the same in the Seamless paper (Page 79, Table 56).

I believe that these biases are really important to keep in mind as we think about assistant-related metrics. We may see strong mean performance but really need to think about the distribution.

Are all of our users having a pretty good experience? Or are most having an excellent experience and some unable to use the product?

Youtube takes a page out of the Meta captions playbook and recognizes the value in user-contributed transcriptions

Several years ago, Meta built a series of features that nudged Facebook creators into generating and correcting automatic captions in their videos. This is a great example of a win-win-win AI flywheel that benefits producers (through more reach with properly captioned videos), consumers (through having access to captions), and Meta (through additional training data). This week, YouTube announced that they will be following suit.

FAIR shows the possibility of training semantic experts for MOE models

Mixture of experts (MoE) language models generally replace the single feed forward network (FFN) in some transformer blocks with a router and collection of FFNs that are selected based on the context. This means that only a fraction of the parameters of each model are activated with each token generated, so inference cost is reduced substantially versus a dense (non-MoE) model for the same level for performance. Empirically, these models seem to be more efficient to train as well.

Intuitively, it seems like these models should learn to route science queries to the science expert and pop culture queries to the pop culture expert and so on, but this turns out to not be true. Rather, experts seem to specialize in different syntactic structures (Figure 8 in the Mixtral paper).

Researchers at FAIR developed a new approach to MoE called Branch-Train-miX that enables researchers to independently train their expert domains and merge them into a single model (compare Figure 6 in their paper to Figure 8 in the Mixtral paper to see this in action). They mostly touted the efficiency benefits inherent to such an approach, but I immediately leapt to how this might be used to improve assistants.

If we know that users do a fixed number of things, can we train an expert for each domain (perhaps some experts are very small and basically only trained to use simple tools) and train an interpretable end-to-end system to handle all user queries?