Money is all you need
Selling tokens is a tough business. AI companies are building some of the most remarkable products that humanity has seen in years, but the industry, partially thanks to Llama, has managed to almost totally commoditize the core model and drive prices down 99% in 18 months.
The accountants are fighting back! Anthropic just dropped Claude Sonnet 3.5 from its free offering and OpenAI is testing the waters with a $200/month tier. Prices must keep coming up for companies that don’t have another cash stream. We’ll see who blinks first.
Much ado about NotebookLM
When NotebookLM podcasts dropped, I felt that it was a compelling demonstration of DeepMind’s speech synthesis and dialogue research but quite a ways from a product that solved a real user problem. The reason that podcasts are a compelling medium for information transmission is because listeners form human connections with the hosts, not because they just want something random to fill the air. Many in the tech community agreed with my assessment [HN, Sam Lessin Thread], but the media continued to go gaga for this demo (top HN comment pasted below).
Slightly tangential, but this is the reason I'm baffled why people think that AI-driven podcasts would ever be worth listening to. If you can find an LLM+TTS generated podcast with even a FRACTION of the infectious energy as something like Good Job Brain, Radio Lab, or Ask Me Another, then I'll eat my hat. I don't even own one. I'll drive to the nearest hat store, purchase the tallest stovetop hat that I can afford and eat it.
That said, a startup should never let a good hype cycle go to waste and Play.ai just released their version. While it is clear that their dialogue modeling is a bit behind Google’s, they are much more comfortable pushing the limits on cloning and impersonation. If given snippets of the Lex/Andrej podcast linked above, I’m not sure I could distinguish between that and the real thing.
And in related news, three members of the NotebookLM product team just left Google to form their own startup. Note that this is not the group that created the speech synthesis and dialogue models but rather the group that stitched them together and created the experience. In some sense, this is a mini version of the Character.ai acquihire (in reverse) that will shed some light onto where the relative value lies in AI products: is it in SOTA models or final presentation.
What safety checks? Create an AI scam agent in minutes.
AI skeptics were deeply worried that open sourcing text LLMs would create a flood of personalized scams. As far as I can tell, this hasn’t happened, but I don’t think that the hypothesis was off base.
The same concerns apply to audio, but the baseline intelligence of these models is much, much higher than when text started to enter the popular consciousness. In fact, today in a few clicks on a variety of platforms, anyone can create a voice agent and plug it right into the telephone network. Many of them appear to have zero safety controls, e.g. my Equifax scam agent I created with Play.ai that is instructed to collect financial information.
However, the real question is “does this matter?” I believe that scams and frauds are much more of a demand than supply problem. Running a phishing farm in Belarus does not cost all the much more (maybe less) than standing up a fleet of autonomous AI agents but is harder to scale. Will easing scaling lead to an increase in scams? My thought is no and that the demand side is fairly saturated. Time will tell.
Dr. GPT
A bunch of doctors interested in AI ran a simple study to determine whether a year-old version of GPT-4 Turbo, a human, or a combination of human and AI would perform better on a handcrafted set of complicated diagnoses.
Adam Rodman, one of the authors, said he was confident that diagnosis would be like chess and the human plus AI would take the W. Surprisingly (to him), AI alone significantly outperformed both.
This is actually not surprising to me. I’ve been involved in a few super complex medical situations in the last few years. Getting to the bottom of each required me reading and understanding dozens of different medical papers published over the past decades and clinicians just don’t have that much time to devote to every patient. I would have been quite surprised if AI didn’t outperform. I (and Dr. Rodman) think that best practice should be to provide the AI with all relevant information and place the burden of proof on the doctor to override the diagnosis.
Many see this as gloom and doom for doctors, but Dr. Rodman had quite a different take. Diagnosis is a task very well suited to AI. A lot of medicine is not. Doctors need to comfort patients, conduct procedures, communicate with colleagues, and many other tasks that are impossible to outsource to AI. He (and I) see his study as a positive development for the field. AI scales essentially infinitely, so now almost every human on the planet can get access to high-quality diagnoses and doctors can focus on either going deeper into fewer cases or on other aspects of their jobs.
Another lesson to take from this is how valuable it is to work across disciplines. The authors published in JAMA Network (impact factor: 13.8) and received an incredible amount of mainstream attention. If this had come from an AI lab, it would have been relegated to Arxiv and largely ignored as yet another eval. The messenger matters.
Fixie cooked right until the end
Fixie released a Llama adapter-powered sequel to their original cascaded model, which I previously found quite impressive. They claim to beat GPT-4o at speech understanding (based on a common AST metric). I am skeptical, but their demo is quite impressive. They really got latency down for both interruptions and responses. I could see Fixie growing into the easiest-to-deploy-and-experiment-with speech-to-speech dialogue model.
However, I think that this may be the end of Fixie. Their CTO just left to start a real-time team at OpenAI.
Tulu 3 edges out Llama 3.1 with open data
The Llama 3 post-training mixture is pretty cracked. While the Llama 2 paper exposed the working of a top posttraining lab for the first time, the mixture itself was mid. Several independent teams improved on the mixture and the original Llama 2 instruct models are nowhere to be found on the leaderboards.
This has not happened with Llama 3. Nous Research claimed a fine-tune that marginally improved some metrics and Nvidia did a great job with Nemotron, but there hasn’t been much until now.
Allen AI seems to have beaten the Llama 3.1 post-training mixture with totally open data. While I don’t think that their marginal improvements on metrics are meaningful (I doubt the model is significantly better or worse in the statistical sense of the word), it is very meaningful that the Tulu 3 team released all the data used to train the model, generated it all on a shoestring budget, and shared a huge drove of experiments for other following in their footsteps.
One important thing to note is that they use closed models like GPT-4o to generate some data, which is challenging for a lot of organizations (perhaps including theirs) who want to follow the license to the letter.