- Return the Fund
- Posts
- The Death of the Large Language Model
The Death of the Large Language Model
Behind the scenes of this crazy week in AI
Return the Fund 🚀
The frontier of tech-focused VC research
In today’s edition
Behind the scenes of 2024’s wildest week in AI
How the engineering world reacted to GPT-4o mini and Llama 3.1 + some shadowy deductions about OpenAI’s behavior
SearchGPT, Perplexity, and the path of true discovery
MADE POSSIBLE BY
Intercom for Startups
Join Intercom’s Early Stage Program to receive a 90% discount.
Get a direct line to your customers. Try the only complete AI-first customer service solution.
None of our startup picks are ever sponsored. So, simply checking out Intercom’s website above is the easiest way to keep Return the Fund alive and free. 🤝
BEFORE WE DIVE IN
Shoutout to the reader whom I saw scrolling RTF while in a conference. ;)
And shoutout to those of you who responded to last week’s edition with thoughtful comments.
My mission is the unrelenting pursuit of knowledge, insight, and discovery. So thank you for being a part of that journey. 🍻
BACKGROUND
The OpenAI Escapism Stack
A month ago, we featured 3 companies empowering the “fine-tuned LLM revolution” as a “narrative of escaping OpenAI’s grasp.”
Much has changed since November of 2022 when OpenAI shocked the world with ChatGPT.
Cost, convenience, performance
OpenAI’s API has always been incredibly convenient. With one line of code, developers can access state-of-the-art intelligence. Their API unequivocally powered the AI startup revolution of the last two years.
I refer to this quote from the above RTF edition:
While small, open-source models (think Llama and Mistral) are not drop-in replacements for large foundational models (think GPT-4o and Claude 3), with careful domain-specific fine-tuning, they can actually outperform large foundational models in niche use cases.
Given that most products use LLMs to handle specifically niche requests, fine-tuning small models is an important alternative to consider due to speed, cost, and control benefits.
Fine-tuning models is both a science and an art. We like OpenPipe because they abstract low-level complexities away from the tuning process. Users can train models simply with a dataset and a couple clicks.
Of course, this isn’t a no-brainer—the magic of a powerful fine-tune still lies in the dataset. Nonetheless, such abstraction will empower a wave of companies to experiment with small models as an alternative to pegging their businesses to the OpenAI API.
Improvements in small language models, mostly thanks to Meta (Llama) and Mistral, have empowered the “OpenAI Escapism Stack.” Now, companies can match OpenAI’s convenience and unlock the benefits of open-source (fine-tuning, data ownership, customization, deployment flexibility) without a single reference to api.openai.com
in the codebase.
Distillation
Fine-tuning is a fantastic option for those with existing chat logs and datasets whose behavior needs to be mimicked (ex. training a legal report writer on thousands of human-written reports).
But what about generalist customers, sending structured prompts to OpenAI via their products? (i.e. “Summarize this user profile: {profile}
”)
A nifty technique known as “distillation” involves using the outputs of a large model to fine-tune a smaller one. For example, using GPT-4 Turbo to generate the dataset used to fine-tune Llama 3.1 8B.
The data-collection phase of model distillation.
The theme of distillation will surface repeatedly as we analyze the developments in AI from this week.
Moral of the story
A consortium of companies have provided customers with everything they need to evade OpenAI’s grasp. A fine-tuning SaaS, a model storage provider, a deployment mechanism, and an observability tool are the building blocks of an LLM’s lifecycle.
This is the OpenAI Escapism Stack, and it’s increasingly becoming the AI standard.
With that background out of the way, let’s dive into what happened this week.
BREAKING IT DOWN
2024’s Craziest Week for AI
Photo courtesy of Jeff Bottari / Getty Images.
So… what happened, who were the winners and losers, what was behind the releases, and what can we expect going forward?
Let’s move in chronological order.
OpenAI releases GPT-4o mini
On July 18, OpenAI introduced a “new affordable and intelligent small model that’s significantly smarter, cheaper, and just as fast as GPT-3.5 Turbo.”
GPT-4o is tiny, but has a 128k token context window—enough to contain a novel.
They “recommend developers using GPT-3.5 Turbo switch to GPT-4o mini to unlock higher intelligence at a lower cost.” To them, GPT-4o mini is the fastest and cheapest method of accessing OpenAI’s intelligence.
Why did OpenAI deviate so far from their norm?
To save an enormous amount of money
To compete against the OpenAI Escapism Stack
Let’s clarify and contextualize these reasons.
Where did GPT-4o mini come from? Well, it was likely trained by distilling GPT-4o’s outputs. But where did GPT-4o come from? Unclear.
After speaking with a handful of startups building products that depend on OpenAI LLMs, it’s apparent that GPT-4o is simply not as performant or reliable as GPT-4 Turbo in production.
OpenAI has shared little about the GPT-4o training process.
The aspect of their release demo that was most exciting was integrated voice and video (interpreting the environment via a camera, watching a user’s screen, detecting vocal inflection, etc.).
With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.
Yet, GPT-4o is twice as fast as GPT-4 Turbo? Something doesn’t add up.
It’s not a stretch to posit that GPT-4o’s base model was also born of distillation. The resulting smaller model is cheaper and faster while being trained to mimic the behavior of its larger counterpart.
But what of the evals in the image above? Clearly they show GPT-4o outperforming all other models…
The eval conundrum
According to OpenAI, GPT-4o is the best model out there. According to Anthropic, the best model is Claude Sonnet 3.5. According to Mistral, it’s Mistral Large 2. You get the idea.
These evaluations are a hallmark of the LLM ethos as they’re used to generalize a model’s performance. The prevalence of evals leads to model providers increasingly ‘overfitting’ their models in RLHF and fine-tuning (whether intentional or not) to inflate performance.
It’s a bold claim—but not unheard of.
If a math teacher exposes concepts to students using the eventual final exam, nobody would take a 99% class average seriously. Similarly, memorizing the answers to a practice exam is the worst way to study for a final.
By definition, letting a model intuit that which it will eventually be evaluated on is ‘overfitting’. The fact that GPT-4o is widely known to perform significantly worse than GPT-4 Turbo, yet beats it in evals, is highly indicative of overfitting.
The logical expectation of a smaller, distilled model is that it behaves similarly to its larger counterpart but struggles with step-by-step reasoning and instruction following. These expectations align with user findings, echoed throughout online forums, including OpenAI’s own community site.
Two examples:
Played around with it a bunch, and it is very obvious, that GPT-4-turbo is a lot better than GPT-4o. Give it any logic riddle or tell it to act in a certain way, and it fails way more.
GPT 4 turbo is much better for step by step tasks. In general it understands much better the prompt instructions.
Ultimately, while GPT-4o’s current underperformance relative to GPT-4 Turbo is well-documented, accusations of distillation and overfitting are lofty. But, not out of the question.
LLM evals should always be interpreted with a grain of salt. Production performance is the true evaluator.
Meta unveils the Llama 3.1 fleet
On Tuesday, Meta reasserted its commitment to open-source and its dominance in the small model realm by releasing the Llama 3.1 fleet.
These models come in 8B, 70B, and 405B variants, with the 405B variant being the largest open-source model ever released.
(ICYDK—the numbers represent the number of parameters, in billions. The more parameters, the larger the model; and, the more expensive to train and run.)
According to Meta’s human evaluation report (once again, take it with a grain of salt), Llama 3.1 405B outperforms GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in response quality.
In reality, they’re mostly on par with one other, each one excelling in different settings.
Human evaluations for Llama 3.1 405B relative to large, proprietary SOTA models.
Even Llama 3.1’s 70B variant appears to be comparable to GPT-4o, hence why everyone and their mother is tweeting about “having SOTA models in open-source.”
According to active user feedback and my own usage of these models, they are incredibly strong. Especially after fine-tuning.
OpenAI enables fine-tuning GPT-4o mini
The release of the Llama 3.1 fleet was leaked quite early. Threatened by the aforementioned OpenAI Escapism Stack, OpenAI likely released GPT-4o mini to compete in the small model space.
GPT-4o mini’s pricing, at $0.15 per million input tokens and $0.60 per million output tokens (~750k words), is 20x cheaper than GPT-4o, 50x cheaper than GPT-4 Turbo, and order of magnitude cheaper than most off-the-shelf open-source models.
After Meta released Llama 3.1, thousands of engineers raced to run fine-tuning jobs and test the shiny models in their apps, reporting back with optimism.
OpenAI then released fine-tuning for GPT-4o mini—the truly missing frontier to compete with small open-source models.
Enabling fine-tuning allows customers of the OpenAI Escapism Stack to run their existing datasets, with no extra effort, through OpenAI. Inference through GPT-4o mini is, then, absurdly cheap.
To incentivize this process, OpenAI made fine-tuning GPT-4o mini free (with limits). It seems engineers took advantage of this opportunity.
I personally ran every one of my Llama fine-tuning jobs, truncated to OpenAI’s threshold, through their API for free. The results were excellent. A few of the resulting models are currently in production.
Good move, OpenAI.
Saving money
Recently, a new research outfit FutureSearch released a report on OpenAI’s revenue.
Shockingly, despite margins of error, a majority of OpenAI’s revenue comes from consumer subscriptions to ChatGPT (as opposed to the enterprise API).
OpenAI likely recognized an opportunity for massive cost savings.
Given that most ChatGPT messages are not computationally intensive (think basic conversation, basic web searching, basic interactions), there's no need to be sending those requests to a trillion-plus parameter model.
Over time, OpenAI has been modifying the ChatGPT interface to flow between models. Users can even choose which model they're interacting with inside the same conversation.
Consider a world wherein OpenAI automatically routes each message to the most appropriate model. Basic conversation goes to GPT-4o mini and advanced data science orchestration goes to GPT-4o/GPT-4 Turbo. OpenAI would save tens (if not, hundreds) of millions of dollars in compute.
Empowered by GPT-4o mini. Once again—touché, OpenAI.
Honorable mentions
Kudos to Mistral for their release of Mistral Large 2. Totally overshadowed by Llama, though.
And, kudos to Mark Zuckerberg for his open-source manifesto.
Moral of the story
Competition is heating up in the small model ecosystem, with OpenAI now getting their hands dirty to protect their advantage.
Open models are cheaper and better than ever before. A wave of new infrastructure companies are empowering the OpenAI Escapism Stack.
As models improve over time, evals are dodgy—pay attention to production performance.
THOUGHTS ON THE FRONTIER
Search
OpenAI is backed into a corner, throwing punches at everyone like Drake in May. The only difference is that, for OpenAI, it seems to be working.
Their next target? Search.
OpenAI unveiled SearchGPT, a prototype LLM-powered search interface soon to be integrated into ChatGPT.
Meanwhile, somehow-unicorn Perplexity is (privately) already raising another $250 million, led by SoftBank’s Vision Fund II.
My favorite quote from their deck: “Building a wrapper is hard.”
Jokes aside, while building a wrapper is a weekend project, scaling a wrapper is a Sisyphean ordeal.
Personally, I pay $20 a month for Perplexity Pro’s convenience. Scaling a consumer product the way Perplexity has is hard. It requires an incredible level of attention to detail. That said, assuming it performs up to OpenAI’s standard, SearchGPT calls Perplexity’s moat into question (again).
The problem with AI search
The central limitation of search-powered LLM tools like Perplexity is their lack of depth. True discovery is uncovered by following leads down the rabbit hole and thinking critically about every intermittent insight.
Each step influences the next, as the searcher becomes increasingly aware of what action will guide them closer to discovery.
A mentor once shared a quote that has guided me, as it did him.
The beginner chases the right answers. The master chases the right questions.
The key to discovery is learning what needs to be discovered.
Perhaps the philosophy of research and discovery can be engineered into an recursively self-guided agentic system. Certainly worth trying. But, for now, I remain highly skeptical of any “all-encompassing” search-powered platforms.
With that said, I can’t wait to try SearchGPT and see what it’ll change in my perspective.
WHO WE ARE
Return the Fund 🚀
One startup a week poised for 10x growth; market deep dives with actionable insights for builders and investors.
Technical breakdowns of advanced new tech to empower informed decision-making
Find your next prospect, product, job, customer, or partner. 🤝 Written by undercover pioneers of the field; trusted by builders and investors from Silicon Valley to NYC. 💸
Last week, we dove into the context behind Biden dropping out of the 2024 election, and JD Vance’s stances.
As you know, we’re hell-bent on uncovering future unicorns cruising under the radar. Preeminent companies are lean, quiet, and driven before reaching their watershed moments. By the time people start talking about them, it’s too late.
In a nutshell—we pitch you startups like you’re an esteemed VC. If you’re interested in them as a partner, product, or prospect, we’ll make a warm intro. Humbly, our network knows no bounds!
We’ll also intuitively break down advanced tech so you can stay ahead of trends and critically analyze hype in the news and in your circles (regardless of your technical prowess).
Periodically, we’ll propose niche market opportunities. These are tangible ways to extract alpha from the private markets.
You won’t get editions from us very often. Weekly at best. Two reasons:
We’re balancing full-time VC/startup work with Return the Fund.
We prioritize depth, insight, and value. This is not a daily news publication… We hope that when you do get an email from us, it’s dense, valuable, actionable, and worth saving.
Thanks for reading today’s RTF. Feel free to reach out to us at [email protected]. 🤝
Psst: None of our company picks are ever sponsored. All research and opinions are completely held by the Return the Fund team.
Reply