Return the Fund
Posts
The Death of the Large Language Model

The Death of the Large Language Model

Behind the scenes of this crazy week in AI

July 28, 2024

In partnership with

Return the Fund 🚀

The frontier of tech-focused VC research

In today’s edition

Behind the scenes of 2024’s wildest week in AI
How the engineering world reacted to GPT-4o mini and Llama 3.1 + some shadowy deductions about OpenAI’s behavior
SearchGPT, Perplexity, and the path of true discovery

MADE POSSIBLE BY

Intercom for Startups

Join Intercom’s Early Stage Program to receive a 90% discount.

Get a direct line to your customers. Try the only complete AI-first customer service solution.

Apply now

None of our startup picks are ever sponsored. So, simply checking out Intercom’s website above is the easiest way to keep Return the Fund alive and free. 🤝

BEFORE WE DIVE IN

Shoutout to the reader whom I saw scrolling RTF while in a conference. ;)

And shoutout to those of you who responded to last week’s edition with thoughtful comments.

My mission is the unrelenting pursuit of knowledge, insight, and discovery. So thank you for being a part of that journey. 🍻

BACKGROUND

The OpenAI Escapism Stack

A month ago, we featured 3 companies empowering the “fine-tuned LLM revolution” as a “narrative of escaping OpenAI’s grasp.”

The AI Cloud Spotlight

Three emerging startup prospects empowering the fine-tuned LLM revolution.

returnthefund.vc/p/ai-cloud-spotlight

Much has changed since November of 2022 when OpenAI shocked the world with ChatGPT.

Cost, convenience, performance

OpenAI’s API has always been incredibly convenient. With one line of code, developers can access state-of-the-art intelligence. Their API unequivocally powered the AI startup revolution of the last two years.

I refer to this quote from the above RTF edition:

While small, open-source models (think Llama and Mistral) are not drop-in replacements for large foundational models (think GPT-4o and Claude 3), with careful domain-specific fine-tuning, they can actually outperform large foundational models in niche use cases.

Given that most products use LLMs to handle specifically niche requests, fine-tuning small models is an important alternative to consider due to speed, cost, and control benefits.

Fine-tuning models is both a science and an art. We like OpenPipe because they abstract low-level complexities away from the tuning process. Users can train models simply with a dataset and a couple clicks.

Of course, this isn’t a no-brainer—the magic of a powerful fine-tune still lies in the dataset. Nonetheless, such abstraction will empower a wave of companies to experiment with small models as an alternative to pegging their businesses to the OpenAI API.

Return the Fund

Improvements in small language models, mostly thanks to Meta (Llama) and Mistral, have empowered the “OpenAI Escapism Stack.” Now, companies can match OpenAI’s convenience and unlock the benefits of open-source (fine-tuning, data ownership, customization, deployment flexibility) without a single reference to api.openai.com in the codebase.

Distillation

Fine-tuning is a fantastic option for those with existing chat logs and datasets whose behavior needs to be mimicked (ex. training a legal report writer on thousands of human-written reports).

But what about generalist customers, sending structured prompts to OpenAI via their products? (i.e. “Summarize this user profile: {profile}”)

A nifty technique known as “distillation” involves using the outputs of a large model to fine-tune a smaller one. For example, using GPT-4 Turbo to generate the dataset used to fine-tune Llama 3.1 8B.

The data-collection phase of model distillation.

The theme of distillation will surface repeatedly as we analyze the developments in AI from this week.

Moral of the story

A consortium of companies have provided customers with everything they need to evade OpenAI’s grasp. A fine-tuning SaaS, a model storage provider, a deployment mechanism, and an observability tool are the building blocks of an LLM’s lifecycle.

This is the OpenAI Escapism Stack, and it’s increasingly becoming the AI standard.

With that background out of the way, let’s dive into what happened this week.

BREAKING IT DOWN

2024’s Craziest Week for AI

Photo courtesy of Jeff Bottari / Getty Images.

So… what happened, who were the winners and losers, what was behind the releases, and what can we expect going forward?

Let’s move in chronological order.

OpenAI releases GPT-4o mini

On July 18, OpenAI introduced a “new affordable and intelligent small model that’s significantly smarter, cheaper, and just as fast as GPT-3.5 Turbo.”

GPT-4o is tiny, but has a 128k token context window—enough to contain a novel.

They “recommend developers using GPT-3.5 Turbo switch to GPT-4o mini to unlock higher intelligence at a lower cost.” To them, GPT-4o mini is the fastest and cheapest method of accessing OpenAI’s intelligence.

Why did OpenAI deviate so far from their norm?

To save an enormous amount of money
To compete against the OpenAI Escapism Stack

Let’s clarify and contextualize these reasons.

Relative GPT-4o mini evals according to OpenAI.

Where did GPT-4o mini come from? Well, it was likely trained by distilling GPT-4o’s outputs. But where did GPT-4o come from? Unclear.

After speaking with a handful of startups building products that depend on OpenAI LLMs, it’s apparent that GPT-4o is simply not as performant or reliable as GPT-4 Turbo in production.

OpenAI has shared little about the GPT-4o training process.

The aspect of their release demo that was most exciting was integrated voice and video (interpreting the environment via a camera, watching a user’s screen, detecting vocal inflection, etc.).

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.

OpenAI

Yet, GPT-4o is twice as fast as GPT-4 Turbo? Something doesn’t add up.

It’s not a stretch to posit that GPT-4o’s base model was also born of distillation. The resulting smaller model is cheaper and faster while being trained to mimic the behavior of its larger counterpart.

But what of the evals in the image above? Clearly they show GPT-4o outperforming all other models…

The eval conundrum

According to OpenAI, GPT-4o is the best model out there. According to Anthropic, the best model is Claude Sonnet 3.5. According to Mistral, it’s Mistral Large 2. You get the idea.

These evaluations are a hallmark of the LLM ethos as they’re used to generalize a model’s performance. The prevalence of evals leads to model providers increasingly ‘overfitting’ their models in RLHF and fine-tuning (whether intentional or not) to inflate performance.

It’s a bold claim—but not unheard of.

If a math teacher exposes concepts to students using the eventual final exam, nobody would take a 99% class average seriously. Similarly, memorizing the answers to a practice exam is the worst way to study for a final.

By definition, letting a model intuit that which it will eventually be evaluated on is ‘overfitting’. The fact that GPT-4o is widely known to perform significantly worse than GPT-4 Turbo, yet beats it in evals, is highly indicative of overfitting.

The logical expectation of a smaller, distilled model is that it behaves similarly to its larger counterpart but struggles with step-by-step reasoning and instruction following. These expectations align with user findings, echoed throughout online forums, including OpenAI’s own community site.

Two examples:

Played around with it a bunch, and it is very obvious, that GPT-4-turbo is a lot better than GPT-4o. Give it any logic riddle or tell it to act in a certain way, and it fails way more.

OpenAI community

GPT 4 turbo is much better for step by step tasks. In general it understands much better the prompt instructions.

OpenAI community

Ultimately, while GPT-4o’s current underperformance relative to GPT-4 Turbo is well-documented, accusations of distillation and overfitting are lofty. But, not out of the question.

LLM evals should always be interpreted with a grain of salt. Production performance is the true evaluator.

Meta unveils the Llama 3.1 fleet

On Tuesday, Meta reasserted its commitment to open-source and its dominance in the small model realm by releasing the Llama 3.1 fleet.

These models come in 8B, 70B, and 405B variants, with the 405B variant being the largest open-source model ever released.

(ICYDK—the numbers represent the number of parameters, in billions. The more parameters, the larger the model; and, the more expensive to train and run.)

According to Meta’s human evaluation report (once again, take it with a grain of salt), Llama 3.1 405B outperforms GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in response quality.

In reality, they’re mostly on par with one other, each one excelling in different settings.

Human evaluations for Llama 3.1 405B relative to large, proprietary SOTA models.

Even Llama 3.1’s 70B variant appears to be comparable to GPT-4o, hence why everyone and their mother is tweeting about “having SOTA models in open-source.”

According to active user feedback and my own usage of these models, they are incredibly strong. Especially after fine-tuning.

OpenAI enables fine-tuning GPT-4o mini

The release of the Llama 3.1 fleet was leaked quite early. Threatened by the aforementioned OpenAI Escapism Stack, OpenAI likely released GPT-4o mini to compete in the small model space.

GPT-4o mini’s pricing, at $0.15 per million input tokens and $0.60 per million output tokens (~750k words), is 20x cheaper than GPT-4o, 50x cheaper than GPT-4 Turbo, and order of magnitude cheaper than most off-the-shelf open-source models.

After Meta released Llama 3.1, thousands of engineers raced to run fine-tuning jobs and test the shiny models in their apps, reporting back with optimism.

OpenAI then released fine-tuning for GPT-4o mini—the truly missing frontier to compete with small open-source models.

Enabling fine-tuning allows customers of the OpenAI Escapism Stack to run their existing datasets, with no extra effort, through OpenAI. Inference through GPT-4o mini is, then, absurdly cheap.

To incentivize this process, OpenAI made fine-tuning GPT-4o mini free (with limits). It seems engineers took advantage of this opportunity.

I personally ran every one of my Llama fine-tuning jobs, truncated to OpenAI’s threshold, through their API for free. The results were excellent. A few of the resulting models are currently in production.

Good move, OpenAI.

Saving money

Recently, a new research outfit FutureSearch released a report on OpenAI’s revenue.

Shockingly, despite margins of error, a majority of OpenAI’s revenue comes from consumer subscriptions to ChatGPT (as opposed to the enterprise API).

OpenAI likely recognized an opportunity for massive cost savings.

Given that most ChatGPT messages are not computationally intensive (think basic conversation, basic web searching, basic interactions), there's no need to be sending those requests to a trillion-plus parameter model.

Over time, OpenAI has been modifying the ChatGPT interface to flow between models. Users can even choose which model they're interacting with inside the same conversation.

Consider a world wherein OpenAI automatically routes each message to the most appropriate model. Basic conversation goes to GPT-4o mini and advanced data science orchestration goes to GPT-4o/GPT-4 Turbo. OpenAI would save tens (if not, hundreds) of millions of dollars in compute.

Empowered by GPT-4o mini. Once again—touché, OpenAI.

Honorable mentions

Kudos to Mistral for their release of Mistral Large 2. Totally overshadowed by Llama, though.

And, kudos to Mark Zuckerberg for his open-source manifesto.

Moral of the story

Competition is heating up in the small model ecosystem, with OpenAI now getting their hands dirty to protect their advantage.
Open models are cheaper and better than ever before. A wave of new infrastructure companies are empowering the OpenAI Escapism Stack.
As models improve over time, evals are dodgy—pay attention to production performance.

THOUGHTS ON THE FRONTIER

Search

OpenAI is backed into a corner, throwing punches at everyone like Drake in May. The only difference is that, for OpenAI, it seems to be working.

Their next target? Search.

OpenAI unveiled SearchGPT, a prototype LLM-powered search interface soon to be integrated into ChatGPT.

Meanwhile, somehow-unicorn Perplexity is (privately) already raising another $250 million, led by SoftBank’s Vision Fund II.

My favorite quote from their deck: “Building a wrapper is hard.”

Jokes aside, while building a wrapper is a weekend project, scaling a wrapper is a Sisyphean ordeal.

Personally, I pay $20 a month for Perplexity Pro’s convenience. Scaling a consumer product the way Perplexity has is hard. It requires an incredible level of attention to detail. That said, assuming it performs up to OpenAI’s standard, SearchGPT calls Perplexity’s moat into question (again).

The problem with AI search

The central limitation of search-powered LLM tools like Perplexity is their lack of depth. True discovery is uncovered by following leads down the rabbit hole and thinking critically about every intermittent insight.

Each step influences the next, as the searcher becomes increasingly aware of what action will guide them closer to discovery.

A mentor once shared a quote that has guided me, as it did him.

❝

The beginner chases the right answers. The master chases the right questions.

The key to discovery is learning what needs to be discovered.

Perhaps the philosophy of research and discovery can be engineered into an recursively self-guided agentic system. Certainly worth trying. But, for now, I remain highly skeptical of any “all-encompassing” search-powered platforms.

With that said, I can’t wait to try SearchGPT and see what it’ll change in my perspective.

WHO WE ARE

Return the Fund 🚀

One startup a week poised for 10x growth; market deep dives with actionable insights for builders and investors.
Technical breakdowns of advanced new tech to empower informed decision-making
Find your next prospect, product, job, customer, or partner. 🤝 Written by undercover pioneers of the field; trusted by builders and investors from Silicon Valley to NYC. 💸

Last week, we dove into the context behind Biden dropping out of the 2024 election, and JD Vance’s stances.

Biden, JD Vance, and Election Calculus

Exploring the context behind Biden's drop out announcement, diving into JD Vance's stances and investments, and updating election prediction markets.

www.returnthefund.vc/p/biden-drops-out

As you know, we’re hell-bent on uncovering future unicorns cruising under the radar. Preeminent companies are lean, quiet, and driven before reaching their watershed moments. By the time people start talking about them, it’s too late.

In a nutshell—we pitch you startups like you’re an esteemed VC. If you’re interested in them as a partner, product, or prospect, we’ll make a warm intro. Humbly, our network knows no bounds!

We’ll also intuitively break down advanced tech so you can stay ahead of trends and critically analyze hype in the news and in your circles (regardless of your technical prowess).

Periodically, we’ll propose niche market opportunities. These are tangible ways to extract alpha from the private markets.

You won’t get editions from us very often. Weekly at best. Two reasons:

We’re balancing full-time VC/startup work with Return the Fund.
We prioritize depth, insight, and value. This is not a daily news publication… We hope that when you do get an email from us, it’s dense, valuable, actionable, and worth saving.

Thanks for reading today’s RTF. Feel free to reach out to us at [email protected]. 🤝

Psst: None of our company picks are ever sponsored. All research and opinions are completely held by the Return the Fund team.

Reply

or to participate.