Why Model Distillation Matters (And How It’s Playing Out One Year Later) - Part 1

Nov 24, 2025

A year ago, at OpenAI DevDay 2024, they announced tools for model distillation that seemed like they’d change how people build AI products. Now that we’ve had twelve months to see how this actually plays out in practice, it’s worth looking back at what they were promising and what actually matters.

The core problem they were addressing: AI is simultaneously too powerful and not powerful enough. Too powerful in the sense that GPT-4o can answer graduate-level physics questions, but not powerful enough because most apps don’t need that – they just need something that works fast and doesn’t bankrupt you on API calls.

That’s where model distillation comes in. OpenAI made it easier to do, but more interesting is seeing who’s actually using it and what for.

The Teacher-Student Thing (But Actually Useful)

Model distillation is basically when a big, expensive AI model teaches a smaller, cheaper one. Think GPT-4o passing its knowledge down to GPT-4o-mini. The metaphor everyone uses is teacher-student, which honestly undersells how practical this is.

Here’s what actually happens: you take your massive model that knows everything but costs a fortune to run, and you use its outputs to train a focused, efficient model that does one specific thing really well. The small model learns to reproduce the big model’s behavior for your particular use case, without needing all that general knowledge weighing it down.

The Economics That Actually Drive This

Look at the numbers that matter. If you’re running a customer service bot that handles 10 million queries a month, GPT-4o at current pricing will cost you roughly $150,000. Switch to a distilled GPT-4o-mini that performs comparably on your specific use case? That drops to $10,000. That’s not optimization - that’s the difference between a feature that works financially and one that doesn’t.

Speed is the other constraint people underestimate. A 200ms response difference doesn’t sound like much until you’re building something interactive. Voice translation needs to feel real-time. Code completion needs to appear while someone’s still thinking. Chat interfaces need to respond before users get impatient. Large models, no matter how capable, introduce latency that breaks these experiences. Smaller models don’t just save money - they enable entirely different interaction patterns.

The accessibility angle is less obvious but possibly more important long-term. When your AI feature requires $50,000/month in API costs to run at meaningful scale, you’re limited to well-funded companies and heavily-used products. When you can get comparable performance from a distilled model running on modest infrastructure, suddenly independent developers and smaller companies can actually build things. The barrier to entry drops from “need venture funding” to “can afford a decent cloud instance.”

OpenAI’s Three-Step Process (That Actually Makes Sense)

OpenAI laid out how to do distillation properly, and unlike most framework announcements, this one tracks with how you’d actually build something.

Step 1: Define what “good” looks like

You can’t improve what you can’t measure. Before you do anything else, you need task-specific evaluation metrics. What does success look like for your particular use case? Don’t skip this. Seriously. Everyone wants to skip this part and jump straight to training, and that’s how you end up with a model that technically works but doesn’t actually solve your problem.

Step 2: Generate high-quality training data

Use your big model (GPT-4o) to create examples of perfect performance. These are your inputs and ideal outputs. The key word here is “ideal” – you’re capturing what excellent looks like, not just what works. This becomes your training dataset for the smaller model.

Step 3: Fine-tune the smaller model

Now you train GPT-4o-mini on that dataset. You’re essentially compressing the intelligence of the larger model into the smaller one, at least for your specific domain. The small model learns to replicate the big model’s responses without needing all that general knowledge.

The Tools They Shipped (And What Actually Happened)

At DevDay 2024, OpenAI announced two features that were supposed to make distillation way less painful:

Stored Completions: You could add store: true to your API calls and OpenAI would save the full input and output. You could tag these interactions too, which meant you could build datasets organically as your app ran in production.

Evals Product (Beta): A platform for managing the whole distillation process inside OpenAI’s ecosystem. You could set up evaluation criteria, run them against different models, and compare results.

A year later? The stored completions feature is actually getting used. Being able to collect real production data without building your own infrastructure made the whole process less intimidating. The Evals product went through the typical beta evolution – initially clunky, gradually more useful.

What’s more interesting is what people learned from actually trying this at scale.

When Distillation Makes Sense

The OpenAI folks had this useful framework for thinking about when distillation works:

Narrow domain, low precision needs: This is the sweet spot. Something like summarizing customer reviews, where you’re working in a defined space and don’t need perfect accuracy every time. Small models crush this.

High precision, narrow domain: Categorization tasks in well-defined domains. You’ll need more training examples and a more diverse dataset, but it’s still a good fit. Think email routing or content classification.

Broad domain, low precision: Tasks that span multiple areas but don’t require pin-point accuracy. Creative text generation, rough translations, that kind of thing.

What doesn’t work: Tasks that need both broad knowledge across domains AND high precision. These still need the full power of large models. No shortcuts here.

The Real-World Example: Superhuman’s Quick Replies

OpenAI showed a case study from Superhuman’s email app. They have this “quick reply” feature that suggests response options after reading an email thread. Simple idea, right? But how do you scale that to hundreds of millions of emails without going bankrupt?

They distilled a small model specifically for generating email replies. It doesn’t need to know quantum physics or write poetry – it just needs to understand email context and suggest reasonable responses. That’s the perfect distillation use case: narrow domain, clearly defined task, needs to run at massive scale.

Things That Will Trip You Up

Uneven or biased data: Your training data needs to match your production data distribution. If you train on one pattern and then deploy to handle a different pattern, you’re going to have a bad time.

Sparse examples: This is especially brutal for rare events. If you’re building fraud detection and fraud is uncommon, your 1,000 training examples might not include any fraud at all. Your model will have blind spots you won’t discover until production.

Dataset size: You don’t actually need millions of examples. OpenAI said they typically see distillation work best with thousands of examples, not millions. Start with a few hundred, verify it’s working through your evals, then scale up. Don’t jump straight to collecting massive datasets.

The Iterative Approach (Or: Why Your First Try Will Probably Fail)

Fine-tuning rarely works on the first attempt. There are too many variables. The smart move is to start small – a few hundred examples – and scale up once you know you’re on the right track based on your evaluation metrics.

This is where those stored completions become really useful. You can start collecting examples in production right away, even before you’re ready to fine-tune. By the time you’re ready to distill, you’ve already got real-world data sitting there.

The Lock-In Question (One Year Later)

One thing that struck me during the presentations: distillation creates serious platform lock-in. If you build your app purely on prompt engineering, you can swap between different LLM providers relatively easily. But once you’ve fine-tuned a model with thousands of examples of proprietary data? You’re committed.

OpenAI was obviously aware of this. They were betting that once you’ve distilled models on their platform, you’re not going anywhere.

A year out, this played out exactly as you’d expect. Teams that invested heavily in distillation on OpenAI’s platform are still there. But what’s interesting is that this didn’t stop people – the economics were compelling enough that the lock-in became acceptable. When you’re cutting costs by 10-15x on your biggest expense line, switching costs become less relevant.

The other thing that happened: other providers figured out they needed similar tools. Anthropic, Google, others all rolled out their own versions of fine-tuning and distillation workflows. So the lock-in became less about the concept and more about where your training data lives.

The Hybrid Future (And How It’s Actually Playing Out)

OpenAI’s vision was that most applications would eventually use a collection of distilled small models for specific tasks, with a few large models handling the stuff that genuinely needs broad capabilities and high precision.

A year later, this is mostly what’s happening, but not always in the way they predicted. The pattern that emerged is more nuanced:

Most production apps do use a mix of model sizes
But distillation is not always a decision; sometimes it’s just using mini vs. full-size models depending on the task
The distillation investment makes sense when you have really high volume on a specific, repeated task
For everything else, people are just using the appropriate model size out of the box

The specialized tools for specific jobs metaphor holds up. But building those specialized tools through distillation requires enough scale to justify the effort. If you’re not running millions of inferences on the same type of task, you’re probably better off just using a smaller base model and good prompts.

What We Learned After a Year

The predictions about when distillation works mostly held up, but with some nuance:

The success stories are exactly what you’d expect: high-volume, well-defined tasks. Customer service routing, content classification, email triage. Places where you’re doing the same type of thing millions of times and the quality bar is “good enough, consistently” rather than “perfect every time.”

The failures are interesting too. Teams that jumped straight to distillation without really nailing their evals first. Companies that tried to distill too early, before they had enough production data to know what good performance actually looked like. People who underestimated how much iteration they’d need.

The surprise is how few teams actually needed distillation. A lot of use cases that seemed like distillation candidates turned out to work fine with just GPT-4o-mini and decent prompts. The base small models got good enough that the distillation investment only made sense at real scale.

The other thing that became clear: data quality matters more than data quantity. The “thousands not millions” guidance was right, but those thousands need to be really good examples. Garbage in, garbage out still applies, even with fancy fine-tuning tools.

In Part 2, we’ll look at how distillation became unexpectedly political in 2025, with the DeepSeek controversy and what it means for the future of AI development.

Serge’s Substack

Discussion about this post