Why Model Distillation Matters - Part 2: When It Became Geopolitical

Nov 26, 2025

In Part 1, we covered what model distillation is, how OpenAI’s tools work, and what teams learned from a year of implementation. Now we’re getting into the part nobody saw coming: how distillation became a flashpoint in US-China AI competition.

The DeepSeek Controversy: When Distillation Became a Geopolitical Issue

In January 2025, distillation suddenly became front-page news in a way nobody predicted. DeepSeek, a Chinese AI startup, released their R1 model claiming it cost only $6 million to train and performed comparably to frontier models. The AI world lost its mind.

OpenAI accused DeepSeek of using model distillation on OpenAI’s models without authorization, allegedly extracting reasoning outputs through API access to train their own competing system. David Sacks, the White House AI czar, stated there was “substantial evidence” of this distillation, though specifics weren’t made public.

Here’s what made this different from normal distillation: OpenAI’s terms of service explicitly prohibit using their models for distillation purposes. If DeepSeek did what OpenAI claims, they weren’t just using a public technique – they were allegedly violating terms of service at massive scale to build a competitor.

The accusations raised uncomfortable questions that the industry hadn’t really grappled with:

The legal gray zone: Model distillation itself is a normal, legal practice if the source model’s license permits it. The issue is when someone uses API access to a closed model that explicitly forbids distillation. But proving this happened is remarkably difficult. Since only the final model is public, not the training data, the burden of proof falls on the accuser.

The irony problem: Some experts pointed out the hypocrisy of OpenAI complaining about terms of service violations when they likely trained ChatGPT on copyrighted content from publishers like Forbes and The New York Times, also against terms of service. The whole thing got messy fast.

The security angle: A bipartisan House report called DeepSeek a “profound threat” to national security, alleging it siphons data back to China and creates security vulnerabilities. Multiple governments banned DeepSeek from official devices, including the US Congress, NASA, Taiwan, Japan, and South Korea.

The detection problem: OpenAI and Microsoft started working together to identify accounts attempting distillation, revoking access when detected. But this is reactive, not preventive. By the time you catch someone, they might already have the data they need.

What emerged from all this is that distillation, which seemed like a straightforward technical optimization a year ago, suddenly became tangled up in intellectual property law, national security concerns, and geopolitical tensions. The AI industry saw a staggering 99% price drop in just two years, from $0.02 per thousand tokens in early 2023 to $0.00014 with DeepSeek’s pricing. That kind of commoditization raised fundamental questions about how AI companies maintain competitive advantage.

The practical upshot: OpenAI said it would take “steps to prevent distillation” and work closely with the US government to protect the most capable models. Other providers got more aggressive about rate limiting and detection. The era of freely accessible API access to frontier models started tightening up.

Meanwhile, The Legitimate Distillation Ecosystem Kept Growing

While OpenAI and DeepSeek were fighting, the legitimate distillation tools kept improving. The contrast was stark – all this controversy happened while major cloud providers were actively building out distillation features as core platform capabilities.

Anthropic announced distillation support for Claude 3 Haiku in Amazon Bedrock in October 2025, with the distilled Haiku achieving Claude 3.5 Sonnet-like accuracy for specific tasks at the same price and speed as their most cost-effective model. Amazon Bedrock Model Distillation automated the entire process, generating synthetic training data and applying different data synthesis methods without requiring developers to manually craft training examples.

Microsoft expanded Azure OpenAI distillation capabilities in January 2025, adding more regions and models to their Stored Completions feature, plus a comparison experience for evaluating distilled models against base teacher models. The enterprise players were clearly betting that distillation would become standard infrastructure, not a controversial edge case.

The difference between these implementations and the DeepSeek situation? Transparency and authorization. These were officially supported features with clear terms, not unauthorized scraping of API outputs.

What This Means Practically (One Year In)

If you’re building AI products now, here’s what actually matters based on what worked and what didn’t over the past year:

Don’t start with distillation. Start with the smallest model that might work and good prompts. Only invest in distillation when you have clear evidence you need it – usually when costs at scale become painful or when base model performance consistently falls short.

But do start logging. If you think you might eventually need distillation, turn on stored completions from day one. Building that dataset costs you nothing and gives you options later. The teams that did this had a huge advantage when they decided to fine-tune.

Use official tools only. The DeepSeek situation made this crystal clear: if you’re going to distill, use officially supported features from your provider. Don’t try to get clever with unauthorized API scraping or violating terms of service. The legal and reputational risks aren’t worth it, and detection is getting better.

Actually build those evaluation metrics. This is still the step everyone wants to skip, and it’s still the difference between success and failure. Every team that did distillation successfully had solid evals first. Every team that struggled didn’t.

Expect iteration. Your first distilled model won’t be good enough. Budget for 3-5 rounds of refinement. The teams that succeeded planned for this from the start.

Check if you even need it. A year of data suggests most applications don’t actually need distilled models. Base small models plus prompt engineering handles more use cases than people expected. Distillation is an optimization for scale, not a default approach.

Consider the IP implications. If you’re distilling models and the technique creates real competitive advantage, think about patent protection. The legal frameworks around AI IP are still evolving, but patents may offer better protection than copyright for model architecture and training techniques.

The Commoditization Question

DeepSeek’s pricing - even if the $6 million training cost is disputed - points to a bigger trend. When model capabilities can be distilled and replicated at a fraction of the original cost, what happens to competitive moats in AI?

Traditionally, software competition has been driven by product differentiation and economies of scale. But if distillation becomes a reliable way to capture most of a frontier model’s capabilities for a specific domain at 1/15th the cost, the value shifts elsewhere:

Proprietary training data becomes more valuable than model architecture
Brand trust and reliability matter more than raw capabilities
Integration and distribution become the real competitive advantages
Legal and policy relationships (who gets access to what) become strategic assets

We’re seeing the beginnings of this shift now. The companies winning aren’t necessarily those with the best base models - they’re the ones with the best data, the strongest partnerships with cloud providers, and the clearest path to regulatory compliance.

What Changed (And What Didn’t)

A year ago, model distillation looked like a straightforward cost optimization technique. Some things changed, some didn’t:

What stayed the same:

The economics still work exactly as promised for high-volume use cases
The three-step process (evals, data generation, fine-tuning) is still the right approach
Data quality matters more than quantity
Most teams don’t actually need distillation

What changed:

Distillation became a geopolitical issue, not just a technical one
Terms of service enforcement got much more serious
The legal frameworks started catching up to the technology
Base small models improved enough to raise the bar for when distillation makes sense
Cloud providers made distillation a standard platform feature, not an advanced technique

What surprised everyone:

How fast pricing collapsed (99% in two years)
How quickly this became about national security
How difficult it is to prove unauthorized distillation happened
How much the controversy exposed gaps in AI IP protection

Looking Back and Forward

A year ago, the really interesting thing about distillation seemed to be what it would enable. When you can take frontier model capabilities and compress them into fast, cheap, specialized tools, you unlock different categories of applications.

That part was right. Real-time voice translation, instant email triage, live customer service that actually works - these are all real now, and distillation is part of why they’re economically viable.

What was less obvious a year ago: how political this would become. Model distillation went from a technical optimization to a flashpoint in US-China AI competition. The DeepSeek controversy showed that as AI capabilities become more strategic, the techniques for transferring those capabilities become strategic too. Expect more export controls, more aggressive terms of service enforcement, and more legal battles over what constitutes legitimate model training versus IP theft.

The other surprise was how much the base small models improved on their own. GPT-4o-mini today is dramatically better than it was at DevDay 2024. Same with Claude 3.5 Haiku getting 60% faster on AWS Trainium2. That raised the bar for when distillation makes sense. You need higher volume or more specialized use cases to justify the investment.

The tools OpenAI announced made distillation easier, but easier doesn’t mean necessary. Most teams should probably optimize their prompts and model selection before investing in fine-tuning. But for the use cases where distillation does make sense - high volume, well-defined tasks, need for consistent behavior - it’s proven to be exactly as useful as promised.

The pattern is becoming clear, but it’s more complicated than we expected a year ago: use the smallest model that works, invest in customization only at scale, always start with evals, and stay firmly within the legal and ethical boundaries because those boundaries are now being actively enforced.

We’re still early, but not as early as we were. And we’re no longer just optimizing for performance and cost – we’re navigating geopolitics, IP law, and questions about what constitutes legitimate AI development versus theft. The technical challenges turned out to be the easy part.

This is Part 2 of a two-part series on model distillation. Read Part 1 for the fundamentals and practical implementation guide.

Serge’s Substack

Discussion about this post