Simulated Minds, Real Gaps: What Stanford's GPT-4 Experiment Reveals About the Limits of AI in Social Science
Stanford researchers replaced human participants in 476 randomized controlled trials with GPT-4. The instruction was almost comically simple: "Be the participant."
Here's what they did: instead of recruiting thousands of real people for studies spanning everything from health nudges to behavioral economics, they let AI step in. What normally takes weeks of recruitment, follow-ups, and data wrangling now finished in hours.
The results? GPT-4 predicted human behavior with an r ≈ 0.85 correlation. That's not just good - that's expert-level accuracy. It held even for studies published after GPT-4's training cutoff, suggesting real generalization rather than memorization.
The efficiency gains are staggering. Early-stage pilot budgets slashed by up to 90%.
But here's where it gets interesting - and where my skepticism kicks in.
These AI "participants" were just too agreeable. They nodded along, rarely pushed back, and showed the kind of compliance that would make any experienced researcher suspicious.
The diversity of responses was thin. Outliers and contrarians - the people who often reveal the most important insights - barely showed up in the data.
The Stanford team tried to fix this. They used multiple personas per prompt, added few-shot learning examples, and seeded responses with prior distribution data. Still, something fundamental was missing: the beautiful, stubborn messiness that real humans bring to every interaction.
There's another wrinkle that makes me pause: as these models get trained on fresher data, benchmarking becomes tricky. Eventually, we'll need to rely on unpublished or archival studies to ensure the AI isn't just echoing patterns it's already seen.
And even with headlines screaming about r = 0.85 correlations, I keep coming back to the same thought: synthetic data is still a map, not the territory.
Real people bring chaos. They bring context you didn't expect, cultural nuances you missed, and reactions that don't fit your models. They bring the stuff you can't automate away or average out in your analysis.
So yes, r = 0.85 is impressive. But I'm more interested in the remaining 0.15 - the wild cards, the dissenters, the voices that change everything. That's the part you'll never fully automate out of human behavior.