AI Scientists: The First Peer-Reviewed Paper Generated by AI
Recently, I encountered The AI Scientist service, which claims to "fully" automate scientific discovery. This is a system that formulates hypotheses, designs experiments, analyzes results, and even writes scientific papers - without human involvement.
The AI Scientist-v2 recently achieved a historic milestone: publishing the first scientific paper entirely generated by AI that was accepted at a workshop after peer review.
The system can conduct research around the clock, potentially finding patterns that humans might miss due to cognitive limitations.
The project is open to the community: source code on GitHub allows for experimentation, refinement, and integration of new models.
Obviously, such an approach could accelerate scientific progress and make research more transparent and accessible, but it also raises new questions about the role of humans, trust in results, and the ethics of autonomous AI in science.
What's Behind the Technology
After diving deep into The AI Scientist-v2 system from Sakana AI, I was impressed by its architectural sophistication. This isn't just "GPT writes papers" - it's a comprehensive research pipeline with several revolutionary improvements:
Agentic Tree Search. Unlike the linear approach of the previous version, v2 uses a tree-like structure for experiments. Each tree node represents a separate experiment with Python code, research plan, and results. The system explores multiple hypotheses in parallel, automatically debugs code when errors occur, and selects the best directions for further development.
Complete Autonomy. The system no longer depends on human-written code templates. AI independently generates all experimental code from scratch, based only on high-level research ideas. This dramatically expands the system's applicability.
Vision-Language Model Integration. VLMs analyze generated graphs and visualizations, checking their correctness, caption clarity, and alignment with conclusions. This ensures quality scientific presentation of results.
Four-Stage Process. Research follows structured phases: preliminary investigation → hyperparameter tuning → research agenda execution → ablation studies. At each stage, the system creates up to 21 parallel experiments.
Historic Achievement with Caveats
The result is impressive: one of three generated papers received an average score of 6.33 out of 10 from ICLR workshop "I Can't Believe It's Not Better" reviewers, exceeding the acceptance threshold. The paper investigated compositional regularization in neural networks and, interestingly, received positive feedback precisely for honestly presenting negative results.
However, it's important to understand the context: this was a workshop, not a main conference track. Acceptance rates at workshops typically range from 60-80% versus 20-30% at top-tier conferences. Only one of the three submitted papers was accepted.
Upon detailed analysis of the accepted paper, the system's authors identified significant flaws: inaccuracies in figure captions, problems with training/test dataset overlap (57% overlap!), terminology confusion, and insufficient justification for methodological choices.
Technical Limitations and Ethical Questions
The system is still far from generating breakthrough hypotheses or deep domain understanding. Experiments are limited to relatively simple ML tasks, and the quality of work doesn't yet reach top-conference standards.
The ethical aspect is particularly important. The Sakana AI team obtained ethics committee approval, warned reviewers about possible AI-generated submissions, and subsequently withdrew the accepted paper to avoid setting a precedent without public discussion.
This raises fundamental questions: Should AI papers be labeled? How should they be evaluated alongside human work?
Development Trajectory
Despite limitations, the trajectory is impressive. In two years, we've moved from proof-of-concepts to systems capable of passing peer review. The rapid development of AI tools suggests that within a few years, we might see conference-level AI researchers.
The potential is enormous: AI can work 24/7, isn't subject to cognitive biases, can test multiple hypotheses in parallel, and process gigantic volumes of literature. In fields with large datasets and clearly defined metrics - from bioinformatics to materials science - such systems could significantly accelerate discoveries.
But the main question remains open: Can AI generate truly revolutionary ideas, or only optimize known approaches? So far, creativity and conceptual breakthroughs remain human prerogatives.
What This Means for Science
We stand on the threshold of a fundamental change in the scientific process. AI researchers could become powerful tools for accelerating routine aspects of science: literature reviews, experiment replication, systematic testing of variations.
But this also requires rethinking the role of human researchers. Perhaps the future lies with hybrid teams, where AI performs large-scale computational work while humans focus on posing deep questions, interpreting results in broad context, and determining research directions.
The AI Scientist-v2 system isn't the end of the road, but an important milestone toward a new paradigm of scientific research. How we integrate these tools will determine the future of science for decades to come.
The Reality Check
Let me be clear about what we're actually seeing here. When I examined the accepted paper's code and methodology, several concerning issues emerged:
Dataset contamination: 57% overlap between training and test sets, fundamentally compromising the reliability of results
Methodological confusion: The paper confused "embedding states" with "hidden states," indicating imprecise understanding
Overclaimed results: The system reported 100% accuracy that was mainly due to task simplicity, not algorithmic breakthrough
This isn't to diminish the achievement - it's to calibrate our expectations. We're witnessing the first steps of AI scientific reasoning, not its maturation.
The Bigger Picture
What excites me most isn't the current capabilities, but the acceleration curve. The system went from template-dependent linear exploration to autonomous tree-based research in just one iteration. The improvements are architectural, not just computational - suggesting we're on a steep learning curve.
We have the opportunity to create tools that enhance human curiosity rather than replace it. We can accelerate progress while preserving the essential human elements that make science meaningful.
This is our moment to shape the future of knowledge itself. Let's make it count.
The paper and additional materials are available in Sakana AI's GitHub repository.