The system, called The AI Scientist, was developed by Sakana AI, a Tokyo-based company. It generated research ideas, wrote the code, ran the experiments, wrote the manuscript, and performed its own automated peer review. The researchers submitted three papers to the ICBINB workshop at the International Conference on Learning Representations (ICLR) 2025. One passed. The team withdrew the paper before publication, as had been agreed with the workshop organisers in advance.
The Sakana AI team were candid about the limitations of the system: hallucinated citations, repeated figures across sections, and a paper whose acceptance at a workshop with a 70% acceptance rate left the broader research integrity question open. The human researchers documented these findings and published them in Nature. The conversation that followed focused, predictably, on AI.
It should have focused on citations.
Citation behaviour in biomedical literature has been studied for decades. The findings are not flattering. A significant proportion of citations in published papers are secondhand, copied from the reference lists of other papers rather than verified at the primary source [2]. Authors cite papers they have not read. They cite papers that, on inspection, do not support the claims they are used to support. This is not a new problem. It is a structural feature of how scientific papers are currently written.
What The AI Scientist paper introduced was not a new failure mode. It was a faster, higher-volume version of an existing one. An AI system can produce fifty plausible-sounding references in the time a researcher takes to draft a paragraph. Each reference has a title, a journal name, authors, and a year. Each looks credible. Very few will be checked at source.
AI does not invent the citation problem. It produces plausible-looking citations faster, at scale, and with lower detectability.
The error mode is the same one researchers have always used. The volume and velocity are not.
The practical implications are not limited to papers submitted to machine learning workshops. Consider three scenarios that are not hypothetical:
A research team uses an AI tool to generate a literature summary supporting a new project direction. The summary includes six supporting references. Two do not exist. The review is approved. The project direction is set. The error is not discovered until a regulatory submission reviewer asks for the primary sources.
A PI uses an AI-assisted drafting tool to prepare the background section of a grant application. The tool suggests three supporting citations for the scientific rationale. One does not exist. The application is submitted. The funder's scientific advisor, who works in the same subfield, searches for the paper and does not find it.
A clinical team uses AI to assist with the literature review section of a regulatory dossier. Supporting evidence for a safety claim includes an AI-suggested reference that, on inspection, does not support the claim as cited. The regulatory reviewer requests clarification on the citation. The submission is delayed.
In each of these cases, the AI tool was used as a time-saving measure. In each case, the citation verification step was implicitly skipped because it was never formally required. The AI did not cause the failure. The absence of a mandatory verification protocol did.
The obvious institutional response to The AI Scientist paper is to invest in better AI detection tools. This is understandable. It is also addressing the wrong layer of the problem.
AI detection tools identify stylistic patterns in writing. They do not verify citation validity. A paper written entirely by a human that includes one AI-generated paragraph of references will not reliably be flagged. The problem sits upstream of detection: it is in the writing and review workflow itself, not in the output text.
Arguing for better detection tools in response to this problem is arguing for better smoke alarms in a building where the sprinkler system was never installed. The tools address symptoms. The sprinkler system, a mandatory citation verification protocol, was never built.
This is not an argument against detection tools. They serve a function. It is an argument against treating them as the primary response to a workflow failure they were not designed to catch.
These are not proposals for institutional reform. They are changes that a PI, a postdoc, or a research operations lead can implement in their current manuscript workflow without requiring new tools, a new budget, or a committee decision.
Not an editorial one. It belongs on the manuscript checklist alongside statistical review and data availability confirmation, before submission, not during copyediting. If you have a manuscript submission checklist, citation verification goes on it. If you do not have a checklist, this is a reason to build one.
Any reference added during an AI-assisted literature search or drafting session should be flagged as unverified until checked against a reliable literature database such as PubMed, Web of Science, or Scopus. A simple naming convention in your reference manager is sufficient. Remove the flag only after you have confirmed the paper exists, is accessible, and supports the claim it is being used to support.
For research organisations: this does not require new tools. It requires adding one explicit step to the internal review process that currently does not exist in most lab workflows. A citation audit takes thirty minutes on a typical manuscript. It identifies secondhand citations, unverifiable references, and AI-suggested papers that do not exist. The thirty minutes are recoverable from the time lost to a delayed submission or a returned manuscript.
Hence, the practical response to the ICLR workshop experiment does not call for a new detection tool. It calls for a new mandatory step in how manuscripts are prepared, one that treats every reference as evidence and not decoration. Citation verification was never formalised as a step because it was assumed. AI has ended the period in which that assumption was safe to make.
The researchers at Sakana AI were transparent about what their system produced and where it fell short: hallucinated citations, a paper accepted at a high-acceptance-rate workshop, and a result that was withdrawn before it entered the published literature. That transparency is useful precisely because it names a problem that was already present in scientific publishing before The AI Scientist wrote a single word.
The question is not whether AI will change scientific publishing. It already has. The question is whether the infrastructure around manuscript preparation will catch up , and whether that happens proactively, by changing workflow protocols, or reactively, after a retraction that could have been prevented.