The 5 Principles Every AI Research Stack Now Has to Solve

In ~7 mins: the 5 principles a 20-author survey of 250+ tools just landed on, an 8-stage map of where AI helps and where it breaks, and 6 rules for using it without losing scientific accountability.
This article was originally posted on X last Fri (22 May).

The AI Scientist generates a complete research paper for $15.

FARS ran for 228 hours, burned 11.4 billion tokens, and shipped 100 papers.

When the same systems run fully autonomous, 80% of their reported results are fabricated.

The bottleneck moved. It is no longer generation. It is verification, provenance, and human handoffs across the research lifecycle.

Twenty researchers just published the first end-to-end survey of AI across the complete academic research lifecycle.

The paper is authored by Lingdong Kong (NUS / Apple) with 19 co-authors and titled “AI for Auto-Research: Roadmap & User Guide.” It was posted to arXiv on May 18, 2026 and covers developments through April 2026.

The scope is unusual. Most existing surveys cover one stage: literature review, or coding agents, or paper writing. This one maps 250+ tools, 52 benchmarks, and 33 end-to-end systems into a single framework of 4 phases and 8 stages.

The companion repo, worldbench/awesome-ai-auto-research, is MIT-licensed and tracking +100 stars and 8 forks as of May 22.

The framework breaks the lifecycle into Creation (ideation, literature, coding, figures), Writing (manuscript), Validation (peer review, rebuttal), and Dissemination (Paper2X). That structure carries the whole argument.

The cost-to-quality numbers refuse to scale.

AI Scientist v2 scores 6.33 on an ICLR 1–10 scale at $25 per paper. FARS, running roughly $1,000 per paper, scores 5.05. The ICLR acceptance threshold is 5.69.

The cheaper system is past the line. The 40x more expensive one is below it.

Pattern-matching benchmarks overstate scientific coding too. Frontier systems clear 76%+ on SWE-bench Verified. The same systems ceiling at 37–39% on ResearchCodeBench, where the task is to implement a method described in a paper. The semantic-error rate there is 58.6%.

The validation numbers are worse. In an LLM-reviewer benchmark, 95.8% of rejected papers were misclassified as acceptable. In MLR-Bench’s fully-autonomous track, 80% of reported results turned out to be fabricated.

The paper’s read is that the field stopped being capability-limited and became reliability-limited. That reframe is what makes the survey worth reading.

AI is reliable when the task is structured, grounded in retrievable evidence, and externally checkable.

SWE-bench Verified at 76%+ vs. ResearchCodeBench at 37–39% is the cleanest illustration. One measures bug fixes against known passing tests. The other measures whether the model implemented the algorithm the paper actually described.

Same models. Different ceiling.

This holds across stages. Retrieval, citation candidates, plot drafts, grammar polishing, format conversion: solid. Novelty assessment, decisive experiment design, long-horizon reasoning, contribution framing: fragile.

This is the paper’s central tension. AI produces research-shaped artifacts faster than it can prove they are correct.

Ideas look novel on the page and weaken after a single implementation attempt. Code runs cleanly while implementing a different algorithm than the paper described. Figures look publication-ready while distorting axes or dropping baselines.

Reviews come back coherent and lenient. Rebuttals read as persuasive and promise experiments that authors never run. Dissemination artifacts simplify results past the evidence the paper actually provides.

The risk is not that the artifacts are useless. It is that they get treated as validated because they look complete.

The strongest empirical result in the paper comes from the ICLR 2025 randomized study of AI in peer review. Across 22,467 reviews, 89% improved in quality when an LLM gave feedback on a human reviewer’s draft.

Hand the same family of models a paper to review alone, and 95.8% of rejected papers come back misclassified as acceptable.

Assist mode lifts quality. Replace mode breaks it.

That asymmetry shows up everywhere the paper looks for it: writing, review, rebuttal, dissemination. AI augments researchers reliably. As the reviewer of record, the same model fails reliably.

Systems that produce credible work all combine the same three layers, regardless of branding.

Exploration searches over hypotheses, code variants, or response strategies through MCTS, evolutionary methods, or branching agents. Execution drives external tools: code interpreters, retrieval engines, experiment runners, plotters, document editors. Verification checks intermediate outputs through execution feedback, citation validation, adversarial critique, or human review.

Stacks built on “more agents = better” lose on sequential reasoning. Google and MIT scaling work cited in the paper finds an empirical sweet spot at 3 to 4 coordinated agents. Bigger swarms accumulate communication overhead faster than they gain critique quality.

Corpus studies estimate detectable AI modification in 17.5% of computer science abstracts and 13.5% of biomedical abstracts. Self-reported usage runs higher.

Detection-based enforcement does not scale. AI text detectors false-positive on formal academic prose and non-native writing. Watermarking depends on provider cooperation and breaks under paraphrase and translation.

What stays durable is a different set of questions. Which forms of AI use must be disclosed? Who is accountable when an AI-generated citation is fabricated, or an AI-drafted rebuttal commits to an experiment that never runs?

Policy has to follow disclosure and accountability, not detection. The paper lands there.

The lifecycle compresses into a map. Each stage has work AI does well, work that needs human inspection, and work that should not be delegated yet.

Six rules sit underneath the table.

Treat every phase handoff as a verification checkpoint, not a transition.
Prefer execution-grounded evaluation over LLM-as-judge for any claim that can be tested.
Use AI to strengthen human reviews. The 89% quality lift only shows up in assist mode.
Track every rebuttal promise against the actual manuscript diff before camera-ready.
Compare every dissemination artifact against the paper’s caveats before release.
Disclose AI use, attribute decisions, accept accountability for AI-generated claims.

The AI Scientist camp is not wrong about cost. They are wrong about ceiling.

Sakana AI, FARS, and the AI Scientist v2 line argue autonomy is already useful at the right cost-quality tradeoff. The rebuttal sits inside the paper itself. A 40x increase in spend per paper (from $25 to $1,000) does not buy quality. It buys volume that scores below the acceptance threshold.

Three problems remain open across every system surveyed. They are the real reason the field is not where the demos suggest.

Phase-boundary faithfulness. No system surveyed maintains a traceable link from hypothesis to dissemination. Hypotheses, logs, manuscript claims, and rebuttal promises break apart at every handoff.

Citation provenance. Generated bibliographies routinely mix metadata across preprint, workshop, conference, and journal versions of the same work. Author list, year, venue, and DOI can come from four different records of one paper. No surveyed tool fixes this.

Cognitive ownership. Aggressive automation hides the work that turns a junior researcher into a senior one. Delegating literature synthesis or rebuttal prevents the field judgment and critical reasoning that build over time.

The reason this paper is worth seven minutes is the reframe. It stops asking whether one autonomous AI scientist can replace a human researcher. It starts asking whether the artifacts a research process produces still link to evidence by the time they reach the public.

That is the right question to be asking in May 2026.

Which principle does your current AI research stack solve, and which one is still open?

All source links are in the first reply. Full breakdown of recent updates + daily signals in our newsletter (link in bio).