Post

The AI Replication Engine: New Experiments, New Results, and the Road to Beta

Categories: Uncategorized

When we first introduced the AI Replication Engine in November 2025 (blog), the goal was to build a system that could help automate research verification at scale. In February 2026 (blog), we shared the first benchmark results and showed that autonomous replication was already possible on a real paper with strong performance. This update is the next step in that sequence: broader experiments, clearer benchmarks, and a shorter path from proof of concept to a usable public tool.

What is new now is not just that the system can rerun a replication package. We are expanding the test suite across more papers, more tasks, and more models, with the aim of evaluating the full workflow rather than a single successful run. That means testing not only computational reproducibility, but also the parts of the process that matter for real verification work: catching coding mistakes, handling messy documentation, and checking whether results remain stable under reasonable alternatives.

The newest experimental results are encouraging. Across seven new test sets, the Engine earned Gold on every run. Four test sets achieved perfect scores, and the remaining three still cleared 95 percent. Average performance across the seven runs was 98.2 percent, with completion times between 1.9 and 2.8 minutes. Taken together, these results suggest that the Engine is moving from an impressive one-off replication toward a system that performs consistently across a broader benchmark.

Newest results

Test setGradeScoreMatchesTime
10001Gold97.3%126/1302.8 min
10010Gold100.0%60/602.1 min
10011Gold95.0%66/702.0 min
10090Gold100.0%188/1882.8 min
10166Gold100.0%45/452.6 min
10167Gold95.1%58/611.9 min
10177Gold100.0%128/1281.9 min

These seven runs all landed in the Gold tier, with four perfect matches and the slowest run still finishing in under three minutes.

Project timeline

• November 25, 2025: First blog post published, introducing the AI Replication Engine and its three-agent architecture.

• February 11, 2026: Second blog post published, reporting the first benchmark results and outlining the scaling plan.

• April-May 2026: Large-scale experiments underway.

• Mid-May 2026: EMNLP paper submission.

• June 1, 2026: Beta version released on the website.

This is the transition point for the project. The first phase was about defining the architecture. The second was about showing that the Engine could deliver strong replication performance on a real task. The current phase is about systematic evaluation and scaling: understanding where the system already works reliably, where it still fails, and what is needed to make it genuinely useful for journals, researchers, and research organizations.

That is why this stage is less about a single demo and more about infrastructure. If the Engine continues to perform at this level across a larger and more varied benchmark, it will become much easier to imagine a practical verification workflow that combines fast automated checks with targeted human oversight. The next few months should tell us how close that vision really is.