In our first post, we introduced the AI Replication Engine: an autonomous system built at the Institute for Replication (I4R) that uses specialized AI agents to reproduce, verify, and stress-test empirical research. Since then, we’ve moved from architecture to action. Today we’re sharing results from our first systematic model comparison and announcing that the project has been submitted for major research funding to scale over the next two years.
We recently benchmarked three open-source models on a replication of the paper “Racial Flux and Voting Behavior”, tasking each with autonomously executing R code and comparing outputs against 16 published metrics (coefficients, R², N values, and standard errors). The results were striking: glm-4.7-flash (quantized to 8-bit) achieved a perfect 16/16 match in just 5 minutes, while qwen3-coder:30b hit the 75-iteration limit after completing only 1 comparison, and qwen3-next:80b (the largest model at 80B parameters) failed entirely, unable to resolve basic file paths and consuming all iterations debugging without producing a single result. The takeaway is clear: for replication tasks, disciplined workflow management and precise instruction-following matter far more than raw model size.
To take the Engine from proof-of-concept to production, we’ve submitted an application for a SSHRC Insight Development Grant requesting $95,000 over two years (August 2026 – July 2028). The grant will fund systematic evaluation against human-verified benchmarks from the full I4R Games dataset (250+ papers) and approximately 430 World Bank policy research working papers, multi-model comparison across several different model families using zero-shot, few-shot, and fine-tuning approaches, and an open-source release of all code, trained weights, and benchmark datasets so that journals, funding agencies, and research offices can deploy verification tools independently.
We’re now expanding our test suite across more papers, more models, and increasingly complex tasks, moving beyond computational reproducibility into coding error detection and robustness assessment through our three-agent architecture. Preliminary results will be submitted to NeurIPS 2026, and we plan to release the first version of the open-source toolkit later this year. If you’re interested in contributing replication data, testing the Engine, or collaborating on evaluation, we’d love to hear from you!
