Funding
16 October 2024
We are grateful for the funding provided by Open Philanthropy for our new project! Here is our proposal.
We will conduct a test of LLM’s ability to determine whether claims about social and behavioural science phenomena are true or false. We operationalize this question by investigating whether LLMs can assess a scientific paper and test whether the primary findings reproduce successfully. This operationalization of a fundamental capability has numerous virtues:
● It is hard, even for humans ● It is possible, even for humans ● It is a real-world task for scientists and policymakers ● It is feasible to obtain ground truth evidence for accuracy ● It is possible to investigate systematic bias in false positives and false negatives ● LLM performance can be compared to the performance of human-only and human-machine teams
Institute for Replication (I4R) has existing research, infrastructure, workflows, and financial support that this effort could leverage directly to accelerate progress on evaluating LLM capabilities on this real-world task. I4R is currently reproducing and replicating 250 studies per year published in leading social science journals, thanks to financial support from Open Philanthropy. I4R organizes up to 2 monthly Replication Games (RGs) all over the world. These events are modeled after hackathons) and are open to faculty, post-docs, graduate students and other researchers. Participants join a small team and are asked to first computationally reproduce the results, and detect coding errors, in a published paper or study in their field of interest. During previous RGs, many coding errors have been uncovered which sometimes drastically change the conclusions of the original studies. See, for instance, this report written by a team who participated in a RG: https://econpapers.repec.org/paper/zbwi4rdps/20.htm. As of now, LLMs have not been used by any of our participants.
As a pilot, the next Replication Games in Toronto (February 2024) will involve randomly separating the teams into one of three treatment arms: human-only (who do not use LLMs), human-machine (who get the possibility to use LLMs), and machine with restricted human input (who have to use LLMs exclusively). Participants in the human-machine and machine with restricted human input teams will receive a short LLMs training offered by I4R collaborators and the University of Toronto. Teams will be asked to detect coding errors, discrepancies between the codes and the article, computationally reproduce the results using the same software as the authors and another language, and conduct robustness checks. The three arms will be assigned the same study for which we know there are major coding errors. These errors will be identified in regular (humans-only) RGs and won’t be publicly disclosed until after the event. We are seeking support to run more RGs following this model.
Reproduction success is one of the most readily available and empirically grounded methods to establish ground truth. We believe that assessing reproducibility offers an ideal context to advance the assessment of LLMs sensitivity to the credibility of claims about social and behavioural science phenomena. Moreover, our proposal also offers the possibility of testing LLM performance on different levels of task complexity: reproducibility is, in principle, determinable directly by the evaluation of the paper and underlying data; robustness is determinable indirectly by the evaluation of the paper and reuse of the underlying data by making reasonable alterations to the analysis pipeline; and detecting (known) coding errors is determinable directly from the code without knowledge of the study. LLMs may perform perfectly on easy reproduction assessments (for example, detecting minor errors in original papers) and only moderately on robustness assessments and complex coding errors. Variation in the performance across these indicators will be revealing about what aspects of the overall credibility assessment process are amenable to automation.
With Open Philanthropy support we will: ●–Organize 7 (additional) replication games in which we will test the performance of human-machine and machine-only teams, including at Cambridge University, University of Ottawa, the University of Nottingham and the University of Toronto. We will co-locate two of the RG with large conferences such as the European Economic Association’s annual conference in Rotterdam (August 2024). No funding is requested for running the RG in Toronto described above which will serve as a pilot for the games we are requesting funding for. ●–Recruit a total of about 450 participants. Recruitment will be made through social media and emails to our network, mainly through the Institute for Replication’s board of editors. The board of editors covers all subfields of behavioral science, economics, finance and political science. The board is selected based on DEI considerations, with various institutional ties, thus allowing to cast a very wide net. ●–Write a paper summarizing our results to be published in a leading outlet such as Nature, Science or PNAS. We will publicly release a preprint as soon as our draft is ready (end of the grant proposal) and all RG participants will be offered co-authorship.
We already have 22 human-only RGs planned for 2024 - thanks to the support of OP, Sloan and other funders- so we will have plenty of papers with both minor and major coding errors to choose from in order to ground truth performance of human-machine and machine with restricted human input teams.
Policymakers and private organisations often leverage social and behavioral science research to design plans, guide investments, assess outcomes, and build models of human social systems and behaviors. However, recent results have revealed that the evidence base in the social and behavioural science contains many claims that fail to replicate and reproduce, which could have important real-world implications. If successful, our project would offer end-users of research a rapid and scalable way of performing basic quality control before implementing research into practice. It would accelerate science production, by providing a stronger evidence base to build on.
Duration of the project: 15 months This project would start in April 2024 and end in June 2025.