Post

A Researcher’s Guide to Replication Packages: Episode 3

Categories: Publishing

Episode 3: The Return of the Code

Hooray, your paper has been accepted! As you are putting together the replication package for the journal, you ask your research assistant to re-run everything, just in case. You carefully inspect the tables and figures they sent you, and yep, there it is: Figure 2 does not match. How is this possible?

Two calls, three coffees, and one entire chocolate bar later, you have the answer. It actually matters whether your code is executed at once by your master file, or whether all pieces of code are run sequentially. Welcome to the class of defence against the dark arts badly structured replication code.

Your overconfidence is your weakness

Never assume that it’s obvious how to run your code. It might be obvious to you right now because you have spent the last few weeks working on the final journal resubmission, but remember that the majority of people who will try to run your code will be doing so for the first time in their life (data editor, replicators, PhD students learning from your work, …).

Ideally, you would make running your code simple: Provide one master file (“one file to rule them all!”), and after a single path gets adjusted, with a single press of a button, all of your results get reproduced. Data cleaning, merging, re-shaping, analysis, tables, figures – all done.

Of course, reality is rarely ideal: you might be working with restricted datasets (and thus cannot share parts of the cleaning or analysis), you might be relying on different software environments (and thus need multiple “master” files), or certain procedures take too long to run as part of a regular reproducibility check (think of “too long” as “longer than overnight”). Don’t stress – but do make sure you explain everything in your Readme file. Bonus points if you number your files in the order they should be handled (e.g., “0_preprocessing.R”, “1_cleaning.R”, “2_analysis_tables.R”, “3_analysis_figures.R”). Here are some clear examples of what you might have to do:

“Part A of the data cleaning and analysis requires <restricted access dataset> from <organization>. This part of the code is commented out, and Tables X and Y cannot be reproduced. Replicators with access to this data should name the dataset partA.dta, store in in folder Raw, and execute Part A of the code.”

“Our code for part B first cleans the data in Stata. The master file is called “1_cleaning.do” and should be run first. Subsequently, our data is imported into MATLAB and analysed using master file “2_analysis.m”. All auxiliary files are called directly by the master files and do not need any adjustments; both master files require working directory adjustments on lines 2 and 4 respectively.”

“Code master file “0_master.do” runs all data cleaning and analyses. However, part C of data cleaning takes approximately 28 hours on a standard laptop, and thus for convenience we provide a post-cleaning part C dataset in the Intermediate folder. Replicators using this file can simply comment out lines 2 – 132, and start with part D of data analysis on line 133.”

Replicator who just realized the main analysis is supposed to run for 26 days (true story!)

Data Cleaning: A Few Tips

Let’s stop here for a second. Data cleaning can have huge influence on your results, and so it deserves to be transparent and well-explained.

  • Do keep your data cleaning separate from your analysis (to the extent possible) because that makes it much easier to follow.
  • Do comment on what each step is doing and, ideally, why. (E.g., “as pre-registered, we exclude all participants who currently do not have a job”, or “for each student, we merge in their exam scores from years 2022 and 2024”) When something unexpected happens, you will be glad to know why you merged 1:m instead of m:1.
  • And finally, do check that the code does what you think it does (e.g., verify that your re-shaping works well in a small sample that you can check manually).

Analysis Code: More Tips

Once your data are clean, the next challenge is making your analysis scripts easy to navigate. And just like with cleaning, clarity is king.

  • Do link your code directly to your paper. For example, use comments such as “Table 4, columns 1–5” or “footnote 7 mentioning that 91% of subjects understood the instructions”. This saves a lot of time for anyone trying to find a specific piece of code related to your results.
  • Do explain what “unused” lines of code do. For example, a referee might have asked you to check whether your main result is robust to the inclusion of a specific control variable. You re-ran the analysis, the results stayed the same, and the editor and referee agreed that you do not have to show or discuss this analysis in the manuscript. You can leave this code commented out, with a note such as: “A referee requested a re-analysis with <this control variable> included; our results are robust to this adjustment. See lines 334 to 337 (commented out).”

Avoiding Oopsies: A Final Tip

It is okay (and often helpful) to leave notes for yourself in the code, such as “TODO: verify whether this clustering of standard errors is correct”. Just make sure you search for all your to-do notes before you submit your package; after all, no data editor wants to see a code that is yelling at itself.

Smooth Operation

In the end, your replication code is not just a record of what you did; it’s a conversation with your future self (and anyone else re-running the code to stand on the shoulders of a giant – yes, you!). You would like that conversation to be smooth, and a well-structured code might just be the best opening line you can wish for.