Post

Don’t Panic: A Researcher’s Guide to Replication Packages

Categories: Publishing

Episode 1: A New README

Who Needs a README?

Imagine coming across a really cool paper in a top journal. The kind of paper you wish you wrote yourself; the kind of paper that gives you seven ideas how to build on it and entirely change our understanding of the topic. Excited, you download the replication package, eager to learn how to do something like this yourself.

You unzip the folder, and find yourself in a trash compactor.[1] Twenty different folders with cryptic names like “Vr_d_preliminary4” and “Xkj_bbr9”, each containing several datasets, code files, and pdfs. After an hour of digging, you manage to find what seems like the master code file: “master_v2_corrected_final_final.do” Great. Three hours later, after fixing several broken paths, downloading five custom functions, and discovering code notes that make you a little uneasy (“fix this later when we have the latest file”, “who did this?”, “why do you replace the zeros?”), you finally press Run. And the code runs. For hours and hours, until it breaks again, because it requires an input file that is not contained in the folder. You pour yourself a drink.

You try to summon help. “README, you are my only hope.” Alas, no README today. Let’s fight this common sight, and let me explain why.[2]

Contents of a README

A good README file is a bit like an experienced tour guide. It takes you where you want to go, mentions all important things that you will encounter on the way, and gives you advance notice if you need to arrange something before you get going. Let’s break this down:

Step 1: What is in the Package

A README should explain what files are contained in the replication repository. Not just provide a list, mind you, but actually explain what each file does. A bad README will thus say “All files necessary for replication are contained in this folder.” (Well, thanks, I expect as much. But what is the difference between “data_final_time.dta” and “data_final_Alice.dta”? Which one should I open first? And who is Alice?)

A good README might thus say: “This package is organized as follows: Study protocols, IRB approvals, and survey instruments are in folder Documents; raw data as downloaded from survey software is in folder Raw, clean data is in folder Input. All codes (merging of raw files, cleaning, and analysis) are in folder Code, and are numbered in the order they should be run.”

Step 2: How to reproduce results

Never think that re-running your codes is trivial. (Try doing it yourself for your own paper from ten years ago, I dare you.)

Thus, think of README as a step-by-step recipe.

  • What software (incl. version) is needed to open your codes?
  • How many and what custom adjustments are needed (incl. path changes), and for which files?
  • In what order should codes be run, and do they have to be run all at once, or is it possible to re-run only a part of the analysis?
  • How long does it take to run your codes (and on what type of machine)?
  • Are there any known software issues or requirements others should know about?

Step 3: What is the data we are dealing with

If you are able to make your data publicly available, you should note where the data comes from, whether it is raw, pre-processed, or cleaned, and provide a data dictionary.

In cases when you cannot publicly share your data, you should nevertheless describe where the data comes from and how others may access it. In the simplest case, data can be downloaded directly from a specific website; in more complicated cases, formal data requests and IRB approvals will be necessary. It is usually more “robust” to provide information about standardized data requests at relevant institutions rather than provide personal emails of data stewards who might switch jobs or retire (e.g., bobrogers@universityU.edu), or hyperlinks to forms that might become out-of-date.

Finally, in cases when you can share only a portion of your data (e.g., for privacy reasons), it is important to explain which data is withheld and whether and under what circumstances it could be accessed. An example explanation might read: “For privacy reasons, we provide de-identified student records from university U; original files with student numbers and demographics may be applied for via university U’s data archive (reference code 3456K) and are subject to IRB approval. For more information and the most recent version of data application form you can contact university U’s research office at research@universityU.edu”.

Step 4: Output

Describe what output (e.g., figures, tables) your code produces, and whether any additional setup is needed (e.g., whether replicators should create an empty folder called Results, and if so, where). In cases when many files are produced, consider creating a mapping between outputs and results in your paper for easier navigation (e.g., “Figure 1 in the paper consists of four panels; these are created by code file analysis_part1.do, and are called Fig1a, Fig1b, Fig1c, and Fig1d. Appendix version of the Figure that excludes incomplete survey responses is created by analysis_appendix.do, and the relevant files are called Fig1aApp, Fig1bApp, Fig1cApp, and Fig1dApp.”).

Step 5: Anything else we need to know?

The journal that you are publishing in might ask you to include additional information; make sure you follow those guidelines. Some of the nice additions include:

  • A short summary of your research paper: What are you using the data for, and what is your central result? How should we cite your paper?
  • How should we cite this dataset/replication package?
  • In case your data cannot be shared publicly, are you available to help with replication efforts?
  • A codebook. Very useful if your variable labels are not detailed enough or are incomplete.
  • Other underlying materials that help others understand the data or the study, or learn from your example: these may include randomization code, original survey forms, informed consent forms, recruitment protocols, de-briefing scripts, experiment logs detailing technical problems or deviations from standard procedures (e.g., during power outage), etc.

Final Thoughts

A good README does more than just guide replicators (or yourself when doing another revision round). It communicates professionalism, transparency, and care for scientific integrity. It’s a small investment that pays off – not just in credibility, but also in impact.

In the next episode, “The data strikes back”, we will have a look at organizing and describing your data. Stay tuned!


[1] Pardon the occasional Star Wars reference. You have been warned.

[2] That’s a Herman’s Hermits reference.