Post

A Researcher’s Guide to Replication Packages: Episode 2

Categories: Publishing

Data Strikes Back

The Importance of Being Earnest

It’s Friday afternoon and your PhD student has just sent you an excited email that they finished the first round of data collection at the other side of the world. They have sent you a link to a (secure!) data repository so that you can have a look.

You open the link to discover an Excel sheet, “raw_data_wave1.xls”. As you scroll down, you notice multiple blanks for questions subjects have refused to answer. “That’s odd,” you think, since you remember quite clearly programming the software to code missing values as -999. Then you notice a couple of subjects giving catastrophic self-evaluations of happiness and well-being. Are these people from the region that was recently flooded? That would make a lot of sense. As you go to check their location, you see that a string variable recording their city was replaced by a numeric code. When was this coded? Was a different version of the survey implemented? You are starting to panic.

Upon seeing a redacted string entry that would have presumably identified the respondent, you realize what is going on: Your PhD student already pre-processed the data, removing all personally identifiable information, recoding missing values, and encoding strings. That is why this file does not correspond to your survey anymore. Only problem: The actual raw data is nowhere to be found, and neither is the processing code.

This is not an uncommon scene in research. Somewhere between enthusiasm and tidiness, the link between raw data and results can quietly vanish, making reproduction of results impossible. Well, it’s time to have “the talk” about responsible data processing.

Three Stages of Data

Broadly, data come in three stages: Raw, intermediate/processed, and final/clean.

We typically understand “raw” data as data directly from a source before any processing took place. For example, if you run a survey using an online platform, the downloaded .csv or .xls file would be considered raw. If you download country statistics from the OECD website, that file would be considered raw.

Understandably, some raw files cannot be shared publicly in replication packages because they contain sensitive and identifiable information about people (such as medical histories, exact locations including IP addresses, contact information, …). To include a subset of these files in your replication repository, it is necessary to remove or redact the relevant variables. Note however that sometimes, a part of the code that does this cleaning cannot be shared either because it mentions the information that should be redacted or dropped.

A good practice is thus to keep the original raw file securely stored together with any data cleaning that cannot be made public (e.g., in your university’s recommended repository), and have a pre-processed file that only drops the necessary variables while not being modified in any other way as a starting point for all other data cleaning and analysis. This file is then an “intermediate” file: It has been modified, but it is (typically) not the final dataset used to arrive at results.

Of course, when dealing with many different data sources, it is common to have multiple stages of intermediate files as we merge, reshape, and clean our data. While it is often helpful to store these in-between steps (especially if some data processing is very computationally time-consuming), it is strictly speaking not necessary as long as the original files and the entire corresponding processing code are available, since all in-between steps can be easily re-created.

That leads us to the final stage: The “clean”, ready-to-use dataset that is needed to run the analysis. It is generally a good idea to provide this final dataset in the replication package so that researchers interested in reproducing the main result can easily do so (again, especially when the initial data processing takes a lot of time).

Example

Let’s go back to our PhD student. Ideally, the student would have done the following:

  1. Stored the original, “as downloaded”, dataset in a folder called “Raw”. Made a note that this folder should not be shared publicly because it contains some personally identifiable information.
  2. Wrote a code that removes the personally identifiable information (names, exact locations, bank account numbers, government ID numbers, phone numbers, etc.). If the code can be written such that it does not reveal this information (e.g., only drops variables without redacting specific passages of text), the code can be shared publicly and stored in a folder “Code”. A good name for the code could be “0_initial_PII_removal.do”.
  3. Saved a pre-processed dataset in a folder called “Intermediate”, and made a note that this dataset is a starting point for further processing. A good name for this dataset could be “0_intermediate_noPII.dta”.
  4. Continued processing the dataset, e.g., replacing missing values of “-999” with actual missings and storing the code (e.g., “1_data_cleaning.do”) and resulting dataset (e.g., “1_intermediate_preliminary_cleaning.dta”) in the “Code” and “Intermediate” folders.
  5. Noted everything in a (preliminary) Readme file to make it obvious to themselves, all co-authors (and data editor later).

What are the odds of navigating your replication package? Are they better than 3,720 to 1?

Who Cares?

Why does it matter that we know whether a file is raw or not? Why build an entire structure to follow the dataset over its “lifetime”? Three reasons:

  1. After the initial investment, it makes your (=the author’s) life much easier. When you realize some processing was done incorrectly, it is easy to go back and re-run only the portion of files that are affected. It is likewise easy to explain to the data editor at a journal what they need to do to reproduce your analysis, which can prevent some unnecessary back-and-forth.
  2. It enhances transparency, which isn’t just a moral stance but a strategic advantage: Others can verify what has been done (thus giving you credibility), build on your work (giving you citations), or refine or correct your results in case of mistakes or new developments (which might be initially unpleasant, but really is a win for science). Finally, journals and funding agencies increasingly expect clear documentation of data provenance, so adopting this structure early saves you a lot of pain later.
  3. Transparency allows you to verify data integrity that you would not have been able to verify otherwise. Think back to our example: Suppose in the initial data cleaning, the PhD student accidentally replaced a couple of non-missing values into missing ones. If no record of this cleaning exists, and there is no original raw file to compare to, how would you know that the data file you are working with is incorrect? (Spoilers: you wouldn’t.)

I Want that Data, Not Excuses

In the end, good data practices aren’t about perfection; they’re about traceability. When every dataset has a clear origin, when every transformation is scripted and documented, and when sensitive material is handled responsibly, reproduction stops being a burden.

Future you, your co-authors, reviewers, replicators, and that one PhD student trying to build on your work will all appreciate being able to see the full data story, not just the final five minutes.