Suggested information for data and code replication package

This is a draft document. Please provide comments through issues in this Github project.

Readings

The following readings might be useful. The earlier in a project these are the consulted, the less pain there will be at the end.

Project Tier https://www.projecttier.org/
Wilson G. et al (2016) "Good enough practices for scientific computing" https://arxiv.org/pdf/1609.00037.pdf
Gentzkow M, Shapiro JM. Code and Data for the Social Sciences: A Practitioner’s Guide; 2014. Available from: http://web.stanford.edu/~gentzkow/research/CodeAndData.pdf.

General

We strongly suggest using some best-practices as suggested by the literature cited above:

separation of code and data
separation of read-only input data and modified "analysis" data
a clearly defined sequence of processing (possibly through a script)

Some concepts

We will refer to a (simplified) data structure as described below. Real-life data structures are often more complex, and the distinctions made in the simplified example should be adapted accordingly.

+--------------+                                +----------------+                                +--------------------+
|  Input data  | --- [Cleaning programs] --->   | Analysis data  |  --- [Analysis programs] --->  |      Outputs       |
|  (acquired)  |                                | (cleaned)      |                                | (Table in article) |
+--------------+                                +----------------+                                +--------------------+

Data

Regarding the data, enough information should be provided

to accurately describe the data so that somebody who doesn't have knowledge of the data can understand its principal (and salient) characteristics;
to be able to acquire the data (whether by download, by contract, by application process, etc.)
to assure the reader (and the AEA Data Editor) that the data is available for a sufficiently long period of time

Data description

Please describe all data, including input data. At a minimum, all variables that are used in the paper should be well-described (variable/column labels, value labels) through a codebook.
Where multiple versions of the data exist, describe the version of the data used by the author.
If possible, provide a Digital Object Identifier.
This information can be generated by the paper author, or provided by pointing to a URL of the data provider (for instance an online codebook).
It doesn't need to be complicated, but should be complete:

// in Stata, the 'codebook' command can generate such information.
use my_input_data
describe
codebook

Data access description

The description of data access should provide enough information that an uninformed user to access the data.
This can be as simple as a download URL.
It might be a pointer to a directory in the replication archive.
This might also be the URL for a description of the application procedure (e.g. NCHS, https://www.cdc.gov/rdc/leftbrch/userestricdt.htm), and an estimate of the monetary and time cost of the application process.

Data persistence

Data should remain available for a sufficiently long time.
By depositing in the AEA Data and Code Repository, the data (and code) will remain available indefinitely.
This is also true if the data is in various other repositories (list of acceptable repositories forthcoming).
This may also be true if the data cannot be shared (restricted access data). Ask your data provider what the archival duration or retention period is.
A good minimum benchmark is 10 years, but this may not always be feasible with data the author does not control.

What is a data provider

A "data provider" in this sense can be a public repository where the data can be found (ICPSR), a website that provided the data (IPUMS), a statistical agency or private company that granted access to the data (U.S. Census Bureau, Twitter, Acme Inc.). The author may also be the data provider, for instance, because the author conducted the survey used in the article.

Programs

We strongly suggest

a clear documentation of all code (within or through a README)
provision of a master script ("master do file", "make file") and a description of how to run or invoke the master script
identification of all pre-requisites (data, code, programs, software, possibly operating system)
(optionally, but useful) it will help potential replicators to know how to long to expect the programs to run

README and master script

All replication archives should have a README (in PDF, text, or a simple formatting language such as Markdown, like this document). The README should provide a sufficient description to understand the structure of the replication archive (directory structure, what is acquired from third parties, what is generated by scripts, how much output to expect). It should document each file or class of files that are included.

We strongly encourage the provision of a master script. The master script should run all programs necessary to provide the outputs, in the right sequence. In some cases, the master script might also serve as a README (for instance, "README.bash", "README.py", "README.Rmd"), as long as it satisfies all conditions of the README as well (i.e., ample comments).

Things not to do

We strongly discourage writing comments like

Run this a first time, generating column 1 of Table 3.
Then comment out line 55, then run a second time, which should
give you column 3 of Table 3.
Then uncomment line 55, change the parameter in line 67 to "5",
and run again to get column 2 of Table 3.

(this is only slightly paraphrased from an actual example).

Avoid ambiguous or imprecise instructions like

Have superDynare available

or

Use the outreg55 package

(no URL or installation command provided)

Things to do

Write code that can be run without human intervention.
Use functions/programs/loops/etc. to iterate through variations of an otherwise identical procedure (but ensure that the purpose of the loop is well described)
Identify all requirements to allow somebody to successfully run the code who has NOT been experimenting with the software and code for the past 5 years on the same laptop. This means
what packages need to be installed, from where, which versions

ssc install outreg55, from(https://myurl/to/o)

or

install.packages(c("dplyr","devtools"))
library(dplyr)
library(devtools)
install_github("myrepo/superols")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggested_information.md

Suggested_information.md

Suggested information for data and code replication package

Readings

General

Some concepts

Data

Data description

Data access description

Data persistence

What is a data provider

Programs

README and master script

Things not to do

Things to do

Files

Suggested_information.md

Latest commit

History

Suggested_information.md

File metadata and controls

Suggested information for data and code replication package

Readings

General

Some concepts

Data

Data description

Data access description

Data persistence

What is a data provider

Programs

README and master script

Things not to do

Things to do