This is a draft document. Please provide comments through issues in this Github project.
The following readings might be useful. The earlier in a project these are the consulted, the less pain there will be at the end.
- Project Tier https://www.projecttier.org/
- Wilson G. et al (2016) "Good enough practices for scientific computing" https://arxiv.org/pdf/1609.00037.pdf
- Gentzkow M, Shapiro JM. Code and Data for the Social Sciences: A Practitioner’s Guide; 2014. Available from: http://web.stanford.edu/~gentzkow/research/CodeAndData.pdf.
We strongly suggest using some best-practices as suggested by the literature cited above:
- separation of code and data
- separation of read-only input data and modified "analysis" data
- a clearly defined sequence of processing (possibly through a script)
We will refer to a (simplified) data structure as described below. Real-life data structures are often more complex, and the distinctions made in the simplified example should be adapted accordingly.
+--------------+ +----------------+ +--------------------+
| Input data | --- [Cleaning programs] ---> | Analysis data | --- [Analysis programs] ---> | Outputs |
| (acquired) | | (cleaned) | | (Table in article) |
+--------------+ +----------------+ +--------------------+
Regarding the data, enough information should be provided
- to accurately describe the data so that somebody who doesn't have knowledge of the data can understand its principal (and salient) characteristics;
- to be able to acquire the data (whether by download, by contract, by application process, etc.)
- to assure the reader (and the AEA Data Editor) that the data is available for a sufficiently long period of time
- Please describe all data, including input data. At a minimum, all variables that are used in the paper should be well-described (variable/column labels, value labels) through a codebook.
- Where multiple versions of the data exist, describe the version of the data used by the author.
- If possible, provide a Digital Object Identifier.
- This information can be generated by the paper author, or provided by pointing to a URL of the data provider (for instance an online codebook).
- It doesn't need to be complicated, but should be complete:
// in Stata, the 'codebook' command can generate such information.
use my_input_data
describe
codebook
- The description of data access should provide enough information that an uninformed user to access the data.
- This can be as simple as a download URL.
- It might be a pointer to a directory in the replication archive.
- This might also be the URL for a description of the application procedure (e.g. NCHS, https://www.cdc.gov/rdc/leftbrch/userestricdt.htm), and an estimate of the monetary and time cost of the application process.
- Data should remain available for a sufficiently long time.
- By depositing in the AEA Data and Code Repository, the data (and code) will remain available indefinitely.
- This is also true if the data is in various other repositories (list of acceptable repositories forthcoming).
- This may also be true if the data cannot be shared (restricted access data). Ask your data provider what the archival duration or retention period is.
- A good minimum benchmark is 10 years, but this may not always be feasible with data the author does not control.
A "data provider" in this sense can be a public repository where the data can be found (ICPSR), a website that provided the data (IPUMS), a statistical agency or private company that granted access to the data (U.S. Census Bureau, Twitter, Acme Inc.). The author may also be the data provider, for instance, because the author conducted the survey used in the article.
We strongly suggest
- a clear documentation of all code (within or through a README)
- provision of a master script ("master do file", "make file") and a description of how to run or invoke the master script
- identification of all pre-requisites (data, code, programs, software, possibly operating system)
- (optionally, but useful) it will help potential replicators to know how to long to expect the programs to run
All replication archives should have a README (in PDF, text, or a simple formatting language such as Markdown, like this document). The README should provide a sufficient description to understand the structure of the replication archive (directory structure, what is acquired from third parties, what is generated by scripts, how much output to expect). It should document each file or class of files that are included.
We strongly encourage the provision of a master script. The master script should run all programs necessary to provide the outputs, in the right sequence. In some cases, the master script might also serve as a README (for instance, "README.bash", "README.py", "README.Rmd"), as long as it satisfies all conditions of the README as well (i.e., ample comments).
- We strongly discourage writing comments like
Run this a first time, generating column 1 of Table 3.
Then comment out line 55, then run a second time, which should
give you column 3 of Table 3.
Then uncomment line 55, change the parameter in line 67 to "5",
and run again to get column 2 of Table 3.
(this is only slightly paraphrased from an actual example).
- Avoid ambiguous or imprecise instructions like
Have superDynare available
or
Use the outreg55 package
(no URL or installation command provided)
- Write code that can be run without human intervention.
- Use functions/programs/loops/etc. to iterate through variations of an otherwise identical procedure (but ensure that the purpose of the loop is well described)
- Identify all requirements to allow somebody to successfully run the code who has NOT been experimenting with the software and code for the past 5 years on the same laptop. This means
- what packages need to be installed, from where, which versions
ssc install outreg55, from(https://myurl/to/o)
or
install.packages(c("dplyr","devtools"))
library(dplyr)
library(devtools)
install_github("myrepo/superols")