This is the data used by this project.
Much of this data is referenced in our paper "Using Automated Code Repair to Fight Back the Deluge of False Positives".
See the README.dataset.md
file for details about that data.
The dataset will be a Zip file with the following directory contents:
data.publication
README.md
: use data/README.dataset.md
Dockerfile.rosecheckers
: for conveniently constructing a container to use with codebases and tools (NOTE: Do not use clang-tidy from this container, it is an older version than what we used.)
Dockerfile.redemption
: for conveniently contstructing a container to use with clang-tidy version 15
codebases.yml
data
: redemption/data
directory minus some files noted below
paper/oss_frequency.csv
: table redemption/paper/oss_frequency.csv
paper/tables
: redemption/paper/tables
directory minus some redemption/paper
files noted below
code/analysis
: from the redemption/code/analysis
directory, only include the following files: cert_rules.2016.tsv
, checkers.csv, my-gcc.sh
, my-g++.sh
, {clang_tidy,cppcheck,rosecheckers}2tsv.py
LICENSE.txt
: license redemption/License.dataset.txt
ABOUT
: per-file markings redemption/ABOUT.dataset
These are files and directories not included in the published dataset.
The accolade.zeek4
file contains info relevant to Zeek v4, which came from our collaborator; it is CUI-derived data, so it should not be published.
It contains CUI-derived data. Writeup said the following:
header: "CERT Guidelines Ranked by Effort Worthiness for This Project"
The raw table lives in accolade.csv
. The tables in the paper contain this data reformatted to fit the page.
This table was generated manually based on the "Excerpt of Per-CERT-Rule Alert Counts and Related Data for Tools and Codebases Used}" table. Each rule that had a non-empty rank column was added to this table. This table also does coalesce information about zeek4 from our collaborator's data (which is CUI and not provided).
Data related to zeek4
. CUI...comes from Brandon.
Data related to zeek5
. CUI...comes from Brandon.
Data related to the scan-build
SA tool. Format similar to clang-tidy
, rosecheckers
, cppcheck
. Scan-build
turned out to be less useful than clang-tidy
.
Data and scripts related to testing our ACR tool on git and zeek. Not useful for our paper.
I?IRC I used this script to join some of the pivot tables when creating all_alerts.csv
.
We exclude Latex, IEEE, and figure files that were used for the paper from the dataset release. From the redemption/paper
directory, the dataset excludes files accolade.org
, IEEEtran.bst
, IEEEtran.cls
, makefile
, mathmode-spacing.tex
, paper.md
, paper.tex
, refs.bib
, plus it excludes all files from the redemption/figs
directory.
Since the one README
file needed is in the top-level directory of the publication dataset, you should make sure these files are deleted from the publication dataset: data/README.md
and data/README.dataset.md