-
Notifications
You must be signed in to change notification settings - Fork 3
/
0-setup.Rmd
133 lines (100 loc) · 3.77 KB
/
0-setup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
title: "0 - Setup & Tools"
output: html_notebook
---
```{r setup, echo=F, results='hide'}
Sys.setenv(R_NOTEBOOK_HOME = getwd())
```
> Do not execute this notebook on the Virtual Machine during artifact evalation, all necessary tools have already been setup for you in advance.
This notebook details the necessary setup & tools required to reproduce the paper results.
## Prerequisites
`ant`, `curl`, `git`, `Java`, `Python`, `C++/C`, `cmake`, `R`, and `MySQL`/`MariaDB` are required to be able to reproduce the paper results. On linux machines, these can be easily installed using the system's package manager. Other operating systems might work too (e.g. WSL on Windows, or brew on OSX), but were not tested.
MySQL or MariaDB must not use `secure_file_priv` (disable by setting it to `""` in the configuration file) and the db user must have rights to access the dataset folders (this may require changing settings in apparmor or SELinux depending on distrbution).
For running the `Rmd` files, we also require RStudio installed.
Install required R packages:
```{r}
install.packages(c("RCurl", "RMySQL", "rjson", "bitops", "ggplot2"))
```
## Creating `config.R`
In order for the notebooks to work properly, a `config.R` file must be created in the folder where the notebooks reside. This file must specify the dataset names and paths, databse connection information, github API Tokens and so on, in the following format:
```
# basic database connection settings
DB_HOST = "127.0.0.1"
DB_USER = "<<your user>>"
DB_PASSWORD = "<<your password>>"
# dataset information
DATASET_NAME = "js" # or language you wish to use
DATASET_PATH = "/home/peta/devel/oopsla17-artifact/datasets/js" # this path must be absolute
# github authentication tokens to circumvent github api limitations
GITHUB_TOKENS = c("","","")
```
## Installing Tools
All the tools will be installed into the `tools` folder, if the folder does not exist, create it and enter:
```{bash}
cd $R_NOTEBOOK_HOME
# cleanup
rm -rf tools
rm -rf graphs
rm -rf datasets
rm -rf downloads
# prepare directories
mkdir -p tools
mkdir -p downloads/js
mkdir -p graphs/js
```
### Sourcerer CC
```{bash}
cd $R_NOTEBOOK_HOME
cd tools
git clone https://github.com/Mondego/SourcererCC.git
```
SourcererCC analyzes the similarity of given file inputs based on specified threshold.
### Javascript Tokenizer
```{bash}
cd $R_NOTEBOOK_HOME
cd tools
git clone https://github.com/reactorlabs/js-tokenizer.git
cd js-tokenizer
git checkout oopsla17
mkdir build
cd build
cmake ..
make
```
Due to sheer volume of the JavaScript projects, it was not feasible for us to keep the downloaded projects. The downloader therefore downloads the projects one by one, tokenizes the files within them and then deletes the projects, keeping only the tokenized files for later stages.
### GHT Pipeline
```{bash}
cd $R_NOTEBOOK_HOME
cd tools
git clone https://github.com/reactorlabs/ght-pipeline.git
cd ght-pipeline
mkdir build
cd build
git checkout stridemerger
cmake ..
make
```
Because the downloader uses multiple passes, the results of it must be merged together into single files. The `ght` tool does exactly that.
### SCC preprocessor
```{bash}
cd $R_NOTEBOOK_HOME
cd tools
git clone https://github.com/reactorlabs/sccpreprocessor.git
cd sccpreprocessor/src
javac *.java
```
Various simple data analyses are implemented in the SCC preprocessor (such as NPM files detection, test files detection, etc.).
### Clone Finder
```{bash}
cd $R_NOTEBOOK_HOME
cd tools
git clone https://github.com/reactorlabs/clone_finder.git
cd clone_finder
mkdir build
cd build
cmake ..
make
```
Clone finder analyses the project-level cloning among the downloader projects.
## Next Steps
[Getting Github project Urls](1-getting-projects.nb.html) in file [`1-getting-projects.Rmd`](1-getting-projects.Rmd).