-
Notifications
You must be signed in to change notification settings - Fork 3
/
7-processing.Rmd
90 lines (63 loc) · 2.69 KB
/
7-processing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
title: "5 - Additional Processing"
output: html_notebook
---
```{r setup, echo=F, results='hide'}
Sys.setenv(R_NOTEBOOK_HOME = getwd())
source("config.R")
source("helpers.R")
```
## Heatmap
First a helper file that contains project ids, stars and commits must be exported from the database (`projects_heat.csv`):
```{r}
exportHeatmapData(DATASET_NAME, DATASET_PATH);
```
Then call the `sccpreprocessor originals PATH_TO_DATASET NUM_CHUNKS` to calculate the heatmaps:
```{bash}
cd $R_NOTEBOOK_HOME
cd tools/sccpreprocessor/src
java SccPreprocessor originals ../../../datasets/js 1
```
Where `NUM_CHUNKS` is number of chunks produced by the clone finder (i.e. how many threads the clone finder used). This produces `heatmap.csv`, which contains for each project a row with the following:
- project id
- number of stars
- number of commits
- number of original files in the project
- number of clones the project contains
## Tokenizer Extra Information
JavaScript tokeizer produces extra information for its projects and files, stored in the `projects_extra.txt` and `files_extra.txt` files. When imported, these will add columns to the `projects` and `files` tables. The `importJSExtras` function in `functions.R` does this:
```{r}
importJSExtras(DATASET_NAME, DATASET_PATH)
```
## File Types
For JavaScript, we also calculate kinds of files. Files can be `npm`, or `bower` files, tests, `min.js` files, locale files and other properties (file name w/o extension, npm package name, aand first npm package name (blame). To calculate this information, `sccpreprocessor` must be executed first to create `files_nm.csv`:
```{bash}
cd $R_NOTEBOOK_HOME
cd tools/sccpreprocessor/src
java SccPreprocessor nm ../../../datasets/js
```
When done, the file can be imported into the database and a database with `_nonpm` suffix can be created, which would contain only non-npm files:
```{r}
importAndCreateJS_NPM(paste(DATASET_NAME, "nonpm", sep = "_"), DATASET_NAME, DATASET_PATH)
```
## Aggregation of files and projects
fileId, cdate, npm, bower, test, minjs, thash
```{r}
exportAggregationData(DATASET_NAME, DATASET_PATH)
```
```{bash}
cd $R_NOTEBOOK_HOME
cd tools/sccpreprocessor/src
java SccPreprocessor aggregate ../../../datasets/js
```
## All NPM using projects
This downloads `package.json` files for all projects if they have them.
```{bash}
cd $R_NOTEBOOK_HOME
mkdir pjsons
cd tools/sccpreprocessor/src
java SccPreprocessor npm ../../../datasets/js ../../../pjsons 8
```
Where first is the dataset name, then output folder whether the `packages.json` files will be stored and finally a number of threads to use.
## Next Steps
[Reporting](8-reporting.nb.html) in file [`8-reporting.Rmd`](8-reporting.Rmd).