-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
181 lines (110 loc) · 8.58 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
output: github_document
resource_files:
- inst/extdata/comparison_report.xlsx
- inst/extdata/comparison_report_screenshot.png
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# compareDFx <a href="https://skgithub14.github.io/compareDFx/"><img src="man/figures/logo.png" align="right" height="139" alt="compareDFx website" /></a>
<!-- badges: start -->
[![R-CMD-check](https://github.com/skgithub14/compareDFx/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/skgithub14/compareDFx/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/skgithub14/compareDFx/branch/master/graph/badge.svg)](https://app.codecov.io/gh/skgithub14/compareDFx?branch=master)
<!-- badges: end -->
{compareDFx} compares two data frames with detailed reporting natively in R or in MS Excel. The MS Excel report is a multi-tab workbook that color codes additions, deletions, and changes which is ideal for R users and non-R users alike and makes identifying data issues and version differences quick and easy.
![Screen shot from 'data top-bottom' tab of comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/report_top_bottom.png)
## Installation
You can install the development version of {compareDFx} like so:
``` r
# install.package("devtools")
install_github("skgithub14/compareDFx")
```
<br>
________________________________________________________________________________
## Comparing two data frames using {compareDFx}
Let's assume that `df1` and `df2` are two versions of the same data set, where `df1` is the newer version. We want to know exactly which values changed. Using the two example data frames included with {compareDFx}, `df1` and `df2`, we will walk through the {compareDFx} workflow.
### Creating the report
We will start by loading the {compareDFx} package and defining our data frames:
```{r}
library(compareDFx)
data(df1)
data(df2)
```
<details><summary>View the example data</summary>
```{r}
# example data, df1
knitr::kable(df1)
```
```{r}
# example data, df2
knitr::kable(df2)
```
</details>
{compareDFx} uses a row's unique ID to evaluate what changed between two data frames. Unique IDs can consist of one or multiple columns. For `df1` and `df2`, the columns the collectively make a unique row ID are `id1` and `id2`. This means that in `df1` there are two columns named `id1` and `id2` and in `df2` there are also two columns named `id1` and `id2`.
```{r id_col}
id_cols <- c("id1", "id2")
```
Now we can run the comparison and generate the report in Excel:
``` r
# run the comparison
comparison <- get_comparison(df1, df2, id_cols)
# create the excel report
create_comparison_excel(comparison, path = "comparison_report.xlsx")
```
```{r, eval=TRUE, include=FALSE}
# run the comparison
comparison <- get_comparison(df1, df2, id_cols)
# directory and file name for the report, hiding this to avoid confusing users
# who don't know what system.file() does
dir <- system.file("extdata", package = "compareDFx")
fname <- "comparison_report.xlsx"
path <- file.path(dir, fname)
# create the excel report
create_comparison_excel(comparison, path = path)
```
Now we have a multi-tab, color-coded MS Excel report with summary statistics. On the data tabs of the report, all cells from `df1` will have a white background and all cells from `df2` will have a grey background. The text is colored accordingly:
- Green text: cell values in `df1` that are different from `df2`
- Red text: cell values in `df2` that are different from `df1`
- Black text: cell values that are the unchanged between `df1` and `df2`
You can explore the file yourself here:
[Download comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/comparison_report.xlsx)
As you can see, there are seven tabs in the Excel workbook:
1. summary: summary statistics
![Screen shot from 'summary' tab of comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/summary_tab.png)
2. summary by column: column summary statistics
![Screen shot from 'summary by column' tab of comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/summary_by_column_tab.png)
3. data pivoted: the combined data displayed so that each value in `df1` is directly to the left of its `df2` counterpart.
![Screen shot from 'data pivoted' tab of comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/report_pivoted.png)
4. data top-bottom: the combined data displayed so that each row of `df1` is directly above its `df2` counterpart.
![Screen shot from 'data top-bottom' tab of comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/report_top_bottom.png)
5. data left-right: the same data as shown in data top-bottom, but where the `df1` columns are immediately to the left of their `df2` counterpart.
![Screen shot from 'data left-right' tab of comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/report_left_right.png)
6. df1: the raw data from `df1`, for reference
7. df2: the raw data from `df2`, for reference
If you aren't interested in the Excel report but want to work with the report in R instead, you can use the output from `comparison` object directly (see the documentation for `get_comparison()`).
### Understanding the report
In the data top-bottom and data left-right tabs, you will see some extra columns that report useful comparison information:
- `discrepancy`: the comparison status, which can take one or more of the following values:
- `"matched"`: a row that exactly matched between `df1` and `df2`
- `"addition"`: a row that only existed in `df1`
- `"deletion"`: a row that only existed in `df2`
- `"changed"`: a row where the `id_cols` matched between between `df1` and `df2` but which had a least one other column with a change
- `"exact duplicate"`: a row with multiple occurrences in either `df1` or `df2`. "Exact" means that every value in the row matched perfectly to its duplicate copies. Exact duplicates can also be marked as `"matched"`, `"addition"`, `"deletion"`, or `"changed"`.
- `"ID duplicate (not exact duplicate)"`: a row that had multiple occurrences of the same combination of `id_cols` in either `df1` or `df2` but for which there was not an exact duplicate (meaning at least one non-ID value in the row was different). An ID duplicate cannot be marked as `"matched"`, `"addition"`, `"deletion"`, or `"changed"` because there will be multiple value matches when comparing `df1` and `df2`. The user should fix these errors and re-run the comparison.
- `"ID contains NA"`: a row in either `df1` or `df2` which had an `NA` value in one or more `id_cols`. The comparison procedure requires non-`NA` values for all `id_cols` therefore these rows will not be marked as `"matched"`, `"addition"`, `"deletion"`, or `"changed"`. The user should fix these errors and re-run the comparison.
- `source`: this column is only show in the data top-bottom sheet and tells you whether the data in the row came from `df1` or `df2`
The next set of columns are hidden by default, but can be shown by clicking the "+" in the top left corner of the worksheet:
- `change group`: if `discrepancy` contains `"changed"` this column will be populated with an arbitrary number that links changed records between `df1` and `df2`
- `exact dup cnt`: if `discrepancy` contains `"exact duplicate"` this column gives the number of exact duplicates in `df1` or `df2`
- `ID dup cnt`: if `discrepancy` contains `"ID duplicate (not exact duplicate)"` this column gives the number of ID duplicates in `df1` or `df2`
In the data pivoted tab you will see the comparison columns:
- `value changed`: will be `"N"` for all values that were unchanged between `df1` and `df2`. For values that were changed, added, or deleted between `df1` and `df2`, this will be `"Y"`. For values which could not be compared between `df1` and `df2` due to ID column issues, this will be `"UNK"`.
- `row discrepancy`: this column does not describes the discrepancy status between `df1` and `df2` for the value being compared rather, it provides the discrepancy status for the entire row in `df1` and/or `df2` that contains the value. This matches the value of the `discrepancy` column from the data top-bottom and data left-right tabs.
- `record from`: this describes whether the value being compared was in both `df1` and `df2`, `df1` only, or `df2` only.