README.Rmd

---
output: github_document
resource_files:
  - inst/extdata/comparison_report.xlsx
  - inst/extdata/comparison_report_screenshot.png
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# compareDFx <a href="https://skgithub14.github.io/compareDFx/"><img src="man/figures/logo.png" align="right" height="139" alt="compareDFx website" /></a>


<!-- badges: start -->
[![R-CMD-check](https://github.com/skgithub14/compareDFx/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/skgithub14/compareDFx/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/skgithub14/compareDFx/branch/master/graph/badge.svg)](https://app.codecov.io/gh/skgithub14/compareDFx?branch=master)
<!-- badges: end -->

{compareDFx} compares two data frames with detailed reporting natively in R or in MS Excel. The MS Excel report is a multi-tab workbook that color codes additions, deletions, and changes which is ideal for R users and non-R users alike and makes identifying data issues and version differences quick and easy. 

![Screen shot from 'data top-bottom' tab of comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/report_top_bottom.png)

## Installation

You can install the development version of {compareDFx} like so:

``` r
# install.package("devtools")
install_github("skgithub14/compareDFx")
```

<br>

________________________________________________________________________________

## Comparing two data frames using {compareDFx} 

Let's assume that `df1` and `df2` are two versions of the same data set, where `df1` is the newer version. We want to know exactly which values changed. Using the two example data frames included with {compareDFx}, `df1` and `df2`, we will walk through the {compareDFx} workflow.

### Creating the report

We will start by loading the {compareDFx} package and defining our data frames:

```{r}
library(compareDFx)
data(df1)
data(df2)
```


<details><summary>View the example data</summary>

```{r}
# example data, df1 
knitr::kable(df1)
```


```{r}
# example data, df2
knitr::kable(df2)
```

</details>

{compareDFx} uses a row's unique ID to evaluate what changed between two data frames. Unique IDs can consist of one or multiple columns. For `df1` and `df2`, the columns the collectively make a unique row ID are `id1` and `id2`. This means that in `df1` there are two columns named `id1` and `id2` and in `df2` there are also two columns named `id1` and `id2`. 

```{r id_col}
id_cols <- c("id1", "id2")
```

Now we can run the comparison and generate the report in Excel:

``` r
# run the comparison
comparison <- get_comparison(df1, df2, id_cols)

# create the excel report
create_comparison_excel(comparison, path = "comparison_report.xlsx")
```

```{r, eval=TRUE, include=FALSE}
# run the comparison
comparison <- get_comparison(df1, df2, id_cols)

# directory and file name for the report, hiding this to avoid confusing users 
#  who don't know what system.file() does
dir <- system.file("extdata", package = "compareDFx")
fname <- "comparison_report.xlsx"
path <- file.path(dir, fname)

# create the excel report
create_comparison_excel(comparison, path = path)
```

Now we have a multi-tab, color-coded MS Excel report with summary statistics. On the data tabs of the report, all cells from `df1` will have a white background and all cells from `df2` will have a grey background. The text is colored accordingly:

-  Green text: cell values in `df1` that are different from `df2`

-  Red text: cell values in `df2` that are different from `df1`

-  Black text: cell values that are the unchanged between `df1` and `df2`

You can explore the file yourself here:

[Download comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/comparison_report.xlsx)

As you can see, there are seven tabs in the Excel workbook:

1.  summary: summary statistics

![Screen shot from 'summary' tab of comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/summary_tab.png) 

2.  summary by column: column summary statistics

![Screen shot from 'summary by column' tab of comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/summary_by_column_tab.png)

3. data pivoted: the combined data displayed so that each value in `df1` is directly to the left of its `df2` counterpart.

![Screen shot from 'data pivoted' tab of comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/report_pivoted.png)

4.  data top-bottom: the combined data displayed so that each row of `df1` is directly above its `df2` counterpart.

![Screen shot from 'data top-bottom' tab of comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/report_top_bottom.png)

5.  data left-right: the same data as shown in  data top-bottom, but where the `df1` columns are immediately to the left of their `df2` counterpart.

![Screen shot from 'data left-right' tab of comparison_report.xlsx](https://github.com/skgithub14/compareDFx/raw/master/inst/extdata/report_left_right.png)

6.  df1: the raw data from `df1`, for reference

7.  df2: the raw data from `df2`, for reference

If you aren't interested in the Excel report but want to work with the report in R instead, you can use the output from `comparison` object directly (see the documentation for `get_comparison()`).

### Understanding the report

In the data top-bottom and data left-right tabs, you will see some extra columns that report useful comparison information:

-  `discrepancy`: the comparison status, which can take one or more of the following values:

    - `"matched"`: a row that exactly matched between `df1` and `df2`
  
    - `"addition"`: a row that only existed in `df1`
  
    - `"deletion"`: a row that only existed in `df2`
  
    - `"changed"`: a row where the `id_cols` matched between between `df1` and `df2` but which had a least one other column with a change
  
    - `"exact duplicate"`: a row with multiple occurrences in either `df1` or `df2`. "Exact" means that every value in the row matched perfectly to its duplicate copies. Exact duplicates can also be marked as `"matched"`, `"addition"`, `"deletion"`, or `"changed"`.
  
    - `"ID duplicate (not exact duplicate)"`: a row that had multiple occurrences of the same combination of `id_cols` in either `df1` or `df2` but for which there was not an exact duplicate (meaning at least one non-ID value in the row was different). An ID duplicate cannot be marked as `"matched"`, `"addition"`, `"deletion"`, or `"changed"` because there will be multiple value matches when comparing `df1` and `df2`. The user should fix these errors and re-run the comparison.
  
    - `"ID contains NA"`: a row in either `df1` or `df2` which had an `NA` value in one or more `id_cols`. The comparison procedure requires non-`NA` values for all `id_cols` therefore these rows will not be marked as `"matched"`, `"addition"`, `"deletion"`, or `"changed"`. The user should fix these errors and re-run the comparison.
    
-  `source`: this column is only show in the data top-bottom sheet and tells you whether the data in the row came from `df1` or `df2`
    
The next set of columns are hidden by default, but can be shown by clicking the "+" in the top left corner of the worksheet:
  
-  `change group`: if `discrepancy` contains `"changed"` this column will be populated with an arbitrary number that links changed records between `df1` and `df2`

-  `exact dup cnt`: if `discrepancy` contains `"exact duplicate"` this column gives the number of exact duplicates in `df1` or `df2`

-  `ID dup cnt`: if `discrepancy` contains `"ID duplicate (not exact duplicate)"` this column gives the number of ID duplicates in `df1` or `df2`

In the data pivoted tab you will see the comparison columns:

- `value changed`: will be `"N"` for all values that were unchanged between `df1` and `df2`. For values that were changed, added, or deleted between `df1` and `df2`, this will be `"Y"`. For values which could not be compared between `df1` and `df2` due to ID column issues, this will be `"UNK"`.

-  `row discrepancy`: this column does not describes the discrepancy status between `df1` and `df2` for the value being compared rather, it provides the discrepancy status for the entire row in `df1` and/or `df2` that contains the value. This matches the value of the `discrepancy` column from the data top-bottom and data left-right tabs. 

-  `record from`: this describes whether the value being compared was in both `df1` and `df2`, `df1` only, or `df2` only.