Skip to content

Commit

Permalink
Add Table of content
Browse files Browse the repository at this point in the history
  • Loading branch information
zeotuan committed Dec 7, 2024
1 parent 02d5799 commit c316b48
Showing 1 changed file with 63 additions and 46 deletions.
109 changes: 63 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,21 @@ Use [chispa](https://github.com/MrPowers/chispa) for PySpark applications.
Read [Testing Spark Applications](https://leanpub.com/testing-spark) for a full explanation on the best way to test
Spark code! Good test suites yield higher quality codebases that are easy to refactor.

## Table of Contents
- [Install](#install)
- [Examples](#simple-examples)
- [Why is this library fast?](#why-is-this-library-fast)
- [Usage](#usage)
- [Local Testing SparkSession](#local-sparksession-for-test)
- [DataFrameComparer](#datasetcomparerdataframecomparer)
- [Unordered DataFrames comparison](#unordered-dataframe-equality-comparisons)
- [Approximate DataFrames comparison](#approximate-dataframe-equality)
- [Ignore Nullable DataFrames comparison](#equality-comparisons-ignoring-the-nullable-flag)
- [ColumnComparer](#column-equality)
- [SchemaComparer](#schema-equality)
- [Testing tips](#testing-tips)


## Install

Fetch the JAR file from Maven.
Expand Down Expand Up @@ -149,6 +164,7 @@ slower.

## Usage

### Local SparkSession for test
The spark-fast-tests project doesn't provide a SparkSession object in your test suite, so you'll need to make one
yourself.

Expand Down Expand Up @@ -176,6 +192,7 @@ big DataFrames in your test suite.
Make sure to only use the `SparkSessionTestWrapper` trait in your test suite. You don't want to use test specific
configuration (like one shuffle partition) when running production code.

### DatasetComparer / DataFrameComparer
The `DatasetComparer` trait defines the `assertSmallDatasetEquality` method. Extend your spec file with the
`SparkSessionTestWrapper` trait to create DataFrames and the `DatasetComparer` trait to make DataFrame comparisons.

Expand Down Expand Up @@ -221,50 +238,7 @@ assertLargeDatasetEquality(actualDF, expectedDF)
`assertSmallDatasetEquality` is faster for test suites that run on your local machine. `assertLargeDatasetEquality`
should only be used for DataFrames that are split across nodes in a cluster.

### Column Equality

The `assertColumnEquality` method can be used to assess the equality of two columns in a DataFrame.

Suppose you have the following DataFrame with two columns that are not equal.

```
+-------+-------------+
| name|expected_name|
+-------+-------------+
| phil| phil|
| rashid| rashid|
|matthew| mateo|
| sami| sami|
| li| feng|
| null| null|
+-------+-------------+
```

The following code will throw a `ColumnMismatch` error message:

```scala
assertColumnEquality(df, "name", "expected_name")
```

<p>
<img src="./images/assertColumnEquality_error_message.png" alt="Description" width="500", height="200">
</p>

Mix in the `ColumnComparer` trait to your test class to access the `assertColumnEquality` method:

```scala
import com.github.mrpowers.spark.fast.tests.ColumnComparer

object MySpecialClassTest
extends TestSuite
with ColumnComparer
with SparkSessionTestWrapper {

// your tests
}
```

### Unordered DataFrame equality comparisons
#### Unordered DataFrame equality comparisons

Suppose you have the following `actualDF`:

Expand Down Expand Up @@ -297,7 +271,7 @@ performing the comparison.

`assertSmallDataFrameEquality(sourceDF, expectedDF, orderedComparison = false)` will not throw an error.

### Equality comparisons ignoring the nullable flag
#### Equality comparisons ignoring the nullable flag

You might also want to make equality comparisons that ignore the nullable flags for the DataFrame columns.

Expand Down Expand Up @@ -326,7 +300,7 @@ val expectedDF = spark.createDF(
assertSmallDatasetEquality(sourceDF, expectedDF, ignoreNullable = true)
```

### Approximate DataFrame Equality
#### Approximate DataFrame Equality

The `assertApproximateDataFrameEquality` function is useful for DataFrames that contain `DoubleType` columns. The
precision threshold must be set when using the `assertApproximateDataFrameEquality` function.
Expand Down Expand Up @@ -355,6 +329,49 @@ val expectedDF = spark.createDF(
assertApproximateDataFrameEquality(sourceDF, expectedDF, 0.01)
```

### Column Equality

The `assertColumnEquality` method can be used to assess the equality of two columns in a DataFrame.

Suppose you have the following DataFrame with two columns that are not equal.

```
+-------+-------------+
| name|expected_name|
+-------+-------------+
| phil| phil|
| rashid| rashid|
|matthew| mateo|
| sami| sami|
| li| feng|
| null| null|
+-------+-------------+
```

The following code will throw a `ColumnMismatch` error message:

```scala
assertColumnEquality(df, "name", "expected_name")
```

<p>
<img src="./images/assertColumnEquality_error_message.png" alt="Description" width="500", height="200">
</p>

Mix in the `ColumnComparer` trait to your test class to access the `assertColumnEquality` method:

```scala
import com.github.mrpowers.spark.fast.tests.ColumnComparer

object MySpecialClassTest
extends TestSuite
with ColumnComparer
with SparkSessionTestWrapper {

// your tests
}
```

### Schema Equality

The SchemaComparer provide `assertSchemaEqual` API which is useful for comparing schema of dataframe schema
Expand Down

0 comments on commit c316b48

Please sign in to comment.