diff --git a/README.md b/README.md index 16a64d1..b04f2a2 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,21 @@ Use [chispa](https://github.com/MrPowers/chispa) for PySpark applications. Read [Testing Spark Applications](https://leanpub.com/testing-spark) for a full explanation on the best way to test Spark code! Good test suites yield higher quality codebases that are easy to refactor. +## Table of Contents +- [Install](#install) +- [Examples](#simple-examples) +- [Why is this library fast?](#why-is-this-library-fast) +- [Usage](#usage) + - [Local Testing SparkSession](#local-sparksession-for-test) + - [DataFrameComparer](#datasetcomparerdataframecomparer) + - [Unordered DataFrames comparison](#unordered-dataframe-equality-comparisons) + - [Approximate DataFrames comparison](#approximate-dataframe-equality) + - [Ignore Nullable DataFrames comparison](#equality-comparisons-ignoring-the-nullable-flag) + - [ColumnComparer](#column-equality) + - [SchemaComparer](#schema-equality) +- [Testing tips](#testing-tips) + + ## Install Fetch the JAR file from Maven. @@ -149,6 +164,7 @@ slower. ## Usage +### Local SparkSession for test The spark-fast-tests project doesn't provide a SparkSession object in your test suite, so you'll need to make one yourself. @@ -176,6 +192,7 @@ big DataFrames in your test suite. Make sure to only use the `SparkSessionTestWrapper` trait in your test suite. You don't want to use test specific configuration (like one shuffle partition) when running production code. +### DatasetComparer / DataFrameComparer The `DatasetComparer` trait defines the `assertSmallDatasetEquality` method. Extend your spec file with the `SparkSessionTestWrapper` trait to create DataFrames and the `DatasetComparer` trait to make DataFrame comparisons. @@ -221,50 +238,7 @@ assertLargeDatasetEquality(actualDF, expectedDF) `assertSmallDatasetEquality` is faster for test suites that run on your local machine. `assertLargeDatasetEquality` should only be used for DataFrames that are split across nodes in a cluster. -### Column Equality - -The `assertColumnEquality` method can be used to assess the equality of two columns in a DataFrame. - -Suppose you have the following DataFrame with two columns that are not equal. - -``` -+-------+-------------+ -| name|expected_name| -+-------+-------------+ -| phil| phil| -| rashid| rashid| -|matthew| mateo| -| sami| sami| -| li| feng| -| null| null| -+-------+-------------+ -``` - -The following code will throw a `ColumnMismatch` error message: - -```scala -assertColumnEquality(df, "name", "expected_name") -``` - -

- Description -

- -Mix in the `ColumnComparer` trait to your test class to access the `assertColumnEquality` method: - -```scala -import com.github.mrpowers.spark.fast.tests.ColumnComparer - -object MySpecialClassTest - extends TestSuite - with ColumnComparer - with SparkSessionTestWrapper { - - // your tests -} -``` - -### Unordered DataFrame equality comparisons +#### Unordered DataFrame equality comparisons Suppose you have the following `actualDF`: @@ -297,7 +271,7 @@ performing the comparison. `assertSmallDataFrameEquality(sourceDF, expectedDF, orderedComparison = false)` will not throw an error. -### Equality comparisons ignoring the nullable flag +#### Equality comparisons ignoring the nullable flag You might also want to make equality comparisons that ignore the nullable flags for the DataFrame columns. @@ -326,7 +300,7 @@ val expectedDF = spark.createDF( assertSmallDatasetEquality(sourceDF, expectedDF, ignoreNullable = true) ``` -### Approximate DataFrame Equality +#### Approximate DataFrame Equality The `assertApproximateDataFrameEquality` function is useful for DataFrames that contain `DoubleType` columns. The precision threshold must be set when using the `assertApproximateDataFrameEquality` function. @@ -355,6 +329,49 @@ val expectedDF = spark.createDF( assertApproximateDataFrameEquality(sourceDF, expectedDF, 0.01) ``` +### Column Equality + +The `assertColumnEquality` method can be used to assess the equality of two columns in a DataFrame. + +Suppose you have the following DataFrame with two columns that are not equal. + +``` ++-------+-------------+ +| name|expected_name| ++-------+-------------+ +| phil| phil| +| rashid| rashid| +|matthew| mateo| +| sami| sami| +| li| feng| +| null| null| ++-------+-------------+ +``` + +The following code will throw a `ColumnMismatch` error message: + +```scala +assertColumnEquality(df, "name", "expected_name") +``` + +

+ Description +

+ +Mix in the `ColumnComparer` trait to your test class to access the `assertColumnEquality` method: + +```scala +import com.github.mrpowers.spark.fast.tests.ColumnComparer + +object MySpecialClassTest + extends TestSuite + with ColumnComparer + with SparkSessionTestWrapper { + + // your tests +} +``` + ### Schema Equality The SchemaComparer provide `assertSchemaEqual` API which is useful for comparing schema of dataframe schema