Read our AWS Big Data Blog for an in-depth look at this solution.
Deequ is an open source library built on top of Apache Spark for defining “unit tests for data”. It is used internally at Amazon for verifying the quality of large production datasets, particularly to:
- Suggest data quality constraints on input tables/files
- Verify aforementioned suggested constraints
- Compute quality metrics
- Run column profiling
More details on Deequ can be found in this AWS Blog.
A serverless data quality framework based on Deequ and running on AWS Glue is showcased in this repository. It takes a database and tables in the AWS Glue Catalog as inputs and outputs various data quality metrics into S3. Additionally, it performs an automatic generation of constraints on previously unseen data. The suggestions are stored in DynamoDB tables and can be reviewed and amended at any point by data owners in a UI. All constraints are disabled by default. Once enabled, they are used by the Glue jobs to carry out the data quality checks on the tables.
To deploy the infrastructure and code, you'll need an AWS account and a correctly configured AWS profile with enough permissions to create the architecture below - Administrator rights are recommended.
cd ./src
./deploy.sh -p <aws_profile> -r <aws_region>
All arguments to the deploy.sh
script are optional. The default AWS profile and region are used if none are provided.
The script will:
- Create an S3 bucket to host Deequ Scripts and Jar
- Create a CodeCommit repository and push the local code copy to it
- Create a CloudFormation stack named
amazon-deequ-glue
holding all the infrastructure listed below - Deploy an AWS Amplify UI and monitor the job
- Upload the Deequ scripts and Jar to the S3 bucket
The initial deployment can take 10-15 minutes. The same command can be used for both creating and updating infrastructure.
If you choose NOT to implement an AWS Amplify frontend, then add a "-d" flag to the deploy script:
cd ./src
./deploy.sh -p <aws_profile> -r <aws_region> -d
- A time-based CloudWatch Event Rule is configured to trigger the data quality step function, passing a JSON with the relevant metadata (i.e glueDatabase and glueTables). It is scheduled to fire every 30 minutes and is disabled by default
- The first step in the data quality step function makes a synchronous call to the AWS Glue Controller job. This Glue job is responsible for determining which data quality checks should be performed
- Below Glue jobs can be triggered by the Controller:
- (Phase a) Suggestions Job: is started when working on previously unseen data to perform an automatic suggestion of constraints. The job would first recommend column-level data quality constraints inferred from a first pass on the data that it then logs in the
DataQualitySuggestion
DynamoDB table. It also outputs the quality checks results based on these suggestions into S3. The suggestions can be reviewed and amended at any point by data owners - (Phase b) Verification Job: reads the user-defined and automatically suggested constraints from both the
DataQualitySuggestion
andDataQualityAnalysis
DynamoDB tables and runs 1) a constraint verification and 2) an analysis metrics computation that it outputs in parquet format to S3 - Profiler Job: is always run regardless of the phase. It performs single-column profiling of data. It generates a profile for each column in the input data, including the completeness of the column, the approximate number of distinct values and the inferred datatype
- (Phase a) Suggestions Job: is started when working on previously unseen data to perform an automatic suggestion of constraints. The job would first recommend column-level data quality constraints inferred from a first pass on the data that it then logs in the
- Once the Controller job succeeds, the second step in the data quality step function is to trigger an AWS Lambda which calls the
data-quality-crawler
. The metrics in the data quality S3 bucket are crawled, stored in adata_quality_db
database in the AWS Glue Catalog and are immediately available to be queried in Athena
We assume you have a Glue database hosting one or more tables in the same region where you deployed this framework.
- Navigate to Step Functions in the AWS console and select the
data-quality-sm
. Start an execution inputting a JSON like the below:The data quality process described in the previous section begins and you can follow it by looking at the different Glue jobs execution runs. By the end of this process, you should see that data quality suggestions were logged in the{ "glueDatabase": "my_database", "glueTables": "table1,table2" }
DataQualitySuggestions
DynamoDB table and Glue tables were created in thedata_quality_db
Glue database which can be queried in Athena - Navigate to Amplify in the AWS console, and select the
deequ-constraints
app. Then click on the highlighted URL (listed ashttps://<env>.<appsync_app_id>.amplifyapp.com
) to open the data quality constraints web app. After completing the registration process (i.e. Create Account) and signing in, a UI similar to the below is visible: it lists data quality suggestions produced by the Glue job in the previous step. Data owners can add/remove and enable/disable these constraints at any point in this UI. Notice how theEnabled
field is set toN
by default for all suggestions. This is to ensure all constraints are human-reviewed before they are processed. Click on the checkbox button to enable a constraint - Optionally, you can also provide data quality constraints in the
Analyzers
tab. These constraints are used by Deequ to calculate column-level statistics on the dataset (e.g. CountDistinct, DataType, Completeness…) called metrics (Refer to the Data Analysis section of this blog for more details). Here is an example of an analysis constraint entry: - Once you have you have reviewed both suggestion and analysis constraints, start a new execution of the step function with the same JSON as an input. This time the Glue jobs use the reviewed constraints to perform the data quality checks. Once more, the results can be immediately queried in Athena
An exhaustive list of suggestion and analysis constraints can be found in the docs.
- Deequ 1.0.3-RC1.jar
- Spark 2.2.0 (Scala)
Translate Scala scripts to Python using python-deequ
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.