Skip to content

Commit

Permalink
chore: documentation update (#50)
Browse files Browse the repository at this point in the history
- refactored swagger schema for Job Configuration
- made some misprint corrections and formatting adjustments for proper
rendering.
  • Loading branch information
gabb1er authored Jul 19, 2024
1 parent ca09fe4 commit 13f569a
Show file tree
Hide file tree
Showing 65 changed files with 1,670 additions and 2,390 deletions.
18 changes: 9 additions & 9 deletions docs/01-application-setup/01-ApplicationSettings.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@ section for more details on working with date and time in Checkita Framework.
DateTime settings include following:

* `timeZone` - Time zone in which string representation of reference date and execution date are parsed and rendered.
*Optional, default is `"UTC"`*.
*Optional, default is `"UTC"`*.
* `referenceDateFormat` - datetime format used to parse and render reference date.
*Optional, default is `"yyyy-MM-dd'T'HH:mm:ss.SSS"`*.
*Optional, default is `"yyyy-MM-dd'T'HH:mm:ss.SSS"`*.
* `executionDateFormat` - datetime format used to parse and render execution date.
*Optional, default is `"yyyy-MM-dd'T'HH:mm:ss.SSS"`*
*Optional, default is `"yyyy-MM-dd'T'HH:mm:ss.SSS"`*

If `dateTimeOptions` section is missing then default values are used for all parameters above.

Expand All @@ -36,7 +36,7 @@ section for more details on runnig data quality checks over streaming sources.
* `window` - Window interval: defines tabbing window size used to accumulate metrics.
All metrics results and checks are evaluated per each window once it finalised. *Optional, default is `10m`*.
* `watermark` - Watermark level: defines time interval after which late records are no longer processed.
*Optional, default is `5m`*.
*Optional, default is `5m`*.
* `allowEmptyWindows` - Boolean flag indicating whether empty windows are allowed. Thus, in situation when window is
below watermark and for some of the processed streams there are no results then all related checks will be skipped
if this flag is set to `true`. Otherwise, checks will be processed and will return error status with
Expand All @@ -52,20 +52,20 @@ Section `enablers` of application configuration file defines various boolean swi
that controls various aspects of data quality job execution:

* `allowSqlQueries` - Enables usage arbitrary SQL queries in data quality job configuration.
*Optional, default is `false`*
*Optional, default is `false`*
* `allowNotifications` - Enables notifications to be sent from DQ application.
*Optional, default is `false`*
*Optional, default is `false`*
* `aggregatedKafkaOutput` - Enables sending aggregates messages for Kafka Targets (one per each target type).
By default, kafka messages are sent per each result entity.
*Optional, default is `false`*
*Optional, default is `false`*
* `enableCaseSensitivity` - Enable columns case sensitivity. Controls column names comparison and lookup.
*Optional, default is `false`*
*Optional, default is `false`*
* `errorDumpSize` - Maximum number of errors to be collected per single metric. Framework is able to collect source
data rows where metric evaluation yielded some errors. But in order to prevent OOM the number of collected errors
have to be limited to a reasonable value. Thus, maximum allowable number of errors per metric is `10000`.
It is possible to lower this number by setting this parameter. *Optional, default is `10000`*
* `outputRepartition` - Sets the number of partitions when writing outputs. By default, writes single file.
*Optional, default is `1`*
*Optional, default is `1`*
* `metricEngineAPI` - Sets engine to be used for regular metric processing: `rdd` (RDD-engine) or `df` (DF-engine) are
available. It is recommended to use DF-engine for batch applications while streaming applications support only
RDD-engine. *Optional, default is `rdd`*.
Expand Down
10 changes: 5 additions & 5 deletions docs/01-application-setup/04-APIServer.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ with Checkita configuration and results.
Thus, at current moment API server supports following functionality:

* Configuration:
* Validation of application configuration.
* Validation of job configuration.
* Validation of application configuration.
* Validation of job configuration.
* DQ Storage:
* Fetch overall summary for all jobs in DQ storage.
* Fetch actual job state that was run.
* Fetch job results for given datetime interval.
* Fetch overall summary for all jobs in DQ storage.
* Fetch actual job state that was run.
* Fetch job results for given datetime interval.

See [Swagger Doc](../swagger/index.md#swagger-doc) for more details on Checkita API Server methods.
6 changes: 3 additions & 3 deletions docs/01-application-setup/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ this mode of operation. A typical architecture for working with Checkita Data Qu
* Spark Application is started.
* Spark Application loads the sources described in the configuration file (HDFS, S3, Hive, external databases),
calculates metrics, performs checks and saves the results:
* The main results are saved in the framework database.
* Additionally, results and notifications are sent via channels configured in the pipeline.
* The main results are saved in the framework database.
* Additionally, results and notifications are sent via channels configured in the pipeline.
* Based on the results, dashboards are formed to monitor data quality
(not included in the functionality of this framework).

Expand All @@ -29,4 +29,4 @@ however, this functionality is currently in experimental state and is subject to
information on running quality checks over streaming sources, please see
[Data Quality Checks over Streaming Sources](../02-general-information/05-StreamingMode.md) chapter.

![image](../../diagrams/Architecture.png)
![image](../diagrams/Architecture.png)
2 changes: 1 addition & 1 deletion docs/02-general-information/03-StatusModel.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ is how statuses are communicated with user:

* When computing metrics, status is obtained for each data row during metric increment step. If status other than
`Success` then metric error is collected for this particular row of data. Then, metric error reports can be requested
as [Error Collection Targets](../03-job-configuration/08-Targets.md#error-collection-targets). For more information
as [Error Collection Targets](../03-job-configuration/10-Targets.md#error-collection-targets). For more information
on metric error collection, see [Metric Error Collection](04-ErrorCollection.md) chapter.
* As for checks, status is their primary result output. Therefore, it is written into data quality storage along with
a detailed message.
7 changes: 5 additions & 2 deletions docs/02-general-information/04-ErrorCollection.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ additionally limited in the application settings by setting `errorDumpSize` para
See [Enablers](../01-application-setup/01-ApplicationSettings.md#enablers) chapter for more details.

Collected metric errors could be used to identify and debug problems in the data. In order to save or send metric error
reports, [Error Collection Targets](../03-job-configuration/08-Targets.md#error-collection-targets) can be configured in
reports, [Error Collection Targets](../03-job-configuration/10-Targets.md#error-collection-targets) can be configured in
`targets` section of job configuration. Note that error collection reports will contain excerpts from data and,
therefore, should be communicated with caution. For the same reason they are never saved in Data Quality storage.
therefore, should be communicated with caution. For the same reason it is up to user to decide wether metrics errrors
will be saved in Data Quality storage. This behaviour is controlled by `saveErrorsToStorage` enabler within
[Storage Configuration](../01-application-setup/01-ApplicationSettings.md#storage-configuration) section of application
configuration.
36 changes: 16 additions & 20 deletions docs/02-general-information/05-StreamingMode.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,30 +45,26 @@ Summarizing, data quality streaming job processing routing consists of following
* Start streaming queries from provided sources with `forEachBatch` sink.
* Start window processor in a separate thread.
* For each micro-batch (evaluated once per trigger interval) process data:

* register metric error accumulator;
* for each record increment metric calculators corresponding to the window to which record is assigned;
* collect metric errors if any;
* if record is late to current watermark, then it is skipped and metric calculators state is unchanged;
* compute new watermark based on time values obtained from processed records;
* update processor buffer state, which contains state of metric calculators for all windows as well as collected
metric errors (also per each window). In addition, processor buffer tracks current watermark levels per each
processed streaming source.

* register metric error accumulator;
* for each record increment metric calculators corresponding to the window to which record is assigned;
* collect metric errors if any;
* if record is late to current watermark, then it is skipped and metric calculators state is unchanged;
* compute new watermark based on time values obtained from processed records;
* update processor buffer state, which contains state of metric calculators for all windows as well as collected
metric errors (also per each window). In addition, processor buffer tracks current watermark levels per each
processed streaming source.
* Window processor checks processor buffer (also once per trigger interval) for windows that are completely below the
watermark level. **IMPORTANT** In order to support synchronised processing of multiple streaming sources, the minimum
watermark level is used (computed from current watermark levels of all the processed sources). This ensures that
window is finalised for all processed sources.
* Once finalised window is obtained, then for this window all data quality routines are performed:

* metric results are retrieved from calculators;
* composed metrics are calculated;
* checks are performed;
* results are stored in the data quality storage;
* all targets are processed and results (or notifications) are sent to required channels.
* checkpoints are saved if checkpoint directory is configured. **This is new feature available since Checkita 2.0.**
* processor buffer is cleared: state for processed window is removed.

* Once finalised window is obtained, then for this window all data quality routines are performed:
* metric results are retrieved from calculators;
* composed metrics are calculated;
* checks are performed;
* results are stored in the data quality storage;
* all targets are processed and results (or notifications) are sent to required channels.
* checkpoints are saved if checkpoint directory is configured. **This is new feature available since Checkita 2.0.**
* processor buffer is cleared: state for processed window is removed.
* Streaming queries and window processor run until application is stopped (`sigterm` signal received) or error occurs.

**Important note on results saving**: since set of results is generated per each processed window than for each set of
Expand Down
2 changes: 1 addition & 1 deletion docs/03-job-configuration/02-Schemas.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ Fixed-short schema definition contains following parameters:
* `id` - *Required*. Schema ID;
* `description` - *Optional*. Schema description;
* `schema` - *Required*. List of schema columns where each column is a string in format `columnName:columnWidth`.
*Type of columns is always a StringType.*
*Type of columns is always a StringType.*
* `metadata` - *Optional*. List of user-defined metadata parameters specific to this schema where each parameter
is a string in format:`param.name=param.value`.

Expand Down
37 changes: 14 additions & 23 deletions docs/03-job-configuration/08-Metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Calculation of various metrics over the data is the main part of Data Quality jo
various indicators that describe data from both technical and business points of view. Indicators in their turn can
signal about problems in the data.

All metrics are linked to a source over which they are calculated. Most of the metrics are computed directly over
Most of the metrics are linked to a source over which they are calculated. Most of the metrics are computed directly over
the data source. Such metrics are called `regular`. Apart from regular metrics there are two special kinds of metrics:

* `composed` metrics - can be calculated based on other metrics results thus allowing metric compositions.
Expand Down Expand Up @@ -190,8 +190,8 @@ Therefore, to prevent OOM errors for extremely large sequences, it is recommende
the [Approximate Sequence Completeness Metric](#approximate-sequence-completeness-metric), which uses HLL probabilistic
algorithm to estimate number of unique values.

The required number of elements is determined by the formula: `(max_value - min_value) / increment + 1`,
Where:
The required number of elements is determined by the formula: `(max_value - min_value) / increment + 1`, where:

* `min_value` - the minimum value in the sequence;
* `max_value` - the maximum value in the sequence;
* `increment` - sequence step, default is 1.
Expand Down Expand Up @@ -229,8 +229,8 @@ Metric definition does not require additional parameters: `params` should not be

All minimum string metrics are defined in `minString` subsection.

Metric increment returns `Failure` status for rows where all values in the specified columns are not castable
to string and, therefore, minimum string length cannot be computed.
Metric is not reversible. Metric increment returns `Failure` status for rows where all values in the specified
columns cannot be cast to string and, therefore, minimum string length cannot be computed.

### Maximum String Metric

Expand Down Expand Up @@ -618,7 +618,7 @@ Calculates an arbitrary quantile for the values in the specified column. Metric
Additional parameters should be supplied:

* `accuracyError` - *Optional, default is `0.01`*. Accuracy error for calculation of quantile value.
* `target` - *Required*. A number in the interval `[0, 1]` corresponding to the quantile that need to be caclulated.
* `target` - *Required*. A number in the interval `[0, 1]` corresponding to the quantile that need to be calculated.

Metric is not reversible and metric increment returns `Failure` status for rows where value in the specified column
cannot be cast to number.
Expand Down Expand Up @@ -716,9 +716,6 @@ Metric definition does not require additional parameters: `params` should not be

**This metric works with exactly two columns.**

> **IMPORTANT**. For the metric to be calculated, values in the specified columns must not be empty or null and
> also can be cast to number (double). If at least one corrupt value is found, then metric calculator returns NaN value.
Metric is not reversible and metric increment returns `Failure` status for rows where some values in the specified
columns cannot be cast to number.

Expand All @@ -731,9 +728,6 @@ Metric definition does not require additional parameters: `params` should not be

**This metric works with exactly two columns.**

> **IMPORTANT**. For the metric to be calculated, values in the specified columns must not be empty or null and
> also can be cast to number (double). If at least one corrupt value is found, then metric calculator returns NaN value.
Metric is not reversible and metric increment returns `Failure` status for rows where some values in the specified
columns cannot be cast to number.

Expand All @@ -746,9 +740,6 @@ Metric definition does not require additional parameters: `params` should not be

**This metric works with exactly two columns.**

> **IMPORTANT**. For the metric to be calculated, values in the specified columns must not be empty or null and
> also can be cast to number (double). If at least one corrupt value is found, then metric calculator returns NaN value.
Metric is not reversible and metric increment returns `Failure` status for rows where some values in the specified
columns cannot be cast to number.

Expand Down Expand Up @@ -807,22 +798,22 @@ Thus, trend metrics are defined `trend` subsection using following set of parame

* `id` - *Required*. Trend metric ID;
* `description` - *Optional*. Trend metric description.
* `kind` - *Required*. Kind of statistic to be calculated over historical metric results.
* Available trend metric kinds are: `avg`, `std`, `min`, `max`, `sum`, `median`, `firstQuartile`, `thirdQuartile`, `quantile`.
* `kind` - *Required*. Kind of statistic to be calculated over historical metric results.
* Available trend metric kinds are: `avg`, `std`, `min`, `max`, `sum`, `median`, `firstQuartile`, `thirdQuartile`, `quantile`.
* `quantile` - *Required*. **ONLY FOR `quantile` TREND METRIC**. Quantile to compute over historical metric results
(must be a number in range `[0, 1]`).
* `lookupMetric` - *Required*. Lookup metric ID: metric which results will be pulled from DQ storage.
* `rule` - *Required*. The rule for loading historical metric results from DQ storage. There are two rules supported:
* `record` - loads specified number of historical metric result records.
* `datetime` - loads historical metric results for configured datetime window.
* `record` - loads specified number of historical metric result records.
* `datetime` - loads historical metric results for configured datetime window.
* `windowSize` - *Required*. Size of the window for which historical results are loaded:
* If `rule` is set to `record` then window size is the number of records to retrieve.
* If `rule` is set to `datetime` then window size is a duration string which should conform to Scala Duration.
* If `rule` is set to `record` then window size is the number of records to retrieve.
* If `rule` is set to `datetime` then window size is a duration string which should conform to Scala Duration.
* `windowOffset` - *Optional, default is `0` or `0s`*. Set window offset back from current reference date
(see [Working with Date and Time](../02-general-information/01-WorkingWithDateTime.md) chapter for more details on
reference date). By default, offset is absent and window start from current reference date (not including it).
* If `rule` is set to `record` then window offset is the number of records to skip from reference date.
* If `rule` is set to `datetime` then window offset is a duration string which should conform to Scala Duration.
* If `rule` is set to `record` then window offset is the number of records to skip from reference date.
* If `rule` is set to `datetime` then window offset is a duration string which should conform to Scala Duration.
* `metadata` - *Optional*. List of user-defined metadata parameters specific to this metric where each parameter
is a string in format: `param.name=param.value`.

Expand Down
4 changes: 2 additions & 2 deletions docs/03-job-configuration/09-Checks.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ then checks can be configured to identify if there are any problems with quality

In Checkita there are two main group of checks:

* `Spanshot` checks - allows comparison of metric results with static thresholds or with other metric results in the
* `Snapshot` checks - allows comparison of metric results with static thresholds or with other metric results in the
same Data Quality job.
* `Trend` checks - allows evaluation of how metric result is changing over a certain period of time. Checks of this type
are used to detect anomalies in data. In order trend check work it is required to set up Data Quality storage since
Expand Down Expand Up @@ -119,7 +119,7 @@ Expression checks represent a boolean expression referring to one or multiple me
`true` or `false`. Metrics must be referenced by their IDs.

Formula must be written using [Mustache Template](https://mustache.github.io/mustache.5.html) notation, e.g.:
`{{ metric_1 }} + {{ metic_2 }}`.
`{{ metric_1 }} + {{ metric_2 }}`.

There are following operations supported to build boolean expressions:

Expand Down
Loading

0 comments on commit 13f569a

Please sign in to comment.