chore: documentation update (#50)

- refactored swagger schema for Job Configuration - made some misprint corrections and formatting adjustments for proper rendering.
Raiffeisen-DGTL · Jul 19, 2024 · 13f569a · 13f569a
1 parent ca09fe4
commit 13f569a
Show file tree

Hide file tree

Showing 65 changed files with 1,670 additions and 2,390 deletions.
diff --git a/docs/01-application-setup/01-ApplicationSettings.md b/docs/01-application-setup/01-ApplicationSettings.md
@@ -18,11 +18,11 @@ section for more details on working with date and time in Checkita Framework.
 DateTime settings include following:
 
 * `timeZone` - Time zone in which string representation of reference date and execution date are parsed and rendered.
-  *Optional, default is `"UTC"`*.
+    *Optional, default is `"UTC"`*.
 * `referenceDateFormat` - datetime format used to parse and render reference date.
-  *Optional, default is `"yyyy-MM-dd'T'HH:mm:ss.SSS"`*.
+    *Optional, default is `"yyyy-MM-dd'T'HH:mm:ss.SSS"`*.
 * `executionDateFormat` - datetime format used to parse and render execution date.
-  *Optional, default is `"yyyy-MM-dd'T'HH:mm:ss.SSS"`*
+    *Optional, default is `"yyyy-MM-dd'T'HH:mm:ss.SSS"`*
 
 If `dateTimeOptions` section is missing then default values are used for all parameters above.
 
@@ -36,7 +36,7 @@ section for more details on runnig data quality checks over streaming sources.
 * `window` - Window interval: defines tabbing window size used to accumulate metrics. 
   All metrics results and checks are evaluated per each window once it finalised. *Optional, default is `10m`*.
 * `watermark` - Watermark level: defines time interval after which late records are no longer processed.
-  *Optional, default is `5m`*.
+    *Optional, default is `5m`*.
 * `allowEmptyWindows` - Boolean flag indicating whether empty windows are allowed. Thus, in situation when window is 
   below watermark and for some of the processed streams there are no results then all related checks will be skipped 
   if this flag is set to `true`. Otherwise, checks will be processed and will return error status with 
@@ -52,20 +52,20 @@ Section `enablers` of application configuration file defines various boolean swi
 that controls various aspects of data quality job execution:
 
 * `allowSqlQueries` - Enables usage arbitrary SQL queries in data quality job configuration.
-  *Optional, default is `false`*
+    *Optional, default is `false`*
 * `allowNotifications` - Enables notifications to be sent from DQ application. 
-  *Optional, default is `false`*
+    *Optional, default is `false`*
 * `aggregatedKafkaOutput` - Enables sending aggregates messages for Kafka Targets (one per each target type).
   By default, kafka messages are sent per each result entity.
-  *Optional, default is `false`*
+    *Optional, default is `false`*
 * `enableCaseSensitivity` - Enable columns case sensitivity. Controls column names comparison and lookup.
-  *Optional, default is `false`*
+    *Optional, default is `false`*
 * `errorDumpSize` - Maximum number of errors to be collected per single metric. Framework is able to collect source 
   data rows where metric evaluation yielded some errors. But in order to prevent OOM the number of collected errors
   have to be limited to a reasonable value. Thus, maximum allowable number of errors per metric is `10000`.
   It is possible to lower this number by setting this parameter. *Optional, default is `10000`*
 * `outputRepartition` - Sets the number of partitions when writing outputs. By default, writes single file.
-  *Optional, default is `1`*
+    *Optional, default is `1`*
 * `metricEngineAPI` - Sets engine to be used for regular metric processing: `rdd` (RDD-engine) or `df` (DF-engine) are
   available. It is recommended to use DF-engine for batch applications while streaming applications support only
   RDD-engine. *Optional, default is `rdd`*.

diff --git a/docs/01-application-setup/04-APIServer.md b/docs/01-application-setup/04-APIServer.md
@@ -12,11 +12,11 @@ with Checkita configuration and results.
 Thus, at current moment API server supports following functionality:
 
 * Configuration:
-  * Validation of application configuration.
-  * Validation of job configuration.
+    * Validation of application configuration.
+    * Validation of job configuration.
 * DQ Storage:
-  * Fetch overall summary for all jobs in DQ storage.
-  * Fetch actual job state that was run.
-  * Fetch job results for given datetime interval.
+    * Fetch overall summary for all jobs in DQ storage.
+    * Fetch actual job state that was run.
+    * Fetch job results for given datetime interval.
 
 See [Swagger Doc](../swagger/index.md#swagger-doc) for more details on Checkita API Server methods.
diff --git a/docs/01-application-setup/index.md b/docs/01-application-setup/index.md
@@ -19,8 +19,8 @@ this mode of operation. A typical architecture for working with Checkita Data Qu
 * Spark Application is started.
 * Spark Application loads the sources described in the configuration file (HDFS, S3, Hive, external databases),
   calculates metrics, performs checks and saves the results:
-  * The main results are saved in the framework database.
-  * Additionally, results and notifications are sent via channels configured in the pipeline.
+    * The main results are saved in the framework database.
+    * Additionally, results and notifications are sent via channels configured in the pipeline.
 * Based on the results, dashboards are formed to monitor data quality
   (not included in the functionality of this framework).
 
@@ -29,4 +29,4 @@ however, this functionality is currently in experimental state and is subject to
 information on running quality checks over streaming sources, please see 
 [Data Quality Checks over Streaming Sources](../02-general-information/05-StreamingMode.md) chapter. 
 
-![image](../../diagrams/Architecture.png)
+![image](../diagrams/Architecture.png)
diff --git a/docs/02-general-information/03-StatusModel.md b/docs/02-general-information/03-StatusModel.md
@@ -14,7 +14,7 @@ is how statuses are communicated with user:
 
 * When computing metrics, status is obtained for each data row during metric increment step. If status other than 
   `Success` then metric error is collected for this particular row of data. Then, metric error reports can be requested
-  as [Error Collection Targets](../03-job-configuration/08-Targets.md#error-collection-targets). For more information 
+  as [Error Collection Targets](../03-job-configuration/10-Targets.md#error-collection-targets). For more information 
   on metric error collection, see [Metric Error Collection](04-ErrorCollection.md) chapter.
 * As for checks, status is their primary result output. Therefore, it is written into data quality storage along with
   a detailed message.
diff --git a/docs/02-general-information/04-ErrorCollection.md b/docs/02-general-information/04-ErrorCollection.md
@@ -20,6 +20,9 @@ additionally limited in the application settings by setting `errorDumpSize` para
 See [Enablers](../01-application-setup/01-ApplicationSettings.md#enablers) chapter for more details.
 
 Collected metric errors could be used to identify and debug problems in the data. In order to save or send metric error
-reports, [Error Collection Targets](../03-job-configuration/08-Targets.md#error-collection-targets) can be configured in
+reports, [Error Collection Targets](../03-job-configuration/10-Targets.md#error-collection-targets) can be configured in
 `targets` section of job configuration. Note that error collection reports will contain excerpts from data and,
-therefore, should be communicated with caution. For the same reason they are never saved in Data Quality storage.
+therefore, should be communicated with caution. For the same reason it is up to user to decide wether metrics errrors
+will be saved in Data Quality storage. This behaviour is controlled by `saveErrorsToStorage` enabler within
+[Storage Configuration](../01-application-setup/01-ApplicationSettings.md#storage-configuration) section of application
+configuration.
diff --git a/docs/02-general-information/05-StreamingMode.md b/docs/02-general-information/05-StreamingMode.md
@@ -45,30 +45,26 @@ Summarizing, data quality streaming job processing routing consists of following
 * Start streaming queries from provided sources with `forEachBatch` sink.
 * Start window processor in a separate thread.
 * For each micro-batch (evaluated once per trigger interval) process data:
-
-  * register metric error accumulator;
-  * for each record increment metric calculators corresponding to the window to which record is assigned;
-  * collect metric errors if any;
-  * if record is late to current watermark, then it is skipped and metric calculators state is unchanged;
-  * compute new watermark based on time values obtained from processed records;
-  * update processor buffer state, which contains state of metric calculators for all windows as well as collected
-    metric errors (also per each window). In addition, processor buffer tracks current watermark levels per each
-    processed streaming source.
-
+    * register metric error accumulator;
+    * for each record increment metric calculators corresponding to the window to which record is assigned;
+    * collect metric errors if any;
+    * if record is late to current watermark, then it is skipped and metric calculators state is unchanged;
+    * compute new watermark based on time values obtained from processed records;
+    * update processor buffer state, which contains state of metric calculators for all windows as well as collected
+      metric errors (also per each window). In addition, processor buffer tracks current watermark levels per each
+      processed streaming source.
 * Window processor checks processor buffer (also once per trigger interval) for windows that are completely below the
   watermark level. **IMPORTANT** In order to support synchronised processing of multiple streaming sources, the minimum
   watermark level is used (computed from current watermark levels of all the processed sources). This ensures that
   window is finalised for all processed sources.
-* Once finalised window is obtained, then for this window all data quality routines are performed: 
-
-  * metric results are retrieved from calculators;
-  * composed metrics are calculated;
-  * checks are performed;
-  * results are stored in the data quality storage;
-  * all targets are processed and results (or notifications) are sent to required channels.
-  * checkpoints are saved if checkpoint directory is configured. **This is new feature available since Checkita 2.0.**
-  * processor buffer is cleared: state for processed window is removed.
-
+* Once finalised window is obtained, then for this window all data quality routines are performed:
+    * metric results are retrieved from calculators;
+    * composed metrics are calculated;
+    * checks are performed;
+    * results are stored in the data quality storage;
+    * all targets are processed and results (or notifications) are sent to required channels.
+    * checkpoints are saved if checkpoint directory is configured. **This is new feature available since Checkita 2.0.**
+    * processor buffer is cleared: state for processed window is removed. 
 * Streaming queries and window processor run until application is stopped (`sigterm` signal received) or error occurs.
 
 **Important note on results saving**: since set of results is generated per each processed window than for each set of 

diff --git a/docs/03-job-configuration/02-Schemas.md b/docs/03-job-configuration/02-Schemas.md
@@ -64,7 +64,7 @@ Fixed-short schema definition contains following parameters:
 * `id` - *Required*. Schema ID;
 * `description` - *Optional*. Schema description;
 * `schema` - *Required*. List of schema columns where each column is a string in format `columnName:columnWidth`.
-  *Type of columns is always a StringType.*
+    *Type of columns is always a StringType.*
 * `metadata` - *Optional*. List of user-defined metadata parameters specific to this schema where each parameter
   is a string in format:`param.name=param.value`.
 

diff --git a/docs/03-job-configuration/08-Metrics.md b/docs/03-job-configuration/08-Metrics.md
@@ -4,7 +4,7 @@ Calculation of various metrics over the data is the main part of Data Quality jo
 various indicators that describe data from both technical and business points of view. Indicators in their turn can
 signal about problems in the data.
 
-All metrics are linked to a source over which they are calculated. Most of the metrics are computed directly over 
+Most of the metrics are linked to a source over which they are calculated. Most of the metrics are computed directly over 
 the data source. Such metrics are called `regular`. Apart from regular metrics there are two special kinds of metrics:
 
 * `composed` metrics - can be calculated based on other metrics results thus allowing metric compositions.
@@ -190,8 +190,8 @@ Therefore, to prevent OOM errors for extremely large sequences, it is recommende
 the [Approximate Sequence Completeness Metric](#approximate-sequence-completeness-metric), which uses HLL probabilistic
 algorithm to estimate number of unique values.
 
-The required number of elements is determined by the formula: `(max_value - min_value) / increment + 1`,
-Where:
+The required number of elements is determined by the formula: `(max_value - min_value) / increment + 1`, where:
+
 * `min_value` - the minimum value in the sequence;
 * `max_value` - the maximum value in the sequence;
 * `increment` - sequence step, default is 1.
@@ -229,8 +229,8 @@ Metric definition does not require additional parameters: `params` should not be
 
 All minimum string metrics are defined in `minString` subsection.
 
-Metric increment returns `Failure` status for rows where all values in the specified columns are not castable 
-to string and, therefore, minimum string length cannot be computed.
+Metric is not reversible. Metric increment returns `Failure` status for rows where all values in the specified 
+columns cannot be cast to string and, therefore, minimum string length cannot be computed.
 
 ### Maximum String Metric
 
@@ -618,7 +618,7 @@ Calculates an arbitrary quantile for the values in the specified column. Metric
 Additional parameters should be supplied:
 
 * `accuracyError` - *Optional, default is `0.01`*. Accuracy error for calculation of quantile value.
-* `target` - *Required*. A number in the interval `[0, 1]` corresponding to the quantile that need to be caclulated.
+* `target` - *Required*. A number in the interval `[0, 1]` corresponding to the quantile that need to be calculated.
 
 Metric is not reversible and metric increment returns `Failure` status for rows where value in the specified column
 cannot be cast to number.
@@ -716,9 +716,6 @@ Metric definition does not require additional parameters: `params` should not be
 
 **This metric works with exactly two columns.**
 
-> **IMPORTANT**. For the metric to be calculated, values in the specified columns must not be empty or null and 
-> also can be cast to number (double). If at least one corrupt value is found, then metric calculator returns NaN value.
-
 Metric is not reversible and metric increment returns `Failure` status for rows where some values in the specified 
 columns cannot be cast to number.
 
@@ -731,9 +728,6 @@ Metric definition does not require additional parameters: `params` should not be
 
 **This metric works with exactly two columns.**
 
-> **IMPORTANT**. For the metric to be calculated, values in the specified columns must not be empty or null and
-> also can be cast to number (double). If at least one corrupt value is found, then metric calculator returns NaN value.
-
 Metric is not reversible and metric increment returns `Failure` status for rows where some values in the specified 
 columns cannot be cast to number.
 
@@ -746,9 +740,6 @@ Metric definition does not require additional parameters: `params` should not be
 
 **This metric works with exactly two columns.**
 
-> **IMPORTANT**. For the metric to be calculated, values in the specified columns must not be empty or null and
-> also can be cast to number (double). If at least one corrupt value is found, then metric calculator returns NaN value.
-
 Metric is not reversible and metric increment returns `Failure` status for rows where some values in the specified 
 columns cannot be cast to number.
 
@@ -807,22 +798,22 @@ Thus, trend metrics are defined `trend` subsection using following set of parame
 
 * `id` - *Required*. Trend metric ID;
 * `description` - *Optional*. Trend metric description.
-* `kind` - *Required*. Kind of statistic to be calculated over historical metric results. 
-  * Available trend metric kinds are: `avg`, `std`, `min`, `max`, `sum`, `median`, `firstQuartile`, `thirdQuartile`, `quantile`.
+* `kind` - *Required*. Kind of statistic to be calculated over historical metric results.
+    * Available trend metric kinds are: `avg`, `std`, `min`, `max`, `sum`, `median`, `firstQuartile`, `thirdQuartile`, `quantile`.
 * `quantile` - *Required*. **ONLY FOR `quantile` TREND METRIC**. Quantile to compute over historical metric results
   (must be a number in range `[0, 1]`).
 * `lookupMetric` - *Required*. Lookup metric ID: metric which results will be pulled from DQ storage.
 * `rule` - *Required*. The rule for loading historical metric results from DQ storage. There are two rules supported:
-  * `record` - loads specified number of historical metric result records.
-  * `datetime` - loads historical metric results for configured datetime window.
+    * `record` - loads specified number of historical metric result records.
+    * `datetime` - loads historical metric results for configured datetime window.
 * `windowSize` - *Required*. Size of the window for which historical results are loaded:
-  * If `rule` is set to `record` then window size is the number of records to retrieve.
-  * If `rule` is set to `datetime` then window size is a duration string which should conform to Scala Duration.
+    * If `rule` is set to `record` then window size is the number of records to retrieve.
+    * If `rule` is set to `datetime` then window size is a duration string which should conform to Scala Duration.
 * `windowOffset` - *Optional, default is `0` or `0s`*. Set window offset back from current reference date
   (see [Working with Date and Time](../02-general-information/01-WorkingWithDateTime.md) chapter for more details on
   reference date). By default, offset is absent and window start from current reference date (not including it).
-  * If `rule` is set to `record` then window offset is the number of records to skip from reference date.
-  * If `rule` is set to `datetime` then window offset is a duration string which should conform to Scala Duration.
+    * If `rule` is set to `record` then window offset is the number of records to skip from reference date.
+    * If `rule` is set to `datetime` then window offset is a duration string which should conform to Scala Duration.
 * `metadata` - *Optional*. List of user-defined metadata parameters specific to this metric where each parameter
   is a string in format: `param.name=param.value`.
 

diff --git a/docs/03-job-configuration/09-Checks.md b/docs/03-job-configuration/09-Checks.md
@@ -5,7 +5,7 @@ then checks can be configured to identify if there are any problems with quality
 
 In Checkita there are two main group of checks:
 
-* `Spanshot` checks - allows comparison of metric results with static thresholds or with other metric results in the 
+* `Snapshot` checks - allows comparison of metric results with static thresholds or with other metric results in the 
   same Data Quality job.
 * `Trend` checks - allows evaluation of how metric result is changing over a certain period of time. Checks of this type
   are used to detect anomalies in data. In order trend check work it is required to set up Data Quality storage since
@@ -119,7 +119,7 @@ Expression checks represent a boolean expression referring to one or multiple me
 `true` or `false`. Metrics must be referenced by their IDs.
 
 Formula must be written using [Mustache Template](https://mustache.github.io/mustache.5.html) notation, e.g.:
-`{{ metric_1 }} + {{ metic_2 }}`.
+`{{ metric_1 }} + {{ metric_2 }}`.
 
 There are following operations supported to build boolean expressions: