From e16349764cd024459ddce1780ce18fcb29204631 Mon Sep 17 00:00:00 2001 From: dat-a-man Date: Fri, 2 Feb 2024 09:57:46 +0530 Subject: [PATCH 01/10] Added info about "Rows count" and "Data loading times" --- .../docs/running-in-production/monitoring.md | 21 +++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/docs/website/docs/running-in-production/monitoring.md b/docs/website/docs/running-in-production/monitoring.md index c9b427fd4e..bfab04fe51 100644 --- a/docs/website/docs/running-in-production/monitoring.md +++ b/docs/website/docs/running-in-production/monitoring.md @@ -48,6 +48,27 @@ GitHub Actions workflow DAG: Using `dlt` [tracing](tracing.md), you can configure [Sentry](https://sentry.io) DSN to start receiving rich information on executed pipelines, including encountered errors and exceptions. + +### Rows count +To find the number of rows loaded per table, use the following command: + +```shell +dlt pipeline trace +``` + +This command will display the names of the tables that were loaded and the number of rows in each table. +For example, below is the plot of the monitored source showing the number of resources on the X-axis and +the number of rows loaded on the Y-axis: + +image + +### Data load times +Data loading times for each table can be obtained using the following command: + +```shell +dlt pipeline load-package +``` + ## Data monitoring Data quality monitoring is considered with ensuring that quality data arrives to the data warehouse From ef8eb96b2945769ea16a7f43a460e558bf68273b Mon Sep 17 00:00:00 2001 From: dat-a-man Date: Fri, 2 Feb 2024 10:03:30 +0530 Subject: [PATCH 02/10] Added info about "Rows count" and "Data loading times" --- docs/website/docs/running-in-production/monitoring.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/running-in-production/monitoring.md b/docs/website/docs/running-in-production/monitoring.md index bfab04fe51..2589a54ea5 100644 --- a/docs/website/docs/running-in-production/monitoring.md +++ b/docs/website/docs/running-in-production/monitoring.md @@ -60,7 +60,7 @@ This command will display the names of the tables that were loaded and the numbe For example, below is the plot of the monitored source showing the number of resources on the X-axis and the number of rows loaded on the Y-axis: -image +![image](https://storage.googleapis.com/dlt-blog-images/docs_data_monitoring_rows_count_.png) ### Data load times Data loading times for each table can be obtained using the following command: From 57b18d904ebd576a0ba7d542dac0e0abd28d3fd5 Mon Sep 17 00:00:00 2001 From: dat-a-man Date: Fri, 2 Feb 2024 10:08:35 +0530 Subject: [PATCH 03/10] Added info about "Rows count" and "Data loading times" --- docs/website/docs/running-in-production/monitoring.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/running-in-production/monitoring.md b/docs/website/docs/running-in-production/monitoring.md index 2589a54ea5..77ed7ba5c9 100644 --- a/docs/website/docs/running-in-production/monitoring.md +++ b/docs/website/docs/running-in-production/monitoring.md @@ -60,7 +60,7 @@ This command will display the names of the tables that were loaded and the numbe For example, below is the plot of the monitored source showing the number of resources on the X-axis and the number of rows loaded on the Y-axis: -![image](https://storage.googleapis.com/dlt-blog-images/docs_data_monitoring_rows_count_.png) +![image](https://storage.googleapis.com/dlt-blog-images/docs_data_monitoring_rows_count) ### Data load times Data loading times for each table can be obtained using the following command: From 3dafdd93734d7ab3fc4cb9695f97c9fe54a6f1bb Mon Sep 17 00:00:00 2001 From: dat-a-man Date: Fri, 2 Feb 2024 10:10:25 +0530 Subject: [PATCH 04/10] Added info about "Rows count" and "Data loading times" --- docs/website/docs/running-in-production/monitoring.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/website/docs/running-in-production/monitoring.md b/docs/website/docs/running-in-production/monitoring.md index 77ed7ba5c9..812d91e6df 100644 --- a/docs/website/docs/running-in-production/monitoring.md +++ b/docs/website/docs/running-in-production/monitoring.md @@ -62,8 +62,8 @@ the number of rows loaded on the Y-axis: ![image](https://storage.googleapis.com/dlt-blog-images/docs_data_monitoring_rows_count) -### Data load times -Data loading times for each table can be obtained using the following command: +### Data load time +Data loading time for each table can be obtained using the following command: ```shell dlt pipeline load-package From a751d752c7e82b38052950fe9eadfde04b6dfa52 Mon Sep 17 00:00:00 2001 From: dat-a-man Date: Fri, 2 Feb 2024 10:11:12 +0530 Subject: [PATCH 05/10] Update --- docs/website/docs/running-in-production/monitoring.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/running-in-production/monitoring.md b/docs/website/docs/running-in-production/monitoring.md index 812d91e6df..6e27d62cf7 100644 --- a/docs/website/docs/running-in-production/monitoring.md +++ b/docs/website/docs/running-in-production/monitoring.md @@ -63,7 +63,7 @@ the number of rows loaded on the Y-axis: ![image](https://storage.googleapis.com/dlt-blog-images/docs_data_monitoring_rows_count) ### Data load time -Data loading time for each table can be obtained using the following command: +Data loading time for each table can be obtained by using the following command: ```shell dlt pipeline load-package From 01de0581a77ede242ea63a08b4427e202e08f574 Mon Sep 17 00:00:00 2001 From: dat-a-man Date: Fri, 2 Feb 2024 10:19:21 +0530 Subject: [PATCH 06/10] Update --- .../website/docs/running-in-production/monitoring.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/docs/website/docs/running-in-production/monitoring.md b/docs/website/docs/running-in-production/monitoring.md index 6e27d62cf7..e327c3430e 100644 --- a/docs/website/docs/running-in-production/monitoring.md +++ b/docs/website/docs/running-in-production/monitoring.md @@ -57,10 +57,16 @@ dlt pipeline trace ``` This command will display the names of the tables that were loaded and the number of rows in each table. -For example, below is the plot of the monitored source showing the number of resources on the X-axis and -the number of rows loaded on the Y-axis: +The above command provides the row count for the Chess source. As shown below: -![image](https://storage.googleapis.com/dlt-blog-images/docs_data_monitoring_rows_count) +```shell +Step normalize COMPLETED in 0.97 seconds. +Normalized data for the following tables: +- _dlt_pipeline_state: 1 row(s) +- players_games: 1179 row(s) +- players_online_status: 4 row(s) +- players_profiles: 4 row(s) +``` ### Data load time Data loading time for each table can be obtained by using the following command: From a23e4ad36d354bb13c7086ab2353ce901a632dd2 Mon Sep 17 00:00:00 2001 From: dat-a-man Date: Sat, 3 Feb 2024 12:10:58 +0530 Subject: [PATCH 07/10] Update --- .../docs/running-in-production/monitoring.md | 20 +++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/docs/website/docs/running-in-production/monitoring.md b/docs/website/docs/running-in-production/monitoring.md index e327c3430e..e45d10d14b 100644 --- a/docs/website/docs/running-in-production/monitoring.md +++ b/docs/website/docs/running-in-production/monitoring.md @@ -68,6 +68,26 @@ Normalized data for the following tables: - players_profiles: 4 row(s) ``` +To load these info back to the destination you can use the following: +```python +# Create a pipeline with the specified name, destination, and dataset +# Run the pipeline + +# Get the trace of the last run of the pipeline +# The trace contains timing information on extract, normalize, and load steps +trace = pipeline.last_trace + +# Load the trace information into a table named "_trace" in the destination +pipeline.run([trace], table_name="_trace") +``` +This process loads several additional tables to the destination, which provide insights into +the extract, normalize, and load steps. Information on the number of rows loaded for each table, +along with the `load_id`, can be found in the `_trace__steps__extract_info__table_metrics` table. +The `load_id` is an epoch timestamp that indicates when the loading was completed. Here's graphical +representation of the rows loaded with `load_id` for different tables: + +![image](https://storage.googleapis.com/dlt-blog-images/docs_monitoring_count_of_rows_vs_load_id.jpg) + ### Data load time Data loading time for each table can be obtained by using the following command: From 55f5f9f42c2205c6bae59c661147a18930b75edb Mon Sep 17 00:00:00 2001 From: dat-a-man Date: Sat, 3 Feb 2024 14:18:15 +0530 Subject: [PATCH 08/10] Updated table --- docs/website/docs/running-in-production/monitoring.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/website/docs/running-in-production/monitoring.md b/docs/website/docs/running-in-production/monitoring.md index e45d10d14b..b1531c346f 100644 --- a/docs/website/docs/running-in-production/monitoring.md +++ b/docs/website/docs/running-in-production/monitoring.md @@ -63,9 +63,11 @@ The above command provides the row count for the Chess source. As shown below: Step normalize COMPLETED in 0.97 seconds. Normalized data for the following tables: - _dlt_pipeline_state: 1 row(s) -- players_games: 1179 row(s) -- players_online_status: 4 row(s) -- players_profiles: 4 row(s) +- payments: 1329 row(s) +- tickets: 1492 row(s) +- orders: 2940 row(s) +- shipment: 2382 row(s) +- retailers: 1342 row(s) ``` To load these info back to the destination you can use the following: From e4ac3eff299b0d76f16f1df2e7da703d6df9e89f Mon Sep 17 00:00:00 2001 From: dat-a-man Date: Sat, 3 Feb 2024 14:22:06 +0530 Subject: [PATCH 09/10] Updated table --- docs/website/docs/running-in-production/monitoring.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/running-in-production/monitoring.md b/docs/website/docs/running-in-production/monitoring.md index b1531c346f..7cc34834f2 100644 --- a/docs/website/docs/running-in-production/monitoring.md +++ b/docs/website/docs/running-in-production/monitoring.md @@ -60,7 +60,7 @@ This command will display the names of the tables that were loaded and the numbe The above command provides the row count for the Chess source. As shown below: ```shell -Step normalize COMPLETED in 0.97 seconds. +Step normalize COMPLETED in 2.37 seconds. Normalized data for the following tables: - _dlt_pipeline_state: 1 row(s) - payments: 1329 row(s) From dc95e6b6aea92f80cdf023363bc177d2b226f6fa Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Tue, 20 Feb 2024 16:21:50 +0000 Subject: [PATCH 10/10] Updated --- .../docs/running-in-production/monitoring.md | 28 ++++++++++++------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/docs/website/docs/running-in-production/monitoring.md b/docs/website/docs/running-in-production/monitoring.md index 7cc34834f2..8532bac36b 100644 --- a/docs/website/docs/running-in-production/monitoring.md +++ b/docs/website/docs/running-in-production/monitoring.md @@ -48,6 +48,18 @@ GitHub Actions workflow DAG: Using `dlt` [tracing](tracing.md), you can configure [Sentry](https://sentry.io) DSN to start receiving rich information on executed pipelines, including encountered errors and exceptions. +## Data monitoring + +Data quality monitoring is considered with ensuring that quality data arrives to the data warehouse +on time. The reason we do monitoring instead of alerting for this is because we cannot easily define +alerts for what could go wrong. + +This is why we want to capture enough context to allow a person to decide if the data looks OK or +requires further investigation when monitoring the data quality. A staple of monitoring are line +charts and time-series charts that provide a baseline or a pattern that a person can interpret. + +For example, to monitor data loading, consider plotting "count of records by `loaded_at` date/hour", +"created at", "modified at", or other recency markers. ### Rows count To find the number of rows loaded per table, use the following command: @@ -97,15 +109,11 @@ Data loading time for each table can be obtained by using the following command: dlt pipeline load-package ``` -## Data monitoring - -Data quality monitoring is considered with ensuring that quality data arrives to the data warehouse -on time. The reason we do monitoring instead of alerting for this is because we cannot easily define -alerts for what could go wrong. +The above information can also be obtained from the script as follows: -This is why we want to capture enough context to allow a person to decide if the data looks OK or -requires further investigation when monitoring the data quality. A staple of monitoring are line -charts and time-series charts that provide a baseline or a pattern that a person can interpret. +```python +info = pipeline.run(source, table_name="table_name", write_disposition='append') -For example, to monitor data loading, consider plotting "count of records by `loaded_at` date/hour", -"created at", "modified at", or other recency markers. +print(info.load_packages[0]) +``` +> `load_packages[0]` will print the information of the first load package in the list of load packages. \ No newline at end of file