Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added info about "Rows count" and "Data loading times" #929

Merged
merged 11 commits into from
Feb 21, 2024
49 changes: 49 additions & 0 deletions docs/website/docs/running-in-production/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,55 @@ GitHub Actions workflow DAG:
Using `dlt` [tracing](tracing.md), you can configure [Sentry](https://sentry.io) DSN to start
receiving rich information on executed pipelines, including encountered errors and exceptions.


### Rows count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this info would fit better in ## Data monitoring section

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

To find the number of rows loaded per table, use the following command:

```shell
dlt pipeline <pipeline_name> trace
```

This command will display the names of the tables that were loaded and the number of rows in each table.
The above command provides the row count for the Chess source. As shown below:

```shell
Step normalize COMPLETED in 2.37 seconds.
Normalized data for the following tables:
- _dlt_pipeline_state: 1 row(s)
- payments: 1329 row(s)
- tickets: 1492 row(s)
- orders: 2940 row(s)
- shipment: 2382 row(s)
- retailers: 1342 row(s)
```

To load these info back to the destination you can use the following:
```python
# Create a pipeline with the specified name, destination, and dataset
# Run the pipeline

# Get the trace of the last run of the pipeline
# The trace contains timing information on extract, normalize, and load steps
trace = pipeline.last_trace

# Load the trace information into a table named "_trace" in the destination
pipeline.run([trace], table_name="_trace")
```
This process loads several additional tables to the destination, which provide insights into
the extract, normalize, and load steps. Information on the number of rows loaded for each table,
along with the `load_id`, can be found in the `_trace__steps__extract_info__table_metrics` table.
The `load_id` is an epoch timestamp that indicates when the loading was completed. Here's graphical
representation of the rows loaded with `load_id` for different tables:

![image](https://storage.googleapis.com/dlt-blog-images/docs_monitoring_count_of_rows_vs_load_id.jpg)

### Data load time
Data loading time for each table can be obtained by using the following command:

```shell
dlt pipeline <pipeline_name> load-package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to get this info with python? may be from loading info?
I would like to have this info here too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

```

## Data monitoring

Data quality monitoring is considered with ensuring that quality data arrives to the data warehouse
Expand Down
Loading