Skip to content

Commit

Permalink
run grammar checker
Browse files Browse the repository at this point in the history
  • Loading branch information
sh-rp committed Nov 25, 2024
1 parent 460020c commit 87f50cc
Show file tree
Hide file tree
Showing 6 changed files with 78 additions and 78 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@ keywords: [destination, schema, data, monitoring, testing, quality]

# Data quality dashboards

After deploying a `dlt` pipeline, you might ask yourself: How can we know if the data is and remains
high quality?
After deploying a `dlt` pipeline, you might ask yourself: How can we know if the data is and remains high quality?

There are two ways to catch errors:

Expand All @@ -16,9 +15,7 @@ There are two ways to catch errors:

## Tests

The first time you load data from a pipeline you have built, you will likely want to test it. Plot
the data on time series line charts and look for any interruptions or spikes, which will highlight
any gaps or loading issues.
The first time you load data from a pipeline you have built, you will likely want to test it. Plot the data on time series line charts and look for any interruptions or spikes, which will highlight any gaps or loading issues.

### Data usage as monitoring

Expand Down
75 changes: 38 additions & 37 deletions docs/website/docs/general-usage/dataset-access/dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@ description: Conveniently accessing the data loaded to any destination in python
keywords: [destination, schema, data, access, retrieval]
---

# Accessing Loaded Data in Python
# Accessing loaded data in Python

This guide explains how to access and manipulate data that has been loaded into your destination using the `dlt` Python library. After running your pipelines and loading data, you can use the `ReadableDataset` and `ReadableRelation` classes to interact with your data programmatically.

**Note:** The `ReadableDataset` and `ReadableRelation` objects are **lazy-loading**. They will only query and retrieve data when you perform an action that requires it, such as fetching data into a DataFrame or iterating over the data. This means that simply creating these objects does not load data into memory, making your code more efficient.

## Quick Start Example
## Quick start example

Here's a full example of how to retrieve data from a pipeline and load it into a Pandas DataFrame or a PyArrow Table.

Expand All @@ -31,7 +31,7 @@ df = items_relation.df()
arrow_table = items_relation.arrow()
```

## Getting Started
## Getting started

Assuming you have a `Pipeline` object (let's call it `pipeline`), you can obtain a `ReadableDataset` and access your tables as `ReadableRelation` objects.

Expand All @@ -42,7 +42,7 @@ Assuming you have a `Pipeline` object (let's call it `pipeline`), you can obtain
dataset = pipeline._dataset()
```

### Access Tables as `ReadableRelation`
### Access tables as `ReadableRelation`

You can access tables in your dataset using either attribute access or item access.

Expand All @@ -54,11 +54,11 @@ items_relation = dataset.items
items_relation = dataset["items"]
```

## Reading Data
## Reading data

Once you have a `ReadableRelation`, you can read data in various formats and sizes.

### Fetch the Entire Table
### Fetch the entire table

:::caution
Loading full tables into memory without limiting or iterating over them can consume a large amount of memory and may cause your program to crash if the table is too large. It's recommended to use chunked iteration or apply limits when dealing with large datasets.
Expand All @@ -76,17 +76,17 @@ df = items_relation.df()
arrow_table = items_relation.arrow()
```

#### As a List of Python Tuples
#### As a list of Python tuples

```py
items_list = items_relation.fetchall()
```

## Lazy Loading Behavior
## Lazy loading behavior

The `ReadableDataset` and `ReadableRelation` objects are **lazy-loading**. This means that they do not immediately fetch data when you create them. Data is only retrieved when you perform an action that requires it, such as calling `.df()`, `.arrow()`, or iterating over the data. This approach optimizes performance and reduces unnecessary data loading.

## Iterating Over Data in Chunks
## Iterating over data in chunks

To handle large datasets efficiently, you can process data in smaller chunks.

Expand All @@ -106,48 +106,48 @@ for arrow_chunk in items_relation.iter_arrow(chunk_size=500):
pass
```

### Iterate as Lists of Tuples
### Iterate as lists of tuples

```py
for items_chunk in items_relation.iter_fetch(chunk_size=500):
# Process each chunk of tuples
pass
```

The methods availableon the ReadableRelation correspond to the methods available on the cursor returned by the sql client. Please refer to the [sql client](./sql-client.md#supported-methods-on-the-cursor) guide for more information.
The methods available on the ReadableRelation correspond to the methods available on the cursor returned by the SQL client. Please refer to the [SQL client](./sql-client.md#supported-methods-on-the-cursor) guide for more information.

## Modifying Queries
## Modifying queries

You can refine your data retrieval by limiting the number of records, selecting specific columns, or chaining these operations.

### Limit the Number of Records
### Limit the number of records

```py
# Get the first 50 items as a PyArrow table
arrow_table = items_relation.limit(50).arrow()
```

#### Using `head()` to Get the First 5 Records
#### Using `head()` to get the first 5 records

```py
df = items_relation.head().df()
```

### Select Specific Columns
### Select specific columns

```py
# Select only 'col1' and 'col2' columns
items_list = items_relation.select("col1", "col2").fetchall()

# alternate notation with brackets
# Alternate notation with brackets
items_list = items_relation[["col1", "col2"]].fetchall()

# only get one column
# Only get one column
items_list = items_relation["col1"].fetchall()

```

### Chain Operations
### Chain operations

You can combine `select`, `limit`, and other methods.

Expand All @@ -156,47 +156,47 @@ You can combine `select`, `limit`, and other methods.
arrow_table = items_relation.select("col1", "col2").limit(50).arrow()
```

## Supported Destinations
## Supported destinations

All SQL and filesystem destinations supported by `dlt` can utilize this data access interface. For filesystem destinations, `dlt` [uses **DuckDB** under the hood](./sql-client.md#the-filesystem-sql-client) to create views from Parquet or JSONL files dynamically. This allows you to query data stored in files using the same interface as you would with SQL databases. If you plan on accessing data in buckets or the filesystem a lot this way, it is adviced to load data as parquet instead of jsonl, as **DuckDB** is able to only load the parts of the data actually needed for the query to work.
All SQL and filesystem destinations supported by `dlt` can utilize this data access interface. For filesystem destinations, `dlt` [uses **DuckDB** under the hood](./sql-client.md#the-filesystem-sql-client) to create views from Parquet or JSONL files dynamically. This allows you to query data stored in files using the same interface as you would with SQL databases. If you plan on accessing data in buckets or the filesystem a lot this way, it is advised to load data as parquet instead of jsonl, as **DuckDB** is able to only load the parts of the data actually needed for the query to work.

## Examples

### Fetch One Record as a Tuple
### Fetch one record as a tuple

```py
record = items_relation.fetchone()
```

### Fetch Many Records as Tuples
### Fetch many records as tuples

```py
records = items_relation.fetchmany(chunk_size=10)
```

### Iterate Over Data with Limit and Column Selection
### Iterate over data with limit and column selection

**Note:** When iterating over filesystem tables, the underlying DuckDB may give you a different chunksize depending on the size of the parquet files the table is based on.
**Note:** When iterating over filesystem tables, the underlying DuckDB may give you a different chunk size depending on the size of the parquet files the table is based on.

```py

# dataframes
# Dataframes
for df_chunk in items_relation.select("col1", "col2").limit(100).iter_df(chunk_size=20):
...

# arrow tables
# Arrow tables
for arrow_table in items_relation.select("col1", "col2").limit(100).iter_arrow(chunk_size=20):
...

# python tuples
# Python tuples
for records in items_relation.select("col1", "col2").limit(100).iter_fetch(chunk_size=20):
# Process each modified DataFrame chunk
...
```

## Advanced Usage
## Advanced usage

### Using custom sql queries to create `ReadableRelations`
### Using custom SQL queries to create `ReadableRelations`

You can use custom SQL queries directly on the dataset to create a `ReadableRelation`:

Expand All @@ -211,27 +211,28 @@ arrow_table = custom_relation.arrow()

### Loading a `ReadableRelation` into a pipeline table

Since the iter_arrow and iter_df methods are generators that iterate over the full ReadableRelation in chunks, you can load use them as a resource for another (or even the same) dlt pipeline:
Since the iter_arrow and iter_df methods are generators that iterate over the full ReadableRelation in chunks, you can use them as a resource for another (or even the same) dlt pipeline:

```py
# create a readable relation with a limit of 1m rows
# Create a readable relation with a limit of 1m rows
limited_items_relation = dataset.items.limit(1_000_000)

# create a new pipeline
# Create a new pipeline
other_pipeline = ...

# we can now load these 1m rows into this pipeline in 10k chunks
# We can now load these 1m rows into this pipeline in 10k chunks
other_pipeline.run(limited_items_relation.iter_arrow(chunk_size=10_000), table_name="limited_items")
```

### Using `ibis` to query the data

Visit the [Native Ibis integration](./ibis-backend.md) guide to learn more.

## Important Considerations
## Important considerations

- **Memory Usage:** Loading full tables into memory without iterating or limiting can consume significant memory, potentially leading to crashes if the dataset is large. Always consider using limits or chunked iteration.
- **Memory usage:** Loading full tables into memory without iterating or limiting can consume significant memory, potentially leading to crashes if the dataset is large. Always consider using limits or chunked iteration.

- **Lazy Evaluation:** `ReadableDataset` and `ReadableRelation` objects delay data retrieval until necessary. This design improves performance and resource utilization.
- **Lazy evaluation:** `ReadableDataset` and `ReadableRelation` objects delay data retrieval until necessary. This design improves performance and resource utilization.

- **Custom SQL queries:** When executing custom SQL queries, remember that additional methods like `limit()` or `select()` won't modify the query. Include all necessary clauses directly in your SQL statement.

- **Custom SQL Queries:** When executing custom SQL queries, remember that additional methods like `limit()` or `select()` won't modify the query. Include all necessary clauses directly in your SQL statement.
11 changes: 6 additions & 5 deletions docs/website/docs/general-usage/dataset-access/ibis-backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ keywords: [data, dataset, ibis]

Ibis is a powerful portable Python dataframe library. Learn more about what it is and how to use it in the [official documentation](https://ibis-project.org/).

`dlt` provides an easy way to handoveor your loaded dataset to an Ibis backend connection.
`dlt` provides an easy way to hand over your loaded dataset to an Ibis backend connection.

:::tip
Not all destinations supported by `dlt` have an equivalent Ibis backend. Natively supported destinations include DuckDB (including Motherduck), Postgres, Redshift, Snowflake, Clickhouse, MSSQL (including Synapse) and BigQuery. The filesystem destination is supported via the [Filesystem SQL client](./sql-client#the-filesystem-sql-client), please install the duckdb backend for ibis to use it. Mutating data with ibis on the filesystem will not result in any actual changes to the persisted files.
Not all destinations supported by `dlt` have an equivalent Ibis backend. Natively supported destinations include DuckDB (including Motherduck), Postgres, Redshift, Snowflake, Clickhouse, MSSQL (including Synapse), and BigQuery. The filesystem destination is supported via the [Filesystem SQL client](./sql-client#the-filesystem-sql-client); please install the duckdb backend for ibis to use it. Mutating data with ibis on the filesystem will not result in any actual changes to the persisted files.
:::

## Prerequisites
Expand All @@ -24,15 +24,15 @@ pip install ibis-framework[duckdb]

## Get an ibis connection from your dataset

Dlt datasets have a helper method to return an ibis connection to the destination they live on. The returned object is a native ibis connection to the destination which you can use to read and even transform data. Please consult the [ibis documentation](https://ibis-project.org/docs/backends/) to learn more about what you can do with ibis.
dlt datasets have a helper method to return an ibis connection to the destination they live on. The returned object is a native ibis connection to the destination, which you can use to read and even transform data. Please consult the [ibis documentation](https://ibis-project.org/docs/backends/) to learn more about what you can do with ibis.

```py

# get the dataset from the pipeline
dataset = pipeline._dataset()
dataset_name = pipeline.dataset_name

# get the native ibis connection form the dataset
# get the native ibis connection from the dataset
ibis_connection = dataset.ibis()

# list all tables in the dataset
Expand All @@ -46,4 +46,5 @@ table = ibis_connection.table("items", database=dataset_name)
print(table.limit(10).execute())

# Visit the ibis docs to learn more about the available methods
```
```

6 changes: 3 additions & 3 deletions docs/website/docs/general-usage/dataset-access/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ import DocCardList from '@theme/DocCardList';
After one or more successful runs of your pipeline, you can inspect or access the loaded data in various ways:

* We have a simple [`streamlit` app](./streamlit.md) that you can use to view your data locally in your webapp.
* We have a [python interface](./dataset.md) that allows you to access your data in python as python tuples, `arrow` tables or `pandas` dataframes with a simple dataset object or an sql interface. You can even run sql commands on the filesystem destination via `DuckDB` or forward data from any table into another pipeline.
* We have an [`ibis` interface](./ibis-backend.md) that allows you to use hand over your loaded data to the powerful [ibis-framework](https://ibis-project.org/) library.
* Lastly we have some advice for [monitoring and ensuring the quality of your data](./data-quality-dashboard.md).
* We have a [Python interface](./dataset.md) that allows you to access your data in Python as Python tuples, `arrow` tables, or `pandas` dataframes with a simple dataset object or an SQL interface. You can even run SQL commands on the filesystem destination via `DuckDB` or forward data from any table into another pipeline.
* We have an [`ibis` interface](./ibis-backend.md) that allows you to hand over your loaded data to the powerful [ibis-framework](https://ibis-project.org/) library.
* Lastly, we have some advice for [monitoring and ensuring the quality of your data](./data-quality-dashboard.md).

# Learn more
<DocCardList />
Expand Down
Loading

0 comments on commit 87f50cc

Please sign in to comment.