From 047ad58c6ad4bcd8b31d0ba959fa6ae34d5846e4 Mon Sep 17 00:00:00 2001 From: David Scharf Date: Sun, 24 Nov 2024 12:48:43 +0100 Subject: [PATCH] Apply suggestions from code review Co-authored-by: Alena Astrakhantseva --- .../general-usage/dataset-access/dataset.md | 10 ++++--- .../dataset-access/ibis-backend.md | 9 +++--- .../dataset-access/sql-client.md | 30 +++++++++---------- .../general-usage/dataset-access/streamlit.md | 12 ++++---- 4 files changed, 31 insertions(+), 30 deletions(-) diff --git a/docs/website/docs/general-usage/dataset-access/dataset.md b/docs/website/docs/general-usage/dataset-access/dataset.md index fb49157539..d05887e4d9 100644 --- a/docs/website/docs/general-usage/dataset-access/dataset.md +++ b/docs/website/docs/general-usage/dataset-access/dataset.md @@ -1,5 +1,5 @@ --- -title: Accessing Loaded Data in Python +title: Accessing loaded data in Python description: Conveniently accessing the data loaded to any destination in python keywords: [destination, schema, data, access, retrieval] --- @@ -158,7 +158,7 @@ arrow_table = items_relation.select("col1", "col2").limit(50).arrow() ## Supported destinations -All SQL and filesystem destinations supported by `dlt` can utilize this data access interface. For filesystem destinations, `dlt` [uses **DuckDB** under the hood](./sql-client.md#the-filesystem-sql-client) to create views from Parquet or JSONL files dynamically. This allows you to query data stored in files using the same interface as you would with SQL databases. If you plan on accessing data in buckets or the filesystem a lot this way, it is advised to load data as parquet instead of jsonl, as **DuckDB** is able to only load the parts of the data actually needed for the query to work. +All SQL and filesystem destinations supported by `dlt` can utilize this data access interface. For filesystem destinations, `dlt` [uses **DuckDB** under the hood](./sql-client.md#the-filesystem-sql-client) to create views from Parquet or JSONL files dynamically. This allows you to query data stored in files using the same interface as you would with SQL databases. If you plan on accessing data in buckets or the filesystem a lot this way, it is advised to load data as Parquet instead of JSONL, as **DuckDB** is able to only load the parts of the data actually needed for the query to work. ## Examples @@ -206,12 +206,14 @@ custom_relation = dataset("SELECT * FROM items JOIN other_items ON items.id = ot arrow_table = custom_relation.arrow() ``` -**Note:** When using custom SQL queries with `dataset()`, methods like `limit` and `select` won't work. Include any filtering or column selection directly in your SQL query. +:::note +When using custom SQL queries with `dataset()`, methods like `limit` and `select` won't work. Include any filtering or column selection directly in your SQL query. +::: ### Loading a `ReadableRelation` into a pipeline table -Since the iter_arrow and iter_df methods are generators that iterate over the full ReadableRelation in chunks, you can use them as a resource for another (or even the same) dlt pipeline: +Since the `iter_arrow` and `iter_df` methods are generators that iterate over the full `ReadableRelation` in chunks, you can use them as a resource for another (or even the same) `dlt` pipeline: ```py # Create a readable relation with a limit of 1m rows diff --git a/docs/website/docs/general-usage/dataset-access/ibis-backend.md b/docs/website/docs/general-usage/dataset-access/ibis-backend.md index 7e17efce24..8f4b0fb6b6 100644 --- a/docs/website/docs/general-usage/dataset-access/ibis-backend.md +++ b/docs/website/docs/general-usage/dataset-access/ibis-backend.md @@ -11,23 +11,22 @@ Ibis is a powerful portable Python dataframe library. Learn more about what it i `dlt` provides an easy way to hand over your loaded dataset to an Ibis backend connection. :::tip -Not all destinations supported by `dlt` have an equivalent Ibis backend. Natively supported destinations include DuckDB (including Motherduck), Postgres, Redshift, Snowflake, Clickhouse, MSSQL (including Synapse), and BigQuery. The filesystem destination is supported via the [Filesystem SQL client](./sql-client#the-filesystem-sql-client); please install the duckdb backend for ibis to use it. Mutating data with ibis on the filesystem will not result in any actual changes to the persisted files. +Not all destinations supported by `dlt` have an equivalent Ibis backend. Natively supported destinations include DuckDB (including Motherduck), Postgres, Redshift, Snowflake, Clickhouse, MSSQL (including Synapse), and BigQuery. The filesystem destination is supported via the [Filesystem SQL client](./sql-client#the-filesystem-sql-client); please install the DuckDB backend for Ibis to use it. Mutating data with Ibis on the filesystem will not result in any actual changes to the persisted files. ::: ## Prerequisites -To use the Ibis backend, you will need to have the `ibis-framework` package with the correct ibis extra installed. The following example will install the duckdb backend: +To use the Ibis backend, you will need to have the `ibis-framework` package with the correct Ibis extra installed. The following example will install the DuckDB backend: ```sh pip install ibis-framework[duckdb] ``` -## Get an ibis connection from your dataset +## Get an Ibis connection from your dataset -dlt datasets have a helper method to return an ibis connection to the destination they live on. The returned object is a native ibis connection to the destination, which you can use to read and even transform data. Please consult the [ibis documentation](https://ibis-project.org/docs/backends/) to learn more about what you can do with ibis. +`dlt` datasets have a helper method to return an Ibis connection to the destination they live on. The returned object is a native Ibis connection to the destination, which you can use to read and even transform data. Please consult the [Ibis documentation](https://ibis-project.org/docs/backends/) to learn more about what you can do with Ibis. ```py - # get the dataset from the pipeline dataset = pipeline._dataset() dataset_name = pipeline.dataset_name diff --git a/docs/website/docs/general-usage/dataset-access/sql-client.md b/docs/website/docs/general-usage/dataset-access/sql-client.md index dc21581302..05371ed4e5 100644 --- a/docs/website/docs/general-usage/dataset-access/sql-client.md +++ b/docs/website/docs/general-usage/dataset-access/sql-client.md @@ -7,16 +7,16 @@ keywords: [data, dataset, sql] # The SQL client :::note -This page contains technical details about the implementation of the SQL client as well as information on how to use low-level APIs. If you simply want to query your data, it's advised to read the pages in this section on accessing data via dlt datasets, streamlit, or ibis. +This page contains technical details about the implementation of the SQL client as well as information on how to use low-level APIs. If you simply want to query your data, it's advised to read the pages in this section on accessing data via `dlt` datasets, Streamlit, or Ibis. ::: -Most dlt destinations use an implementation of the SqlClientBase class to connect to the physical destination to which your data is loaded. DDL statements, data insert or update commands, as well as SQL merge and replace queries, are executed via a connection on this client. It also is used for reading data for the [streamlit app](./streamlit.md) and [data access via dlt datasets](./dataset.md). +Most `dlt` destinations use an implementation of the `SqlClientBase` class to connect to the physical destination to which your data is loaded. DDL statements, data insert or update commands, as well as SQL merge and replace queries, are executed via a connection on this client. It also is used for reading data for the [Streamlit app](./streamlit.md) and [data access via `dlt` datasets](./dataset.md). -All SQL destinations make use of an SQL client; additionally, the filesystem has a special implementation of the SQL client which you can read about below. +All SQL destinations make use of an SQL client; additionally, the filesystem has a special implementation of the SQL client which you can read about [below](#the-filesystem-sql-client). ## Executing a query on the SQL client -You can access the SQL client of your destination via the sql_client method on your pipeline. The code below shows how to use the SQL client to execute a query. +You can access the SQL client of your destination via the `sql_client` method on your pipeline. The code below shows how to use the SQL client to execute a query. ```py pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm") @@ -31,9 +31,9 @@ with pipeline.sql_client() as client: ## Retrieving the data in different formats -The cursor returned by execute_query has several methods for retrieving the data. The supported formats are Python tuples, pandas DataFrame, and Arrow table. +The cursor returned by `execute_query` has several methods for retrieving the data. The supported formats are Python tuples, Pandas DataFrame, and Arrow table. -The code below shows how to retrieve the data as a pandas DataFrame and then manipulate it in memory: +The code below shows how to retrieve the data as a Pandas DataFrame and then manipulate it in memory: ```py pipeline = dlt.pipeline(...) @@ -48,17 +48,17 @@ counts = reactions.sum(0).sort_values(0, ascending=False) ## Supported methods on the cursor -- `fetchall()`: returns all rows as a list of tuples -- `fetchone()`: returns a single row as a tuple -- `fetchmany(size=None)`: returns a number of rows as a list of tuples; if no size is provided, all rows are returned -- `df(chunk_size=None, **kwargs)`: returns the data as a pandas DataFrame; if chunk_size is provided, the data is retrieved in chunks of the given size -- `arrow(chunk_size=None, **kwargs)`: returns the data as an Arrow table; if chunk_size is provided, the data is retrieved in chunks of the given size -- `iter_fetch(chunk_size: int)`: iterates over the data in chunks of the given size as lists of tuples -- `iter_df(chunk_size: int)`: iterates over the data in chunks of the given size as pandas DataFrames -- `iter_arrow(chunk_size: int)`: iterates over the data in chunks of the given size as Arrow tables +- `fetchall()`: returns all rows as a list of tuples; +- `fetchone()`: returns a single row as a tuple; +- `fetchmany(size=None)`: returns a number of rows as a list of tuples; if no size is provided, all rows are returned; +- `df(chunk_size=None, **kwargs)`: returns the data as a Pandas DataFrame; if `chunk_size` is provided, the data is retrieved in chunks of the given size; +- `arrow(chunk_size=None, **kwargs)`: returns the data as an Arrow table; if `chunk_size` is provided, the data is retrieved in chunks of the given size; +- `iter_fetch(chunk_size: int)`: iterates over the data in chunks of the given size as lists of tuples; +- `iter_df(chunk_size: int)`: iterates over the data in chunks of the given size as Pandas DataFrames; +- `iter_arrow(chunk_size: int)`: iterates over the data in chunks of the given size as Arrow tables. :::info -Which retrieval method you should use very much depends on your use case and the destination you are using. Some drivers for our destinations provided by their vendors natively support Arrow or pandas DataFrames; in these cases, we will use that interface. If they do not, `dlt` will convert lists of tuples into these formats. +Which retrieval method you should use very much depends on your use case and the destination you are using. Some drivers for our destinations provided by their vendors natively support Arrow or Pandas DataFrames; in these cases, we will use that interface. If they do not, `dlt` will convert lists of tuples into these formats. ::: ## The filesystem SQL client diff --git a/docs/website/docs/general-usage/dataset-access/streamlit.md b/docs/website/docs/general-usage/dataset-access/streamlit.md index fced1fd901..32589d8e23 100644 --- a/docs/website/docs/general-usage/dataset-access/streamlit.md +++ b/docs/website/docs/general-usage/dataset-access/streamlit.md @@ -1,5 +1,5 @@ --- -title: Viewing your data with streamlit +title: Viewing your data with Streamlit description: Viewing your data with streamlit keywords: [data, dataset, streamlit] --- @@ -9,12 +9,12 @@ keywords: [data, dataset, streamlit] Once you have run a pipeline locally, you can launch a web app that displays the loaded data. For this to work, you will need to have the `streamlit` package installed. :::tip -The streamlit app does not work with all destinations supported by `dlt`. Only destinations that provide a SQL client will work. The filesystem destination has support via the [Filesystem SQL client](./sql-client#the-filesystem-sql-client) and will work in most cases. Vector databases generally are unsupported. +The Streamlit app does not work with all destinations supported by `dlt`. Only destinations that provide a SQL client will work. The filesystem destination has support via the [Filesystem SQL client](./sql-client#the-filesystem-sql-client) and will work in most cases. Vector databases generally are unsupported. ::: ## Prerequisites -To install streamlit, run the following command: +To install Streamlit, run the following command: ```sh pip install streamlit @@ -35,11 +35,11 @@ Use the pipeline name you defined in your Python code with the `pipeline_name` a You can now inspect the schema and your data. Use the left sidebar to switch between: -* Exploring your data (default) -* Information about your loads +* Exploring your data (default); +* Information about your loads. ## Further reading -If you are running dlt in Python interactively or in a notebook, read the [Accessing your data with Python](./dataset.md) guide. +If you are running `dlt` in Python interactively or in a notebook, read the [Accessing loaded data in Python](./dataset.md) guide.