diff --git a/docs/website/docs/build-a-pipeline-tutorial.md b/docs/website/docs/build-a-pipeline-tutorial.md index 078ef5999e..1c87d4f1cf 100644 --- a/docs/website/docs/build-a-pipeline-tutorial.md +++ b/docs/website/docs/build-a-pipeline-tutorial.md @@ -391,7 +391,7 @@ utilization, schema enforcement and curation, and schema change alerts. which consist of a timestamp and pipeline name. Load IDs enable incremental transformations and data vaulting by tracking data loads and facilitating data lineage and traceability. -Read more about [lineage.](dlt-ecosystem/visualizations/understanding-the-tables.md#load-ids) +Read more about [lineage](general-usage/destination-tables.md#data-lineage). ### Schema Enforcement and Curation diff --git a/docs/website/docs/dlt-ecosystem/destinations/filesystem.md b/docs/website/docs/dlt-ecosystem/destinations/filesystem.md index 8db9a35514..32bf561a82 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/filesystem.md +++ b/docs/website/docs/dlt-ecosystem/destinations/filesystem.md @@ -155,7 +155,7 @@ All the files are stored in a single folder with the name of the dataset that yo The name of each file contains essential metadata on the content: - **schema_name** and **table_name** identify the [schema](../../general-usage/schema.md) and table that define the file structure (column names, data types etc.) -- **load_id** is the [id of the load package](https://dlthub.com/docs/dlt-ecosystem/visualizations/understanding-the-tables#load-ids) form which the file comes from. +- **load_id** is the [id of the load package](../../general-usage/destination-tables.md#load-packages-and-load-ids) form which the file comes from. - **file_id** is there are many files with data for a single table, they are copied with different file id. - **ext** a format of the file ie. `jsonl` or `parquet` diff --git a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md b/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md deleted file mode 100644 index e14ef554f5..0000000000 --- a/docs/website/docs/dlt-ecosystem/visualizations/understanding-the-tables.md +++ /dev/null @@ -1,84 +0,0 @@ ---- -title: Understanding the tables -description: Understanding the tables that have been loaded -keywords: [understanding tables, loaded data, data structure] ---- - -# Understanding the tables - -## Show tables and data in the destination - -``` -dlt pipeline show -``` - -[This command](../../reference/command-line-interface.md#show-tables-and-data-in-the-destination) -generates and launches a simple Streamlit app that you can use to inspect the schemas -and data in the destination as well as your pipeline state and loading status / stats. It should be -executed from the same folder where you ran the pipeline script to access destination credentials. -It requires `streamlit` and `pandas` to be installed. - -## Table and column names - -We [normalize table and column names,](../../general-usage/schema.md#naming-convention) so they fit -what the destination database allows. We convert all the names in your source data into -`snake_case`, alphanumeric identifiers. Please note that in many cases the names you had in your -input document will be (slightly) different from identifiers you see in the database. - -## Child and parent tables - -When creating a schema during normalization, `dlt` recursively unpacks this nested structure into -relational tables, creating and linking children and parent tables. - -This is how table linking works: - -1. Each row in all (top level and child) data tables created by `dlt` contains UNIQUE column named - `_dlt_id`. -1. Each child table contains FOREIGN KEY column `_dlt_parent_id` linking to a particular row - (`_dlt_id`) of a parent table. -1. Rows in child tables come from the lists: `dlt` stores the position of each item in the list in - `_dlt_list_idx`. -1. For tables that are loaded with the `merge` write disposition, we add a ROOT KEY column - `_dlt_root_id`, which links child table to a row in top level table. - -> 💡 Note: If you define your own primary key in a child table, it will be used to link to parent table -and the `_dlt_parent_id` and `_dlt_list_idx` will not be added. `_dlt_id` is always added even in -case the primary key or other unique columns are defined. - -## Load IDs - -Each pipeline run creates one or more load packages, which can be identified by their `load_id`. A load -package typically contains data from all [resources](../../general-usage/glossary.md#resource) of a -particular [source](../../general-usage/glossary.md#source). The `load_id` of a particular package -is added to the top data tables (`_dlt_load_id` column) and to the `_dlt_loads` table with a status 0 (when the load process -is fully completed). - -The `_dlt_loads` table tracks complete loads and allows chaining transformations on top of them. -Many destinations do not support distributed and long-running transactions (e.g. Amazon Redshift). -In that case, the user may see the partially loaded data. It is possible to filter such data out—any -row with a `load_id` that does not exist in `_dlt_loads` is not yet completed. The same procedure may be used to delete and identify -and delete data for packages that never got completed. - -For each load, you can test and [alert](../../running-in-production/alerting.md) on anomalies (e.g. -no data, too much loaded to a table). There are also some useful load stats in the `Load info` tab -of the [Streamlit app](understanding-the-tables.md#show-tables-and-data-in-the-destination) -mentioned above. - -You can add [transformations](../transformations) and chain them together -using the `status` column. You start the transformation for all the data with a particular -`load_id` with a status of 0 and then update it to 1. The next transformation starts with the status -of 1 and is then updated to 2. This can be repeated for every additional transformation. - -### Data lineage - -Data lineage can be super relevant for architectures like the -[data vault architecture](https://www.data-vault.co.uk/what-is-data-vault/) or when troubleshooting. -The data vault architecture is a data warehouse that large organizations use when representing the -same process across multiple systems, which adds data lineage requirements. Using the pipeline name -and `load_id` provided out of the box by `dlt`, you are able to identify the source and time of -data. - -You can [save](../../running-in-production/running.md#inspect-and-save-the-load-info-and-trace) -complete lineage info for a particular `load_id` including a list of loaded files, error messages -(if any), elapsed times, schema changes. This can be helpful, for example, when troubleshooting -problems. diff --git a/docs/website/docs/general-usage/destination-tables.md b/docs/website/docs/general-usage/destination-tables.md new file mode 100644 index 0000000000..8f95639f87 --- /dev/null +++ b/docs/website/docs/general-usage/destination-tables.md @@ -0,0 +1,320 @@ +--- +title: Destination tables +description: Understanding the tables created in the destination database +keywords: [destination tables, loaded data, data structure, schema, table, child table, load package, load id, lineage, staging dataset, versioned dataset] +--- + +# Destination tables + +When you run a [pipeline](pipeline.md), dlt creates tables in the destination database and loads the data +from your [source](source.md) into these tables. In this section, we will take a closer look at what +destination tables look like and how they are organized. + +We start with a simple dlt pipeline: + +```py +import dlt + +data = [ + {'id': 1, 'name': 'Alice'}, + {'id': 2, 'name': 'Bob'} +] + +pipeline = dlt.pipeline( + pipeline_name='quick_start', + destination='duckdb', + dataset_name='mydata' +) +load_info = pipeline.run(data, table_name="users") +``` + +:::note + +Here we are using the [DuckDb destination](../dlt-ecosystem/destinations/duckdb.md), which is an in-memory database. Other database destinations +will behave similarly and have similar concepts. + +::: + +Running this pipeline will create a database schema in the destination database (DuckDB) along with a table named `users`. Quick tip: you can use the `show` command of the `dlt pipeline` CLI [to see the tables](../dlt-ecosystem/visualizations/exploring-the-data.md#exploring-the-data) in the destination database. + +## Database schema + +The database schema is a collection of tables that represent the data you loaded into the database. +The schema name is the same as the `dataset_name` you provided in the pipeline definition. +In the example above, we explicitly set the `dataset_name` to `mydata`. If you don't set it, +it will be set to the pipeline name with a suffix `_dataset`. + +Be aware that the schema referred to in this section is distinct from the [dlt Schema](schema.md). +The database schema pertains to the structure and organization of data within the database, including table +definitions and relationships. On the other hand, the "dlt Schema" specifically refers to the format +and structure of normalized data within the dlt pipeline. + +## Tables + +Each [resource](resource.md) in your pipeline definition will be represented by a table in +the destination. In the example above, we have one resource, `users`, so we will have one table, `mydata.users`, +in the destination. Where `mydata` is the schema name, and `users` is the table name. Here also, we explicitly set +the `table_name` to `users`. When `table_name` is not set, the table name will be set to the resource name. + +For example, we can rewrite the pipeline above as: + +```py +@dlt.resource +def users(): + yield [ + {'id': 1, 'name': 'Alice'}, + {'id': 2, 'name': 'Bob'} + ] + +pipeline = dlt.pipeline( + pipeline_name='quick_start', + destination='duckdb', + dataset_name='mydata' +) +load_info = pipeline.run(users) +``` + +The result will be the same, but the table is implicitly named `users` based on the resource name. + +:::note + +Special tables are created to track the pipeline state. These tables are prefixed with `_dlt_` +and are not shown in the `show` command of the `dlt pipeline` CLI. However, you can see them when +connecting to the database directly. + +::: + +## Child and parent tables + +Now let's look at a more complex example: + +```py +import dlt + +data = [ + { + 'id': 1, + 'name': 'Alice', + 'pets': [ + {'id': 1, 'name': 'Fluffy', 'type': 'cat'}, + {'id': 2, 'name': 'Spot', 'type': 'dog'} + ] + }, + { + 'id': 2, + 'name': 'Bob', + 'pets': [ + {'id': 3, 'name': 'Fido', 'type': 'dog'} + ] + } +] + +pipeline = dlt.pipeline( + pipeline_name='quick_start', + destination='duckdb', + dataset_name='mydata' +) +load_info = pipeline.run(data, table_name="users") +``` + +Running this pipeline will create two tables in the destination, `users` and `users__pets`. The +`users` table will contain the top level data, and the `users__pets` table will contain the child +data. Here is what the tables may look like: + +**mydata.users** + +| id | name | _dlt_id | _dlt_load_id | +| --- | --- | --- | --- | +| 1 | Alice | wX3f5vn801W16A | 1234562350.98417 | +| 2 | Bob | rX8ybgTeEmAmmA | 1234562350.98417 | + +**mydata.users__pets** + +| id | name | type | _dlt_id | _dlt_parent_id | _dlt_list_idx | +| --- | --- | --- | --- | --- | --- | +| 1 | Fluffy | cat | w1n0PEDzuP3grw | wX3f5vn801W16A | 0 | +| 2 | Spot | dog | 9uxh36VU9lqKpw | wX3f5vn801W16A | 1 | +| 3 | Fido | dog | pe3FVtCWz8VuNA | rX8ybgTeEmAmmA | 0 | + +When creating a database schema, dlt recursively unpacks nested structures into relational tables, +creating and linking children and parent tables. + +This is how it works: + +1. Each row in all (top level and child) data tables created by `dlt` contains UNIQUE column named + `_dlt_id`. +1. Each child table contains FOREIGN KEY column `_dlt_parent_id` linking to a particular row + (`_dlt_id`) of a parent table. +1. Rows in child tables come from the lists: `dlt` stores the position of each item in the list in + `_dlt_list_idx`. +1. For tables that are loaded with the `merge` write disposition, we add a ROOT KEY column + `_dlt_root_id`, which links child table to a row in top level table. + + +:::note + +If you define your own primary key in a child table, it will be used to link to parent table +and the `_dlt_parent_id` and `_dlt_list_idx` will not be added. `_dlt_id` is always added even in +case the primary key or other unique columns are defined. + +::: + +## Naming convention: tables and columns + +During a pipeline run, dlt [normalizes both table and column names](schema.md#naming-convention) to ensure compatibility with the destination database's accepted format. All names from your source data will be transformed into snake_case and will only include alphanumeric characters. Please be aware that the names in the destination database may differ somewhat from those in your original input. + +## Load Packages and Load IDs + +Each execution of the pipeline generates one or more load packages. A load package typically contains data retrieved from +all the [resources](glossary.md#resource) of a particular [source](glossary.md#source). +These packages are uniquely identified by a `load_id`. The `load_id` of a particular package is added to the top data tables +(referenced as `_dlt_load_id` column in the example above) and to the special `_dlt_loads` table with a status 0 +(when the load process is fully completed). + +To illustrate this, let's load more data into the same destination: + +```py +data = [ + { + 'id': 3, + 'name': 'Charlie', + 'pets': [] + }, +] +``` + +The rest of the pipeline definition remains the same. Running this pipeline will create a new load +package with a new `load_id` and add the data to the existing tables. The `users` table will now +look like this: + +**mydata.users** + +| id | name | _dlt_id | _dlt_load_id | +| --- | --- | --- | --- | +| 1 | Alice | wX3f5vn801W16A | 1234562350.98417 | +| 2 | Bob | rX8ybgTeEmAmmA | 1234562350.98417 | +| 3 | Charlie | h8lehZEvT3fASQ | **1234563456.12345** | + +The `_dlt_loads` table will look like this: + +**mydata._dlt_loads** + +| load_id | schema_name | status | inserted_at | schema_version_hash | +| --- | --- | --- | --- | --- | +| 1234562350.98417 | quick_start | 0 | 2023-09-12 16:45:51.17865+00 | aOEb...Qekd/58= | +| **1234563456.12345** | quick_start | 0 | 2023-09-12 16:46:03.10662+00 | aOEb...Qekd/58= | + +The `_dlt_loads` table tracks complete loads and allows chaining transformations on top of them. +Many destinations do not support distributed and long-running transactions (e.g. Amazon Redshift). +In that case, the user may see the partially loaded data. It is possible to filter such data out: any +row with a `load_id` that does not exist in `_dlt_loads` is not yet completed. The same procedure may be used to identify +and delete data for packages that never got completed. + +For each load, you can test and [alert](../running-in-production/alerting.md) on anomalies (e.g. +no data, too much loaded to a table). There are also some useful load stats in the `Load info` tab +of the [Streamlit app](../dlt-ecosystem/visualizations/exploring-the-data.md#exploring-the-data) +mentioned above. + +You can add [transformations](../dlt-ecosystem/transformations/) and chain them together +using the `status` column. You start the transformation for all the data with a particular +`load_id` with a status of 0 and then update it to 1. The next transformation starts with the status +of 1 and is then updated to 2. This can be repeated for every additional transformation. + +### Data lineage + +Data lineage can be super relevant for architectures like the +[data vault architecture](https://www.data-vault.co.uk/what-is-data-vault/) or when troubleshooting. +The data vault architecture is a data warehouse that large organizations use when representing the +same process across multiple systems, which adds data lineage requirements. Using the pipeline name +and `load_id` provided out of the box by `dlt`, you are able to identify the source and time of +data. + +You can [save](../running-in-production/running.md#inspect-and-save-the-load-info-and-trace) +complete lineage info for a particular `load_id` including a list of loaded files, error messages +(if any), elapsed times, schema changes. This can be helpful, for example, when troubleshooting +problems. + +## Staging dataset + +So far we've been using the `append` write disposition in our example pipeline. This means that +each time we run the pipeline, the data is appended to the existing tables. When you use [the +merge write disposition](incremental-loading.md), dlt creates a staging database schema for +staging data. This schema is named `_staging` and contains the same tables as the +destination schema. When you run the pipeline, the data from the staging tables is loaded into the +destination tables in a single atomic transaction. + +Let's illustrate this with an example. We change our pipeline to use the `merge` write disposition: + +```py +import dlt + +@dlt.resource(primary_key="id", write_disposition="merge") +def users(): + yield [ + {'id': 1, 'name': 'Alice 2'}, + {'id': 2, 'name': 'Bob 2'} + ] + +pipeline = dlt.pipeline( + pipeline_name='quick_start', + destination='duckdb', + dataset_name='mydata' +) + +load_info = pipeline.run(users) +``` + +Running this pipeline will create a schema in the destination database with the name `mydata_staging`. +If you inspect the tables in this schema, you will find `mydata_staging.users` table identical to the +`mydata.users` table in the previous example. + +Here is what the tables may look like after running the pipeline: + +**mydata_staging.users** + +| id | name | _dlt_id | _dlt_load_id | +| --- | --- | --- | --- | +| 1 | Alice 2 | wX3f5vn801W16A | 2345672350.98417 | +| 2 | Bob 2 | rX8ybgTeEmAmmA | 2345672350.98417 | + +**mydata.users** + +| id | name | _dlt_id | _dlt_load_id | +| --- | --- | --- | --- | +| 1 | Alice 2 | wX3f5vn801W16A | 2345672350.98417 | +| 2 | Bob 2 | rX8ybgTeEmAmmA | 2345672350.98417 | +| 3 | Charlie | h8lehZEvT3fASQ | 1234563456.12345 | + +Notice that the `mydata.users` table now contains the data from both the previous pipeline run and +the current one. + +## Versioned datasets + +When you set the `full_refresh` argument to `True` in `dlt.pipeline` call, dlt creates a versioned dataset. +This means that each time you run the pipeline, the data is loaded into a new dataset (a new database schema). +The dataset name is the same as the `dataset_name` you provided in the pipeline definition with a +datetime-based suffix. + +We modify our pipeline to use the `full_refresh` option to see how this works: + +```py +import dlt + +data = [ + {'id': 1, 'name': 'Alice'}, + {'id': 2, 'name': 'Bob'} +] + +pipeline = dlt.pipeline( + pipeline_name='quick_start', + destination='duckdb', + dataset_name='mydata', + full_refresh=True # <-- add this line +) +load_info = pipeline.run(data, table_name="users") +``` + +Every time you run this pipeline, a new schema will be created in the destination database with a +datetime-based suffix. The data will be loaded into tables in this schema. +For example, the first time you run the pipeline, the schema will be named +`mydata_20230912064403`, the second time it will be named `mydata_20230912064407`, and so on. diff --git a/docs/website/docs/general-usage/full-loading.md b/docs/website/docs/general-usage/full-loading.md index f6f359914d..92fdf064fd 100644 --- a/docs/website/docs/general-usage/full-loading.md +++ b/docs/website/docs/general-usage/full-loading.md @@ -40,15 +40,15 @@ replace_strategy = "staging-optimized" ### The `truncate-and-insert` strategy The `truncate-and-insert` replace strategy is the default and the fastest of all three strategies. If you load data with this setting, then the -destination tables will be truncated at the beginning of the load and the new data will be inserted consecutively but not within the same transaction. +destination tables will be truncated at the beginning of the load and the new data will be inserted consecutively but not within the same transaction. The downside of this strategy is, that your tables will have no data for a while until the load is completed. You -may end up with new data in some tables and no data in other tables if the load fails during the run. Such incomplete load may be however detected by checking if the -[_dlt_loads table contains load id](../dlt-ecosystem/visualizations/understanding-the-tables.md#load-ids) from _dlt_load_id of the replaced tables. If you prefer to have no data downtime, please use one of the other strategies. +may end up with new data in some tables and no data in other tables if the load fails during the run. Such incomplete load may be however detected by checking if the +[_dlt_loads table contains load id](destination-tables.md#load-packages-and-load-ids) from _dlt_load_id of the replaced tables. If you prefer to have no data downtime, please use one of the other strategies. ### The `insert-from-staging` strategy -The `insert-from-staging` is the slowest of all three strategies. It will load all new data into staging tables away from your final destination tables and will then truncate and insert the new data in one transaction. -It also maintains a consistent state between child and parent tables at all times. Use this strategy if you have the requirement for consistent destination datasets with zero downtime and the `optimized` strategy does not work for you. +The `insert-from-staging` is the slowest of all three strategies. It will load all new data into staging tables away from your final destination tables and will then truncate and insert the new data in one transaction. +It also maintains a consistent state between child and parent tables at all times. Use this strategy if you have the requirement for consistent destination datasets with zero downtime and the `optimized` strategy does not work for you. This strategy behaves the same way across all destinations. ### The `staging-optimized` strategy diff --git a/docs/website/docs/getting-started.mdx b/docs/website/docs/getting-started.mdx index 5afa0fb0da..cdbeac8eab 100644 --- a/docs/website/docs/getting-started.mdx +++ b/docs/website/docs/getting-started.mdx @@ -109,7 +109,7 @@ Learn more: - [The full list of available destinations.](dlt-ecosystem/destinations/) - [Exploring the data](dlt-ecosystem/visualizations/exploring-the-data). - What happens after loading? - [Understanding the tables](dlt-ecosystem/visualizations/understanding-the-tables). + [Destination tables](general-usage/destination-tables). ## Load your data diff --git a/docs/website/docs/user-guides/data-beginner.md b/docs/website/docs/user-guides/data-beginner.md new file mode 100644 index 0000000000..e6dd8b8d22 --- /dev/null +++ b/docs/website/docs/user-guides/data-beginner.md @@ -0,0 +1,130 @@ +--- +title: Data Beginner +description: A guide to using dlt for aspiring data professionals +keywords: [beginner, analytics, machine learning] +--- + +# Data Beginner + +If you are an aspiring data professional, here are some ways you can showcase your understanding and +value to data teams with the help of `dlt`. + +## Analytics: Empowering decision-makers + +Operational users at a company need general business analytics capabilities to make decisions, e.g. +dashboards, data warehouse, self-service, etc. + +### Show you can deliver results, not numbers + +The goal of such a project is to get you into the top 5% of candidates, so you get invited to an +interview and understand pragmatically what is expected of you. + +Depending on whether you want to be more in engineering or analytics, you can focus on different +parts of this project. If you showcase that you are able to deliver end to end, there remains little +reason for a potential employer to not hire you. + +Someone hiring folks on this business analytics path will be looking for the following skills: + +- Can you load data to a db? + - Can you do incremental loading? + - Are your pipelines maintainable? + - Are your pipelines reusable? Do they take meaningful arguments? +- Can you transform the data to a standard architecture? + - Do you know dimensional modelling architecture? + - Does your model make the data accessible via a user facing tool to a business user? + - Can you translate a business requirement into a technical requirement? +- Can you identify a use case and prepare reporting? + - Are you displaying a sensible use case? + - Are you taking a pragmatic approach as to what should be displayed and why? + - Did you hard code charts in a notebook that the end user cannot use or did you use a user-facing + dashboard tool? + - Is the user able to answer follow-up questions by changing the dimensions in a tool or did you + hard code queries? + +Project idea: + +1. Choose an API that produces data. If this data is somehow business relevant, that’s better. Many + business apps offer free developer accounts that allow you to develop business apps with them. +1. Choose a use case for this data. Make sure this use case makes some business sense and is not + completely theoretical. Business understanding and pragmatism are key for such roles, so do not + waste your chance to show it. Keep the use case simple-otherwise it will not be pragmatic right + off the bat, handicapping yourself from a good outcome. A few examples are ranking leads in a + sales CRM, clustering users, and something around customer lifetime value predictions. +1. Build a dlt pipeline that loads data from the API for your use case. Keep the case simple and + your code clean. Use explicit variable and method names. Tell a story with your code. For loading + mode, use incremental loading and don’t hardcode parameters that are subject to change. +1. Build a [dbt package](../dlt-ecosystem/transformations/dbt.md) for this pipeline. +1. Build a visualization. Focus on usability more than code. Remember, your goal is to empower a + business user to self-serve, so hard coded dashboards are usually seen as liabilities that need + to be maintained. On the other hand, dashboard tools can be adjusted by business users too. For + example, the free “Looker studio” fro Google is relatable to business users, while notebooks + might make them feel insecure. Your evaluator will likely not take time to set up and run your + things, so make sure your outcomes are well documented with images. Make sure they are self + readable, explain how you intend the business user to use this visualization to fulfil the use + case. +1. Make it presentable somewhere public, such as GitHub, and add docs. Show it to someone for + feedback. You will find likeminded people in + [our Slack](https://join.slack.com/t/dlthub-community/shared_invite/zt-1slox199h-HAE7EQoXmstkP_bTqal65g) + that will happily give their opinion. + +## Machine Learning: Automating decisions + +Solving specific business problems with data products that generate further insights and sometimes +automate decisions. + +### Show you can solve business problems + +Here the challenges might seem different from the business analytics path, but they are often quite +similar. Many courses focus on statistics and data science but very few focus on pragmatic +approaches to solving business problems in organizations. Most of the time, the largest obstacles to +solving a problem with ML are not purely algorithmic but rather about the semantics of the business, +data, and people who need to use the data products. + +Employers look for a project that showcases both technical ability and business pragmatism in a use +case. In reality, data does not typically come in files but via APIs with fresh data, where you +usually will have to grab it and move it somewhere to use, so show your ability to deliver end to +end. + +Project idea: + +1. Choose an API that produces data. If this data is somehow business relevant, that’s better. Many + business apps offer free developer accounts that allow you to develop business apps with them. +1. Choose a use case for this data. Make sure this use case makes some business sense and is not + completely theoretical. Business understanding and pragmatism are key for such roles, so do not + waste your chance to show it. Keep the use case simple-otherwise it will not be pragmatic right + off the bat, handicapping yourself from a good outcome. A few examples are ranking leads in a + sales CRM, clustering users, and something around customer lifetime value predictions. +1. Build a dlt pipeline that loads data from the API for your use case. Keep the case simple and + your code clean. Use explicit variable and method names. Tell a story with your code. For loading + mode, use incremental loading and don’t hardcode parameters that are subject to change. +1. Build a data model with SQL. If you are ambitious you could try running the SQL with a + [dbt package](../dlt-ecosystem/transformations). +1. Showcase your chosen use case that uses ML or statistics to achieve your goal. Don’t forget to + mention how you plan to do this “in production”. Choose a case that is simple so you don’t end up + overcomplicating your solution. Focus on outcomes and next steps. Describe what the company needs + to do to use your results, demonstrating that you understand the costs of your propositions. +1. Make it presentable somewhere public, such as GitHub, and add docs. Show it to someone for + feedback. You will find likeminded people in + [our Slack](https://join.slack.com/t/dlthub-community/shared_invite/zt-1slox199h-HAE7EQoXmstkP_bTqal65g) + that will happily give their opinion. + +## Further reading + +Good docs pages to check out: + +- [Getting started.](../getting-started) +- [Create a pipeline.](../walkthroughs/create-a-pipeline) +- [Run a pipeline.](../walkthroughs/run-a-pipeline) +- [Deploy a pipeline with GitHub Actions.](../walkthroughs/deploy-a-pipeline/deploy-with-github-actions) +- [Understand the loaded data.](../general-usage/destination-tables.md) +- [Explore the loaded data in Streamlit.](../dlt-ecosystem/visualizations/exploring-the-data.md) +- [Transform the data with SQL or python.](../dlt-ecosystem/transformations) +- [Contribute a pipeline.](https://github.com/dlt-hub/verified-sources/blob/master/CONTRIBUTING.md) + +Here are some example projects: + +- [Is DuckDB a database for ducks? Using DuckDB to explore the DuckDB open source community.](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing) +- [Using DuckDB to explore the Rasa open source community.](https://colab.research.google.com/drive/1c9HrNwRi8H36ScSn47m3rDqwj5O0obMk?usp=sharing) +- [MRR and churn calculations on Stripe data.](../dlt-ecosystem/verified-sources/stripe.md) + +Please [open a PR](https://github.com/dlt-hub/verified-sources) to add projects that use `dlt` here! diff --git a/docs/website/docs/user-guides/data-scientist.md b/docs/website/docs/user-guides/data-scientist.md new file mode 100644 index 0000000000..b8415937e4 --- /dev/null +++ b/docs/website/docs/user-guides/data-scientist.md @@ -0,0 +1,129 @@ +--- +title: Data Scientist +description: A guide to using dlt for Data Scientists +keywords: [data scientist, data science, machine learning, machine learning engineer] +--- + + +# Data Scientist + +Data Load Tool (`dlt`) can be highly useful for Data Scientists in several ways. Here are three +potential use cases: + +## Use case #1: Efficient Data Ingestion and Optimized Workflow + +Data Scientists often deal with large volumes of data from various sources. `dlt` can help +streamline the process of data ingestion by providing a robust and scalable tool for loading data +into their analytics environment. It can handle diverse data formats, such as CSV, JSON, or database +dumps, and efficiently load them into a data lake or a data warehouse. + +![dlt-main](images/dlt-main.png) + +By using `dlt`, Data Scientists can save time and effort on data extraction and transformation +tasks, allowing them to focus more on data analysis and models training. The tool is designed as a +library that can be added to their code, making it easy to integrate into existing workflows. + +`dlt` can facilitate a seamless transition from data exploration to production deployment. Data +Scientists can leverage `dlt` capabilities to load data in the format that matches the production +environment while exploring and analyzing the data. This streamlines the process of moving from the +exploration phase to the actual implementation of models, saving time and effort. By using `dlt` +throughout the workflow, Data Scientists can ensure that the data is properly prepared and aligned +with the production environment, leading to smoother integration and deployment of their models. + +- [Use existed Verified Sources](../walkthroughs/add-a-verified-source) and pipeline examples or + [create your own](../walkthroughs/create-a-pipeline) quickly. + +- [Deploy the pipeline](../walkthroughs/deploy-a-pipeline), so that the data is automatically loaded + on a schedule. + +- Transform the [loaded data](../dlt-ecosystem/transformations) with dbt or in + Pandas DataFrames. + +- Learn how to [run](../running-in-production/running), + [monitor](../running-in-production/monitoring), and [alert](../running-in-production/alerting) + when you put your pipeline in production. + +- Use `dlt` when doing exploration in a Jupyter Notebook and move more easily to production. Explore + our + [Colab Demo for Chess.com API](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing) + to realize how easy it is to create and use `dlt` in your projects: + + ![colab-demo](images/colab-demo.png) + +### `dlt` is optimized for local use on laptops + +- It offers a seamless + [integration with Streamlit](../dlt-ecosystem/visualizations/exploring-the-data.md). + This integration enables a smooth and interactive data analysis experience, where Data Scientists + can leverage the power of `dlt` alongside Streamlit's intuitive interface and visualization + capabilities. +- In addition to Streamlit, `dlt` natively supports + [DuckDB](https://dlthub.com/docs/blog/is-duckdb-a-database-for-ducks), an in-process SQL OLAP + database management system. This native support ensures efficient data processing and querying + within `dlt`, leveraging the capabilities of DuckDB. By integrating DuckDB, Data Scientists can + benefit from fast and scalable data operations, enhancing the overall performance of their + analytical workflows. +- Moreover, `dlt` provides resources that can directly return data in the form of + [Pandas DataFrames from an SQL client](../dlt-ecosystem/visualizations/exploring-the-data). This + feature simplifies data retrieval and allows Data Scientists to seamlessly work with data in + familiar Pandas DataFrame format. With this capability, Data Scientists can leverage the rich + ecosystem of Python libraries and tools that support Pandas. + +With `dlt`, the transition from local storage to remote is quick and easy. For example, read the +documentation [Share a dataset: DuckDB -> BigQuery](../walkthroughs/share-a-dataset). + +## Use case #2: Structured Data and Enhanced Data Understanding + +### Structured data + +Data Scientists often prefer structured data lakes over unstructured ones to facilitate efficient +data analysis and modeling. `dlt` can help in this regard by offering seamless integration with +structured data storage systems, allowing Data Scientists to easily load and organize their data in +a structured format. This enables them to access and analyze the data more effectively, improving +their understanding of the underlying data structure. + +![structured-data](images/structured-data.png) + +A `dlt` pipeline is made of a source, which contains resources, and a connection to the destination, +which we call pipeline. So in the simplest use case, you could pass your unstructured data to the +`pipeline` and it will automatically be migrated to structured at the destination. See how to do +that in our [pipeline documentation](../general-usage/pipeline). + +Besides strurdiness, this also adds convenience by automatically converting json types to db types, +such as timestamps, etc. + +Read more about schema evolution in our blog: +**[The structured data lake: How schema evolution enables the next generation of data platforms](https://dlthub.com/docs/blog/next-generation-data-platform).** + +### Data exploration + +Data Scientists require a comprehensive understanding of their data to derive meaningful insights +and build accurate models. `dlt` can contribute to this by providing intuitive and user-friendly +features for data exploration. It allows Data Scientists to quickly gain insights into their data by +visualizing data summaries, statistics, and distributions. With `dlt`, data understanding becomes +clearer and more accessible, enabling Data Scientists to make informed decisions throughout the +analysis process. + +Besides, having a schema imposed on the data acts as a technical description of the data, +accelerating the discovery process. + +See [Destination tables](../general-usage/destination-tables.md) and +[Exploring the data](../dlt-ecosystem/visualizations/exploring-the-data) in our documentation. + +## Use case #3: Data Preprocessing and Transformation + +Data preparation is a crucial step in the data science workflow. `dlt` can facilitate data +preprocessing and transformation tasks by providing a range of built-in features. It simplifies +various tasks like data cleaning, anonymizing, handling missing values, data type conversion, +feature scaling, and feature engineering. Data Scientists can leverage these capabilities to clean +and transform their datasets efficiently, making them suitable for subsequent analysis and modeling. + +Python-first users can heavily customize how `dlt` sources produce data, as `dlt` supports +selecting, [filtering](../general-usage/resource#filter-transform-and-pivot-data), +[renaming](../general-usage/customising-pipelines/renaming_columns), +[anonymizing](../general-usage/customising-pipelines/pseudonymizing_columns), and just about any +custom operation. + +Compliance is also a case where preprocessing is the way to solve the issue: Besides being +python-friendly, the ability to apply transformation logic before loading data allows us to +separate, filter or transform sensitive data. diff --git a/docs/website/docs/user-guides/engineering-manager.md b/docs/website/docs/user-guides/engineering-manager.md new file mode 100644 index 0000000000..70e23eb2c1 --- /dev/null +++ b/docs/website/docs/user-guides/engineering-manager.md @@ -0,0 +1,155 @@ +--- +title: Staff Data Engineer +description: A guide to using dlt for Staff Data Engineers +keywords: [staff data engineer, senior data engineer, ETL engineer, head of data platform, + data platform engineer] +--- + +# Staff Data Engineer + +Staff data engineers create data pipelines, data warehouses and data lakes in order to democratize +access to data in their organizations. + +With `dlt` we offer a library and building blocks that data tool builders can use to create modern +data infrastructure for their companies. Staff Data Engineer, Senior Data Engineer, ETL Engineer, +Head of Data Platform - there’s a variety of titles of how data tool builders are called in +companies. + +## What does this role do in an organisation? + +The job responsibilities of this senior vary, but often revolve around building and maintaining a +robust data infrastructure: + +- Tech: They design and implement scalable data architectures, data pipelines, and data processing + frameworks. +- Governance: They ensure data integrity, reliability, and security across the data stack. They + manage data governance, including data quality, data privacy, and regulatory compliance. +- Strategy: Additionally, they evaluate and adopt new technologies, tools, and methodologies to + improve the efficiency, performance, and scalability of data processes. +- Team skills and staffing: Their responsibilities also involve providing technical leadership, + mentoring team members, driving innovation, and aligning the data strategy with the organization's + overall goals. +- Return on investment focus: Ultimately, their focus is on empowering the organization to derive + actionable insights, make data-driven decisions, and unlock the full potential of their data + assets. + +## Choosing a Data Stack + +The above roles play a critical role in choosing the right data stack for their organization. When +selecting a data stack, they need to consider several factors. These include: + +- The organization's data requirements. +- Scalability, performance, data governance and security needs. +- Integration capabilities with existing systems and tools. +- Team skill sets, budget, and long-term strategic goals. + +They evaluate the pros and cons of various technologies, frameworks, and platforms, considering +factors such as ease of use, community support, vendor reliability, and compatibility with their +specific use cases. The goal is to choose a data stack that aligns with the organization's needs, +enables efficient data processing and analysis, promotes data governance and security, and empowers +teams to deliver valuable insights and solutions. + +## What does a senior architect or engineer consider when choosing a tech stack? + +- Company Goals and Strategy. +- Cost and Return on Investment (ROI). +- Staffing and Skills. +- Employee Happiness and Productivity. +- Maintainability and Long-term Support. +- Integration with Existing Systems. +- Scalability and Performance. +- Data Security and Compliance. +- Vendor Reliability and Ecosystem. + +## What makes dlt a must-have for your data stack or platform? + +For starters, `dlt` is the first data pipeline solution that is built for your data team's ROI. Our +vision is to add value, not gatekeep it. + +By being a library built to enable free usage, we are uniquely positioned to run in existing stacks +without replacing them. This enables us to disrupt and revolutionise the industry in ways that only +open source communities can. + +## dlt massively reduces pipeline maintenance, increases efficiency and ROI + +- Reduce engineering effort as much as 5x via a paradigm shift. Structure data automatically to not + do it manually. + Read about [structured data lake](https://dlthub.com/docs/blog/next-generation-data-platform), and + [how to do schema evolution](../reference/explainers/schema-evolution.md). +- Better Collaboration and Communication: Structured data promotes better collaboration and + communication among team members. Since everyone operates on a shared understanding of the data + structure, it becomes easier to discuss and align on data-related topics. Queries, reports, and + analysis can be easily shared and understood by others, enhancing collaboration and teamwork. +- Faster time to build pipelines: After extracting data, if you pass it to `dlt`, you are done. If + not, it needs to be structured. Because structuring is hard, we curate it. Curation involves at + least the producer, and consumer, but often also an analyst and the engineer, and is a long, + friction-ful process. +- Usage focus improves ROI: To use data, we need to understand what it is. Structured data already + contains a technical description, accelerating usage. +- Lower cost: Reading structured data is cheaper and faster because we can specify which parts of a + document we want to read. +- Removing friction: By alerting schema changes to the producer and stakeholder, and by automating + structuring, we can keep the data engineer out of curation and remove the bottleneck. + [Notify maintenance events.](../running-in-production/running#inspect-save-and-alert-on-schema-changes) +- Improving quality: No more garbage in, garbage out. Because `dlt` structures data and alerts schema + changes, we can have better governance. + +## dlt makes your team happy + +- Spend more time using data, less time loading it. When you build a `dlt` pipeline, you only build + the extraction part, automating the tedious structuring and loading. +- Data meshing to reduce friction: By structuring data before loading, the engineer is no longer + involved in curation. This makes both the engineer and the others happy. +- Better governance with end to end pipelining via dbt: + [run dbt packages on the fly](../dlt-ecosystem/transformations/dbt.md), + [lineage out of the box](../general-usage/destination-tables.md#data-lineage). +- Zero learning curve: Declarative loading, simple functional programming. By using `dlt`'s + declarative, standard approach to loading data, there is no complicated code to maintain, and the + analysts can thus maintain the code. +- Autonomy and Self service: Customising pipelines is easy, whether you want to plug an anonymiser, + rename things, or curate what you load. + [Anonymisers, renamers](../general-usage/customising-pipelines/pseudonymizing_columns.md). +- Easy discovery and governance: By tracking metadata like data lineage, describing data with + schemas, and alerting changes, we stay on top of the data. +- Simplified access: Querying structured data can be done by anyone with their tools of choice. + +## dlt is a library that you can run in unprecedented places + +Before `dlt` existed, all loading tools were built either + +- as SaaS (5tran, Stitch, etc.); +- as installed apps with their own orchestrator: Pentaho, Talend, Airbyte; +- or as abandonware framework meant to be unrunnable without help (Singer was released without + orchestration, not for public). + +`dlt` is the first python library in this space, which means you can just run it wherever the rest of +your python stuff runs, without adding complexity. + +- You can run `dlt` in [Airflow](../dlt-ecosystem/deployments/orchestrators/airflow-deployment.md) - + this is the first ingestion tool that does this. +- You can run `dlt` in small spaces like [Cloud Functions](../dlt-ecosystem/deployments/running-in-cloud-functions.md) + or [GitHub Actions](../dlt-ecosystem/deployments/orchestrators/github-actions.md) - + so you could easily set up webhooks, etc. +- You can run `dlt` in your Jupyter Notebook and load data to [DuckDB](../dlt-ecosystem/destinations/duckdb.md). +- You can run `dlt` on large machines, it will attempt to make the best use of the resources available + to it. +- You can [run `dlt` locally](../walkthroughs/run-a-pipeline.md) just like you run any python scripts. + +The implications: + +- Empowering Data Teams and Collaboration: You can discover or prototype in notebooks, run in cloud + functions, and deploy to production, the same scalable, robust code. No more friction between + roles. + [Colab demo.](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing#scrollTo=A3NRS0y38alk) +- Rapid Data Exploration and Prototyping: By running in Colab with DuckDB, you can explore + semi-structured data much faster by structuring it with `dlt` and analysing it in SQL. + [Schema inference](../general-usage/schema#data-normalizer), + [exploring the loaded data](../dlt-ecosystem/visualizations/exploring-the-data.md). +- No vendor limits: `dlt` is forever free, with no vendor strings. We do not create value by creating + a pain for you and solving it. We create value by supporting you beyond. +- `dlt` removes complexity: You can use `dlt` in your existing stack, no overheads, no race conditions, + full observability. Other tools add complexity. +- `dlt` can be leveraged by AI: Because it's a library with low complexity to use, large language + models can produce `dlt` code for your pipelines. +- Ease of adoption: If you are running python, you can adopt `dlt`. `dlt` is orchestrator and + destination agnostic. diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index e25f8d8836..713a67165c 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -123,7 +123,6 @@ const sidebars = { 'dlt-ecosystem/transformations/dbt', 'dlt-ecosystem/transformations/sql', 'dlt-ecosystem/transformations/pandas', - , ] }, { @@ -131,7 +130,6 @@ const sidebars = { label: 'Visualizations', items: [ 'dlt-ecosystem/visualizations/exploring-the-data', - 'dlt-ecosystem/visualizations/understanding-the-tables' ] }, ], @@ -184,6 +182,7 @@ const sidebars = { 'general-usage/resource', 'general-usage/source', 'general-usage/pipeline', + 'general-usage/destination-tables', 'general-usage/state', 'general-usage/incremental-loading', 'general-usage/full-loading',