improve how dlt works page #2152

sh-rp · 2024-12-16T15:58:26Z

Description

This PR improves the how dlt works page:

Update inaccurate information
Update main image
Extend text

netlify · 2024-12-16T15:58:44Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`53c86bf`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/67615ee53436f0000951af63
😎 Deploy Preview	https://deploy-preview-2152--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

AstrakhantsevaAA

very good! much much better

AstrakhantsevaAA · 2024-12-16T16:20:45Z

docs/website/docs/reference/explainers/how-dlt-works.md

+
+### Extract
+
+During the extract phase, `dlt` fully extracts the data from your [sources](../../general-usage/source) to your hard drive into a new [load package](../../general-usage/destination-tables#load-packages-and-load-ids), which will be assigned a unique ID and will contain your raw data as received from your sources. Additionally, you can [supply schema hints](../../general-usage/source#define-schema) to define the data types of some of the columns or add a primary key and unique indexes. You can also control this phase by [limiting](../../general-usage/resource#sample-from-large-data) the number of items extracted in one run, using [incremental cursor fields](../../general-usage/incremental-loading#incremental-loading-with-a-cursor-field), and by tuning the performance with [parallelization](../../reference/performance#extract). You can also apply filters and maps to [obfuscate](../../general-usage/customising-pipelines/pseudonymizing_columns) or [remove](../../general-usage/customising-pipelines/removing_columns) personal data, and you can use [transformers](../../examples/transformers) to create derivative data.


can you please mention that they all are parts of pipeline.run. Users can run each step separately like

pipeline.extract(data) pipeline.normalize() pipeline.load()

AstrakhantsevaAA · 2024-12-16T16:22:30Z

docs/website/docs/reference/explainers/how-dlt-works.md

-[schema](../../general-usage/glossary.md#schema), which will automatically evolve to accommodate any future
-source data changes (e.g., new fields or tables).
+pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
+pipeline.run([{"id": 1}, {"id": 2}, {"id": 3}], table_name="items")


maybe add a hint:
Use progress="log" to see logs for each stage separately https://dlthub.com/docs/walkthroughs/run-a-pipeline#2-see-the-progress-during-loading

I think this is more something for the walkthrough

AstrakhantsevaAA · 2024-12-16T16:27:45Z

docs/website/docs/reference/explainers/how-dlt-works.md

+
+During the extract phase, `dlt` fully extracts the data from your [sources](../../general-usage/source) to your hard drive into a new [load package](../../general-usage/destination-tables#load-packages-and-load-ids), which will be assigned a unique ID and will contain your raw data as received from your sources. Additionally, you can [supply schema hints](../../general-usage/source#define-schema) to define the data types of some of the columns or add a primary key and unique indexes. You can also control this phase by [limiting](../../general-usage/resource#sample-from-large-data) the number of items extracted in one run, using [incremental cursor fields](../../general-usage/incremental-loading#incremental-loading-with-a-cursor-field), and by tuning the performance with [parallelization](../../reference/performance#extract). You can also apply filters and maps to [obfuscate](../../general-usage/customising-pipelines/pseudonymizing_columns) or [remove](../../general-usage/customising-pipelines/removing_columns) personal data, and you can use [transformers](../../examples/transformers) to create derivative data.
+
+### Normalize


mention here that this step is performed only after the extraction step is completed. So if extract stage in failed normalization won't be started

same for load step

AstrakhantsevaAA · 2024-12-16T16:31:11Z

docs/website/docs/reference/explainers/how-dlt-works.md

+
+### Load
+
+During the loading phase, `dlt` first runs schema migrations as needed on your destination and then loads your data into the destination. `dlt` will load your data in smaller chunks called load jobs to be able to parallelize large loads. If the connection to the destination fails, it is safe to rerun the pipeline, and `dlt` will continue to load all load jobs from the current load package. `dlt` will also create special tables that store the internal dlt schema, information about all load packages, and some state information which, among other things, are used by the incrementals to be able to restore the incremental state from a previous run to another machine. Some ways to control the loading phase are by using different [`write_dispositions`](../../general-usage/incremental-loading#choosing-a-write-disposition) to replace the data in the destination, simply append to it, or merge on certain merge keys that you can configure per table. For some destinations, you can use a remote staging dataset on a bucket provider, and `dlt` even supports modern open table formats like [deltables and iceberg](../../dlt-ecosystem/destinations/delta-iceberg), and [reverse ETL](../../dlt-ecosystem/destinations/destination) is also possible.


can you explain better this part?

dlt will load your data in smaller chunks called load jobs to be able to parallelize large loads.

does it mean dlt loads data in chunks? or one chunk is one resource? or if we enable file rotation it's one file?

or it's too much info?

imo, will be too much info :)

I'd like to add a new page about the load package and how to inspect it.

AstrakhantsevaAA · 2024-12-16T16:36:45Z

docs/website/docs/reference/explainers/how-dlt-works.md

-[schema](../../general-usage/glossary.md#schema), which will automatically evolve to accommodate any future
-source data changes (e.g., new fields or tables).
+pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
+pipeline.run([{"id": 1}, {"id": 2}, {"id": 3}], table_name="items")


maybe make the data sample nested? then you can mention unnesting further in the normalize step ?

yes, good idea!

VioletM

Really enjoyed the explanations about all steps with a lot of links and info combined. Great job!

VioletM · 2024-12-16T16:46:11Z

docs/website/docs/reference/explainers/how-dlt-works.md

-does this by first [extracting](how-dlt-works.md#extract) the JSON data, then
-[normalizing](how-dlt-works.md#normalize) it to a schema, and finally [loading](how-dlt-works#load)
-it to the location where you will store it.
+In a nutshell, `dlt` automatically turns data from a number of available [sources](../../general-usage/source) (e.g., an API, a PostgreSQL database, or Python data structures) into a live dataset stored in a [destination](../../general-usage/destination) of your choice (e.g., Google BigQuery, a Deltalake on Azure, or by pushing the data back via reverse ETL). You can easily implement your own sources, as long as you yield data in a way that is compatible with `dlt`, such as JSON objects, Python lists and dictionaries, pandas dataframes, and arrow tables. `dlt` will be able to automatically compute the schema and move the data to your destination.


Suggested change

In a nutshell, `dlt` automatically turns data from a number of available [sources](../../general-usage/source) (e.g., an API, a PostgreSQL database, or Python data structures) into a live dataset stored in a [destination](../../general-usage/destination) of your choice (e.g., Google BigQuery, a Deltalake on Azure, or by pushing the data back via reverse ETL). You can easily implement your own sources, as long as you yield data in a way that is compatible with `dlt`, such as JSON objects, Python lists and dictionaries, pandas dataframes, and arrow tables. `dlt` will be able to automatically compute the schema and move the data to your destination.

In a nutshell, `dlt` automatically turns data from a number of available [sources](../../dlt-ecosystem/verified-sources) (e.g., an API, a PostgreSQL database, or Python data structures) into a live dataset stored in a [destination](../../dlt-ecosystem/destinations) of your choice (e.g., Google BigQuery, a Deltalake on Azure, or by pushing the data back via reverse ETL). You can easily implement your own sources, as long as you yield data in a way that is compatible with `dlt`, such as JSON objects, Python lists and dictionaries, pandas dataframes, and arrow tables. `dlt` will be able to automatically compute the schema and move the data to your destination.

VioletM · 2024-12-16T16:51:10Z

docs/website/docs/reference/explainers/how-dlt-works.md

+
+### Extract
+
+During the extract phase, `dlt` fully extracts the data from your [sources](../../general-usage/source) to your hard drive into a new [load package](../../general-usage/destination-tables#load-packages-and-load-ids), which will be assigned a unique ID and will contain your raw data as received from your sources. Additionally, you can [supply schema hints](../../general-usage/source#define-schema) to define the data types of some of the columns or add a primary key and unique indexes. You can also control this phase by [limiting](../../general-usage/resource#sample-from-large-data) the number of items extracted in one run, using [incremental cursor fields](../../general-usage/incremental-loading#incremental-loading-with-a-cursor-field), and by tuning the performance with [parallelization](../../reference/performance#extract). You can also apply filters and maps to [obfuscate](../../general-usage/customising-pipelines/pseudonymizing_columns) or [remove](../../general-usage/customising-pipelines/removing_columns) personal data, and you can use [transformers](../../examples/transformers) to create derivative data.


No such page [supply schema hints](../../general-usage/source#define-schema)

VioletM · 2024-12-16T16:53:18Z

docs/website/docs/reference/explainers/how-dlt-works.md

+
+### Normalize
+
+During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes, for example, new columns or tables. `dlt` will also unnest nested data structures into child tables and create variant columns if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize).


Suggested change

During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes, for example, new columns or tables. `dlt` will also unnest nested data structures into child tables and create variant columns if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize).

During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes, for example, new columns or tables. `dlt` will also unnest nested data structures into child tables and create [variant columns](../../general-usage/schema/#variant-columns) if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize).

Maybe you also mention, that the result is stored on the hard disk as well?

I can't see what the change is here? I can't commit it because I changed the branch history..

VioletM · 2024-12-16T16:54:33Z

docs/website/docs/reference/explainers/how-dlt-works.md

+
+### Load
+
+During the loading phase, `dlt` first runs schema migrations as needed on your destination and then loads your data into the destination. `dlt` will load your data in smaller chunks called load jobs to be able to parallelize large loads. If the connection to the destination fails, it is safe to rerun the pipeline, and `dlt` will continue to load all load jobs from the current load package. `dlt` will also create special tables that store the internal dlt schema, information about all load packages, and some state information which, among other things, are used by the incrementals to be able to restore the incremental state from a previous run to another machine. Some ways to control the loading phase are by using different [`write_dispositions`](../../general-usage/incremental-loading#choosing-a-write-disposition) to replace the data in the destination, simply append to it, or merge on certain merge keys that you can configure per table. For some destinations, you can use a remote staging dataset on a bucket provider, and `dlt` even supports modern open table formats like [deltables and iceberg](../../dlt-ecosystem/destinations/delta-iceberg), and [reverse ETL](../../dlt-ecosystem/destinations/destination) is also possible.


imo, will be too much info :)

VioletM · 2024-12-16T16:56:44Z

docs/website/docs/reference/explainers/how-dlt-works.md

+
+## Other notable features
+
+`ToDo`


Maybe we put something we usually mention about core dlt features with links:

Schema evolution

Incremental loading

Performance settings

Get metadata and traces

Transformations of the dataset

Deploy dlt

rudolfix

please remove 5MB image from the PR 🙏 best if we link it from some bucket or we use drastically resized one. also rewrite the branch so we do not merge any large blobs in past commits

VioletM · 2024-12-17T12:48:09Z

docs/website/docs/reference/explainers/how-dlt-works.md

+pipeline.normalize()
+```
+
+During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes like new columns or tables. `dlt` will also unnest nested data structures into child tables and create variant columns if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize).


Suggested change

During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes like new columns or tables. `dlt` will also unnest nested data structures into child tables and create variant columns if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize).

During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes like new columns or tables. `dlt` will also unnest nested data structures into child tables and create [variant columns](../../general-usage/schema/#variant-columns) if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize).

Hm, yes, for some reason github does not show that properly
I just added the like to variant columns because it might be confusing term for people.

[variant columns](../../general-usage/schema/#variant-columns)

first draft

d035619

sh-rp added the documentation Improvements or additions to documentation label Dec 16, 2024

sh-rp marked this pull request as ready for review December 16, 2024 15:59

AstrakhantsevaAA reviewed Dec 16, 2024

View reviewed changes

VioletM previously approved these changes Dec 16, 2024

View reviewed changes

rudolfix requested changes Dec 16, 2024

View reviewed changes

updates

80ece4a

sh-rp dismissed VioletM’s stale review via 80ece4a December 17, 2024 10:28

sh-rp force-pushed the docs/improve_how_dlt_works branch from 3b21f97 to 80ece4a Compare December 17, 2024 10:28

sh-rp added 2 commits December 17, 2024 11:40

add image

ee84714

add bottom list

53c86bf

sh-rp requested review from AstrakhantsevaAA and VioletM December 17, 2024 11:22

VioletM reviewed Dec 17, 2024

View reviewed changes

VioletM approved these changes Dec 17, 2024

View reviewed changes

sh-rp merged commit 9e70cd2 into devel Dec 17, 2024
49 checks passed

sh-rp deleted the docs/improve_how_dlt_works branch December 17, 2024 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve how dlt works page #2152

improve how dlt works page #2152

sh-rp commented Dec 16, 2024

netlify bot commented Dec 16, 2024 •

edited

Loading

AstrakhantsevaAA left a comment

AstrakhantsevaAA Dec 16, 2024

sh-rp Dec 17, 2024

AstrakhantsevaAA Dec 16, 2024

sh-rp Dec 17, 2024

AstrakhantsevaAA Dec 16, 2024

AstrakhantsevaAA Dec 16, 2024

sh-rp Dec 17, 2024

AstrakhantsevaAA Dec 16, 2024

VioletM Dec 16, 2024

sh-rp Dec 17, 2024

AstrakhantsevaAA Dec 16, 2024

sh-rp Dec 17, 2024

VioletM left a comment

VioletM Dec 16, 2024

sh-rp Dec 17, 2024

VioletM Dec 16, 2024

sh-rp Dec 17, 2024

VioletM Dec 16, 2024

sh-rp Dec 17, 2024

VioletM Dec 16, 2024

VioletM Dec 16, 2024

rudolfix left a comment

VioletM Dec 17, 2024

VioletM Dec 17, 2024


		### Extract

		During the extract phase, `dlt` fully extracts the data from your [sources](../../general-usage/source) to your hard drive into a new [load package](../../general-usage/destination-tables#load-packages-and-load-ids), which will be assigned a unique ID and will contain your raw data as received from your sources. Additionally, you can [supply schema hints](../../general-usage/source#define-schema) to define the data types of some of the columns or add a primary key and unique indexes. You can also control this phase by [limiting](../../general-usage/resource#sample-from-large-data) the number of items extracted in one run, using [incremental cursor fields](../../general-usage/incremental-loading#incremental-loading-with-a-cursor-field), and by tuning the performance with [parallelization](../../reference/performance#extract). You can also apply filters and maps to [obfuscate](../../general-usage/customising-pipelines/pseudonymizing_columns) or [remove](../../general-usage/customising-pipelines/removing_columns) personal data, and you can use [transformers](../../examples/transformers) to create derivative data.


		During the extract phase, `dlt` fully extracts the data from your [sources](../../general-usage/source) to your hard drive into a new [load package](../../general-usage/destination-tables#load-packages-and-load-ids), which will be assigned a unique ID and will contain your raw data as received from your sources. Additionally, you can [supply schema hints](../../general-usage/source#define-schema) to define the data types of some of the columns or add a primary key and unique indexes. You can also control this phase by [limiting](../../general-usage/resource#sample-from-large-data) the number of items extracted in one run, using [incremental cursor fields](../../general-usage/incremental-loading#incremental-loading-with-a-cursor-field), and by tuning the performance with [parallelization](../../reference/performance#extract). You can also apply filters and maps to [obfuscate](../../general-usage/customising-pipelines/pseudonymizing_columns) or [remove](../../general-usage/customising-pipelines/removing_columns) personal data, and you can use [transformers](../../examples/transformers) to create derivative data.

		### Normalize


		### Load

		During the loading phase, `dlt` first runs schema migrations as needed on your destination and then loads your data into the destination. `dlt` will load your data in smaller chunks called load jobs to be able to parallelize large loads. If the connection to the destination fails, it is safe to rerun the pipeline, and `dlt` will continue to load all load jobs from the current load package. `dlt` will also create special tables that store the internal dlt schema, information about all load packages, and some state information which, among other things, are used by the incrementals to be able to restore the incremental state from a previous run to another machine. Some ways to control the loading phase are by using different [`write_dispositions`](../../general-usage/incremental-loading#choosing-a-write-disposition) to replace the data in the destination, simply append to it, or merge on certain merge keys that you can configure per table. For some destinations, you can use a remote staging dataset on a bucket provider, and `dlt` even supports modern open table formats like [deltables and iceberg](../../dlt-ecosystem/destinations/delta-iceberg), and [reverse ETL](../../dlt-ecosystem/destinations/destination) is also possible.


		### Normalize

		During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes, for example, new columns or tables. `dlt` will also unnest nested data structures into child tables and create variant columns if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize).

improve how dlt works page #2152

improve how dlt works page #2152

Conversation

sh-rp commented Dec 16, 2024

Description

netlify bot commented Dec 16, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

AstrakhantsevaAA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VioletM left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Dec 16, 2024 •

edited

Loading