-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve how dlt works page #2152
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very good! much much better
|
||
### Extract | ||
|
||
During the extract phase, `dlt` fully extracts the data from your [sources](../../general-usage/source) to your hard drive into a new [load package](../../general-usage/destination-tables#load-packages-and-load-ids), which will be assigned a unique ID and will contain your raw data as received from your sources. Additionally, you can [supply schema hints](../../general-usage/source#define-schema) to define the data types of some of the columns or add a primary key and unique indexes. You can also control this phase by [limiting](../../general-usage/resource#sample-from-large-data) the number of items extracted in one run, using [incremental cursor fields](../../general-usage/incremental-loading#incremental-loading-with-a-cursor-field), and by tuning the performance with [parallelization](../../reference/performance#extract). You can also apply filters and maps to [obfuscate](../../general-usage/customising-pipelines/pseudonymizing_columns) or [remove](../../general-usage/customising-pipelines/removing_columns) personal data, and you can use [transformers](../../examples/transformers) to create derivative data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please mention that they all are parts of pipeline.run. Users can run each step separately like
pipeline.extract(data)
pipeline.normalize()
pipeline.load()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
[schema](../../general-usage/glossary.md#schema), which will automatically evolve to accommodate any future | ||
source data changes (e.g., new fields or tables). | ||
pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb") | ||
pipeline.run([{"id": 1}, {"id": 2}, {"id": 3}], table_name="items") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a hint:
Use progress="log"
to see logs for each stage separately https://dlthub.com/docs/walkthroughs/run-a-pipeline#2-see-the-progress-during-loading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more something for the walkthrough
|
||
During the extract phase, `dlt` fully extracts the data from your [sources](../../general-usage/source) to your hard drive into a new [load package](../../general-usage/destination-tables#load-packages-and-load-ids), which will be assigned a unique ID and will contain your raw data as received from your sources. Additionally, you can [supply schema hints](../../general-usage/source#define-schema) to define the data types of some of the columns or add a primary key and unique indexes. You can also control this phase by [limiting](../../general-usage/resource#sample-from-large-data) the number of items extracted in one run, using [incremental cursor fields](../../general-usage/incremental-loading#incremental-loading-with-a-cursor-field), and by tuning the performance with [parallelization](../../reference/performance#extract). You can also apply filters and maps to [obfuscate](../../general-usage/customising-pipelines/pseudonymizing_columns) or [remove](../../general-usage/customising-pipelines/removing_columns) personal data, and you can use [transformers](../../examples/transformers) to create derivative data. | ||
|
||
### Normalize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention here that this step is performed only after the extraction step is completed. So if extract stage in failed normalization won't be started
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for load step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
### Load | ||
|
||
During the loading phase, `dlt` first runs schema migrations as needed on your destination and then loads your data into the destination. `dlt` will load your data in smaller chunks called load jobs to be able to parallelize large loads. If the connection to the destination fails, it is safe to rerun the pipeline, and `dlt` will continue to load all load jobs from the current load package. `dlt` will also create special tables that store the internal dlt schema, information about all load packages, and some state information which, among other things, are used by the incrementals to be able to restore the incremental state from a previous run to another machine. Some ways to control the loading phase are by using different [`write_dispositions`](../../general-usage/incremental-loading#choosing-a-write-disposition) to replace the data in the destination, simply append to it, or merge on certain merge keys that you can configure per table. For some destinations, you can use a remote staging dataset on a bucket provider, and `dlt` even supports modern open table formats like [deltables and iceberg](../../dlt-ecosystem/destinations/delta-iceberg), and [reverse ETL](../../dlt-ecosystem/destinations/destination) is also possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain better this part?
dlt
will load your data in smaller chunks called load jobs to be able to parallelize large loads.
does it mean dlt loads data in chunks? or one chunk is one resource? or if we enable file rotation it's one file?
or it's too much info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo, will be too much info :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to add a new page about the load package and how to inspect it.
[schema](../../general-usage/glossary.md#schema), which will automatically evolve to accommodate any future | ||
source data changes (e.g., new fields or tables). | ||
pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb") | ||
pipeline.run([{"id": 1}, {"id": 2}, {"id": 3}], table_name="items") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make the data sample nested? then you can mention unnesting further in the normalize step ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really enjoyed the explanations about all steps with a lot of links and info combined. Great job!
does this by first [extracting](how-dlt-works.md#extract) the JSON data, then | ||
[normalizing](how-dlt-works.md#normalize) it to a schema, and finally [loading](how-dlt-works#load) | ||
it to the location where you will store it. | ||
In a nutshell, `dlt` automatically turns data from a number of available [sources](../../general-usage/source) (e.g., an API, a PostgreSQL database, or Python data structures) into a live dataset stored in a [destination](../../general-usage/destination) of your choice (e.g., Google BigQuery, a Deltalake on Azure, or by pushing the data back via reverse ETL). You can easily implement your own sources, as long as you yield data in a way that is compatible with `dlt`, such as JSON objects, Python lists and dictionaries, pandas dataframes, and arrow tables. `dlt` will be able to automatically compute the schema and move the data to your destination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a nutshell, `dlt` automatically turns data from a number of available [sources](../../general-usage/source) (e.g., an API, a PostgreSQL database, or Python data structures) into a live dataset stored in a [destination](../../general-usage/destination) of your choice (e.g., Google BigQuery, a Deltalake on Azure, or by pushing the data back via reverse ETL). You can easily implement your own sources, as long as you yield data in a way that is compatible with `dlt`, such as JSON objects, Python lists and dictionaries, pandas dataframes, and arrow tables. `dlt` will be able to automatically compute the schema and move the data to your destination. | |
In a nutshell, `dlt` automatically turns data from a number of available [sources](../../dlt-ecosystem/verified-sources) (e.g., an API, a PostgreSQL database, or Python data structures) into a live dataset stored in a [destination](../../dlt-ecosystem/destinations) of your choice (e.g., Google BigQuery, a Deltalake on Azure, or by pushing the data back via reverse ETL). You can easily implement your own sources, as long as you yield data in a way that is compatible with `dlt`, such as JSON objects, Python lists and dictionaries, pandas dataframes, and arrow tables. `dlt` will be able to automatically compute the schema and move the data to your destination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
### Extract | ||
|
||
During the extract phase, `dlt` fully extracts the data from your [sources](../../general-usage/source) to your hard drive into a new [load package](../../general-usage/destination-tables#load-packages-and-load-ids), which will be assigned a unique ID and will contain your raw data as received from your sources. Additionally, you can [supply schema hints](../../general-usage/source#define-schema) to define the data types of some of the columns or add a primary key and unique indexes. You can also control this phase by [limiting](../../general-usage/resource#sample-from-large-data) the number of items extracted in one run, using [incremental cursor fields](../../general-usage/incremental-loading#incremental-loading-with-a-cursor-field), and by tuning the performance with [parallelization](../../reference/performance#extract). You can also apply filters and maps to [obfuscate](../../general-usage/customising-pipelines/pseudonymizing_columns) or [remove](../../general-usage/customising-pipelines/removing_columns) personal data, and you can use [transformers](../../examples/transformers) to create derivative data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No such page [supply schema hints](../../general-usage/source#define-schema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
### Normalize | ||
|
||
During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes, for example, new columns or tables. `dlt` will also unnest nested data structures into child tables and create variant columns if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes, for example, new columns or tables. `dlt` will also unnest nested data structures into child tables and create variant columns if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize). | |
During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes, for example, new columns or tables. `dlt` will also unnest nested data structures into child tables and create [variant columns](../../general-usage/schema/#variant-columns) if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize). |
Maybe you also mention, that the result is stored on the hard disk as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see what the change is here? I can't commit it because I changed the branch history..
|
||
### Load | ||
|
||
During the loading phase, `dlt` first runs schema migrations as needed on your destination and then loads your data into the destination. `dlt` will load your data in smaller chunks called load jobs to be able to parallelize large loads. If the connection to the destination fails, it is safe to rerun the pipeline, and `dlt` will continue to load all load jobs from the current load package. `dlt` will also create special tables that store the internal dlt schema, information about all load packages, and some state information which, among other things, are used by the incrementals to be able to restore the incremental state from a previous run to another machine. Some ways to control the loading phase are by using different [`write_dispositions`](../../general-usage/incremental-loading#choosing-a-write-disposition) to replace the data in the destination, simply append to it, or merge on certain merge keys that you can configure per table. For some destinations, you can use a remote staging dataset on a bucket provider, and `dlt` even supports modern open table formats like [deltables and iceberg](../../dlt-ecosystem/destinations/delta-iceberg), and [reverse ETL](../../dlt-ecosystem/destinations/destination) is also possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo, will be too much info :)
|
||
## Other notable features | ||
|
||
`ToDo` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we put something we usually mention about core dlt features with links:
- Schema evolution
- Incremental loading
- Performance settings
- Get metadata and traces
- Transformations of the dataset
- Deploy dlt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove 5MB image from the PR 🙏 best if we link it from some bucket or we use drastically resized one. also rewrite the branch so we do not merge any large blobs in past commits
3b21f97
to
80ece4a
Compare
pipeline.normalize() | ||
``` | ||
|
||
During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes like new columns or tables. `dlt` will also unnest nested data structures into child tables and create variant columns if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes like new columns or tables. `dlt` will also unnest nested data structures into child tables and create variant columns if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize). | |
During the normalization phase, `dlt` inspects and normalizes your data and computes a [schema](../../general-usage/schema) corresponding to the input data. The schema will automatically evolve to accommodate any future source data changes like new columns or tables. `dlt` will also unnest nested data structures into child tables and create [variant columns](../../general-usage/schema/#variant-columns) if detected values do not match a schema computed during a previous run. The result of the normalization phase is an updated load package that holds your normalized data in a format your destination understands and a full schema which can be used to migrate your data to your destination. You can control the normalization phase, for example, by [defining the allowed nesting level](../../general-usage/source#reduce-the-nesting-level-of-generated-tables) of input data, by [applying schema contracts](../../general-usage/schema-contracts) that govern how the schema might evolve, and how rows that do not fit are treated. Performance settings are [also available](../../reference/performance#normalize). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yes, for some reason github does not show that properly
I just added the like to variant columns because it might be confusing term for people.
[variant columns](../../general-usage/schema/#variant-columns)
Description
This PR improves the how dlt works page: