From 81580bd38fa670afe56c4f0f8e8b212681450716 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Tue, 28 May 2024 12:48:49 +0200 Subject: [PATCH 01/21] Link rest_api & OpenAPI generator from helpers section in the docs (#1420) --- docs/website/docs/general-usage/http/overview.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/website/docs/general-usage/http/overview.md b/docs/website/docs/general-usage/http/overview.md index 94dc64eac5..2d193ceb2c 100644 --- a/docs/website/docs/general-usage/http/overview.md +++ b/docs/website/docs/general-usage/http/overview.md @@ -8,6 +8,10 @@ dlt has built-in support for fetching data from APIs: - [RESTClient](./rest-client.md) for interacting with RESTful APIs and paginating the results - [Requests wrapper](./requests.md) for making simple HTTP requests with automatic retries and timeouts +Additionally, dlt provides tools to simplify working with APIs: +- [REST API generic source](../../dlt-ecosystem/verified-sources/rest_api) integrates APIs using a [declarative configuration](../../dlt-ecosystem/verified-sources/rest_api#source-configuration) to minimize custom code. +- [OpenAPI source generator](../../dlt-ecosystem/verified-sources/openapi-generator) automatically creates declarative API configurations from [OpenAPI specifications](https://swagger.io/specification/). + ## Quick example Here's a simple pipeline that reads issues from the [dlt GitHub repository](https://github.com/dlt-hub/dlt/issues). The API endpoint is https://api.github.com/repos/dlt-hub/dlt/issues. The result is "paginated", meaning that the API returns a limited number of issues per page. The `paginate()` method iterates over all pages and yields the results which are then processed by the pipeline. From b4e04918d740717ed97da7f78d3aef9bfa6e3773 Mon Sep 17 00:00:00 2001 From: anuunchin <88698977+anuunchin@users.noreply.github.com> Date: Wed, 29 May 2024 10:53:29 +0200 Subject: [PATCH 02/21] Fixed the title of a blog post (#1421) --- .../blog/2024-05-23-contributed-first-pipeline.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/website/blog/2024-05-23-contributed-first-pipeline.md b/docs/website/blog/2024-05-23-contributed-first-pipeline.md index aae6e0f298..c6d9252da3 100644 --- a/docs/website/blog/2024-05-23-contributed-first-pipeline.md +++ b/docs/website/blog/2024-05-23-contributed-first-pipeline.md @@ -1,6 +1,6 @@ --- slug: contributed-first-pipeline -title: "How I contributed my first data pipeline to the open source." +title: "How I Contributed to My First Open Source Data Pipeline" image: https://storage.googleapis.com/dlt-blog-images/blog_my_first_data_pipeline.png authors: name: Aman Gupta @@ -78,13 +78,13 @@ def incremental_resource( With the steps defined above, I was able to load the data from Freshdesk to BigQuery and use the pipeline in production. Here’s a summary of the steps I followed: 1. Created a Freshdesk API token with sufficient privileges. -1. Created an API client to make requests to the Freshdesk API with rate limit and pagination. -1. Made incremental requests to this client based on the “updated_at” field in the response. -1. Ran the pipeline using the Python script. +2. Created an API client to make requests to the Freshdesk API with rate limit and pagination. +3. Made incremental requests to this client based on the “updated_at” field in the response. +4. Ran the pipeline using the Python script. While my journey from civil engineering to data engineering was initially intimidating, it has proved to be a profound learning experience. Writing a pipeline with **`dlt`** mirrors the simplicity of a GET request: you request data, yield it, and it flows from the source to its destination. Now, I help other clients integrate **`dlt`** to streamline their data workflows, which has been an invaluable part of my professional growth. In conclusion, diving into data engineering has expanded my technical skill set and provided a new lens through which I view challenges and solutions. As for me, the lens view mainly was concrete and steel a couple of years back, which has now begun to notice the pipelines of the data world. -Data engineering has proved both challenging, satisfying and a good carrier option for me till now. For those interested in the detailed workings of these pipelines, I encourage exploring dlt's [GitHub repository](https://github.com/dlt-hub/verified-sources) or diving into the [documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/freshdesk). \ No newline at end of file +Data engineering has proved both challenging, satisfying, and a good career option for me till now. For those interested in the detailed workings of these pipelines, I encourage exploring dlt's [GitHub repository](https://github.com/dlt-hub/verified-sources) or diving into the [documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/freshdesk). \ No newline at end of file From 8f96ff33d1e77e2051c1f2b5a33483a741f1cba6 Mon Sep 17 00:00:00 2001 From: adrianbr Date: Wed, 29 May 2024 15:39:12 +0200 Subject: [PATCH 03/21] blog openapi (#1422) --- ...2024-03-07-openapi-generation-chargebee.md | 2 +- .../blog/2024-05-14-rest-api-source-client.md | 2 +- .../blog/2024-05-28-openapi-pipeline.md | 97 +++++++++++++++++++ 3 files changed, 99 insertions(+), 2 deletions(-) create mode 100644 docs/website/blog/2024-05-28-openapi-pipeline.md diff --git a/docs/website/blog/2024-03-07-openapi-generation-chargebee.md b/docs/website/blog/2024-03-07-openapi-generation-chargebee.md index 3d77c3ea4c..97fc6e4865 100644 --- a/docs/website/blog/2024-03-07-openapi-generation-chargebee.md +++ b/docs/website/blog/2024-03-07-openapi-generation-chargebee.md @@ -7,7 +7,7 @@ authors: title: Data Engineer & ML Engineer url: https://github.com/dlt-hub/dlt image_url: https://avatars.githubusercontent.com/u/89419010?s=48&v=4 -tags: [data observability, data pipeline observability] +tags: [data observability, data pipeline observability, openapi] --- At dltHub, we have been pioneering the future of data pipeline generation, [making complex processes simple and scalable.](https://dlthub.com/product/#multiply-don't-add-to-our-productivity) We have not only been building dlt for humans, but also LLMs. diff --git a/docs/website/blog/2024-05-14-rest-api-source-client.md b/docs/website/blog/2024-05-14-rest-api-source-client.md index 18c8f1196e..ee20b43b41 100644 --- a/docs/website/blog/2024-05-14-rest-api-source-client.md +++ b/docs/website/blog/2024-05-14-rest-api-source-client.md @@ -7,7 +7,7 @@ authors: title: Open source Data Engineer url: https://github.com/adrianbr image_url: https://avatars.githubusercontent.com/u/5762770?v=4 -tags: [full code etl, yes code etl, etl, python elt] +tags: [rest-api, declarative etl] --- ## What is the REST API Source toolkit? diff --git a/docs/website/blog/2024-05-28-openapi-pipeline.md b/docs/website/blog/2024-05-28-openapi-pipeline.md new file mode 100644 index 0000000000..60faa062e0 --- /dev/null +++ b/docs/website/blog/2024-05-28-openapi-pipeline.md @@ -0,0 +1,97 @@ +--- +slug: openapi-pipeline +title: "Instant pipelines with dlt-init-openapi" +image: https://storage.googleapis.com/dlt-blog-images/openapi.png +authors: + name: Adrian Brudaru + title: Open source Data Engineer + url: https://github.com/adrianbr + image_url: https://avatars.githubusercontent.com/u/5762770?v=4 +tags: [openapi] +--- + +# The Future of Data Pipelines starts now. + +Dear dltHub Community, + +We are thrilled to announce the launch of our groundbreaking pipeline generator tool. + +We call it `dlt-init-openapi`. + +Just point it to an OpenAPI spec, select your endpoints, and you're done! + + +### What's OpenAPI again? + +[OpenAPI](https://www.openapis.org/) is the world's most widely used API description standard. You may have heard about swagger docs? those are docs generated from the spec. +In 2021 an information-security company named Assetnote scanned the web and unearthed [200,000 public +OpenAPI files](https://www.assetnote.io/resources/research/contextual-content-discovery-youve-forgotten-about-the-api-endpoints). +Modern API frameworks like [FastAPI](https://pypi.org/project/fastapi/) generate such specifications automatically. + +## How does it work? + +**A pipeline is a series of datapoints or decisions about how to extract and load the data**, expressed as code or config. I say decisions because building a pipeline can be boiled down to inspecting a documentation or response and deciding how to write the code. + +Our tool does its best to pick out the necessary details and detect the rest to generate the complete pipeline for you. + +The information required for taking those decisions comes from: +- The OpenAPI [Spec](https://github.com/dlt-hub/openapi-specs) (endpoints, auth) +- The dlt [REST API Source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api) which attempts to detect pagination +- The [dlt init OpenAPI generator](https://dlthub.com/docs/dlt-ecosystem/verified-sources/openapi-generator) which attempts to detect incremental logic and dependent requests. + +### How well does it work? + +This is something we are also learning about. We did an internal hackathon where we each built a few pipelines with this generator. In our experiments with APIs for which we had credentials, it worked pretty well. + +However, we cannot undertake a big detour from our work to manually test each possible pipeline, so your feedback will be invaluable. +So please, if you try it, let us know how well it worked - and ideally, add the spec you used to our [repository](https://github.com/dlt-hub/openapi-specs). + +### What to do if it doesn't work? + +Once a pipeline is created, it is a **fully configurable instance of the REST API Source**. +So if anything did not go smoothly, you can make the final tweaks. +You can learn how to adjust the generated pipeline by reading our [REST API Source documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api). + +### Are we using LLMS under the hood? + +No. This is a potential future enhancement, so maybe later. + +The pipelines are generated algorithmically with deterministic outcomes. This way, we have more control over the quality of the decisions. + +If we took an LLM-first approach, the errors would compound and put the burden back on the data person. + +We are however considering using LLM-assists for the things that the algorithmic approach can't detect. Another avenue could be generating the OpenAPI spec from website docs. +So we are eager to get feedback from you on what works and what needs work, enabling us to improve it. + +## Try it out now! + +**Video Walkthrough:** + + + + +**[Colab demo](https://colab.research.google.com/drive/1MRZvguOTZj1MlkEGzjiso8lQ_wr1MJRI?usp=sharing)** - Load data from Stripe API to DuckDB using dlt and OpenAPI + +**[Docs](https://dlthub.com/docs/dlt-ecosystem/verified-sources/openapi-generator)** for `dlt-init-openapi` + +dlt init openapi **[code repo.](https://github.com/dlt-hub/dlt-init-openapi)** + +**[Specs repository you can generate from.](https://github.com/dlt-hub/openapi-specs)** + +Showcase your pipeline in the community sources **[here](https://www.notion.so/dlthub/dltHub-Community-Sources-Snippets-7a7f7ddb39334743b1ba3debbdfb8d7f) + +## Next steps: Feedback, discussion and sharing. + +Solving data engineering headaches in the open source is a team sport. +We got this far with your feedback and help (especially on [REST API source](https://dlthub.com/docs/blog/rest-api-source-client)), and are counting on your continuous usage and engagement +to steer our pushing of what's possible into uncharted, but needed directions. + +So here's our call to action: + +- We're excited to see how you will use our new pipeline generator and we are +eager for your feedback. **[Join our community and let us know how we can improve dlt-init-openapi](https://dlthub.com/community)** +- Got an OpenAPI spec? **[Add it to our specs repository](https://github.com/dlt-hub/openapi-specs)** so others may use it. If the spec doesn't work, please note that in the PR and we will use it for R&D. + +*Thank you for being part of our community and for building the future of ETL together!* + +*- dltHub Team* From bad7645202ffab203fd149372fd6227e71725455 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Fri, 31 May 2024 09:35:27 +0200 Subject: [PATCH 04/21] rest_api: add troubleshooting section (#1371) --- .../verified-sources/rest_api.md | 85 +++++++++++++++++++ 1 file changed, 85 insertions(+) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md index 98725627b9..546adb3e5b 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md @@ -618,3 +618,88 @@ In this example, the source will ignore responses with a status code of 404, res - `content` (str, optional): A substring to search for in the response content. - `action` (str): The action to take when the condition is met. Currently supported actions: - `ignore`: Ignore the response. + +## Troubleshooting + +If you encounter issues while running the pipeline, enable [logging](../../running-in-production/running.md#set-the-log-level-and-format) for detailed information about the execution: + +```sh +RUNTIME__LOG_LEVEL=INFO python my_script.py +``` + +This also provides details on the HTTP requests. + +### Configuration issues + +#### Getting validation errors + +When you running the pipeline and getting a `DictValidationException`, it means that the [source configuration](#source-configuration) is incorrect. The error message provides details on the issue including the path to the field and the expected type. + +For example, if you have a source configuration like this: + +```py +config: RESTAPIConfig = { + "client": { + # ... + }, + "resources": [ + { + "name": "issues", + "params": { # <- Wrong: this should be inside + "sort": "updated", # the endpoint field below + }, + "endpoint": { + "path": "issues", + # "params": { # <- Correct configuration + # "sort": "updated", + # }, + }, + }, + # ... + ], +} +``` + +You will get an error like this: + +```sh +dlt.common.exceptions.DictValidationException: In path .: field 'resources[0]' +expects the following types: str, EndpointResource. Provided value {'name': 'issues', 'params': {'sort': 'updated'}, +'endpoint': {'path': 'issues', ... }} with type 'dict' is invalid with the following errors: +For EndpointResource: In path ./resources[0]: following fields are unexpected {'params'} +``` + +It means that in the first resource configuration (`resources[0]`), the `params` field should be inside the `endpoint` field. + +:::tip +Import the `RESTAPIConfig` type from the `rest_api` module to have convenient hints in your editor/IDE and use it to define the configuration object. + +```py +from rest_api import RESTAPIConfig +``` +::: + +#### Getting wrong data or no data + +If incorrect data is received from an endpoint, check the `data_selector` field in the [endpoint configuration](#endpoint-configuration). Ensure the JSONPath is accurate and points to the correct data in the response body. `rest_api` attempts to auto-detect the data location, which may not always succeed. See the [data selection](#data-selection) section for more details. + +#### Getting insufficient data or incorrect pagination + +Check the `paginator` field in the configuration. When not explicitly specified, the source tries to auto-detect the pagination method. If auto-detection fails, or the system is unsure, a warning is logged. For production environments, we recommend to specify an explicit paginator in the configuration. See the [pagination](#pagination) section for more details. Some APIs may have non-standard pagination methods, and you may need to implement a [custom paginator](../../general-usage/http/rest-client.md#implementing-a-custom-paginator). + +#### Getting HTTP 404 errors + +Some API may return 404 errors for resources that do not exist or have no data. Manage these responses by configuring the `ignore` action in [response actions](#response-actions). + +### Authentication issues + +If experiencing 401 (Unauthorized) errors, this could indicate: + +- Incorrect authorization credentials. Verify credentials in the `secrets.toml`. Refer to [Secret and configs](../../general-usage/credentials/configuration#understanding-the-exceptions) for more information. +- An incorrect authentication type. Consult the API documentation for the proper method. See the [authentication](#authentication) section for details. For some APIs, a [custom authentication method](../../general-usage/http/rest-client.md#custom-authentication) may be required. + +### General guidelines + +The `rest_api` source uses the [RESTClient](../../general-usage/http/rest-client.md) class for HTTP requests. Refer to the RESTClient [troubleshooting guide](../../general-usage/http/rest-client.md#troubleshooting) for debugging tips. + +For further assistance, join our [Slack community](https://dlthub.com/community). We're here to help! \ No newline at end of file From 964a94d61f6a8243cab3195a0bbbe7e65bc6384b Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Tue, 4 Jun 2024 15:24:34 +0200 Subject: [PATCH 05/21] RESTClient: add docs for `init_request` (#1442) --- docs/website/docs/general-usage/http/rest-client.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/website/docs/general-usage/http/rest-client.md b/docs/website/docs/general-usage/http/rest-client.md index 1093428b0f..8f517389c6 100644 --- a/docs/website/docs/general-usage/http/rest-client.md +++ b/docs/website/docs/general-usage/http/rest-client.md @@ -305,7 +305,9 @@ client = RESTClient( ### Implementing a custom paginator -When working with APIs that use non-standard pagination schemes, or when you need more control over the pagination process, you can implement a custom paginator by subclassing the `BasePaginator` class and `update_state` and `update_request` methods: +When working with APIs that use non-standard pagination schemes, or when you need more control over the pagination process, you can implement a custom paginator by subclassing the `BasePaginator` class and implementing `init_request`, `update_state` and `update_request` methods: + +- `init_request(request: Request) -> None`: This method is called before making the first API call in the `RESTClient.paginate` method. You can use this method to set up the initial request query parameters, headers, etc. For example, you can set the initial page number or cursor value. - `update_state(response: Response) -> None`: This method updates the paginator's state based on the response of the API call. Typically, you extract pagination details (like the next page reference) from the response and store them in the paginator instance. @@ -325,6 +327,10 @@ class QueryParamPaginator(BasePaginator): self.page_param = page_param self.page = initial_page + def init_request(self, request: Request) -> None: + # This will set the initial page number (e.g. page=1) + self.update_request(request) + def update_state(self, response: Response) -> None: # Assuming the API returns an empty list when no more data is available if not response.json(): From b6a3969d3d5f4bfb08ab6c7f1810b877b0a227ea Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Wed, 12 Jun 2024 19:02:39 +0200 Subject: [PATCH 06/21] Add a troubleshooting section to incremental docs (#1458) --- .../verified-sources/rest_api.md | 6 ++ .../docs/general-usage/incremental-loading.md | 63 ++++++++++++++++++- 2 files changed, 68 insertions(+), 1 deletion(-) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md index 546adb3e5b..e28c5bac30 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md @@ -577,6 +577,8 @@ When the API endpoint supports incremental loading, you can configure the source See the [incremental loading](../../general-usage/incremental-loading.md#incremental-loading-with-a-cursor-field) guide for more details. +If you encounter issues with incremental loading, see the [troubleshooting section](../../general-usage/incremental-loading.md#troubleshooting) in the incremental loading guide. + ## Advanced configuration `rest_api_source()` function creates the [dlt source](../../general-usage/source.md) and lets you configure the following parameters: @@ -687,6 +689,10 @@ If incorrect data is received from an endpoint, check the `data_selector` field Check the `paginator` field in the configuration. When not explicitly specified, the source tries to auto-detect the pagination method. If auto-detection fails, or the system is unsure, a warning is logged. For production environments, we recommend to specify an explicit paginator in the configuration. See the [pagination](#pagination) section for more details. Some APIs may have non-standard pagination methods, and you may need to implement a [custom paginator](../../general-usage/http/rest-client.md#implementing-a-custom-paginator). +#### Incremental loading not working + +See the [troubleshooting guide](../../general-usage/incremental-loading.md#troubleshooting) for incremental loading issues. + #### Getting HTTP 404 errors Some API may return 404 errors for resources that do not exist or have no data. Manage these responses by configuring the `ignore` action in [response actions](#response-actions). diff --git a/docs/website/docs/general-usage/incremental-loading.md b/docs/website/docs/general-usage/incremental-loading.md index 18bdb13b06..f99a3a9e57 100644 --- a/docs/website/docs/general-usage/incremental-loading.md +++ b/docs/website/docs/general-usage/incremental-loading.md @@ -501,7 +501,7 @@ def get_events(last_created_at = dlt.sources.incremental("$", last_value_func=by ``` ### Using `last_value_func` for lookback -The example below uses the `last_value_func` to load data from the past month. +The example below uses the `last_value_func` to load data from the past month. ```py def lookback(event): last_value = None @@ -977,3 +977,64 @@ def search_tweets(twitter_bearer_token=dlt.secrets.value, search_terms=None, sta yield page ``` + +## Troubleshooting + +If you see that the incremental loading is not working as expected and the incremental values are not modified between pipeline runs, check the following: + +1. Make sure the `destination`, `pipeline_name` and `dataset_name` are the same between pipeline runs. + +2. Check if `dev_mode` is `False` in the pipeline configuration. Check if `refresh` for associated sources and resources is not enabled. + +3. Check the logs for `Bind incremental on ...` message. This message indicates that the incremental value was bound to the resource and shows the state of the incremental value. + +4. After the pipeline run, check the state of the pipeline. You can do this by running the following command: + +```sh +dlt pipeline -v info +``` + +For example, if your pipeline is defined as follows: + +```py +@dlt.resource +def my_resource( + incremental_object = dlt.sources.incremental("some_key", initial_value=0), +): + ... + +pipeline = dlt.pipeline( + pipeline_name="example_pipeline", + destination="duckdb", +) + +pipeline.run(my_resource) +``` + +You'll see the following output: + +```text +Attaching to pipeline +... + +sources: +{ + "example": { + "resources": { + "my_resource": { + "incremental": { + "some_key": { + "initial_value": 0, + "last_value": 42, + "unique_hashes": [ + "nmbInLyII4wDF5zpBovL" + ] + } + } + } + } + } +} +``` + +Verify that the `last_value` is updated between pipeline runs. \ No newline at end of file From 5ac69c22c71a87c0d5f9a5b9c080214dcf69906b Mon Sep 17 00:00:00 2001 From: adrianbr Date: Mon, 17 Jun 2024 08:52:27 +0200 Subject: [PATCH 07/21] blog pandas (#1462) --- .../2024-06-12-from-pandas-to-production.md | 205 ++++++++++++++++++ 1 file changed, 205 insertions(+) create mode 100644 docs/website/blog/2024-06-12-from-pandas-to-production.md diff --git a/docs/website/blog/2024-06-12-from-pandas-to-production.md b/docs/website/blog/2024-06-12-from-pandas-to-production.md new file mode 100644 index 0000000000..92fe58e107 --- /dev/null +++ b/docs/website/blog/2024-06-12-from-pandas-to-production.md @@ -0,0 +1,205 @@ +--- +slug: pandas-to-production +title: "From Pandas to Production: why dlt is the right ELT tool for Normies" +image: https://storage.googleapis.com/dlt-blog-images/i-am-normal.png +authors: + name: Adrian Brudaru + title: Open source Data Engineer + url: https://github.com/adrianbr + image_url: https://avatars.githubusercontent.com/u/5762770?v=4 +tags: [pandas, production, etl, etl] +--- + + + +:::tip +**TL;DR: We created a library to reduce friction between data engineers, data scientists, and the rest of the team. From Pandas to Production article tells the story of how we got here.** + +But if you want to load pandas dfs to production databases, with all the best practices built-in, check out this [documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/arrow-pandas) or this colab notebook that shows [easy handling of complex api data](https://colab.research.google.com/drive/1DhaKW0tiSTHDCVmPjM-eoyL47BJ30xmP#scrollTo=1wf1R0yQh7pv). + +Here are the best practices: [wishlist becomes reality](#our-dream-a-tool-that-meets-production-pipelines-requirements) + +Or check out more resources [at the end of the article](#call-to-action) +::: + +## I. The background story: Normal people load data too + +Hey, I’m Adrian, cofounder of dlt. I’ve been working in the data industry since 2012, doing all kinds of end-to-end things. + +In 2017, a hiring team called me a data engineer. As I saw that title brought me a lot of work offers, I kept it and went with it. + +But was I doing data engineering? Yes and no. Since my studies were not technical, I always felt some impostor syndrome calling myself a data engineer. I had started as an analyst, did more and more and became an end to end data professional that does everything from building the tech stack, collecting requirements, getting managers to agree on the metrics used 🙄, creating roadmap and hiring a team. + +Back in 2022 there was an online conference called [Normconf](https://normconf.com/) and I ‘felt seen’. As [I watched Normconf participants](https://www.youtube.com/@normconf), I could relate more to them than to the data engineer label. No, I am not just writing code and pushing best practices - I am actually just trying to get things done without getting bogged down in bad practice gotchas. And it seemed at this conference that many people felt this way. + +![normal](https://storage.googleapis.com/dlt-blog-images/i-am-normal.png) + +### Normies: Problem solvers with antipathy for black boxes, gratuitous complexity and external dependencies + +At Normconf, "normie" participants often embodied the three fundamental psychological needs identified in Self-Determination Theory: autonomy, competence, and relatedness. + +They talked about how they autonomously solved all kinds of problems, related on the pains and gains of their roles, and showed off their competence across the board, in solving problems. + +What they did, was what I also did as a data engineer: We start from a business problem, and work back through what needs to be done to understand and solve it. + +By very definition, Normie is someone not very specialised at one thing or another, and in our field, even data engineers are jacks of all trades. + +What undermines the Normie mission are things that clash with the basic needs, from uncustomisable products, to vendors that add bottlenecks and unreliable dependencies. + +### Encountering friction between data engineers and Python-first analysts + +Before becoming a co-founder of dlt I had 5 interesting years as a startup employee, a half-year nightmare in a corporation with no autonomy or mastery (I got fired for refusing the madness, and it was such a huge relief), followed by 5 fun, rewarding and adventure-filled years of freelancing. Much of my work was “build&hire” which usually meant building a first time data warehouse and hiring a team for it. The setups that I did were bespoke to the businesses that were getting them, including the teams - Meaning, the technical complexity was also tailored to the (lack of) technical culture of the companies I was building for. + +In this time, I saw an acute friction between data engineers and Python-first analysts, mostly around the fact that data engineers easily become a bottleneck and data scientists are forced to pick up the slack. And of course, this causes other issues that might further complicate the life of the data engineer, while still not being a good solution for the data consumers. + +So at this point I started building boilerplate code for data warehouses and learning how to better cater to the entire team. + + +### II. The initial idea: pandas.df.to_sql() with data engineering best practices + +After a few attempts I ended up with the hypothesis that df.to_sql() is the natural abstraction a data person would use - I have a table here, I want a table there, shouldn’t be harder than a function call right? + +Right. + +Except that particular function call is anything but data engineering complete. A single run will do what it promises. A production pipeline will also have many additional requirements. In the early days, we wrote up an ideal list of features that should be auto-handled (spoiler alert: today dlt does all that and more). Read on for the wish list: + +### Our dream: a tool that meets production pipelines requirements + +- Wouldn’t it be nice if we could auto-flatten and unpack nested structures into tables with generated join keys? + + +- Wouldn’t it be nice if data types were properly defined and managed? +- Wouldn’t it be nice if we could load the data incrementally, meaning retain some state to know where to start from? +- Wouldn’t it be nice if this incremental load was bound to a way to do incremental extraction? +- Wouldn’t it be nice if we didn’t run out of memory? +- Wouldn’t it be nice if we got alerted/notified when schemas change? +- Wouldn’t it be nice if schema changes were self healing? +- Wouldn’t it be nice if I could run it all in parallel, or do async calls? +- Wouldn’t it be nice if it ran on different databases too, from dev to prod? +- Wouldn’t it be nice if it offered requests with built in retries for those nasty unreliable apis (Hey Zendesk, why you fail on call 99998/100000?) +- Wouldn’t it be nice if we had some extraction helpers like pagination detection? + +Auto typing and unpacking with generated keys: +![keys](https://storage.googleapis.com/dlt-blog-images/generated_keys.png) + +Performance [docs](https://dlthub.com/docs/reference/performance) + + +### The initial steps + +How did we go about it? At first dlt was created as an engine to iron out its functionality. During this time, it was deployed it in several projects, from startups to enterprises, particularly to accelerate data pipeline building in a robust way. + +A while later, to prepare this engine for the general public, we created the current interface on top of it. We then tested it in a workshop with many “Normies” of which over 50% were pre-employment learners. + +For the workshop we broke down the steps to build an incremental pipeline into 20 steps. In the 6 hour workshop we asked people to react on Slack to each “checkpoint”. We then exported the slack data and loaded it with dlt, exposing the completion rate per checkpoint. Turns out, it was 100%. +Everyone who started, managed to build the pipeline. “This is it!” we thought, and spend the next 6 months preparing our docs and adding some plugins for easy deployment. + +## III. Launching dlt + +We finally launched dlt mid 2023 to the general public. Our initial community was mostly data engineers who had been using dlt without docs, +managing from reading code. As we hoped a lot of “normies” are using dlt, too! + +## dlt = code + docs + Slack support + +A product is a sum of many parts. For us dlt is not only the dlt library and interface, but also our docs and Slack community and the support and discussions there. + +In the early days of dlt we talked to Sebastian Ramirez from FastAPI who told us that he spends 2/3 of his FastAPI time writing documentation. + +In this vein, from the beginning docs were very important to us and we quickly adopted our own [docs standard](https://www.writethedocs.org/videos/eu/2017/the-four-kinds-of-documentation-and-why-you-need-to-understand-what-they-are-daniele-procida/). + +However, when we originally launched dlt, we found that different user types, especially Normies, expect different things from our docs, and because we asked for feedback, they told us. + +So overall, we were not satisfied to stop there. + +### "Can you make your docs more like my favorite tool's docs?" + +To this end we built and embedded our own docs helper in our docs. + +The result? The docs helper has been running for a year and we currently see around **300 questions per day.** Comparing this to other communities that do AI support on Slack, that’s almost 2 orders of magnitude difference in question volume by community size. + +We think this is a good thing, and a result of several factors. + +- Embedded in docs means at the right place at the right time. Available to anyone, whether they use Slack or not. +- Conversations are private and anonymous. This reduces the emotional barrier of asking. We suspect this is great for the many “Normies” / “problem solvers” that work in data. +- The questions are different than in our Slack community: Many questions are around “Setup and configuration”, “Troubleshooting” and “General questions” about dlt architecture. In Slack, we see the questions that our docs or assistant could not answer. +- The bot is conversational and will remember recent context, enabling it to be particularly helpful. This is different from the “question answering service” that many Slack bots offer, which do not keep context once a question was answered. By retaining context, it’s possible to reach a useful outcome even if it doesn’t come in the first reply. + +### dlt = “pip install and go” - the fastest way to create a pipeline and source + +dlt offers a small number of verified sources, but encourages you to build your own. As we +mentioned, creating an ad hoc dlt [pipeline and source](https://dlthub.com/docs/tutorial/load-data-from-an-api) is +[dramatically simpler](https://dlthub.com/docs/build-a-pipeline-tutorial#the-simplest-pipeline-1-liner-to-load-data-with-schema-evolution) compared to other python libraries. +Maintaining a custom dlt source in production takes no time at all because the pipeline won't break unless the source stops existing. + +The sources you build and run that are not shared back into the verified sources are what we call “private sources”. + +By the end of 2023, our community had already built 1,000 private sources, [2,000 by early March](https://dlthub.com/docs/blog/code-vs-buy). We +are now at the end of q2 2024 and we see 5,000 private sources. + +### Embracing LLM-free code generation + +We recently launched additional tooling that helps our users build sources. If you wish to try our python-first +dict-based declarative approach to building sources, check out the relevant post. + +- Rest api connector +- Openapi based pipeline generator that configures the rest api connector. + +Alena introduces the generator and troubleshoots the outcome in 4min: + + +Community videos for rest api source: [playlist](https://www.youtube.com/playlist?list=PLpTgUMBCn15rs2NkB4ise780UxLKImZTh). + +Both tools are LLM-free pipeline generators. I stress LLM free, because in our experience, GPT can +do some things to some extent - so if we ask it to complete 10 tasks to produce a pipeline, each +having 50-90% accuracy, we can expect very low success rates. + +To get around this problem, we built from the OpenAPI standard which contains information that can +be turned into a pipeline algorithmically. OpenAPI is an Api spec that’s also used by FastAPI and +constantly growing in popularity, with around 50% of apis currently supporting it. + +By leveraging the data in the spec, we are able to have a basic pipeline. Our generator also infers +some other pieces of information algorithmically to make the pipeline incremental and add some other useful details. + +### When generation doesn’t work + +Of course, generation doesn’t always work but you can take the generated pipeline and make the final +adjustments to have a standard REST API config-based pipeline that won’t suffer from code smells. + +### The benefit of minimalistic sources + +The real benefit of this declarative source is not at building time - A declarative interface requires +more upfront knowledge. Instead, by having this option, we enable minimalistic pipelines that anyone could +maintain, including non coders or human-assisted LLMs. After all, LLMs are particularly proficient at translating configurations back and forth. + +Want to influence us? we listen, so you’re welcome to discuss with us in our slack channel [**#4-discussions**](https://dlthub.com/community) + +### Towards a paid offering + +dlt is an open core product, meaning it won’t be gated to push you to the paid version at some point. +Instead, much like Kafka and Confluent, we will offer things around dlt to help you leverage it in your context. + +If you are interested to help us research what’s needed, you can apply for our design partnership +program, that aims to help you deploy dlt, while helping us learn about your challenges. + +## Call to action. + +If you like the idea of dlt, there is one thing that would help us: + +**Set aside 30min and try it.** + +See resource below. + +We often hear variations of “oh i postponed dlt so long but it only took a few minutes to get going, wish I hadn’t +installed [other tool] which took 2 weeks to set up properly and now we need to maintain or replace”, so don't be that guy. + + +Here are some notebooks and docs to open your appetite: + + +- An [API pipeline step by step tutorial](https://dlthub.com/docs/tutorial/load-data-from-an-api) to build a production pipeline from an api +- A colab demo of [schema evolution](https://colab.research.google.com/drive/1H6HKFi-U1V4p0afVucw_Jzv1oiFbH2bu#scrollTo=e4y4sQ78P_OM) (2min read) +- Docs: RestClient, the imperative class that powers the REST API source, featuring auto pagination https://dlthub.com/docs/general-usage/http/rest-client +- Docs: [Build a simple pipeline](https://dlthub.com/docs/walkthroughs/create-a-pipeline) +- Docs: [Build a complex pipeline](https://dlthub.com/docs/walkthroughs/create-a-pipeline) +- Docs: [capabilities overview](https://dlthub.com/docs/build-a-pipeline-tutorial) hub page +- Community & Help: [Slack join link.](https://dlthub.com/community) \ No newline at end of file From 773a3030656c72ba0e81055fe491decf780f908a Mon Sep 17 00:00:00 2001 From: Adrian Date: Mon, 17 Jun 2024 09:02:33 +0200 Subject: [PATCH 08/21] blog post format --- docs/website/blog/2024-06-12-from-pandas-to-production.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/blog/2024-06-12-from-pandas-to-production.md b/docs/website/blog/2024-06-12-from-pandas-to-production.md index 92fe58e107..0de99085a7 100644 --- a/docs/website/blog/2024-06-12-from-pandas-to-production.md +++ b/docs/website/blog/2024-06-12-from-pandas-to-production.md @@ -12,7 +12,7 @@ tags: [pandas, production, etl, etl] -:::tip +:::info **TL;DR: We created a library to reduce friction between data engineers, data scientists, and the rest of the team. From Pandas to Production article tells the story of how we got here.** But if you want to load pandas dfs to production databases, with all the best practices built-in, check out this [documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/arrow-pandas) or this colab notebook that shows [easy handling of complex api data](https://colab.research.google.com/drive/1DhaKW0tiSTHDCVmPjM-eoyL47BJ30xmP#scrollTo=1wf1R0yQh7pv). From b455f61aff4937741244821d8aa37f39637c1ebf Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Mon, 17 Jun 2024 12:40:24 +0530 Subject: [PATCH 09/21] Added slowly changing dimensions and incremental loading blog article (#1468) * Added slowly changing dimensions and incremental loading blog article * Updated pic * Corrected for the error in `sql_database.md`: ``` Type checking Python snippets WARNING: Failed to type check Snippet No. 1131 in ../website/docs/dlt-ecosystem/verified-sources/sql_database.md at line 173 with language py lint_setup/lint_me.py:47: error: Module has no attribute "Double" [attr-defined] Found 1 error in 1 file (checked 1 source file) ``` * Added info about slowly changing dimensions --- ...2024-06-19-scd2-and-incremental-loading.md | 91 +++++++++++++++++++ .../verified-sources/sql_database.md | 4 +- .../docs/general-usage/incremental-loading.md | 7 +- 3 files changed, 97 insertions(+), 5 deletions(-) create mode 100644 docs/website/blog/2024-06-19-scd2-and-incremental-loading.md diff --git a/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md b/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md new file mode 100644 index 0000000000..6e6fe4e3b0 --- /dev/null +++ b/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md @@ -0,0 +1,91 @@ +--- +slug: scd2-and-incremental-loading +title: "Slowly Changing Dimensions and Incremental loading strategies" +authors: + name: Aman Gupta + title: Junior Data Engineer + url: https://github.com/dat-a-man + image_url: https://dlt-static.s3.eu-central-1.amazonaws.com/images/aman.png +tags: [scd2, incremental loading, slowly changing dimensions, python data pipelines] +--- + +Data flows over time. Recognizing this is crucial for building effective data pipelines. This article focuses on the temporal aspect of data, especially when evaluating the volume of data processed in previous runs and planning for new data loads. You can easily tackle this aspect by using timestamps or cursor fields. + +Incremental loading is a key technique here. It captures the latest changes by tracking where the last data chunk was processed and starting the next load. This state management ensures you only process new or changed data. + +Additionally, slowly changing dimensions (SCDs) play a vital role. They help capture changes over time, ensuring historical data accuracy. Using SCDs allows you to manage and track data changes, maintaining current and historical views. This article will delve into these concepts and provide insights on how to implement them effectively. + +### **What is Slowly Changing Dimension Type 2 (SCD2)?** + +Let’s consider Slowly Changing Dimension Type 2 (SCD2), a method that tracks changes over time without discarding history. For example, when a customer updates their address in a delivery platform, a new entry is created instead of replacing the old address. This marks each address’s relevance timeline, showcasing SCD2’s practical application. + +When a customer updates their address, the previous one isn’t erased; a new row is added, and the old record is timestamped to indicate its validity period. This approach enriches the customer profile with a historical dimension rather than simply overwriting it. + +For more information on how dlt handles SCD2 strategy, refer to the [documentation here](https://dlthub.com/docs/general-usage/incremental-loading#scd2-strategy). + +### **Loading usually doesn’t preserve change history** + +Incremental loading boosts data loading efficiency by selectively processing only new or modified data based on the source extraction pattern. This approach is similar to updating a journal with new events without rewriting past entries. For example, in a customer interaction tracking system, incremental loading captures only the most recent interactions, avoiding the redundancy of reprocessing historical data. + +Consider two issues, “Issue A” and “Issue B”. Initially, both are processed. If the pipeline is set to increment based on an `updated_at` field and “Issue B” gets updated, only this issue will be fetched and loaded in the next run. The `updated_at` timestamp is stored in the pipeline state and serves as a reference point for the next data load. How the data is added to the table depends on the pipeline’s write disposition. + +For more details on incremental loading, refer to the [documentation here](https://dlthub.com/docs/general-usage/incremental-loading). + +### The change history: SCD2 and Incremental loading + +Combining SCD2 with incremental loading creates a symphony in data management. While SCD2 safeguards historical data, incremental loading efficiently incorporates new updates. This synergy is illustrated through the following example: + +**Example 1: Customer Status Changes** + +**Initial load with slowly changing dimensions enabled:** + +- Alice and Bob start with the statuses "bronze" and "gold", respectively. + +| customer_key | name | status | _dlt_valid_from | _dlt_valid_to | last_updated | +| ------------ | ----- | ------ | ------------------- | ------------- | ------------------- | +| 1 | Alice | bronze | 2024-01-01 00:00:00 | NULL | 2024-01-01 12:00:00 | +| 2 | Bob | gold | 2024-01-01 00:00:00 | NULL | 2024-01-01 12:00:00 | + +**Incremental load (Alice's status changes to silver):** +- Alice’s status update to "silver" triggers a new entry, while her "bronze" status is preserved with a timestamp marking its duration. + +| customer_key | name | status | _dlt_valid_from | _dlt_valid_to | last_updated | +| ------------ | ----- | ------ | ------------------- | ------------------- | ------------------- | +| 1 | Alice | bronze | 2024-01-01 00:00:00 | 2024-02-01 00:00:00 | 2024-01-01 12:00:00 | +| 1 | Alice | silver | 2024-02-01 00:00:00 | NULL | 2024-02-01 12:00:00 | +| 2 | Bob | gold | 2024-01-01 00:00:00 | NULL | 2024-01-01 12:00:00 | + + +Incremental loading would process only Alice's record because it was updated after the last load, and slowly changing dimensions would keep the record until Alice’s status was bronze. + +This demonstrates how using the `last_updated` timestamp, an incremental loading strategy, ensures only the latest data is fetched and loaded. Meanwhile, SCD2 helps maintain historical records. + +### Simple steps to determine data loading strategy and write disposition + +This decision flowchart helps determine the most suitable data loading strategy and write disposition: + +1. Is your data stateful? Stateful data is subject to change, like your age. Stateless data does not change, for example, events that happened in the past are stateless. + 1. If your data is stateless, such as logs, you can just increment by appending new logs + 2. If it is stateful, do you need to track changes to it? + 1. If yes, then use SCD2 to track changes + 2. If no, + 1. Can you extract it incrementally (new changes only?) + 1. If yes, load incrementally via merge + 2. If no, re-load fully via replace. + +Below is a visual representation of steps discussed above: +![Image](https://storage.googleapis.com/dlt-blog-images/flowchart_for_scd2.png) + +### **Conclusion** + +Slowly changing dimensions detect and log changes between runs. The mechanism is not perfect and it will not capture if multiple changes occurred to a single data point between the runs - only the last state will be reflected. + +However, they enable you to track things you could not before such as + +- Hard deletes +- Daily changes and when they occurred +- Different versions of entities valid at different historical times + +Want to discuss? + +[Join the dlt slack community](https://dlthub.com/community) to take part in the conversation. \ No newline at end of file diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/sql_database.md b/docs/website/docs/dlt-ecosystem/verified-sources/sql_database.md index 1891157b4a..f3d94ab1be 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/sql_database.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/sql_database.md @@ -177,9 +177,9 @@ pipeline = dlt.pipeline( ) def _double_as_decimal_adapter(table: sa.Table) -> None: - """Return double as double, not decimals, this is mysql thing""" + """Emits decimals instead of floats.""" for column in table.columns.values(): - if isinstance(column.type, sa.Double): + if isinstance(column.type, sa.Float): column.type.asdecimal = False sql_alchemy_source = sql_database( diff --git a/docs/website/docs/general-usage/incremental-loading.md b/docs/website/docs/general-usage/incremental-loading.md index f99a3a9e57..c3466dc688 100644 --- a/docs/website/docs/general-usage/incremental-loading.md +++ b/docs/website/docs/general-usage/incremental-loading.md @@ -29,7 +29,7 @@ using `primary_key`. Use `write_disposition='merge'`.
-![write disposition flowchart](/img/write-dispo-choice.png) +![write disposition flowchart](https://storage.googleapis.com/dlt-blog-images/flowchart_for_scd2.png)
@@ -41,8 +41,9 @@ user's profile Stateless data cannot change - for example, a recorded event, suc Because stateless data does not need to be updated, we can just append it. -For stateful data, comes a second question - Can I extract it incrementally from the source? If not, -then we need to replace the entire data set. If however we can request the data incrementally such +For stateful data, comes a second question - Can I extract it incrementally from the source? If yes, you should use [slowly changing dimensions (Type-2)](#scd2-strategy), which allow you to maintain historical records of data changes over time. + +If not, then we need to replace the entire data set. If however we can request the data incrementally such as "all users added or modified since yesterday" then we can simply apply changes to our existing dataset with the merge write disposition. From 8a3e95c63672cde9ecb2ee1ba2b51e88c9d57bd2 Mon Sep 17 00:00:00 2001 From: Marcin Rudolf Date: Mon, 17 Jun 2024 14:52:37 +0200 Subject: [PATCH 10/21] fixes wrong oauth sample config for snowflake --- .../docs/dlt-ecosystem/destinations/snowflake.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/destinations/snowflake.md b/docs/website/docs/dlt-ecosystem/destinations/snowflake.md index deaaff3562..13dd2d878a 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/snowflake.md +++ b/docs/website/docs/dlt-ecosystem/destinations/snowflake.md @@ -73,7 +73,7 @@ You can also decrease the suspend time for your warehouse to 1 minute (**Admin** Snowflake destination accepts three authentication types: - password authentication - [key pair authentication](https://docs.snowflake.com/en/user-guide/key-pair-auth) -- external authentication +- oauth authentication The **password authentication** is not any different from other databases like Postgres or Redshift. `dlt` follows the same syntax as the [SQLAlchemy dialect](https://docs.snowflake.com/en/developer-guide/python-connector/sqlalchemy#required-parameters). @@ -81,6 +81,7 @@ You can also pass credentials as a database connection string. For example: ```toml # keep it at the top of your toml file! before any section starts destination.snowflake.credentials="snowflake://loader:@kgiotue-wn98412/dlt_data?warehouse=COMPUTE_WH&role=DLT_LOADER_ROLE" + ``` In **key pair authentication**, you replace the password with a private key string that should be in Base64-encoded DER format ([DBT also recommends](https://docs.getdbt.com/docs/core/connect-data-platform/snowflake-setup#key-pair-authentication) base64-encoded private keys for Snowflake connections). The private key may also be encrypted. In that case, you must provide a passphrase alongside the private key. @@ -100,16 +101,19 @@ If you pass a passphrase in the connection string, please URL encode it. destination.snowflake.credentials="snowflake://loader:@kgiotue-wn98412/dlt_data?private_key=&private_key_passphrase=" ``` -In **external authentication**, you can use an OAuth provider like Okta or an external browser to authenticate. You pass your authenticator and refresh token as below: +In **oauth authentication**, you can use an OAuth provider like Snowflake, Okta or an external browser to authenticate. In case of Snowflake, you pass your authenticator and refresh token as below: ```toml [destination.snowflake.credentials] database = "dlt_data" username = "loader" -authenticator="..." +password = "ignore" # put a fake password until #1456 is fixed +authenticator="oauth" +[destination.snowflake.credentials.query] token="..." ``` or in the connection string as query parameters. -Refer to Snowflake [OAuth](https://docs.snowflake.com/en/user-guide/oauth-intro) for more details. + +In case of external authentication, you need to find documentation for your OAuth provider. Refer to Snowflake [OAuth](https://docs.snowflake.com/en/user-guide/oauth-intro) for more details. ## Write disposition All write dispositions are supported. From e792db6bc6ebd7ab00485917bdd3c79aa749557f Mon Sep 17 00:00:00 2001 From: adrianbr Date: Mon, 17 Jun 2024 17:13:56 +0200 Subject: [PATCH 11/21] blog post format (#1475) --- ...2024-06-19-scd2-and-incremental-loading.md | 131 ++++++++++++------ 1 file changed, 86 insertions(+), 45 deletions(-) diff --git a/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md b/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md index 6e6fe4e3b0..8b0a4c3cdd 100644 --- a/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md +++ b/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md @@ -1,6 +1,6 @@ --- slug: scd2-and-incremental-loading -title: "Slowly Changing Dimensions and Incremental loading strategies" +title: "Slowly Changing Dimension Type2: Explanation and code" authors: name: Aman Gupta title: Junior Data Engineer @@ -9,68 +9,115 @@ authors: tags: [scd2, incremental loading, slowly changing dimensions, python data pipelines] --- -Data flows over time. Recognizing this is crucial for building effective data pipelines. This article focuses on the temporal aspect of data, especially when evaluating the volume of data processed in previous runs and planning for new data loads. You can easily tackle this aspect by using timestamps or cursor fields. -Incremental loading is a key technique here. It captures the latest changes by tracking where the last data chunk was processed and starting the next load. This state management ensures you only process new or changed data. -Additionally, slowly changing dimensions (SCDs) play a vital role. They help capture changes over time, ensuring historical data accuracy. Using SCDs allows you to manage and track data changes, maintaining current and historical views. This article will delve into these concepts and provide insights on how to implement them effectively. +:::info +**TL;DR: Check this colab notebook for a short and sweet demo: +[Colab demo](https://colab.research.google.com/drive/115cRdw1qvekZbXIQSXYkAZzLAqD9_x_I) +::: -### **What is Slowly Changing Dimension Type 2 (SCD2)?** +# What is a slowly changing dimension? -Let’s consider Slowly Changing Dimension Type 2 (SCD2), a method that tracks changes over time without discarding history. For example, when a customer updates their address in a delivery platform, a new entry is created instead of replacing the old address. This marks each address’s relevance timeline, showcasing SCD2’s practical application. +Slowly changing dimensions are a dimensional modelling technique created for historising changes in data. -When a customer updates their address, the previous one isn’t erased; a new row is added, and the old record is timestamped to indicate its validity period. This approach enriches the customer profile with a historical dimension rather than simply overwriting it. +This technique only works if the dimensions change slower than we read the data, since we would not be able to track changes happening between reads. +For example, if someone changes their address once in a blue moon, we will capture the changes with daily loads - but if +they change their address 3x in a day, we will only see the last state and only capture 2 of the 4 versions of the address. -For more information on how dlt handles SCD2 strategy, refer to the [documentation here](https://dlthub.com/docs/general-usage/incremental-loading#scd2-strategy). +However, they enable you to track things you could not before such as: -### **Loading usually doesn’t preserve change history** +- Hard deletes. +- Most of the changes and when they occurred. +- Different versions of entities valid at different historical times. -Incremental loading boosts data loading efficiency by selectively processing only new or modified data based on the source extraction pattern. This approach is similar to updating a journal with new events without rewriting past entries. For example, in a customer interaction tracking system, incremental loading captures only the most recent interactions, avoiding the redundancy of reprocessing historical data. +## What is Slowly Changing Dimension Type 2 (SCD2)? and why use it? -Consider two issues, “Issue A” and “Issue B”. Initially, both are processed. If the pipeline is set to increment based on an `updated_at` field and “Issue B” gets updated, only this issue will be fetched and loaded in the next run. The `updated_at` timestamp is stored in the pipeline state and serves as a reference point for the next data load. How the data is added to the table depends on the pipeline’s write disposition. +The Type 2 subtype of Slowly Changing Dimensions (SCD) manages changes in data over time. +When data changes, a new record is added to the database, but the old record remains unchanged. +Each record includes a timestamp or version number. This allows you to view both the historical +data and the most current data separately. -For more details on incremental loading, refer to the [documentation here](https://dlthub.com/docs/general-usage/incremental-loading). +Traditional data loading methods often involve updating existing records with new information, which results in the loss of historical data. -### The change history: SCD2 and Incremental loading +SCD2 not only preserves an audit trail of data changes but also allows for accurate historical analysis and reporting. -Combining SCD2 with incremental loading creates a symphony in data management. While SCD2 safeguards historical data, incremental loading efficiently incorporates new updates. This synergy is illustrated through the following example: +## SCD2 applications -**Example 1: Customer Status Changes** +[Colab demo](https://colab.research.google.com/drive/115cRdw1qvekZbXIQSXYkAZzLAqD9_x_I) -**Initial load with slowly changing dimensions enabled:** +### Use Case 1: Versioning a record that changes -- Alice and Bob start with the statuses "bronze" and "gold", respectively. +In environments where maintaining a complete historical record of data changes is crucial, +such as in financial services or healthcare, SCD Type 2 plays a vital role. For instance, if a +customer's address changes, SCD2 ensures that the old address is preserved in historical +records while the new address is available for current transactions. This ability to view the +evolution of data over time supports auditing, tracking changes, and analyzing trends without losing +the context of past information. It allows organizations to track the lifecycle of a data +entity across different states. -| customer_key | name | status | _dlt_valid_from | _dlt_valid_to | last_updated | -| ------------ | ----- | ------ | ------------------- | ------------- | ------------------- | -| 1 | Alice | bronze | 2024-01-01 00:00:00 | NULL | 2024-01-01 12:00:00 | -| 2 | Bob | gold | 2024-01-01 00:00:00 | NULL | 2024-01-01 12:00:00 | +Here's an example with the customer address change. -**Incremental load (Alice's status changes to silver):** -- Alice’s status update to "silver" triggers a new entry, while her "bronze" status is preserved with a timestamp marking its duration. +Before: -| customer_key | name | status | _dlt_valid_from | _dlt_valid_to | last_updated | -| ------------ | ----- | ------ | ------------------- | ------------------- | ------------------- | -| 1 | Alice | bronze | 2024-01-01 00:00:00 | 2024-02-01 00:00:00 | 2024-01-01 12:00:00 | -| 1 | Alice | silver | 2024-02-01 00:00:00 | NULL | 2024-02-01 12:00:00 | -| 2 | Bob | gold | 2024-01-01 00:00:00 | NULL | 2024-01-01 12:00:00 | +| `_dlt_valid_from` | `_dlt_valid_to` | `customer_key` | `c1` | `c2` | +|-----------------------------|-----------------|----------------|-------------|------| +| 2024-04-09 18:27:53.734235 | NULL | 1 | 123 Elm St | TN | +After update: -Incremental loading would process only Alice's record because it was updated after the last load, and slowly changing dimensions would keep the record until Alice’s status was bronze. +| `_dlt_valid_from` | `_dlt_valid_to` | `customer_key` | `c1` | `c2` | +|-----------------------------|-----------------------------|----------------|-------------|------| +| 2024-04-09 18:27:53.734235 | 2024-05-01 17:00:00.000000 | 1 | 123 Elm St | TN | +| 2024-05-02 08:00:00.000000 | NULL | 1 | 456 Oak Ave | TN | -This demonstrates how using the `last_updated` timestamp, an incremental loading strategy, ensures only the latest data is fetched and loaded. Meanwhile, SCD2 helps maintain historical records. +In the updated state, the previous address record is closed with an `_dlt_valid_to` timestamp, and a new record is created +with the new address "456 Oak Ave" effective from May 2, 2024. The NULL in the `_dlt_valid_to` field for this +new record signifies that it is the current and active address. + +### Use Case 2: Tracking deletions + +This approach ensures that historical data is preserved for audit and compliance purposes, even though the +record is no longer active in the current dataset. It allows businesses to maintain integrity and a full +historical trail of their data changes. + +State Before Deletion: Customer Record Active + +| `_dlt_valid_from` | `_dlt_valid_to` | `customer_key` | `c1` | `c2` | +|-----------------------------|-----------------|----------------|-------------|------| +| 2024-04-09 18:27:53.734235 | NULL | 1 | 123 Elm St | TN | +This table shows the customer record when it was active, with an address at "123 Elm St". The `_dlt_valid_to` field is NULL, indicating that the record is currently active. + +State after deletion: Customer record marked as deleted + +| `_dlt_valid_from` | `_dlt_valid_to` | `customer_key` | `c1` | `c2` | +|-----------------------------|-----------------------------|----------------|-------------|------| +| 2024-04-09 18:27:53.734235 | 2024-06-01 10:00:00.000000 | 1 | 123 Elm St | TN | + +In this updated table, the record that was previously active is marked as deleted by updating the `_dlt_valid_to` field +to reflect the timestamp when the deletion was recognized, in this case, June 1, 2024, at 10:00 AM. The presence +of a non-NULL `_dlt_valid_to` date indicates that this record is no longer active as of that timestamp. + + +Learn how to customise your column names and validity dates in our [SDC2 docs](https://dlthub.com/docs/general-usage/incremental-loading#scd2-strategy). + + + +### Surrogate keys, what are they? Why use? + +Every record in the SCD2 table needs its own id. We call this a surrogate key. We use it to identify the specific +record or version of an entity, and we can use it when joining to our fact tables for performance (as opposed to joining on entity id + validity time). ### Simple steps to determine data loading strategy and write disposition This decision flowchart helps determine the most suitable data loading strategy and write disposition: 1. Is your data stateful? Stateful data is subject to change, like your age. Stateless data does not change, for example, events that happened in the past are stateless. - 1. If your data is stateless, such as logs, you can just increment by appending new logs - 2. If it is stateful, do you need to track changes to it? - 1. If yes, then use SCD2 to track changes - 2. If no, - 1. Can you extract it incrementally (new changes only?) - 1. If yes, load incrementally via merge + 1. If your data is stateless, such as logs, you can just increment by appending new logs. + 2. If it is stateful, do you need to track changes to it? + 1. If yes, then use SCD2 to track changes. + 2. If no, + 1. Can you extract it incrementally (new changes only)? + 1. If yes, load incrementally via merge. 2. If no, re-load fully via replace. Below is a visual representation of steps discussed above: @@ -78,14 +125,8 @@ Below is a visual representation of steps discussed above: ### **Conclusion** -Slowly changing dimensions detect and log changes between runs. The mechanism is not perfect and it will not capture if multiple changes occurred to a single data point between the runs - only the last state will be reflected. - -However, they enable you to track things you could not before such as - -- Hard deletes -- Daily changes and when they occurred -- Different versions of entities valid at different historical times +Use SCD2 where it makes sense but keep in mind the shortcomings related to the read vs update frequency. +Use dlt to do it at loading and keep everything downstream clean and simple. Want to discuss? - -[Join the dlt slack community](https://dlthub.com/community) to take part in the conversation. \ No newline at end of file +[Join the dlt slack community!](https://dlthub.com/community) \ No newline at end of file From 8d94b0bf6fb122966fd8346cff89672a4a865c87 Mon Sep 17 00:00:00 2001 From: adrianbr Date: Mon, 17 Jun 2024 17:35:43 +0200 Subject: [PATCH 12/21] Scd2 blog (#1478) --- .../blog/2024-06-19-scd2-and-incremental-loading.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md b/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md index 8b0a4c3cdd..fef6538a3a 100644 --- a/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md +++ b/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md @@ -12,8 +12,7 @@ tags: [scd2, incremental loading, slowly changing dimensions, python data pipeli :::info -**TL;DR: Check this colab notebook for a short and sweet demo: -[Colab demo](https://colab.research.google.com/drive/115cRdw1qvekZbXIQSXYkAZzLAqD9_x_I) +**Check [this Colab Notebook](https://colab.research.google.com/drive/115cRdw1qvekZbXIQSXYkAZzLAqD9_x_I) for a short and sweet demo.** ::: # What is a slowly changing dimension? @@ -24,11 +23,11 @@ This technique only works if the dimensions change slower than we read the data, For example, if someone changes their address once in a blue moon, we will capture the changes with daily loads - but if they change their address 3x in a day, we will only see the last state and only capture 2 of the 4 versions of the address. -However, they enable you to track things you could not before such as: +However, they enable you to track things you could not before such as -- Hard deletes. -- Most of the changes and when they occurred. -- Different versions of entities valid at different historical times. +- Hard deletes +- Most of the changes and when they occurred +- Different versions of entities valid at different historical times ## What is Slowly Changing Dimension Type 2 (SCD2)? and why use it? @@ -101,7 +100,6 @@ of a non-NULL `_dlt_valid_to` date indicates that this record is no longer activ Learn how to customise your column names and validity dates in our [SDC2 docs](https://dlthub.com/docs/general-usage/incremental-loading#scd2-strategy). - ### Surrogate keys, what are they? Why use? Every record in the SCD2 table needs its own id. We call this a surrogate key. We use it to identify the specific @@ -112,6 +110,7 @@ record or version of an entity, and we can use it when joining to our fact table This decision flowchart helps determine the most suitable data loading strategy and write disposition: 1. Is your data stateful? Stateful data is subject to change, like your age. Stateless data does not change, for example, events that happened in the past are stateless. + 1. If your data is stateless, such as logs, you can just increment by appending new logs. 2. If it is stateful, do you need to track changes to it? 1. If yes, then use SCD2 to track changes. From 46d59ca65f73fe0b7587b102341cb6c2dc8c7d83 Mon Sep 17 00:00:00 2001 From: adrianbr Date: Tue, 18 Jun 2024 08:35:53 +0200 Subject: [PATCH 13/21] Update 2024-06-19-scd2-and-incremental-loading.md with image --- docs/website/blog/2024-06-19-scd2-and-incremental-loading.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md b/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md index fef6538a3a..11c858c076 100644 --- a/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md +++ b/docs/website/blog/2024-06-19-scd2-and-incremental-loading.md @@ -1,6 +1,7 @@ --- slug: scd2-and-incremental-loading title: "Slowly Changing Dimension Type2: Explanation and code" +image: https://storage.googleapis.com/dlt-blog-images/flowchart_for_scd2.png authors: name: Aman Gupta title: Junior Data Engineer @@ -128,4 +129,4 @@ Use SCD2 where it makes sense but keep in mind the shortcomings related to the r Use dlt to do it at loading and keep everything downstream clean and simple. Want to discuss? -[Join the dlt slack community!](https://dlthub.com/community) \ No newline at end of file +[Join the dlt slack community!](https://dlthub.com/community) From 303c8a52e2c8d0ce1374512510c59673c70d4a50 Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 18 Jun 2024 12:01:27 +0200 Subject: [PATCH 14/21] blog post format --- .../blog/2024-06-12-from-pandas-to-production.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/docs/website/blog/2024-06-12-from-pandas-to-production.md b/docs/website/blog/2024-06-12-from-pandas-to-production.md index 0de99085a7..3ca005dcf1 100644 --- a/docs/website/blog/2024-06-12-from-pandas-to-production.md +++ b/docs/website/blog/2024-06-12-from-pandas-to-production.md @@ -1,6 +1,6 @@ --- slug: pandas-to-production -title: "From Pandas to Production: why dlt is the right ELT tool for Normies" +title: "From Pandas to Production: How we built dlt as the right ELT tool for Normies" image: https://storage.googleapis.com/dlt-blog-images/i-am-normal.png authors: name: Adrian Brudaru @@ -13,11 +13,17 @@ tags: [pandas, production, etl, etl] :::info -**TL;DR: We created a library to reduce friction between data engineers, data scientists, and the rest of the team. From Pandas to Production article tells the story of how we got here.** +**TL;DR: dlt is a library for Normies: Problem solvers with antipathy for black boxes, gratuitous complexity and external dependencies. -But if you want to load pandas dfs to production databases, with all the best practices built-in, check out this [documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/arrow-pandas) or this colab notebook that shows [easy handling of complex api data](https://colab.research.google.com/drive/1DhaKW0tiSTHDCVmPjM-eoyL47BJ30xmP#scrollTo=1wf1R0yQh7pv). +This post tells the story of how we got here.** + +Try it in colab: +* [Schema evolution](https://colab.research.google.com/drive/1H6HKFi-U1V4p0afVucw_Jzv1oiFbH2bu#scrollTo=e4y4sQ78P_OM) +* [Data Talks Club Open Source Spotlight](https://colab.research.google.com/drive/1D39_koejvi-eTtA_8AI33AHhMGGOklfb) + [Video](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJ5t5nnjzC0F2zan76Dpsz0) +* [Hackernews Api demo](https://colab.research.google.com/drive/1DhaKW0tiSTHDCVmPjM-eoyL47BJ30xmP) -Here are the best practices: [wishlist becomes reality](#our-dream-a-tool-that-meets-production-pipelines-requirements) + +But if you want to load pandas dfs to production databases, with all the best practices built-in, check out this [documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/arrow-pandas) or this colab notebook that shows [easy handling of complex api data](https://colab.research.google.com/drive/1DhaKW0tiSTHDCVmPjM-eoyL47BJ30xmP#scrollTo=1wf1R0yQh7pv). Or check out more resources [at the end of the article](#call-to-action) ::: From f700f9f36b4cbbb1043ee6b62611e88878068f5a Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 18 Jun 2024 12:21:23 +0200 Subject: [PATCH 15/21] blog post format --- docs/website/blog/2024-06-12-from-pandas-to-production.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/website/blog/2024-06-12-from-pandas-to-production.md b/docs/website/blog/2024-06-12-from-pandas-to-production.md index 3ca005dcf1..a12d1a4229 100644 --- a/docs/website/blog/2024-06-12-from-pandas-to-production.md +++ b/docs/website/blog/2024-06-12-from-pandas-to-production.md @@ -13,9 +13,9 @@ tags: [pandas, production, etl, etl] :::info -**TL;DR: dlt is a library for Normies: Problem solvers with antipathy for black boxes, gratuitous complexity and external dependencies. +**TL;DR: dlt is a library for Normies: Problem solvers with antipathy for black boxes, gratuitous complexity and external dependencies.** -This post tells the story of how we got here.** +**This post tells the story of how we got here.** Try it in colab: * [Schema evolution](https://colab.research.google.com/drive/1H6HKFi-U1V4p0afVucw_Jzv1oiFbH2bu#scrollTo=e4y4sQ78P_OM) From ed0f5b0947a6beae2cabfe233ae8b4f045fe9dd8 Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 18 Jun 2024 13:30:46 +0200 Subject: [PATCH 16/21] blog post format --- docs/website/blog/2024-06-12-from-pandas-to-production.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/blog/2024-06-12-from-pandas-to-production.md b/docs/website/blog/2024-06-12-from-pandas-to-production.md index a12d1a4229..5dbd494a3e 100644 --- a/docs/website/blog/2024-06-12-from-pandas-to-production.md +++ b/docs/website/blog/2024-06-12-from-pandas-to-production.md @@ -21,7 +21,7 @@ Try it in colab: * [Schema evolution](https://colab.research.google.com/drive/1H6HKFi-U1V4p0afVucw_Jzv1oiFbH2bu#scrollTo=e4y4sQ78P_OM) * [Data Talks Club Open Source Spotlight](https://colab.research.google.com/drive/1D39_koejvi-eTtA_8AI33AHhMGGOklfb) + [Video](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJ5t5nnjzC0F2zan76Dpsz0) * [Hackernews Api demo](https://colab.research.google.com/drive/1DhaKW0tiSTHDCVmPjM-eoyL47BJ30xmP) - +* [LLM-free pipeline generation demo](https://colab.research.google.com/drive/1MRZvguOTZj1MlkEGzjiso8lQ_wr1MJRI) +[4min Video](https://www.youtube.com/watch?v=b99qv9je12Q) But if you want to load pandas dfs to production databases, with all the best practices built-in, check out this [documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/arrow-pandas) or this colab notebook that shows [easy handling of complex api data](https://colab.research.google.com/drive/1DhaKW0tiSTHDCVmPjM-eoyL47BJ30xmP#scrollTo=1wf1R0yQh7pv). From a43eb387f03f28676ff9cec0f754acbe22e221f9 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Thu, 20 Jun 2024 13:38:27 +0200 Subject: [PATCH 17/21] Add a section covering custom auth; rework auth and paginators sections (#1493) --- .../verified-sources/rest_api.md | 109 +++++++++++------- .../docs/general-usage/http/rest-client.md | 4 +- 2 files changed, 71 insertions(+), 42 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md index e28c5bac30..0f8360e591 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md @@ -187,12 +187,12 @@ config: RESTAPIConfig = { #### `client` -`client` contains the configuration to connect to the API's endpoints. It includes the following fields: +The `client` configuration is used to connect to the API's endpoints. It includes the following fields: - `base_url` (str): The base URL of the API. This string is prepended to all endpoint paths. For example, if the base URL is `https://api.example.com/v1/`, and the endpoint path is `users`, the full URL will be `https://api.example.com/v1/users`. -- `headers` (dict, optional): Additional headers to be sent with each request. -- `auth` (optional): Authentication configuration. It can be a simple token, a `AuthConfigBase` object, or a more complex authentication method. -- `paginator` (optional): Configuration for the default pagination to be used for resources that support pagination. See the [pagination](#pagination) section for more details. +- `headers` (dict, optional): Additional headers that are sent with each request. +- `auth` (optional): Authentication configuration. This can be a simple token, an `AuthConfigBase` object, or a more complex authentication method. +- `paginator` (optional): Configuration for the default pagination used for resources that support pagination. Refer to the [pagination](#pagination) section for more details. #### `resource_defaults` (optional) @@ -291,46 +291,69 @@ The REST API source will try to automatically handle pagination for you. This wo In some special cases, you may need to specify the pagination configuration explicitly. -:::note -Currently pagination is supported only for GET requests. To handle POST requests with pagination, you need to implement a [custom paginator](../../general-usage/http/rest-client.md#custom-paginator). -::: +To specify the pagination configuration, use the `paginator` field in the [client](#client) or [endpoint](#endpoint-configuration) configurations. You may either use a dictionary with a string alias in the `type` field along with the required parameters, or use a [paginator class instance](../../general-usage/http/rest-client.md#paginators). -These are the available paginators: +#### Example + +Suppose the API response for `https://api.example.com/posts` contains a `next` field with the URL to the next page: -| Paginator class | String Alias (`type`) | Description | -| -------------- | ------------ | ----------- | -| [JSONResponsePaginator](../../general-usage/http/rest-client.md#jsonresponsepaginator) | `json_response` | The links to the next page are in the body (JSON) of the response. | -| [HeaderLinkPaginator](../../general-usage/http/rest-client.md#headerlinkpaginator) | `header_link` | The links to the next page are in the response headers. | -| [OffsetPaginator](../../general-usage/http/rest-client.md#offsetpaginator) | `offset` | The pagination is based on an offset parameter. With total items count either in the response body or explicitly provided. | -| [PageNumberPaginator](../../general-usage/http/rest-client.md#pagenumberpaginator) | `page_number` | The pagination is based on a page number parameter. With total pages count either in the response body or explicitly provided. | -| [JSONCursorPaginator](../../general-usage/http/rest-client.md#jsonresponsecursorpaginator) | `cursor` | The pagination is based on a cursor parameter. The value of the cursor is in the response body (JSON). | -| SinglePagePaginator | `single_page` | The response will be interpreted as a single-page response, ignoring possible pagination metadata. | -| `None` | `auto` | Explicitly specify that the source should automatically detect the pagination method. | +```json +{ + "data": [ + {"id": 1, "title": "Post 1"}, + {"id": 2, "title": "Post 2"}, + {"id": 3, "title": "Post 3"} + ], + "pagination": { + "next": "https://api.example.com/posts?page=2" + } +} +``` -To specify the pagination configuration, use the `paginator` field in the [client](#client) or [endpoint](#endpoint-configuration) configurations. You may either use a dictionary with a string alias in the `type` field along with the required parameters, or use the paginator instance directly: +You can configure the pagination for the `posts` resource like this: ```py { - # ... + "path": "posts", "paginator": { - "type": "json_links", - "next_url_path": "paging.next", + "type": "json_response", + "next_url_path": "pagination.next", } } ``` -Or using the paginator instance: +Alternatively, you can use the paginator instance directly: ```py +from dlt.sources.helpers.rest_client.paginators import JSONResponsePaginator + +# ... + { - # ... + "path": "posts", "paginator": JSONResponsePaginator( - next_url_path="paging.next" + next_url_path="pagination.next" ), } ``` -This is useful when you're [implementing and using a custom paginator](../../general-usage/http/rest-client.md#custom-paginator). +:::note +Currently pagination is supported only for GET requests. To handle POST requests with pagination, you need to implement a [custom paginator](../../general-usage/http/rest-client.md#custom-paginator). +::: + +These are the available paginators: + +| `type` | Paginator class | Description | +| ------------ | -------------- | ----------- | +| `json_response` | [JSONResponsePaginator](../../general-usage/http/rest-client.md#jsonresponsepaginator) | The link to the next page is in the body (JSON) of the response.
*Parameters:*
  • `next_url_path` (str) - the JSONPath to the next page URL
| +| `header_link` | [HeaderLinkPaginator](../../general-usage/http/rest-client.md#headerlinkpaginator) | The links to the next page are in the response headers.
*Parameters:*
  • `link_header` (str) - the name of the header containing the links. Default is "next".
| +| `offset` | [OffsetPaginator](../../general-usage/http/rest-client.md#offsetpaginator) | The pagination is based on an offset parameter. With total items count either in the response body or explicitly provided.
*Parameters:*
  • `limit` (int) - the maximum number of items to retrieve in each request
  • `offset` (int) - the initial offset for the first request. Defaults to `0`
  • `offset_param` (str) - the name of the query parameter used to specify the offset. Defaults to "offset"
  • `limit_param` (str) - the name of the query parameter used to specify the limit. Defaults to "limit"
  • `total_path` (str) - a JSONPath expression for the total number of items. If not provided, pagination is controlled by `maximum_offset`
  • `maximum_offset` (int) - optional maximum offset value. Limits pagination even without total count
| +| `page_number` | [PageNumberPaginator](../../general-usage/http/rest-client.md#pagenumberpaginator) | The pagination is based on a page number parameter. With total pages count either in the response body or explicitly provided.
*Parameters:*
  • `initial_page` (int) - the starting page number. Defaults to `0`
  • `page_param` (str) - the query parameter name for the page number. Defaults to "page"
  • `total_path` (str) - a JSONPath expression for the total number of pages. If not provided, pagination is controlled by `maximum_page`
  • `maximum_page` (int) - optional maximum page number. Stops pagination once this page is reached
| +| `cursor` | [JSONResponseCursorPaginator](../../general-usage/http/rest-client.md#jsonresponsecursorpaginator) | The pagination is based on a cursor parameter. The value of the cursor is in the response body (JSON).
*Parameters:*
  • `cursor_path` (str) - the JSONPath to the cursor value. Defaults to "cursors.next"
  • `cursor_param` (str) - the query parameter name for the cursor. Defaults to "after"
| +| `single_page` | SinglePagePaginator | The response will be interpreted as a single-page response, ignoring possible pagination metadata. | +| `auto` | `None` | Explicitly specify that the source should automatically detect the pagination method. | + +For more complex pagination methods, you can implement a [custom paginator](../../general-usage/http/rest-client.md#implementing-a-custom-paginator), instantiate it, and use it in the configuration. ### Data selection @@ -387,11 +410,11 @@ Read more about [JSONPath syntax](https://github.com/h2non/jsonpath-ng?tab=readm ### Authentication -Many APIs require authentication to access their endpoints. The REST API source supports various authentication methods, such as token-based, query parameters, basic auth, etc. +For APIs that require authentication to access their endpoints, the REST API source supports various authentication methods, including token-based authentication, query parameters, basic authentication, and custom authentication. The authentication configuration is specified in the `auth` field of the [client](#client) either as a dictionary or as an instance of the [authentication class](../../general-usage/http/rest-client.md#authentication). #### Quick example -One of the most common method is token-based authentication. To authenticate with a token, you can use the `token` field in the `auth` configuration: +One of the most common methods is token-based authentication (also known as Bearer token authentication). To authenticate using this method, you can use the following shortcut: ```py { @@ -405,23 +428,12 @@ One of the most common method is token-based authentication. To authenticate wit } ``` -:::warning -Make sure to store your access tokens and other sensitive information in the `secrets.toml` file and never commit it to the version control system. -::: - -Available authentication types: - -| Authentication class | String Alias (`type`) | Description | -| ------------------- | ----------- | ----------- | -| [BearTokenAuth](../../general-usage/http/rest-client.md#bearer-token-authentication) | `bearer` | Bearer token authentication. | -| [HTTPBasicAuth](../../general-usage/http/rest-client.md#http-basic-authentication) | `http_basic` | Basic HTTP authentication. | -| [APIKeyAuth](../../general-usage/http/rest-client.md#api-key-authentication) | `api_key` | API key authentication with key defined in the query parameters or in the headers. | - -To specify the authentication configuration, use the `auth` field in the [client](#client) configuration: +The full version of the configuration would also include the authentication type (`bearer`) explicitly: ```py { "client": { + # ... "auth": { "type": "bearer", "token": dlt.secrets["your_api_token"], @@ -444,6 +456,23 @@ config = { } ``` +:::warning +Make sure to store your access tokens and other sensitive information in the `secrets.toml` file and never commit it to the version control system. +::: + +Available authentication types: + +| `type` | Authentication class | Description | +| ----------- | ------------------- | ----------- | +| `bearer` | [BearTokenAuth](../../general-usage/http/rest-client.md#bearer-token-authentication) | Bearer token authentication.
Parameters:
  • `token` (str)
| +| `http_basic` | [HTTPBasicAuth](../../general-usage/http/rest-client.md#http-basic-authentication) | Basic HTTP authentication.
Parameters:
  • `username` (str)
  • `password` (str)
| +| `api_key` | [APIKeyAuth](../../general-usage/http/rest-client.md#api-key-authentication) | API key authentication with key defined in the query parameters or in the headers.
Parameters:
  • `name` (str) - the name of the query parameter or header
  • `api_key` (str) - the API key value
  • `location` (str, optional) - the location of the API key in the request. Can be `query` or `header`. Default is `header`
| + + +For more complex authentication methods, you can implement a [custom authentication class](../../general-usage/http/rest-client.md#implementing-custom-authentication) and use it in the configuration. + + + ### Define resource relationships When you have a resource that depends on another resource, you can define the relationship using the `resolve` configuration. With it you link a path parameter in the child resource to a field in the parent resource's data. diff --git a/docs/website/docs/general-usage/http/rest-client.md b/docs/website/docs/general-usage/http/rest-client.md index 8f517389c6..32c74ec908 100644 --- a/docs/website/docs/general-usage/http/rest-client.md +++ b/docs/website/docs/general-usage/http/rest-client.md @@ -231,7 +231,7 @@ Note, that in this case, the `total_path` parameter is set explicitly to `None` **Parameters:** -- `initial_page`: The starting page number. Defaults to `1`. +- `initial_page`: The starting page number. Defaults to `0`. - `page_param`: The query parameter name for the page number. Defaults to `"page"`. - `total_path`: A JSONPath expression for the total number of pages. If not provided, pagination is controlled by `maximum_page`. - `maximum_page`: Optional maximum page number. Stops pagination once this page is reached. @@ -413,7 +413,7 @@ The available authentication methods are defined in the `dlt.sources.helpers.res - [APIKeyAuth](#api-key-authentication) - [HttpBasicAuth](#http-basic-authentication) -For specific use cases, you can [implement custom authentication](#implementing-custom-authentication) by subclassing the `AuthBase` class from the Requests library. +For specific use cases, you can [implement custom authentication](#implementing-custom-authentication) by subclassing the [`AuthBase`](https://requests.readthedocs.io/en/latest/api/#requests.auth.AuthBase) class from the Requests library. ### Bearer token authentication From 8eba8347ca4e8e2692c6756cbc1106774e81d612 Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Fri, 21 Jun 2024 12:42:28 +0530 Subject: [PATCH 18/21] Added blog google-forms-to-notion (#1497) --- .../blog/2024-06-21-google-forms-to-notion.md | 142 ++++++++++++++++++ 1 file changed, 142 insertions(+) create mode 100644 docs/website/blog/2024-06-21-google-forms-to-notion.md diff --git a/docs/website/blog/2024-06-21-google-forms-to-notion.md b/docs/website/blog/2024-06-21-google-forms-to-notion.md new file mode 100644 index 0000000000..ec1631bc44 --- /dev/null +++ b/docs/website/blog/2024-06-21-google-forms-to-notion.md @@ -0,0 +1,142 @@ +--- +slug: google-forms-to-notion +title: "Syncing Google Forms data with Notion using dlt" +authors: + name: Aman Gupta + title: Junior Data Engineer + url: https://github.com/dat-a-man + image_url: https://dlt-static.s3.eu-central-1.amazonaws.com/images/aman.png +tags: [google forms, cloud functions, google-forms-to-notion] +--- + +## Why do we do it? + +Hello, I'm Aman, and I assist the dlthub team with various data-related tasks. In a recent project, the Operations team needed to gather information through Google Forms and integrate it into a Notion database. Initially, they tried using the Zapier connector as a quick and cost-effective solution, but it didn’t work as expected. Since we’re at dlthub, where everyone is empowered to create pipelines, I stepped in to develop one that would automate this process. + +The solution involved setting up a workflow to automatically sync data from Google Forms to a Notion database. This was achieved using Google Sheets, Google Apps Script, and a `dlt` pipeline, ensuring that every new form submission was seamlessly transferred to the Notion database without the need for manual intervention. + +## Implementation + +So here are a few steps followed: + +**Step 1: Link Google Form to Google Sheet** + +Link the Google Form to a Google Sheet to save responses in the sheet. Follow [Google's documentation](https://support.google.com/docs/answer/2917686?hl=en#zippy=%2Cchoose-where-to-store-responses) for setup. + +**Step 2: Google Apps Script for Data Transfer** + +Create a Google Apps Script to send data from Google Sheets to a Notion database via a webhook. This script triggers every time a form response is saved. + +**Google Apps Script code:** + +```text +function sendWebhookOnEdit(e) { + var sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet(); + var range = sheet.getActiveRange(); + var updatedRow = range.getRow(); + var lastColumn = sheet.getLastColumn(); + var headers = sheet.getRange(1, 1, 1, lastColumn).getValues()[0]; + var updatedFields = {}; + var rowValues = sheet.getRange(updatedRow, 1, 1, lastColumn).getValues()[0]; + + for (var i = 0; i < headers.length; i++) { + updatedFields[headers[i]] = rowValues[i]; + } + + var jsonPayload = JSON.stringify(updatedFields); + Logger.log('JSON Payload: ' + jsonPayload); + + var url = 'https://your-webhook.cloudfunctions.net/to_notion_from_google_forms'; // Replace with your Cloud Function URL + var options = { + 'method': 'post', + 'contentType': 'application/json', + 'payload': jsonPayload + }; + + try { + var response = UrlFetchApp.fetch(url, options); + Logger.log('Response: ' + response.getContentText()); + } catch (error) { + Logger.log('Failed to send webhook: ' + error.toString()); + } +} +``` + +**Step 3: Deploying the ETL Pipeline** + +Deploy a `dlt` pipeline to Google Cloud Functions to handle data transfer from Google Sheets to the Notion database. The pipeline is triggered by the Google Apps Script. + +1. Create a Google Cloud function. +2. Create `main.py` with the Python code below. +3. Ensure `requirements.txt` includes `dlt`. +4. Deploy the pipeline to Google Cloud Functions. +5. Use the function URL in the Google Apps Script. + +:::note +This pipeline uses `@dlt.destination` decorator which is used to set up custom destinations. Using custom destinations is a part of `dlt's` reverse ETL capabilities. To read more about `dlt's` reverse ETL pipelines, please read the [documentation here.](https://dlthub.com/docs/dlt-ecosystem/destinations/destination) +::: + +**Python code for `main.py` (Google cloud functions) :** + +```py +import dlt +from dlt.common import json +from dlt.common.typing import TDataItems +from dlt.common.schema import TTableSchema +from datetime import datetime +from dlt.sources.helpers import requests + +@dlt.destination(name="notion", batch_size=1, naming_convention="direct", skip_dlt_columns_and_tables=True) +def insert_into_notion(items: TDataItems, table: TTableSchema) -> None: + api_key = dlt.secrets.value # Add your notion API key to "secrets.toml" + database_id = "your_notion_database_id" # Replace with your Notion Database ID + url = "https://api.notion.com/v1/pages" + headers = { + "Authorization": f"Bearer {api_key}", + "Content-Type": "application/json", + "Notion-Version": "2022-02-22" + } + + for item in items: + if isinstance(item.get('Timestamp'), datetime): + item['Timestamp'] = item['Timestamp'].isoformat() + data = { + "parent": {"database_id": database_id}, + "properties": { + "Timestamp": { + "title": [{ + "text": {"content": item.get('Timestamp')} + }] + }, + # Add other properties here + } + } + response = requests.post(url, headers=headers, data=json.dumps(data)) + print(response.status_code, response.text) + +def your_webhook(request): + data = request.get_json() + Event = [data] + + pipeline = dlt.pipeline( + pipeline_name='platform_to_notion', + destination=insert_into_notion, + dataset_name='webhooks', + full_refresh=True + ) + + pipeline.run(Event, table_name='webhook') + return 'Event received and processed successfully.' +``` + +### Step 4: Automation and Real-Time updates + +With everything set up, the workflow automates data transfer as follows: + +1. Form submission saves data in Google Sheets. +2. Google Apps Script sends a POST request to the Cloud Function. +3. The `dlt` pipeline processes the data and updates the Notion database. + +# Conclusion + +We initially considered using Zapier for this small task, but ultimately, handling it ourselves proved to be quite effective. Since we already use an orchestrator for our other automations, the only expense was the time I spent writing and testing the code. This experience demonstrates that `dlt` is a straightforward and flexible tool, suitable for a variety of scenarios. Essentially, wherever Python can be used, `dlt` can be applied effectively for data loading, provided it meets your specific needs. \ No newline at end of file From ace102e200ecbe0ff8be1e079fc2948feef09e20 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Mon, 24 Jun 2024 22:00:32 +0200 Subject: [PATCH 19/21] rest_api: add an example to the incremental load section (#1502) --- .../verified-sources/rest_api.md | 125 +++++++++++++----- 1 file changed, 93 insertions(+), 32 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md index 0f8360e591..8271694523 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md @@ -560,49 +560,110 @@ This will include the `id`, `title`, and `created_at` fields from the `issues` r Some APIs provide a way to fetch only new or changed data (most often by using a timestamp field like `updated_at`, `created_at`, or incremental IDs). This is called [incremental loading](../../general-usage/incremental-loading.md) and is very useful as it allows you to reduce the load time and the amount of data transferred. -When the API endpoint supports incremental loading, you can configure the source to load only the new or changed data using these two methods: +When the API endpoint supports incremental loading, you can configure dlt to load only the new or changed data using these two methods: -1. Defining a special parameter in the `params` section of the [endpoint configuration](#endpoint-configuration): +1. Defining a special parameter in the `params` section of the [endpoint configuration](#endpoint-configuration). +2. Specifying the `incremental` field in the endpoint configuration. - ```py - { - "": { - "type": "incremental", - "cursor_path": "", - "initial_value": "", - }, - } - ``` +Let's start with the first method. - For example, in the `issues` resource configuration in the GitHub example, we have: +### Incremental loading in `params` - ```py - { - "since": { +Imagine we have the following endpoint `https://api.example.com/posts` and it: +1. Accepts a `created_since` query parameter to fetch posts created after a certain date. +2. Returns a list of posts with the `created_at` field for each post. + +For example, if we query the endpoint with `https://api.example.com/posts?created_since=2024-01-25`, we get the following response: + +```json +{ + "results": [ + {"id": 1, "title": "Post 1", "created_at": "2024-01-26"}, + {"id": 2, "title": "Post 2", "created_at": "2024-01-27"}, + {"id": 3, "title": "Post 3", "created_at": "2024-01-28"} + ] +} +``` + +To enable the incremental loading for this endpoint, you can use the following endpoint configuration: + +```py +{ + "path": "posts", + "data_selector": "results", # Optional JSONPath to select the list of posts + "params": { + "created_since": { "type": "incremental", - "cursor_path": "updated_at", - "initial_value": "2024-01-25T11:21:28Z", + "cursor_path": "created_at", # The JSONPath to the field we want to track in each post + "initial_value": "2024-01-25", }, - } - ``` + }, +} +``` - This configuration tells the source to create an incremental object that will keep track of the `updated_at` field in the response and use it as a value for the `since` parameter in subsequent requests. +After you run the pipeline, dlt will keep track of the last `created_at` from all the posts fetched and use it as the `created_since` parameter in the next request. +So in our case, the next request will be made to `https://api.example.com/posts?created_since=2024-01-28` to fetch only the new posts created after `2024-01-28`. -2. Specifying the `incremental` field in the [endpoint configuration](#endpoint-configuration): +Let's break down the configuration. - ```py - { - "incremental": { - "start_param": "", - "end_param": "", - "cursor_path": "", - "initial_value": "", - "end_value": "", - } +1. We explicitly set `data_selector` to `"results"` to select the list of posts from the response. This is optional, if not set, dlt will try to auto-detect the data location. +2. We define the `created_since` parameter as an incremental parameter with the following fields: + +```py +{ + "created_since": { + "type": "incremental", + "cursor_path": "created_at", + "initial_value": "2024-01-25", + }, +} +``` + +- `type`: The type of the parameter definition. In this case, it must be set to `incremental`. +- `cursor_path`: The JSONPath to the field within each item in the list. The value of this field will be used in the next request. In the example above our items look like `{"id": 1, "title": "Post 1", "created_at": "2024-01-26"}` so to track the created time we set `cursor_path` to `"created_at"`. Note that the JSONPath starts from the root of the item (dict) and not from the root of the response. +- `initial_value`: The initial value for the cursor. This is the value that will initialize the state of incremental loading. In this case, it's `2024-01-25`. The value type should match the type of the field in the data item. + +### Incremental loading using the `incremental` field + +The alternative method is to use the `incremental` field in the [endpoint configuration](#endpoint-configuration). This method is more flexible and allows you to specify the start and end conditions for the incremental loading. + +Let's take the same example as above and configure it using the `incremental` field: + +```py +{ + "path": "posts", + "data_selector": "results", + "incremental": { + "start_param": "created_since", + "cursor_path": "created_at", + "initial_value": "2024-01-25", + }, +} +``` + +Note that we specify the query parameter name `created_since` in the `start_param` field and not in the `params` section. + +The full available configuration for the `incremental` field is: + +```py +{ + "incremental": { + "start_param": "", + "end_param": "", + "cursor_path": "", + "initial_value": "", + "end_value": "", } - ``` +} +``` + +The fields are: - This configuration is more flexible and allows you to specify the start and end conditions for the incremental loading. +- `start_param` (str): The name of the query parameter to be used as the start condition. If we use the example above, it would be `"created_since"`. +- `end_param` (str): The name of the query parameter to be used as the end condition. This is optional and can be omitted if you only need to track the start condition. This is useful when you need to fetch data within a specific range and the API supports end conditions (like `created_before` query parameter). +- `cursor_path` (str): The JSONPath to the field within each item in the list. This is the field that will be used to track the incremental loading. In the example above, it's `"created_at"`. +- `initial_value` (str): The initial value for the cursor. This is the value that will initialize the state of incremental loading. +- `end_value` (str): The end value for the cursor to stop the incremental loading. This is optional and can be omitted if you only need to track the start condition. If you set this field, `initial_value` needs to be set as well. See the [incremental loading](../../general-usage/incremental-loading.md#incremental-loading-with-a-cursor-field) guide for more details. From 7874dedc73a4972e914fc3395b8661ff64646156 Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Mon, 1 Jul 2024 11:34:22 +0530 Subject: [PATCH 20/21] Updated API playground article (#1529) * Updated article API playground * Removed one API from the list --- .../blog/2024-02-06-practice-api-sources.md | 175 +++++++++++++++++- 1 file changed, 174 insertions(+), 1 deletion(-) diff --git a/docs/website/blog/2024-02-06-practice-api-sources.md b/docs/website/blog/2024-02-06-practice-api-sources.md index 4e78fc48e4..248d4ae647 100644 --- a/docs/website/blog/2024-02-06-practice-api-sources.md +++ b/docs/website/blog/2024-02-06-practice-api-sources.md @@ -32,6 +32,7 @@ This article outlines 10 APIs, detailing their use cases, any free tier limitati ### Data talks club open source spotlight * [Video](https://www.youtube.com/watch?v=eMbhyOECpcE) * [Notebook](https://github.com/dlt-hub/dlt_demos/blob/main/spotlight_demo.ipynb) +* DTC Learners showcase (review again) ### Docs * [Getting started](https://dlthub.com/docs/getting-started) @@ -100,8 +101,166 @@ This article outlines 10 APIs, detailing their use cases, any free tier limitati - **Free:** Varies by API. - **Auth:** Depends on API. +### 11. News API +- **URL**: [News API](https://newsapi.ai/). +- **Use**: Get datasets containing current and historic news articles. +- **Free**: Access to current news articles. +- **Auth**: API-Key. + +### 12. Exchangerates API +- **URL**: [Exchangerate API](https://exchangeratesapi.io/). +- **Use**: Get realtime, intraday and historic currency rates. +- **Free**: 250 monthly requests. +- **Auth**: API-Key. + +### 13. Spotify API +- **URL**: [Spotify API](https://developer.spotify.com/documentation/web-api). +- **Use**: Get spotify content and metadata about songs. +- **Free**: Rate limit. +- **Auth**: API-Key. + +### 14. Football API +- **URL**: [FootBall API](https://www.api-football.com/). +- **Use**: Get information about Football Leagues & Cups. +- **Free**: 100 requests/day. +- **Auth**: API-Key. + +### 15. Yahoo Finance API +- **URL**: [Yahoo Finance API](https://rapidapi.com/sparior/api/yahoo-finance15/details). +- **Use**: Access a wide range of financial data. +- **Free**: 500 requests/month. +- **Auth**: API-Key. + +### 16. Basketball API + +- URL: [Basketball API](https://www.api-basketball.com/). +- Use: Get information about basketball leagues & cups. +- Free: 100 requests/day. +- Auth: API-Key. + +### 17. NY Times API + +- URL: [NY Times API](https://developer.nytimes.com/apis). +- Use: Get info about articles, books, movies and more. +- Free: 500 requests/day or 5 requests/minute. +- Auth: API-Key. + +### 18. Spoonacular API + +- URL: [Spoonacular API](https://spoonacular.com/food-api). +- Use: Get info about ingredients, recipes, products and menu items. +- Free: 150 requests/day and 1 request/sec. +- Auth: API-Key. + +### 19. Movie database alternative API + +- URL: [Movie database alternative API](https://rapidapi.com/rapidapi/api/movie-database-alternative/pricing). +- Use: Movie data for entertainment industry trend analysis. +- Free: 1000 requests/day and 10 requests/sec. +- Auth: API-Key. + +### 20. RAWG Video games database API + +- URL: [RAWG Video Games Database](https://rawg.io/apidocs). +- Use: Gather video game data, such as release dates, platforms, genres, and reviews. +- Free: Unlimited requests for limited endpoints. +- Auth: API key. + +### 21. Jikan API + +- **URL:** [Jikan API](https://jikan.moe/). +- **Use:** Access data from MyAnimeList for anime and manga projects. +- **Free:** Rate-limited. +- **Auth:** None. + +### 22. Open Library Books API + +- URL: [Open Library Books API](https://openlibrary.org/dev/docs/api/books). +- Use: Access data about millions of books, including titles, authors, and publication dates. +- Free: Unlimited. +- Auth: None. + +### 23. YouTube Data API + +- URL: [YouTube Data API](https://developers.google.com/youtube/v3/docs/search/list). +- Use: Access YouTube video data, channels, playlists, etc. +- Free: Limited quota. +- Auth: Google API key and OAuth 2.0. + +### 24. Reddit API + +- URL: [Reddit API](https://www.reddit.com/dev/api/). +- Use: Access Reddit data for social media analysis or content retrieval. +- Free: Rate-limited. +- Auth: OAuth 2.0. + +### 25. World Bank API + +- URL: [World bank API](https://documents.worldbank.org/en/publication/documents-reports/api). +- Use: Access economic and development data from the World Bank. +- Free: Unlimited. +- Auth: None. + Each API offers unique insights for data engineering, from ingestion to visualization. Check each API's documentation for up-to-date details on limitations and authentication. +## Using the above sources + +You can create a pipeline for the APIs discussed above by using `dlt's` REST API source. Let’s create a PokeAPI pipeline as an example. Follow these steps: + +1. Create a Rest API source: + + ```sh + dlt init rest_api duckdb + ``` + +2. The following directory structure gets generated: + + ```sh + rest_api_pipeline/ + ├── .dlt/ + │ ├── config.toml # configs for your pipeline + │ └── secrets.toml # secrets for your pipeline + ├── rest_api/ # folder with source-specific files + │ └── ... + ├── rest_api_pipeline.py # your main pipeline script + ├── requirements.txt # dependencies for your pipeline + └── .gitignore # ignore files for git (not required) + ``` + +3. Configure the source in `rest_api_pipeline.py`: + + ```py + def load_pokemon() -> None: + pipeline = dlt.pipeline( + pipeline_name="rest_api_pokemon", + destination='duckdb', + dataset_name="rest_api_data", + ) + + pokemon_source = rest_api_source( + { + "client": { + "base_url": "https://pokeapi.co/api/v2/", + }, + "resource_defaults": { + "endpoint": { + "params": { + "limit": 1000, + }, + }, + }, + "resources": [ + "pokemon", + "berry", + "location", + ], + } + ) + + ``` + +For a detailed guide on creating a pipeline using the Rest API source, please read the Rest API source [documentation here](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api). + ## Example projects Here are some examples from dlt users and working students: @@ -115,5 +274,19 @@ Here are some examples from dlt users and working students: - Japanese language demos [Notion calendar](https://stable.co.jp/blog/notion-calendar-dlt) and [exploring csv to bigquery with dlt](https://soonraah.github.io/posts/load-csv-data-into-bq-by-dlt/). - Demos with [Dagster](https://dagster.io/blog/dagster-dlt) and [Prefect](https://www.prefect.io/blog/building-resilient-data-pipelines-in-minutes-with-dlt-prefect). +## DTC learners showcase +Check out the incredible projects from our DTC learners: + +1. [e2e_de_project](https://github.com/scpkobayashi/e2e_de_project/tree/153d485bba3ea8f640d0ccf3ec9593790259a646) by [scpkobayashi](https://github.com/scpkobayashi). +2. [de-zoomcamp-project](https://github.com/theDataFixer/de-zoomcamp-project/tree/1737b6a9d556348c2d7d48a91e2a43bb6e12f594) by [theDataFixer](https://github.com/theDataFixer). +3. [data-engineering-zoomcamp2024-project2](https://github.com/pavlokurochka/data-engineering-zoomcamp2024-project2/tree/f336ed00870a74cb93cbd9783dbff594393654b8) by [pavlokurochka](https://github.com/pavlokurochka). +4. [de-zoomcamp-2024](https://github.com/snehangsude/de-zoomcamp-2024) by [snehangsude](https://github.com/snehangsude). +5. [zoomcamp-data-engineer-2024](https://github.com/eokwukwe/zoomcamp-data-engineer-2024) by [eokwukwe](https://github.com/eokwukwe). +6. [data-engineering-zoomcamp-alex](https://github.com/aaalexlit/data-engineering-zoomcamp-alex) by [aaalexlit](https://github.com/aaalexlit). +7. [Zoomcamp2024](https://github.com/alfredzou/Zoomcamp2024) by [alfredzou](https://github.com/alfredzou). +8. [data-engineering-zoomcamp](https://github.com/el-grudge/data-engineering-zoomcamp) by [el-grudge](https://github.com/el-grudge). + +Explore these projects to see the innovative solutions and hard work the learners have put into their data engineering journeys! + ## Showcase your project -If you want your project to be featured, let us know in the [#sharing-and-contributing channel of our community Slack](https://dlthub.com/community). +If you want your project to be featured, let us know in the [#sharing-and-contributing channel of our community Slack](https://dlthub.com/community). \ No newline at end of file From 41918a3f1c2220a24b85e9963ac5ac48ebc7cecf Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Wed, 3 Jul 2024 12:41:17 +0200 Subject: [PATCH 21/21] Docs: update integrations and verified sources pages (#1532) Co-authored-by: Alena Astrakhantseva --- docs/website/docs/dlt-ecosystem/index.md | 18 +++++++++++++ .../dlt-ecosystem/verified-sources/index.md | 27 ++++++++++++++----- docs/website/sidebars.js | 7 ++--- 3 files changed, 41 insertions(+), 11 deletions(-) create mode 100644 docs/website/docs/dlt-ecosystem/index.md diff --git a/docs/website/docs/dlt-ecosystem/index.md b/docs/website/docs/dlt-ecosystem/index.md new file mode 100644 index 0000000000..740a3a3a39 --- /dev/null +++ b/docs/website/docs/dlt-ecosystem/index.md @@ -0,0 +1,18 @@ +--- +title: Integrations +description: List of integrations +keywords: ['integrations, sources, destinations'] +--- +import DocCardList from '@theme/DocCardList'; +import Link from '../_book-onboarding-call.md'; + +Speed up the process of creating data pipelines by using dlt's multiple pre-built sources and destinations: + +- Each [dlt verified source](verified-sources) allows you to create [pipelines](../general-usage/pipeline) that extract data from a particular source: a database, a cloud service, or an API. +- [Destinations](destinations) are where you want to load your data. dlt supports a variety of destinations, including databases, data warehouses, and data lakes. + + + +:::tip +Most source-destination pairs work seamlessly together. If the merge [write disposition](../general-usage/incremental-loading#choosing-a-write-disposition) is not supported by a destination (for example, [file sytem destination](destinations/filesystem)), dlt will automatically fall back to the [append](../general-usage/incremental-loading#append) write disposition. +::: \ No newline at end of file diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/index.md b/docs/website/docs/dlt-ecosystem/verified-sources/index.md index d9ae2d1f21..7b5d9e2bcb 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/index.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/index.md @@ -6,14 +6,29 @@ keywords: ['verified source'] import DocCardList from '@theme/DocCardList'; import Link from '../../_book-onboarding-call.md'; -Pick one of our verified sources that we wrote or maintain ourselves. All of them are constantly tested on real data and distributed as simple Python code so they can be easily customized or hacked. +Choose from our collection of verified sources, developed and maintained by the dlt team and community. Each source is rigorously tested against a real API and provided as Python code for easy customization. -* Need more info? [Join our Slack community](https://dlthub.com/community) and ask in the tech help channel or . +Planning to use dlt in production and need a source that isn't listed? We're happy to build it for you: . -Do you plan to run dlt in production and source is missing? We are happy to build it. -* Source missing? [Request a new verified source](https://github.com/dlt-hub/verified-sources/issues/new?template=source-request.md) -* Missing endpoint or a feature? [Request or contribute](https://github.com/dlt-hub/verified-sources/issues/new?template=extend-a-source.md) +### Popular sources + +- [SQL databases](sql_database). Supports PostgreSQL, MySQL, MS SQL Server, BigQuery, Redshift, and more. +- [REST API generic source](rest_api). Loads data from REST APIs using declarative configuration. +- [OpenAPI source generator](openapi-generator). Generates a source from an OpenAPI 3.x spec using the REST API source. +- [Cloud and local storage](filesystem). Retrieves data from AWS S3, Google Cloud Storage, Azure Blob Storage, local files, and more. -Otherwise pick a source below: +### Full list of verified sources + +:::tip +If you're looking for a source that isn't listed and it provides a REST API, be sure to check out our [REST API generic source](rest_api) + source. +::: + + +### Get help + +* Source missing? [Request a new verified source.](https://github.com/dlt-hub/verified-sources/issues/new?template=source-request.md) +* Missing endpoint or a feature? [Request or contribute](https://github.com/dlt-hub/verified-sources/issues/new?template=extend-a-source.md) +* [Join our Slack community](https://dlthub.com/community) and ask in the technical-help channel. diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index d3d7def8fc..1ab525a890 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -46,11 +46,8 @@ const sidebars = { type: 'category', label: 'Integrations', link: { - type: 'generated-index', - title: 'Integrations', - description: 'dlt fits everywhere where the data flows. check out our curated data sources, destinations and unexpected places where dlt runs', - slug: 'dlt-ecosystem', - keywords: ['getting started'], + type: 'doc', + id: 'dlt-ecosystem/index', }, items: [ {