From 76f44ce34d36c1e17d2ae7c941155f1b1cb3a179 Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 14 May 2024 12:04:34 +0200 Subject: [PATCH 1/8] rest api blog --- .../blog/2024-05-14-rest-api-source-client.md | 236 ++++++++++++++++++ 1 file changed, 236 insertions(+) create mode 100644 docs/website/blog/2024-05-14-rest-api-source-client.md diff --git a/docs/website/blog/2024-05-14-rest-api-source-client.md b/docs/website/blog/2024-05-14-rest-api-source-client.md new file mode 100644 index 0000000000..07deadde5d --- /dev/null +++ b/docs/website/blog/2024-05-14-rest-api-source-client.md @@ -0,0 +1,236 @@ +--- +slug: rest-api-source-client +title: "Announcing: REST API Source toolkit from dltHub - A Python-only high level approach to pipelines" +image: https://storage.googleapis.com/dlt-blog-images/martin_salo_tweet.png +authors: + name: Adrian Brudaru + title: Open source Data Engineer + url: https://github.com/adrianbr + image_url: https://avatars.githubusercontent.com/u/5762770?v=4 +tags: [full code etl, yes code etl, etl, python elt] +--- + +## What is the REST API Source toolkit? +:::tip +** tl;dr: You are probably familiar with REST APIs. + +- Our new **REST API Source** is a short, declarative configuration driven way of creating sources. +- Our new **REST API Client** is a collection of Python helpers used by the above source, which you can also use as a standalone, config-free, imperative high level abstraction for building pipelines. + +Want to skip to docs? links at the [bottom of the post](#next-steps) +::: + +### Why REST configuration pipeline? Obviously, we need one! + +But of course! Why repeat write all this code for requests and loading, when we could write it once and re-use it with different apis with different configs? + +Once you have built a few pipelines from REST APIs, you can recognise we could, instead of writing code, write configuration. + +**We can call such an obvious next step in ETL tools a “[focal point](https://en.wikipedia.org/wiki/Focal_point_(game_theory))” of “[convergent evolution](https://en.wikipedia.org/wiki/Convergent_evolution)”.** + +And if you’ve been in a few larger more mature companies, you will see a variety of home-grown solutions that look similar. You might also have seen such solutions as commercial products or offerings. + +### But ours will be better… + +So far we have seen many REST API configurators and products - they suffer from predictable flaws + +- Local homebrewed flavors are local for a reason: They aren’t suitable for the broad audience. And often if you ask the users/beneficiaries of these frameworks, they will sometimes argue that they aren’t suitable for anyone at all. +- Commercial products are yet another data product that doesn’t plug into your stack, brings black boxes and removes autonomy, so they simply aren’t an acceptable solution in many cases. + +So how can dlt do better? + +Because it can keep the best of both worlds: the autonomy of a library, the quality of a commercial product. + +As you will see further, we created not just a standalone “configuration-based source builder” but we also expose the REST API client used enabling its use directly in code. + +## Hey community, you made us do it! + +The push for this is coming from you, the community. While we had considered the concept before, there were many things dlt needed before creating a new way to build pipelines. A declarative extractor after all, would not make dlt easier to adopt, because a declarative approach requires more upfront knowledge. + +Credits: + +- So, thank you Alex Butler for building a first version of this and donating it to us back in August ‘23 https://github.com/dlt-hub/dlt-init-openapi/pull/2. +- And thank you Francesco Mucio and Willi Müller for re-opening the topic, and creating video [tutorials](https://www.youtube.com/playlist?list=PLpTgUMBCn15rs2NkB4ise780UxLKImZTh). +- And last but not least, thank you to dlt team’s Anton Burnashev (also known for [gspread](https://github.com/burnash/gspread) library) for building it out! + +## The outcome? Two Python-only interfaces, one declarative, one imperative. + +- **dlt’s REST API Source is a Python dictionary-first declarative source builder,** that has enhanced flexibility**,** supports callable passes, native config validations via python dictionaries, and composability directly in your scripts. It enables generating sources dynamically during runtime, enabling straightforward, manual or automated workflows for adapting sources to changes. +- **dlt’s REST API client** is the low level abstraction that powers the **REST API Source.** You can use it in your imperative code for more automation and brevity, if you do not wish to use the higher level declarative interface. + +## Useful for those who frequently build new pipelines + +If you are on a team with 2-3 pipelines that never change much you likely won’t see much benefit from our latest tool. What we observe from early feedback a declarative extractor is great at is enabling easier work at scale. We heard excitement about the **REST API Source** from: **** + +- companies with many pipelines that frequently create new pipelines, +- data platform teams, +- freelancers and agencies, +- folks who want to generate pipelines with LLMs and need a simple interface. + +## How to use the REST API Source? + +Since this is a declarative interface, we can’t make things up as we go along, and instead need to understand what we want to do upfront and declare that. + +In some cases, we might not have the information upfront, so we will show you how to get that info during your development workflow. + +Depending on how you learn better, you can either watch the videos that our community members made, or follow the walkthrough below. + +## **Video walkthroughs:** + +In these videos, you will learn at a leisurely pace how to use the new interface. +[playlist link](https://www.youtube.com/playlist?list=PLpTgUMBCn15rs2NkB4ise780UxLKImZTh) + + +## Workflow walkthrough: Step by step. + +If you prefer to do things at your own pace, try the workflow walkthrough, which will show you the workflow of using the declarative interface. + +In the example below, we will show how to create an API integration with 2 endpoints. One of these is a child resource, using the data from the parent endpoint to make a new request. + +### Configuration Checklist: **Before getting started** + +We will use GitHub’s API as an example. + +1. Collect your api url and endpoints + - an url is the base of the request, for example: `https://api.github.com/` + - an endpoint is the path of an individual resource such as: + - `/repos/{OWNER}/{REPO}/issues` + - or `/repos/{OWNER}/{REPO}/issues/{issue_number}/comments` which would require the issue number from the above endpoint + - or `/users/{username}/starred` etc. +2. Identify the authentication methods + - Github uses bearer tokens for auth, but we can also skip it for public endpoints https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api?apiVersion=2022-11-28 +3. Identify if you have any dependent request patterns such as first get ids in a list, then use id for requesting details. + 1. for github we might do the below or any other chained requests. + 1. get all repos of an org [`https://api.github.com/orgs/{org}/repos`](https://api.github.com/orgs/%7Borg%7D/repos) + 2. then get all contributors [`https://api.github.com/repos/{owner}/{repo}/contributors`](https://api.github.com/repos/%7Bowner%7D/%7Brepo%7D/contributors) +4. How does pagination work? is there any? do we know the exact pattern? + - On github we have consistent [pagination](https://docs.github.com/en/rest/using-the-rest-api/using-pagination-in-the-rest-api?apiVersion=2022-11-28) between endpoints that looks like this `link_header = response.headers.get('Link', None)` +5. Identify the necessary information for incremental loading + - Will any endpoints be loaded incrementally? + - What columns will you use for incremental extraction and loading? + - github example: we can extract new issues by requesting issues after a particular time: `https://api.github.com/repos/{repo_owner}/{repo_name}/issues?since={since}` + +### Configuration Checklist: Checking responses during **development** + +1. Data path + - You could print the source and see what is yielded +2. Unless you had full documentation at point 4 (which we did), you likely need to still figure out some details on how pagination works. + 1. To do that we suggest using curl or a second python script to do a request and inspect the response. This gives you flexibility to try anything. + 2. Or you could print the source as above - but if there is metadata in headers etc, you might miss it. + +## Applying the configuration + +Here’s what a configured example could look like + +1. Base Url and endpoints +2. Authentication +3. Chained request +4. Pagination +5. Incremental configuration +6. Dependent resource (child) configuration + +```python +# This source has 2 resources: +# - issues: Parent resource, retrieves issues incl issue number. +# - issues_comments: child resource which needs the issue number + +import os +from rest_api import RESTAPIConfig + +github_config: RESTAPIConfig = { + "client": { + "base_url": "https://api.github.com/repos/dlt-hub/dlt/", #(1) + # Optional auth for improving rate limits #(2) + # "auth": { + # "token": os.environ.get('GITHUB_TOKEN', userdata.get('GITHUB_TOKEN')), + # }, + }, + # The paginator is autodetected, but we can pass it explicitly #(4) + # "paginator": { + # "type": "header_link", + # "next_url_path": "paging.link", + # } + # we can declare generic settings in one place + # our data is stateful so we load it incrementally by merging on id. + "resource_defaults": { + "primary_key": "id", #(5) + "write_disposition": "merge", #(5) + # these are request params specific to github + "endpoint": { + "params": { + "per_page": 10, + }, + }, + }, + "resources": [ + # This is the first issue + { + "name": "issues", + "endpoint": { + "path": "issues", #(1) + "params": { + "sort": "updated", + "direction": "desc", + "state": "open", + "since": { + "type": "incremental", + "cursor_path": "updated_at", + "initial_value": "2024-01-25T11:21:28Z", + }, + } + }, + }, + # Configuration for fetching comments on issues #(3) + # This is a child resource - as in, it needs something from another. + { + "name": "issue_comments", + "endpoint": { + "path": "issues/{issue_number}/comments", #(1) + # For child resources, you can use values from the parent resource for params. + "params": { + "issue_number": { + # Use type "resolve" to define child endpoint wich should be resolved + "type": "resolve", + # Parent endpoint + "resource": "issues", + # The specific field in the issues resource to use for resolution + "field": "number", + } + }, + }, + # A list of fields, from the parent resource, which will be included in the child resource output. + "include_from_parent": ["id"], + }, + ], +} +``` + +## And that’s a wrap - what else should you know? + +- As we mentioned, there’s also a REST client - an imperative way to use the same abstractions, for example the auto-paginator - check out this runnable snippet + + ```python + from dlt.sources.helpers.rest_client import RESTClient + + # Initialize the RESTClient with the Pokémon API base URL + client = RESTClient(base_url="https://pokeapi.co/api/v2") + + # Define a function to fetch and paginate through Pokémon data + def fetch_pokemon(): + # Using the paginate method to automatically handle pagination + for page in client.paginate("/pokemon"): + print(page) + # Call the function to start fetching data + fetch_pokemon() + ``` + +- We are going to generate a bunch of sources from openapi specs - stay tuned for an update in a couple of weeks. + +## Next steps: + +- Read more about the + - [REST API Source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api) and + - **[RESTClient](https://dlthub.com/docs/general-usage/http/rest-client),** + - **and the related [API helpers](https://dlthub.com/devel/general-usage/http/overview) and** [request](https://dlthub.com/docs/general-usage/http/requests)s helper. +- [Join our community](https://dlthub.com/community) and give us feedback! +- Want to share back your work? See this page for instructions: [https://dlthub.notion.site/dltHub-Community-Sources-Snippets-7a7f7ddb39334743b1ba3debbdfb8d7f](https://www.notion.so/7a7f7ddb39334743b1ba3debbdfb8d7f?pvs=21) \ No newline at end of file From 0ac3e7870e00c1942b0aec34e9ae6c275482917b Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 14 May 2024 13:04:16 +0200 Subject: [PATCH 2/8] format --- .../blog/2024-05-14-rest-api-source-client.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/docs/website/blog/2024-05-14-rest-api-source-client.md b/docs/website/blog/2024-05-14-rest-api-source-client.md index 07deadde5d..ec76e118b4 100644 --- a/docs/website/blog/2024-05-14-rest-api-source-client.md +++ b/docs/website/blog/2024-05-14-rest-api-source-client.md @@ -17,7 +17,7 @@ tags: [full code etl, yes code etl, etl, python elt] - Our new **REST API Source** is a short, declarative configuration driven way of creating sources. - Our new **REST API Client** is a collection of Python helpers used by the above source, which you can also use as a standalone, config-free, imperative high level abstraction for building pipelines. -Want to skip to docs? links at the [bottom of the post](#next-steps) +Want to skip to docs? links at the [bottom of the post](#next-steps) ** ::: ### Why REST configuration pipeline? Obviously, we need one! @@ -28,7 +28,7 @@ Once you have built a few pipelines from REST APIs, you can recognise we could, **We can call such an obvious next step in ETL tools a “[focal point](https://en.wikipedia.org/wiki/Focal_point_(game_theory))” of “[convergent evolution](https://en.wikipedia.org/wiki/Convergent_evolution)”.** -And if you’ve been in a few larger more mature companies, you will see a variety of home-grown solutions that look similar. You might also have seen such solutions as commercial products or offerings. +And if you’ve been in a few larger more mature companies, you will have seen a variety of home-grown solutions that look similar. You might also have seen such solutions as commercial products or offerings. ### But ours will be better… @@ -60,7 +60,9 @@ Credits: ## Useful for those who frequently build new pipelines -If you are on a team with 2-3 pipelines that never change much you likely won’t see much benefit from our latest tool. What we observe from early feedback a declarative extractor is great at is enabling easier work at scale. We heard excitement about the **REST API Source** from: **** +If you are on a team with 2-3 pipelines that never change much you likely won’t see much benefit from our latest tool. +What we observe from early feedback a declarative extractor is great at is enabling easier work at scale. +We heard excitement about the **REST API Source** from: - companies with many pipelines that frequently create new pipelines, - data platform teams, @@ -128,6 +130,7 @@ Here’s what a configured example could look like 4. Pagination 5. Incremental configuration 6. Dependent resource (child) configuration +If you are using a narrow screen, scroll the snippet below to look for the numbers designating each component `(n)` ```python # This source has 2 resources: @@ -145,7 +148,7 @@ github_config: RESTAPIConfig = { # "token": os.environ.get('GITHUB_TOKEN', userdata.get('GITHUB_TOKEN')), # }, }, - # The paginator is autodetected, but we can pass it explicitly #(4) + # The paginator is autodetected, but we can pass it explicitly #(4) # "paginator": { # "type": "header_link", # "next_url_path": "paging.link", @@ -233,4 +236,4 @@ github_config: RESTAPIConfig = { - **[RESTClient](https://dlthub.com/docs/general-usage/http/rest-client),** - **and the related [API helpers](https://dlthub.com/devel/general-usage/http/overview) and** [request](https://dlthub.com/docs/general-usage/http/requests)s helper. - [Join our community](https://dlthub.com/community) and give us feedback! -- Want to share back your work? See this page for instructions: [https://dlthub.notion.site/dltHub-Community-Sources-Snippets-7a7f7ddb39334743b1ba3debbdfb8d7f](https://www.notion.so/7a7f7ddb39334743b1ba3debbdfb8d7f?pvs=21) \ No newline at end of file +- Want to share back your work? See this page for instructions: [dltHub-Community-Sources-Snippets](https://www.notion.so/7a7f7ddb39334743b1ba3debbdfb8d7f?pvs=21) \ No newline at end of file From 2f9276eba8e02f420d97b4e03a4bd1d9d40bc883 Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 14 May 2024 14:10:08 +0200 Subject: [PATCH 3/8] format --- docs/website/blog/2024-05-14-rest-api-source-client.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/website/blog/2024-05-14-rest-api-source-client.md b/docs/website/blog/2024-05-14-rest-api-source-client.md index ec76e118b4..2fc6773a9b 100644 --- a/docs/website/blog/2024-05-14-rest-api-source-client.md +++ b/docs/website/blog/2024-05-14-rest-api-source-client.md @@ -1,7 +1,7 @@ --- slug: rest-api-source-client title: "Announcing: REST API Source toolkit from dltHub - A Python-only high level approach to pipelines" -image: https://storage.googleapis.com/dlt-blog-images/martin_salo_tweet.png +image: https://storage.googleapis.com/dlt-blog-images/rest-img.png authors: name: Adrian Brudaru title: Open source Data Engineer @@ -12,12 +12,12 @@ tags: [full code etl, yes code etl, etl, python elt] ## What is the REST API Source toolkit? :::tip -** tl;dr: You are probably familiar with REST APIs. +tl;dr: You are probably familiar with REST APIs. - Our new **REST API Source** is a short, declarative configuration driven way of creating sources. - Our new **REST API Client** is a collection of Python helpers used by the above source, which you can also use as a standalone, config-free, imperative high level abstraction for building pipelines. -Want to skip to docs? links at the [bottom of the post](#next-steps) ** +Want to skip to docs? links at the [bottom of the post](#next-steps) ::: ### Why REST configuration pipeline? Obviously, we need one! @@ -130,6 +130,7 @@ Here’s what a configured example could look like 4. Pagination 5. Incremental configuration 6. Dependent resource (child) configuration + If you are using a narrow screen, scroll the snippet below to look for the numbers designating each component `(n)` ```python From 1ef77d315437583b290207ca5dfbda77f9d9e615 Mon Sep 17 00:00:00 2001 From: AstrakhantsevaAA Date: Tue, 14 May 2024 14:24:24 +0200 Subject: [PATCH 4/8] made it prettier --- .../blog/2024-05-14-rest-api-source-client.md | 150 +++++++++--------- 1 file changed, 73 insertions(+), 77 deletions(-) diff --git a/docs/website/blog/2024-05-14-rest-api-source-client.md b/docs/website/blog/2024-05-14-rest-api-source-client.md index 2fc6773a9b..1b6478640e 100644 --- a/docs/website/blog/2024-05-14-rest-api-source-client.md +++ b/docs/website/blog/2024-05-14-rest-api-source-client.md @@ -15,14 +15,14 @@ tags: [full code etl, yes code etl, etl, python elt] tl;dr: You are probably familiar with REST APIs. - Our new **REST API Source** is a short, declarative configuration driven way of creating sources. -- Our new **REST API Client** is a collection of Python helpers used by the above source, which you can also use as a standalone, config-free, imperative high level abstraction for building pipelines. +- Our new **REST API Client** is a collection of Python helpers used by the above source, which you can also use as a standalone, config-free, imperative high-level abstraction for building pipelines. -Want to skip to docs? links at the [bottom of the post](#next-steps) +Want to skip to docs? Links at the [bottom of the post.](#next-steps) ::: ### Why REST configuration pipeline? Obviously, we need one! -But of course! Why repeat write all this code for requests and loading, when we could write it once and re-use it with different apis with different configs? +But of course! Why repeat write all this code for requests and loading, when we could write it once and re-use it with different APIs with different configs? Once you have built a few pipelines from REST APIs, you can recognise we could, instead of writing code, write configuration. @@ -32,12 +32,12 @@ And if you’ve been in a few larger more mature companies, you will have seen a ### But ours will be better… -So far we have seen many REST API configurators and products - they suffer from predictable flaws +So far we have seen many REST API configurators and products — they suffer from predictable flaws: - Local homebrewed flavors are local for a reason: They aren’t suitable for the broad audience. And often if you ask the users/beneficiaries of these frameworks, they will sometimes argue that they aren’t suitable for anyone at all. - Commercial products are yet another data product that doesn’t plug into your stack, brings black boxes and removes autonomy, so they simply aren’t an acceptable solution in many cases. -So how can dlt do better? +So how can `dlt` do better? Because it can keep the best of both worlds: the autonomy of a library, the quality of a commercial product. @@ -45,18 +45,18 @@ As you will see further, we created not just a standalone “configuration-based ## Hey community, you made us do it! -The push for this is coming from you, the community. While we had considered the concept before, there were many things dlt needed before creating a new way to build pipelines. A declarative extractor after all, would not make dlt easier to adopt, because a declarative approach requires more upfront knowledge. +The push for this is coming from you, the community. While we had considered the concept before, there were many things `dlt` needed before creating a new way to build pipelines. A declarative extractor after all, would not make `dlt` easier to adopt, because a declarative approach requires more upfront knowledge. Credits: -- So, thank you Alex Butler for building a first version of this and donating it to us back in August ‘23 https://github.com/dlt-hub/dlt-init-openapi/pull/2. +- So, thank you Alex Butler for building a first version of this and donating it to us back in August ‘23: https://github.com/dlt-hub/dlt-init-openapi/pull/2. - And thank you Francesco Mucio and Willi Müller for re-opening the topic, and creating video [tutorials](https://www.youtube.com/playlist?list=PLpTgUMBCn15rs2NkB4ise780UxLKImZTh). -- And last but not least, thank you to dlt team’s Anton Burnashev (also known for [gspread](https://github.com/burnash/gspread) library) for building it out! +- And last but not least, thank you to `dlt` team’s Anton Burnashev (also known for [gspread](https://github.com/burnash/gspread) library) for building it out! ## The outcome? Two Python-only interfaces, one declarative, one imperative. -- **dlt’s REST API Source is a Python dictionary-first declarative source builder,** that has enhanced flexibility**,** supports callable passes, native config validations via python dictionaries, and composability directly in your scripts. It enables generating sources dynamically during runtime, enabling straightforward, manual or automated workflows for adapting sources to changes. -- **dlt’s REST API client** is the low level abstraction that powers the **REST API Source.** You can use it in your imperative code for more automation and brevity, if you do not wish to use the higher level declarative interface. +- **dlt’s REST API Source** is a Python dictionary-first declarative source builder, that has enhanced flexibility, supports callable passes, native config validations via python dictionaries, and composability directly in your scripts. It enables generating sources dynamically during runtime, enabling straightforward, manual or automated workflows for adapting sources to changes. +- **dlt’s REST API Client** is the low-level abstraction that powers the REST API Source. You can use it in your imperative code for more automation and brevity, if you do not wish to use the higher level declarative interface. ## Useful for those who frequently build new pipelines @@ -77,66 +77,66 @@ In some cases, we might not have the information upfront, so we will show you ho Depending on how you learn better, you can either watch the videos that our community members made, or follow the walkthrough below. -## **Video walkthroughs:** +## Video walkthroughs In these videos, you will learn at a leisurely pace how to use the new interface. -[playlist link](https://www.youtube.com/playlist?list=PLpTgUMBCn15rs2NkB4ise780UxLKImZTh) +[playlist link.](https://www.youtube.com/playlist?list=PLpTgUMBCn15rs2NkB4ise780UxLKImZTh) -## Workflow walkthrough: Step by step. +## Workflow walkthrough: Step by step If you prefer to do things at your own pace, try the workflow walkthrough, which will show you the workflow of using the declarative interface. In the example below, we will show how to create an API integration with 2 endpoints. One of these is a child resource, using the data from the parent endpoint to make a new request. -### Configuration Checklist: **Before getting started** +### Configuration Checklist: Before getting started We will use GitHub’s API as an example. -1. Collect your api url and endpoints - - an url is the base of the request, for example: `https://api.github.com/` - - an endpoint is the path of an individual resource such as: - - `/repos/{OWNER}/{REPO}/issues` - - or `/repos/{OWNER}/{REPO}/issues/{issue_number}/comments` which would require the issue number from the above endpoint +1. Collect your api url and endpoints: + - An URL is the base of the request, for example: `https://api.github.com/`. + - An endpoint is the path of an individual resource such as: + - `/repos/{OWNER}/{REPO}/issues`; + - or `/repos/{OWNER}/{REPO}/issues/{issue_number}/comments` which would require the issue number from the above endpoint; - or `/users/{username}/starred` etc. -2. Identify the authentication methods - - Github uses bearer tokens for auth, but we can also skip it for public endpoints https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api?apiVersion=2022-11-28 +2. Identify the authentication methods: + - GitHub uses bearer tokens for auth, but we can also skip it for public endpoints https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api?apiVersion=2022-11-28. 3. Identify if you have any dependent request patterns such as first get ids in a list, then use id for requesting details. - 1. for github we might do the below or any other chained requests. - 1. get all repos of an org [`https://api.github.com/orgs/{org}/repos`](https://api.github.com/orgs/%7Borg%7D/repos) - 2. then get all contributors [`https://api.github.com/repos/{owner}/{repo}/contributors`](https://api.github.com/repos/%7Bowner%7D/%7Brepo%7D/contributors) -4. How does pagination work? is there any? do we know the exact pattern? - - On github we have consistent [pagination](https://docs.github.com/en/rest/using-the-rest-api/using-pagination-in-the-rest-api?apiVersion=2022-11-28) between endpoints that looks like this `link_header = response.headers.get('Link', None)` -5. Identify the necessary information for incremental loading + + For GitHub, we might do the below or any other chained requests: + 1. Get all repos of an org `https://api.github.com/orgs/{org}/repos`. + 2. Then get all contributors `https://api.github.com/repos/{owner}/{repo}/contributors`. +4. How does pagination work? Is there any? Do we know the exact pattern? + - On GitHub, we have consistent [pagination](https://docs.github.com/en/rest/using-the-rest-api/using-pagination-in-the-rest-api?apiVersion=2022-11-28) between endpoints that looks like this `link_header = response.headers.get('Link', None)`. +5. Identify the necessary information for incremental loading: - Will any endpoints be loaded incrementally? - What columns will you use for incremental extraction and loading? - - github example: we can extract new issues by requesting issues after a particular time: `https://api.github.com/repos/{repo_owner}/{repo_name}/issues?since={since}` + - GitHub example: We can extract new issues by requesting issues after a particular time: `https://api.github.com/repos/{repo_owner}/{repo_name}/issues?since={since}`. -### Configuration Checklist: Checking responses during **development** +### Configuration Checklist: Checking responses during development -1. Data path - - You could print the source and see what is yielded +1. Data path: + - You could print the source and see what is yielded. 2. Unless you had full documentation at point 4 (which we did), you likely need to still figure out some details on how pagination works. - 1. To do that we suggest using curl or a second python script to do a request and inspect the response. This gives you flexibility to try anything. + 1. To do that, we suggest using `curl` or a second python script to do a request and inspect the response. This gives you flexibility to try anything. 2. Or you could print the source as above - but if there is metadata in headers etc, you might miss it. -## Applying the configuration +### Applying the configuration -Here’s what a configured example could look like +Here’s what a configured example could look like: -1. Base Url and endpoints -2. Authentication -3. Chained request -4. Pagination -5. Incremental configuration -6. Dependent resource (child) configuration +1. Base URL and endpoints. +2. Authentication. +3. Pagination. +4. Incremental configuration. +5. Dependent resource (child) configuration. -If you are using a narrow screen, scroll the snippet below to look for the numbers designating each component `(n)` +If you are using a narrow screen, scroll the snippet below to look for the numbers designating each component `(n)`. -```python +```py # This source has 2 resources: -# - issues: Parent resource, retrieves issues incl issue number. -# - issues_comments: child resource which needs the issue number +# - issues: Parent resource, retrieves issues incl. issue number +# - issues_comments: Child resource which needs the issue number import os from rest_api import RESTAPIConfig @@ -144,22 +144,22 @@ from rest_api import RESTAPIConfig github_config: RESTAPIConfig = { "client": { "base_url": "https://api.github.com/repos/dlt-hub/dlt/", #(1) - # Optional auth for improving rate limits #(2) - # "auth": { - # "token": os.environ.get('GITHUB_TOKEN', userdata.get('GITHUB_TOKEN')), + # Optional auth for improving rate limits + # "auth": { #(2) + # "token": os.environ.get('GITHUB_TOKEN'), # }, }, - # The paginator is autodetected, but we can pass it explicitly #(4) + # The paginator is autodetected, but we can pass it explicitly #(3) # "paginator": { - # "type": "header_link", - # "next_url_path": "paging.link", + # "type": "header_link", + # "next_url_path": "paging.link", # } - # we can declare generic settings in one place - # our data is stateful so we load it incrementally by merging on id. + # We can declare generic settings in one place + # Our data is stateful so we load it incrementally by merging on id "resource_defaults": { - "primary_key": "id", #(5) - "write_disposition": "merge", #(5) - # these are request params specific to github + "primary_key": "id", #(4) + "write_disposition": "merge", #(4) + # these are request params specific to GitHub "endpoint": { "params": { "per_page": 10, @@ -167,7 +167,7 @@ github_config: RESTAPIConfig = { }, }, "resources": [ - # This is the first issue + # This is the first resource - issues { "name": "issues", "endpoint": { @@ -177,15 +177,15 @@ github_config: RESTAPIConfig = { "direction": "desc", "state": "open", "since": { - "type": "incremental", - "cursor_path": "updated_at", - "initial_value": "2024-01-25T11:21:28Z", + "type": "incremental", #(4) + "cursor_path": "updated_at", #(4) + "initial_value": "2024-01-25T11:21:28Z", #(4) }, } }, }, - # Configuration for fetching comments on issues #(3) - # This is a child resource - as in, it needs something from another. + # Configuration for fetching comments on issues #(5) + # This is a child resource - as in, it needs something from another { "name": "issue_comments", "endpoint": { @@ -209,32 +209,28 @@ github_config: RESTAPIConfig = { } ``` -## And that’s a wrap - what else should you know? +## And that’s a wrap — what else should you know? -- As we mentioned, there’s also a REST client - an imperative way to use the same abstractions, for example the auto-paginator - check out this runnable snippet +- As we mentioned, there’s also a **REST Client** - an imperative way to use the same abstractions, for example, the auto-paginator - check out this runnable snippet: - ```python + ```py from dlt.sources.helpers.rest_client import RESTClient # Initialize the RESTClient with the Pokémon API base URL client = RESTClient(base_url="https://pokeapi.co/api/v2") - # Define a function to fetch and paginate through Pokémon data - def fetch_pokemon(): - # Using the paginate method to automatically handle pagination - for page in client.paginate("/pokemon"): - print(page) - # Call the function to start fetching data - fetch_pokemon() + # Using the paginate method to automatically handle pagination + for page in client.paginate("/pokemon"): + print(page) ``` -- We are going to generate a bunch of sources from openapi specs - stay tuned for an update in a couple of weeks. +- We are going to generate a bunch of sources from OpenAPI specs — stay tuned for an update in a couple of weeks! ## Next steps: -- Read more about the - - [REST API Source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api) and - - **[RESTClient](https://dlthub.com/docs/general-usage/http/rest-client),** - - **and the related [API helpers](https://dlthub.com/devel/general-usage/http/overview) and** [request](https://dlthub.com/docs/general-usage/http/requests)s helper. +- Read more about the related [API helpers](https://dlthub.com/devel/general-usage/http/overview): + - [REST API Source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api), + - [RESTClient](https://dlthub.com/docs/general-usage/http/rest-client), + - and [requests](https://dlthub.com/docs/general-usage/http/requests) helper. - [Join our community](https://dlthub.com/community) and give us feedback! -- Want to share back your work? See this page for instructions: [dltHub-Community-Sources-Snippets](https://www.notion.so/7a7f7ddb39334743b1ba3debbdfb8d7f?pvs=21) \ No newline at end of file +- Want to share back your work? See this page for instructions: [dltHub Community: Sources & Snippets.](https://www.notion.so/7a7f7ddb39334743b1ba3debbdfb8d7f?pvs=21) \ No newline at end of file From ed29fff1b85191fec681b75b819df449e7c27249 Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 14 May 2024 14:17:54 +0200 Subject: [PATCH 5/8] format --- .../blog/2024-05-14-rest-api-source-client.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/docs/website/blog/2024-05-14-rest-api-source-client.md b/docs/website/blog/2024-05-14-rest-api-source-client.md index 1b6478640e..e736df7a84 100644 --- a/docs/website/blog/2024-05-14-rest-api-source-client.md +++ b/docs/website/blog/2024-05-14-rest-api-source-client.md @@ -227,10 +227,9 @@ github_config: RESTAPIConfig = { - We are going to generate a bunch of sources from OpenAPI specs — stay tuned for an update in a couple of weeks! ## Next steps: - -- Read more about the related [API helpers](https://dlthub.com/devel/general-usage/http/overview): - - [REST API Source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api), - - [RESTClient](https://dlthub.com/docs/general-usage/http/rest-client), - - and [requests](https://dlthub.com/docs/general-usage/http/requests) helper. -- [Join our community](https://dlthub.com/community) and give us feedback! -- Want to share back your work? See this page for instructions: [dltHub Community: Sources & Snippets.](https://www.notion.so/7a7f7ddb39334743b1ba3debbdfb8d7f?pvs=21) \ No newline at end of file +- Share back your work! Instructions: **[dltHub-Community-Sources-Snippets](https://www.notion.so/7a7f7ddb39334743b1ba3debbdfb8d7f?pvs=21)** +- Read more about the + - **[REST API Source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api)** and + - **[RESTClient](https://dlthub.com/docs/general-usage/http/rest-client),** + - and the related **[API helpers](https://dlthub.com/devel/general-usage/http/overview)** and **[requests](https://dlthub.com/docs/general-usage/http/requests)** helper. +- **[Join our community](https://dlthub.com/community)** and give us feedback! From 9c518b14e27085902054fed9661e9700f99ea62c Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 14 May 2024 14:44:10 +0200 Subject: [PATCH 6/8] format --- docs/website/blog/2024-05-14-rest-api-source-client.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/website/blog/2024-05-14-rest-api-source-client.md b/docs/website/blog/2024-05-14-rest-api-source-client.md index e736df7a84..b1511fe438 100644 --- a/docs/website/blog/2024-05-14-rest-api-source-client.md +++ b/docs/website/blog/2024-05-14-rest-api-source-client.md @@ -91,7 +91,10 @@ In the example below, we will show how to create an API integration with 2 endpo ### Configuration Checklist: Before getting started -We will use GitHub’s API as an example. +We will use GitHub’s API as an example. # + +We will link to examples also in this [Colab tutorial demo](https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=SCr8ACUtyfBN&forceEdit=true&sandboxMode=true) + 1. Collect your api url and endpoints: - An URL is the base of the request, for example: `https://api.github.com/`. From 2d6a6ee319180987ec2fb5de5d111ce230d91f04 Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 14 May 2024 14:51:33 +0200 Subject: [PATCH 7/8] format --- .../blog/2024-05-14-rest-api-source-client.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/website/blog/2024-05-14-rest-api-source-client.md b/docs/website/blog/2024-05-14-rest-api-source-client.md index b1511fe438..720230a11c 100644 --- a/docs/website/blog/2024-05-14-rest-api-source-client.md +++ b/docs/website/blog/2024-05-14-rest-api-source-client.md @@ -96,22 +96,22 @@ We will use GitHub’s API as an example. # We will link to examples also in this [Colab tutorial demo](https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=SCr8ACUtyfBN&forceEdit=true&sandboxMode=true) -1. Collect your api url and endpoints: +1. Collect your api url and endpoints, [colab example](https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=bKthJGV6Mg6C): - An URL is the base of the request, for example: `https://api.github.com/`. - An endpoint is the path of an individual resource such as: - `/repos/{OWNER}/{REPO}/issues`; - or `/repos/{OWNER}/{REPO}/issues/{issue_number}/comments` which would require the issue number from the above endpoint; - or `/users/{username}/starred` etc. -2. Identify the authentication methods: +2. Identify the authentication methods, [colab example](https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=mViSDre8McI7): - GitHub uses bearer tokens for auth, but we can also skip it for public endpoints https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api?apiVersion=2022-11-28. 3. Identify if you have any dependent request patterns such as first get ids in a list, then use id for requesting details. - - For GitHub, we might do the below or any other chained requests: + For GitHub, we might do the below or any other dependent requests. [colab example](https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=vw7JJ0BlpFyh): 1. Get all repos of an org `https://api.github.com/orgs/{org}/repos`. 2. Then get all contributors `https://api.github.com/repos/{owner}/{repo}/contributors`. -4. How does pagination work? Is there any? Do we know the exact pattern? + +4. How does pagination work? Is there any? Do we know the exact pattern? [colab example](https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=rqqJhUoCB9F3) - On GitHub, we have consistent [pagination](https://docs.github.com/en/rest/using-the-rest-api/using-pagination-in-the-rest-api?apiVersion=2022-11-28) between endpoints that looks like this `link_header = response.headers.get('Link', None)`. -5. Identify the necessary information for incremental loading: +5. Identify the necessary information for incremental loading, [colab example](https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=fsd_SPZD7nBj): - Will any endpoints be loaded incrementally? - What columns will you use for incremental extraction and loading? - GitHub example: We can extract new issues by requesting issues after a particular time: `https://api.github.com/repos/{repo_owner}/{repo_name}/issues?since={since}`. @@ -119,9 +119,9 @@ We will link to examples also in this [Colab tutorial demo](https://colab.resear ### Configuration Checklist: Checking responses during development 1. Data path: - - You could print the source and see what is yielded. + - You could print the source and see what is yielded. [Colab example](https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=oJ9uWLb8ZYto&line=6&uniqifier=1) 2. Unless you had full documentation at point 4 (which we did), you likely need to still figure out some details on how pagination works. - 1. To do that, we suggest using `curl` or a second python script to do a request and inspect the response. This gives you flexibility to try anything. + 1. To do that, we suggest using `curl` or a second python script to do a request and inspect the response. This gives you flexibility to try anything. [Colab example](https://colab.research.google.com/drive/1qnzIM2N4iUL8AOX1oBUypzwoM3Hj5hhG#scrollTo=tFZ3SrZIMTKH) 2. Or you could print the source as above - but if there is metadata in headers etc, you might miss it. ### Applying the configuration @@ -139,7 +139,7 @@ If you are using a narrow screen, scroll the snippet below to look for the numbe ```py # This source has 2 resources: # - issues: Parent resource, retrieves issues incl. issue number -# - issues_comments: Child resource which needs the issue number +# - issues_comments: Child resource which needs the issue number from parent. import os from rest_api import RESTAPIConfig From 3bf3e68b7092f7b5b88cf9a9a7b4a2476f264bbf Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 14 May 2024 14:57:00 +0200 Subject: [PATCH 8/8] format --- docs/website/blog/2024-05-14-rest-api-source-client.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/website/blog/2024-05-14-rest-api-source-client.md b/docs/website/blog/2024-05-14-rest-api-source-client.md index 720230a11c..87066fe7bd 100644 --- a/docs/website/blog/2024-05-14-rest-api-source-client.md +++ b/docs/website/blog/2024-05-14-rest-api-source-client.md @@ -58,7 +58,7 @@ Credits: - **dlt’s REST API Source** is a Python dictionary-first declarative source builder, that has enhanced flexibility, supports callable passes, native config validations via python dictionaries, and composability directly in your scripts. It enables generating sources dynamically during runtime, enabling straightforward, manual or automated workflows for adapting sources to changes. - **dlt’s REST API Client** is the low-level abstraction that powers the REST API Source. You can use it in your imperative code for more automation and brevity, if you do not wish to use the higher level declarative interface. -## Useful for those who frequently build new pipelines +### Useful for those who frequently build new pipelines If you are on a team with 2-3 pipelines that never change much you likely won’t see much benefit from our latest tool. What we observe from early feedback a declarative extractor is great at is enabling easier work at scale. @@ -77,7 +77,7 @@ In some cases, we might not have the information upfront, so we will show you ho Depending on how you learn better, you can either watch the videos that our community members made, or follow the walkthrough below. -## Video walkthroughs +## **Video walkthroughs** In these videos, you will learn at a leisurely pace how to use the new interface. [playlist link.](https://www.youtube.com/playlist?list=PLpTgUMBCn15rs2NkB4ise780UxLKImZTh)