Merge branch 'master' into devel

dlt-hub · Jul 8, 2024 · 34e97cc · 34e97cc
2 parents 20f6b04 + 41918a3
commit 34e97cc
Show file tree

Hide file tree

Showing 17 changed files with 1,135 additions and 84 deletions.
diff --git a/docs/website/blog/2024-02-06-practice-api-sources.md b/docs/website/blog/2024-02-06-practice-api-sources.md
@@ -32,6 +32,7 @@ This article outlines 10 APIs, detailing their use cases, any free tier limitati
 ### Data talks club open source spotlight
 * [Video](https://www.youtube.com/watch?v=eMbhyOECpcE)
 * [Notebook](https://github.com/dlt-hub/dlt_demos/blob/main/spotlight_demo.ipynb)
+* DTC Learners showcase   (review again)
 
 ### Docs
 * [Getting started](https://dlthub.com/docs/getting-started)
@@ -100,8 +101,166 @@ This article outlines 10 APIs, detailing their use cases, any free tier limitati
 - **Free:** Varies by API.
 - **Auth:** Depends on API.
 
+### 11. News API
+- **URL**: [News API](https://newsapi.ai/).
+- **Use**: Get datasets containing current and historic news articles.
+- **Free**: Access to current news articles.
+- **Auth**: API-Key.
+
+### 12. Exchangerates API
+- **URL**: [Exchangerate API](https://exchangeratesapi.io/).
+- **Use**: Get realtime, intraday and historic currency rates.
+- **Free**: 250 monthly requests.
+- **Auth**: API-Key.
+
+### 13. Spotify API
+- **URL**: [Spotify API](https://developer.spotify.com/documentation/web-api).
+- **Use**: Get spotify content and metadata about songs.
+- **Free**: Rate limit.
+- **Auth**: API-Key.
+
+### 14. Football API
+- **URL**: [FootBall API](https://www.api-football.com/).
+- **Use**: Get information about Football Leagues & Cups.
+- **Free**: 100 requests/day.
+- **Auth**: API-Key.
+
+### 15. Yahoo Finance API
+- **URL**: [Yahoo Finance API](https://rapidapi.com/sparior/api/yahoo-finance15/details).
+- **Use**: Access a wide range of financial data.
+- **Free**: 500 requests/month.
+- **Auth**: API-Key.
+
+### 16. Basketball API
+
+- URL: [Basketball API](https://www.api-basketball.com/).
+- Use: Get information about basketball leagues & cups.
+- Free: 100 requests/day.
+- Auth: API-Key.
+
+### 17. NY Times API
+
+- URL: [NY Times API](https://developer.nytimes.com/apis).
+- Use: Get info about articles, books, movies and more.
+- Free: 500 requests/day or 5 requests/minute.
+- Auth: API-Key.
+
+### 18. Spoonacular API
+
+- URL: [Spoonacular API](https://spoonacular.com/food-api).
+- Use: Get info about ingredients, recipes, products and menu items.
+- Free: 150 requests/day and 1 request/sec.
+- Auth: API-Key.
+
+### 19. Movie database alternative API
+
+- URL: [Movie database alternative API](https://rapidapi.com/rapidapi/api/movie-database-alternative/pricing).
+- Use: Movie data for entertainment industry trend analysis.
+- Free: 1000 requests/day and 10 requests/sec.
+- Auth: API-Key.
+
+### 20. RAWG Video games database API
+
+- URL: [RAWG Video Games Database](https://rawg.io/apidocs).
+- Use: Gather video game data, such as release dates, platforms, genres, and reviews.
+- Free: Unlimited requests for limited endpoints.
+- Auth: API key.
+
+### 21. Jikan API
+
+- **URL:** [Jikan API](https://jikan.moe/).
+- **Use:** Access data from MyAnimeList for anime and manga projects.
+- **Free:** Rate-limited.
+- **Auth:** None.
+
+### 22. Open Library Books API
+
+- URL: [Open Library Books API](https://openlibrary.org/dev/docs/api/books).
+- Use: Access data about millions of books, including titles, authors, and publication dates.
+- Free: Unlimited.
+- Auth: None.
+
+### 23. YouTube Data API
+
+- URL: [YouTube Data API](https://developers.google.com/youtube/v3/docs/search/list).
+- Use: Access YouTube video data, channels, playlists, etc.
+- Free: Limited quota.
+- Auth: Google API key and OAuth 2.0.
+
+### 24. Reddit API
+
+- URL: [Reddit API](https://www.reddit.com/dev/api/).
+- Use: Access Reddit data for social media analysis or content retrieval.
+- Free: Rate-limited.
+- Auth: OAuth 2.0.
+
+### 25. World Bank API
+
+- URL: [World bank API](https://documents.worldbank.org/en/publication/documents-reports/api).
+- Use: Access economic and development data from the World Bank.
+- Free: Unlimited.
+- Auth: None.
+
 Each API offers unique insights for data engineering, from ingestion to visualization. Check each API's documentation for up-to-date details on limitations and authentication.
 
+## Using the above sources
+
+You can create a pipeline for the APIs discussed above by using `dlt's` REST API source. Let’s create a PokeAPI pipeline as an example. Follow these steps:
+
+1. Create a Rest API source:
+
+   ```sh
+   dlt init rest_api duckdb
+   ```
+
+2. The following directory structure gets generated:
+
+   ```sh
+   rest_api_pipeline/
+   ├── .dlt/
+   │   ├── config.toml          # configs for your pipeline
+   │   └── secrets.toml         # secrets for your pipeline
+   ├── rest_api/                # folder with source-specific files
+   │   └── ...
+   ├── rest_api_pipeline.py     # your main pipeline script
+   ├── requirements.txt         # dependencies for your pipeline
+   └── .gitignore               # ignore files for git (not required)
+   ```
+
+3. Configure the source in `rest_api_pipeline.py`:
+
+   ```py 
+   def load_pokemon() -> None:
+       pipeline = dlt.pipeline(
+           pipeline_name="rest_api_pokemon",
+           destination='duckdb',
+           dataset_name="rest_api_data",
+       )
+
+       pokemon_source = rest_api_source(
+           {
+               "client": {
+                   "base_url": "https://pokeapi.co/api/v2/",
+               },
+               "resource_defaults": {
+                   "endpoint": {
+                       "params": {
+                           "limit": 1000,
+                       },
+                   },
+               },
+               "resources": [
+                   "pokemon",
+                   "berry",
+                   "location",
+               ],
+           }
+       )
+
+    ```
+
+For a detailed guide on creating a pipeline using the Rest API source, please read the Rest API source [documentation here](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api).
+
 ## Example projects
 
 Here are some examples from dlt users and working students:
@@ -115,5 +274,19 @@ Here are some examples from dlt users and working students:
 - Japanese language demos [Notion calendar](https://stable.co.jp/blog/notion-calendar-dlt) and [exploring csv to bigquery with dlt](https://soonraah.github.io/posts/load-csv-data-into-bq-by-dlt/).
 - Demos with [Dagster](https://dagster.io/blog/dagster-dlt) and [Prefect](https://www.prefect.io/blog/building-resilient-data-pipelines-in-minutes-with-dlt-prefect).
 
+## DTC learners showcase
+Check out the incredible projects from our DTC learners:
+
+1. [e2e_de_project](https://github.com/scpkobayashi/e2e_de_project/tree/153d485bba3ea8f640d0ccf3ec9593790259a646) by [scpkobayashi](https://github.com/scpkobayashi).
+2. [de-zoomcamp-project](https://github.com/theDataFixer/de-zoomcamp-project/tree/1737b6a9d556348c2d7d48a91e2a43bb6e12f594) by [theDataFixer](https://github.com/theDataFixer).
+3. [data-engineering-zoomcamp2024-project2](https://github.com/pavlokurochka/data-engineering-zoomcamp2024-project2/tree/f336ed00870a74cb93cbd9783dbff594393654b8) by [pavlokurochka](https://github.com/pavlokurochka).
+4. [de-zoomcamp-2024](https://github.com/snehangsude/de-zoomcamp-2024) by [snehangsude](https://github.com/snehangsude).
+5. [zoomcamp-data-engineer-2024](https://github.com/eokwukwe/zoomcamp-data-engineer-2024) by [eokwukwe](https://github.com/eokwukwe).
+6. [data-engineering-zoomcamp-alex](https://github.com/aaalexlit/data-engineering-zoomcamp-alex) by [aaalexlit](https://github.com/aaalexlit).
+7. [Zoomcamp2024](https://github.com/alfredzou/Zoomcamp2024) by [alfredzou](https://github.com/alfredzou).
+8. [data-engineering-zoomcamp](https://github.com/el-grudge/data-engineering-zoomcamp) by [el-grudge](https://github.com/el-grudge).
+
+Explore these projects to see the innovative solutions and hard work the learners have put into their data engineering journeys!
+
 ## Showcase your project
-If you want your project to be featured, let us know in the [#sharing-and-contributing channel of our community Slack](https://dlthub.com/community).
+If you want your project to be featured, let us know in the [#sharing-and-contributing channel of our community Slack](https://dlthub.com/community).
diff --git a/docs/website/blog/2024-03-07-openapi-generation-chargebee.md b/docs/website/blog/2024-03-07-openapi-generation-chargebee.md
@@ -7,7 +7,7 @@ authors:
   title: Data Engineer & ML Engineer
   url: https://github.com/dlt-hub/dlt
   image_url: https://avatars.githubusercontent.com/u/89419010?s=48&v=4
-tags: [data observability, data pipeline observability]
+tags: [data observability, data pipeline observability, openapi]
 ---
 
 At dltHub, we have been pioneering the future of data pipeline generation, [making complex processes simple and scalable.](https://dlthub.com/product/#multiply-don't-add-to-our-productivity) We have not only been building dlt for humans, but also LLMs.

diff --git a/docs/website/blog/2024-05-14-rest-api-source-client.md b/docs/website/blog/2024-05-14-rest-api-source-client.md
@@ -7,7 +7,7 @@ authors:
   title: Open source Data Engineer
   url: https://github.com/adrianbr
   image_url: https://avatars.githubusercontent.com/u/5762770?v=4
-tags: [full code etl, yes code etl, etl, python elt]
+tags: [rest-api, declarative etl]
 ---
 
 ## What is the REST API Source toolkit?

diff --git a/docs/website/blog/2024-05-23-contributed-first-pipeline.md b/docs/website/blog/2024-05-23-contributed-first-pipeline.md
@@ -1,6 +1,6 @@
 ---
 slug: contributed-first-pipeline
-title: "How I contributed my first data pipeline to the open source."
+title: "How I Contributed to My First Open Source Data Pipeline"
 image:  https://storage.googleapis.com/dlt-blog-images/blog_my_first_data_pipeline.png
 authors:
   name: Aman Gupta
@@ -78,13 +78,13 @@ def incremental_resource(
 With the steps defined above, I was able to load the data from Freshdesk to BigQuery and use the pipeline in production. Here’s a summary of the steps I followed:
 
 1. Created a Freshdesk API token with sufficient privileges.
-1. Created an API client to make requests to the Freshdesk API with rate limit and pagination.
-1. Made incremental requests to this client based on the “updated_at” field in the response.
-1. Ran the pipeline using the Python script.
+2. Created an API client to make requests to the Freshdesk API with rate limit and pagination.
+3. Made incremental requests to this client based on the “updated_at” field in the response.
+4. Ran the pipeline using the Python script.
 
 
 While my journey from civil engineering to data engineering was initially intimidating, it has proved to be a profound learning experience. Writing a pipeline with **`dlt`** mirrors the simplicity of a GET request: you request data, yield it, and it flows from the source to its destination. Now, I help other clients integrate **`dlt`** to streamline their data workflows, which has been an invaluable part of my professional growth.
 
 In conclusion, diving into data engineering has expanded my technical skill set and provided a new lens through which I view challenges and solutions. As for me, the lens view mainly was concrete and steel a couple of years back, which has now begun to notice the pipelines of the data world. 
 
-Data engineering has proved both challenging, satisfying and a good carrier option for me till now. For those interested in the detailed workings of these pipelines, I encourage exploring dlt's [GitHub repository](https://github.com/dlt-hub/verified-sources) or diving into the [documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/freshdesk).
+Data engineering has proved both challenging, satisfying, and a good career option for me till now. For those interested in the detailed workings of these pipelines, I encourage exploring dlt's [GitHub repository](https://github.com/dlt-hub/verified-sources) or diving into the [documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/freshdesk).
diff --git a/docs/website/blog/2024-05-28-openapi-pipeline.md b/docs/website/blog/2024-05-28-openapi-pipeline.md
@@ -0,0 +1,97 @@
+---
+slug: openapi-pipeline
+title: "Instant pipelines with dlt-init-openapi"
+image:  https://storage.googleapis.com/dlt-blog-images/openapi.png
+authors:
+  name: Adrian Brudaru
+  title: Open source Data Engineer
+  url: https://github.com/adrianbr
+  image_url: https://avatars.githubusercontent.com/u/5762770?v=4
+tags: [openapi]
+---
+
+# The Future of Data Pipelines starts now.
+
+Dear dltHub Community,
+
+We are thrilled to announce the launch of our groundbreaking pipeline generator tool.
+
+We call it `dlt-init-openapi`.
+
+Just point it to an OpenAPI spec, select your endpoints, and you're done!
+
+
+### What's OpenAPI again?
+
+[OpenAPI](https://www.openapis.org/) is the world's most widely used API description standard. You may have heard about swagger docs? those are docs generated from the spec.
+In 2021 an information-security company named Assetnote scanned the web and unearthed [200,000 public
+OpenAPI files](https://www.assetnote.io/resources/research/contextual-content-discovery-youve-forgotten-about-the-api-endpoints).
+Modern API frameworks like [FastAPI](https://pypi.org/project/fastapi/) generate such specifications automatically.
+
+## How does it work?
+
+**A pipeline is a series of datapoints or decisions about how to extract and load the data**, expressed as code or config. I say decisions because building a pipeline can be boiled down to inspecting a documentation or response and deciding how to write the code.
+
+Our tool does its best to pick out the necessary details and detect the rest to generate the complete pipeline for you.
+
+The information required for taking those decisions comes from:
+- The OpenAPI [Spec](https://github.com/dlt-hub/openapi-specs) (endpoints, auth)
+- The dlt [REST API Source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api) which attempts to detect pagination
+- The [dlt init OpenAPI generator](https://dlthub.com/docs/dlt-ecosystem/verified-sources/openapi-generator) which attempts to detect incremental logic and dependent requests.
+
+### How well does it work?
+
+This is something we are also learning about. We did an internal hackathon where we each built a few pipelines with this generator. In our experiments with APIs for which we had credentials, it worked pretty well.
+
+However, we cannot undertake a big detour from our work to manually test each possible pipeline, so your feedback will be invaluable.
+So please, if you try it, let us know how well it worked - and ideally, add the spec you used to our [repository](https://github.com/dlt-hub/openapi-specs).
+
+### What to do if it doesn't work?
+
+Once a pipeline is created, it is a **fully configurable instance of the REST API Source**.
+So if anything did not go smoothly, you can make the final tweaks.
+You can learn how to adjust the generated pipeline by reading our [REST API Source documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api).
+
+### Are we using LLMS under the hood?
+
+No. This is a potential future enhancement, so maybe later.
+
+The pipelines are generated algorithmically with deterministic outcomes. This way, we have more control over the quality of the decisions.
+
+If we took an LLM-first approach, the errors would compound and put the burden back on the data person.
+
+We are however considering using LLM-assists for the things that the algorithmic approach can't detect. Another avenue could be generating the OpenAPI spec from website docs.
+So we are eager to get feedback from you on what works and what needs work, enabling us to improve it.
+
+## Try it out now!
+
+**Video Walkthrough:**
+
+<iframe width="560" height="315" src="https://www.youtube.com/embed/b99qv9je12Q?si=veVVSlHkKQxDX3FX" title="OpenAPI tutorial" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
+
+
+**[Colab demo](https://colab.research.google.com/drive/1MRZvguOTZj1MlkEGzjiso8lQ_wr1MJRI?usp=sharing)** - Load data from Stripe API to DuckDB using dlt and OpenAPI
+
+**[Docs](https://dlthub.com/docs/dlt-ecosystem/verified-sources/openapi-generator)** for `dlt-init-openapi`
+
+dlt init openapi **[code repo.](https://github.com/dlt-hub/dlt-init-openapi)**
+
+**[Specs repository you can generate from.](https://github.com/dlt-hub/openapi-specs)**
+
+Showcase your pipeline in the community sources **[here](https://www.notion.so/dlthub/dltHub-Community-Sources-Snippets-7a7f7ddb39334743b1ba3debbdfb8d7f)
+
+## Next steps: Feedback, discussion and sharing.
+
+Solving data engineering headaches in the open source is a team sport.
+We got this far with your feedback and help (especially on [REST API source](https://dlthub.com/docs/blog/rest-api-source-client)), and are counting on your continuous usage and engagement
+to steer our pushing of what's possible into uncharted, but needed directions.
+
+So here's our call to action:
+
+- We're excited to see how you will use our new pipeline generator and we are
+eager for your feedback. **[Join our community and let us know how we can improve dlt-init-openapi](https://dlthub.com/community)**
+- Got an OpenAPI spec? **[Add it to our specs repository](https://github.com/dlt-hub/openapi-specs)** so others may use it. If the spec doesn't work, please note that in the PR and we will use it for R&D.
+
+*Thank you for being part of our community and for building the future of ETL together!*
+
+*-  dltHub Team*