Skip to content

Commit

Permalink
Merge branch 'master' into devel
Browse files Browse the repository at this point in the history
  • Loading branch information
rudolfix authored Jul 8, 2024
2 parents 20f6b04 + 41918a3 commit 34e97cc
Show file tree
Hide file tree
Showing 17 changed files with 1,135 additions and 84 deletions.
175 changes: 174 additions & 1 deletion docs/website/blog/2024-02-06-practice-api-sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ This article outlines 10 APIs, detailing their use cases, any free tier limitati
### Data talks club open source spotlight
* [Video](https://www.youtube.com/watch?v=eMbhyOECpcE)
* [Notebook](https://github.com/dlt-hub/dlt_demos/blob/main/spotlight_demo.ipynb)
* DTC Learners showcase (review again)

### Docs
* [Getting started](https://dlthub.com/docs/getting-started)
Expand Down Expand Up @@ -100,8 +101,166 @@ This article outlines 10 APIs, detailing their use cases, any free tier limitati
- **Free:** Varies by API.
- **Auth:** Depends on API.

### 11. News API
- **URL**: [News API](https://newsapi.ai/).
- **Use**: Get datasets containing current and historic news articles.
- **Free**: Access to current news articles.
- **Auth**: API-Key.

### 12. Exchangerates API
- **URL**: [Exchangerate API](https://exchangeratesapi.io/).
- **Use**: Get realtime, intraday and historic currency rates.
- **Free**: 250 monthly requests.
- **Auth**: API-Key.

### 13. Spotify API
- **URL**: [Spotify API](https://developer.spotify.com/documentation/web-api).
- **Use**: Get spotify content and metadata about songs.
- **Free**: Rate limit.
- **Auth**: API-Key.

### 14. Football API
- **URL**: [FootBall API](https://www.api-football.com/).
- **Use**: Get information about Football Leagues & Cups.
- **Free**: 100 requests/day.
- **Auth**: API-Key.

### 15. Yahoo Finance API
- **URL**: [Yahoo Finance API](https://rapidapi.com/sparior/api/yahoo-finance15/details).
- **Use**: Access a wide range of financial data.
- **Free**: 500 requests/month.
- **Auth**: API-Key.

### 16. Basketball API

- URL: [Basketball API](https://www.api-basketball.com/).
- Use: Get information about basketball leagues & cups.
- Free: 100 requests/day.
- Auth: API-Key.

### 17. NY Times API

- URL: [NY Times API](https://developer.nytimes.com/apis).
- Use: Get info about articles, books, movies and more.
- Free: 500 requests/day or 5 requests/minute.
- Auth: API-Key.

### 18. Spoonacular API

- URL: [Spoonacular API](https://spoonacular.com/food-api).
- Use: Get info about ingredients, recipes, products and menu items.
- Free: 150 requests/day and 1 request/sec.
- Auth: API-Key.

### 19. Movie database alternative API

- URL: [Movie database alternative API](https://rapidapi.com/rapidapi/api/movie-database-alternative/pricing).
- Use: Movie data for entertainment industry trend analysis.
- Free: 1000 requests/day and 10 requests/sec.
- Auth: API-Key.

### 20. RAWG Video games database API

- URL: [RAWG Video Games Database](https://rawg.io/apidocs).
- Use: Gather video game data, such as release dates, platforms, genres, and reviews.
- Free: Unlimited requests for limited endpoints.
- Auth: API key.

### 21. Jikan API

- **URL:** [Jikan API](https://jikan.moe/).
- **Use:** Access data from MyAnimeList for anime and manga projects.
- **Free:** Rate-limited.
- **Auth:** None.

### 22. Open Library Books API

- URL: [Open Library Books API](https://openlibrary.org/dev/docs/api/books).
- Use: Access data about millions of books, including titles, authors, and publication dates.
- Free: Unlimited.
- Auth: None.

### 23. YouTube Data API

- URL: [YouTube Data API](https://developers.google.com/youtube/v3/docs/search/list).
- Use: Access YouTube video data, channels, playlists, etc.
- Free: Limited quota.
- Auth: Google API key and OAuth 2.0.

### 24. Reddit API

- URL: [Reddit API](https://www.reddit.com/dev/api/).
- Use: Access Reddit data for social media analysis or content retrieval.
- Free: Rate-limited.
- Auth: OAuth 2.0.

### 25. World Bank API

- URL: [World bank API](https://documents.worldbank.org/en/publication/documents-reports/api).
- Use: Access economic and development data from the World Bank.
- Free: Unlimited.
- Auth: None.

Each API offers unique insights for data engineering, from ingestion to visualization. Check each API's documentation for up-to-date details on limitations and authentication.

## Using the above sources

You can create a pipeline for the APIs discussed above by using `dlt's` REST API source. Let’s create a PokeAPI pipeline as an example. Follow these steps:

1. Create a Rest API source:

```sh
dlt init rest_api duckdb
```

2. The following directory structure gets generated:

```sh
rest_api_pipeline/
├── .dlt/
│ ├── config.toml # configs for your pipeline
│ └── secrets.toml # secrets for your pipeline
├── rest_api/ # folder with source-specific files
│ └── ...
├── rest_api_pipeline.py # your main pipeline script
├── requirements.txt # dependencies for your pipeline
└── .gitignore # ignore files for git (not required)
```

3. Configure the source in `rest_api_pipeline.py`:

```py
def load_pokemon() -> None:
pipeline = dlt.pipeline(
pipeline_name="rest_api_pokemon",
destination='duckdb',
dataset_name="rest_api_data",
)

pokemon_source = rest_api_source(
{
"client": {
"base_url": "https://pokeapi.co/api/v2/",
},
"resource_defaults": {
"endpoint": {
"params": {
"limit": 1000,
},
},
},
"resources": [
"pokemon",
"berry",
"location",
],
}
)

```

For a detailed guide on creating a pipeline using the Rest API source, please read the Rest API source [documentation here](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api).

## Example projects

Here are some examples from dlt users and working students:
Expand All @@ -115,5 +274,19 @@ Here are some examples from dlt users and working students:
- Japanese language demos [Notion calendar](https://stable.co.jp/blog/notion-calendar-dlt) and [exploring csv to bigquery with dlt](https://soonraah.github.io/posts/load-csv-data-into-bq-by-dlt/).
- Demos with [Dagster](https://dagster.io/blog/dagster-dlt) and [Prefect](https://www.prefect.io/blog/building-resilient-data-pipelines-in-minutes-with-dlt-prefect).

## DTC learners showcase
Check out the incredible projects from our DTC learners:

1. [e2e_de_project](https://github.com/scpkobayashi/e2e_de_project/tree/153d485bba3ea8f640d0ccf3ec9593790259a646) by [scpkobayashi](https://github.com/scpkobayashi).
2. [de-zoomcamp-project](https://github.com/theDataFixer/de-zoomcamp-project/tree/1737b6a9d556348c2d7d48a91e2a43bb6e12f594) by [theDataFixer](https://github.com/theDataFixer).
3. [data-engineering-zoomcamp2024-project2](https://github.com/pavlokurochka/data-engineering-zoomcamp2024-project2/tree/f336ed00870a74cb93cbd9783dbff594393654b8) by [pavlokurochka](https://github.com/pavlokurochka).
4. [de-zoomcamp-2024](https://github.com/snehangsude/de-zoomcamp-2024) by [snehangsude](https://github.com/snehangsude).
5. [zoomcamp-data-engineer-2024](https://github.com/eokwukwe/zoomcamp-data-engineer-2024) by [eokwukwe](https://github.com/eokwukwe).
6. [data-engineering-zoomcamp-alex](https://github.com/aaalexlit/data-engineering-zoomcamp-alex) by [aaalexlit](https://github.com/aaalexlit).
7. [Zoomcamp2024](https://github.com/alfredzou/Zoomcamp2024) by [alfredzou](https://github.com/alfredzou).
8. [data-engineering-zoomcamp](https://github.com/el-grudge/data-engineering-zoomcamp) by [el-grudge](https://github.com/el-grudge).

Explore these projects to see the innovative solutions and hard work the learners have put into their data engineering journeys!

## Showcase your project
If you want your project to be featured, let us know in the [#sharing-and-contributing channel of our community Slack](https://dlthub.com/community).
If you want your project to be featured, let us know in the [#sharing-and-contributing channel of our community Slack](https://dlthub.com/community).
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ authors:
title: Data Engineer & ML Engineer
url: https://github.com/dlt-hub/dlt
image_url: https://avatars.githubusercontent.com/u/89419010?s=48&v=4
tags: [data observability, data pipeline observability]
tags: [data observability, data pipeline observability, openapi]
---

At dltHub, we have been pioneering the future of data pipeline generation, [making complex processes simple and scalable.](https://dlthub.com/product/#multiply-don't-add-to-our-productivity) We have not only been building dlt for humans, but also LLMs.
Expand Down
2 changes: 1 addition & 1 deletion docs/website/blog/2024-05-14-rest-api-source-client.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ authors:
title: Open source Data Engineer
url: https://github.com/adrianbr
image_url: https://avatars.githubusercontent.com/u/5762770?v=4
tags: [full code etl, yes code etl, etl, python elt]
tags: [rest-api, declarative etl]
---

## What is the REST API Source toolkit?
Expand Down
10 changes: 5 additions & 5 deletions docs/website/blog/2024-05-23-contributed-first-pipeline.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
slug: contributed-first-pipeline
title: "How I contributed my first data pipeline to the open source."
title: "How I Contributed to My First Open Source Data Pipeline"
image: https://storage.googleapis.com/dlt-blog-images/blog_my_first_data_pipeline.png
authors:
name: Aman Gupta
Expand Down Expand Up @@ -78,13 +78,13 @@ def incremental_resource(
With the steps defined above, I was able to load the data from Freshdesk to BigQuery and use the pipeline in production. Here’s a summary of the steps I followed:

1. Created a Freshdesk API token with sufficient privileges.
1. Created an API client to make requests to the Freshdesk API with rate limit and pagination.
1. Made incremental requests to this client based on the “updated_at” field in the response.
1. Ran the pipeline using the Python script.
2. Created an API client to make requests to the Freshdesk API with rate limit and pagination.
3. Made incremental requests to this client based on the “updated_at” field in the response.
4. Ran the pipeline using the Python script.


While my journey from civil engineering to data engineering was initially intimidating, it has proved to be a profound learning experience. Writing a pipeline with **`dlt`** mirrors the simplicity of a GET request: you request data, yield it, and it flows from the source to its destination. Now, I help other clients integrate **`dlt`** to streamline their data workflows, which has been an invaluable part of my professional growth.

In conclusion, diving into data engineering has expanded my technical skill set and provided a new lens through which I view challenges and solutions. As for me, the lens view mainly was concrete and steel a couple of years back, which has now begun to notice the pipelines of the data world.

Data engineering has proved both challenging, satisfying and a good carrier option for me till now. For those interested in the detailed workings of these pipelines, I encourage exploring dlt's [GitHub repository](https://github.com/dlt-hub/verified-sources) or diving into the [documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/freshdesk).
Data engineering has proved both challenging, satisfying, and a good career option for me till now. For those interested in the detailed workings of these pipelines, I encourage exploring dlt's [GitHub repository](https://github.com/dlt-hub/verified-sources) or diving into the [documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/freshdesk).
97 changes: 97 additions & 0 deletions docs/website/blog/2024-05-28-openapi-pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
slug: openapi-pipeline
title: "Instant pipelines with dlt-init-openapi"
image: https://storage.googleapis.com/dlt-blog-images/openapi.png
authors:
name: Adrian Brudaru
title: Open source Data Engineer
url: https://github.com/adrianbr
image_url: https://avatars.githubusercontent.com/u/5762770?v=4
tags: [openapi]
---

# The Future of Data Pipelines starts now.

Dear dltHub Community,

We are thrilled to announce the launch of our groundbreaking pipeline generator tool.

We call it `dlt-init-openapi`.

Just point it to an OpenAPI spec, select your endpoints, and you're done!


### What's OpenAPI again?

[OpenAPI](https://www.openapis.org/) is the world's most widely used API description standard. You may have heard about swagger docs? those are docs generated from the spec.
In 2021 an information-security company named Assetnote scanned the web and unearthed [200,000 public
OpenAPI files](https://www.assetnote.io/resources/research/contextual-content-discovery-youve-forgotten-about-the-api-endpoints).
Modern API frameworks like [FastAPI](https://pypi.org/project/fastapi/) generate such specifications automatically.

## How does it work?

**A pipeline is a series of datapoints or decisions about how to extract and load the data**, expressed as code or config. I say decisions because building a pipeline can be boiled down to inspecting a documentation or response and deciding how to write the code.

Our tool does its best to pick out the necessary details and detect the rest to generate the complete pipeline for you.

The information required for taking those decisions comes from:
- The OpenAPI [Spec](https://github.com/dlt-hub/openapi-specs) (endpoints, auth)
- The dlt [REST API Source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api) which attempts to detect pagination
- The [dlt init OpenAPI generator](https://dlthub.com/docs/dlt-ecosystem/verified-sources/openapi-generator) which attempts to detect incremental logic and dependent requests.

### How well does it work?

This is something we are also learning about. We did an internal hackathon where we each built a few pipelines with this generator. In our experiments with APIs for which we had credentials, it worked pretty well.

However, we cannot undertake a big detour from our work to manually test each possible pipeline, so your feedback will be invaluable.
So please, if you try it, let us know how well it worked - and ideally, add the spec you used to our [repository](https://github.com/dlt-hub/openapi-specs).

### What to do if it doesn't work?

Once a pipeline is created, it is a **fully configurable instance of the REST API Source**.
So if anything did not go smoothly, you can make the final tweaks.
You can learn how to adjust the generated pipeline by reading our [REST API Source documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api).

### Are we using LLMS under the hood?

No. This is a potential future enhancement, so maybe later.

The pipelines are generated algorithmically with deterministic outcomes. This way, we have more control over the quality of the decisions.

If we took an LLM-first approach, the errors would compound and put the burden back on the data person.

We are however considering using LLM-assists for the things that the algorithmic approach can't detect. Another avenue could be generating the OpenAPI spec from website docs.
So we are eager to get feedback from you on what works and what needs work, enabling us to improve it.

## Try it out now!

**Video Walkthrough:**

<iframe width="560" height="315" src="https://www.youtube.com/embed/b99qv9je12Q?si=veVVSlHkKQxDX3FX" title="OpenAPI tutorial" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>


**[Colab demo](https://colab.research.google.com/drive/1MRZvguOTZj1MlkEGzjiso8lQ_wr1MJRI?usp=sharing)** - Load data from Stripe API to DuckDB using dlt and OpenAPI

**[Docs](https://dlthub.com/docs/dlt-ecosystem/verified-sources/openapi-generator)** for `dlt-init-openapi`

dlt init openapi **[code repo.](https://github.com/dlt-hub/dlt-init-openapi)**

**[Specs repository you can generate from.](https://github.com/dlt-hub/openapi-specs)**

Showcase your pipeline in the community sources **[here](https://www.notion.so/dlthub/dltHub-Community-Sources-Snippets-7a7f7ddb39334743b1ba3debbdfb8d7f)

## Next steps: Feedback, discussion and sharing.

Solving data engineering headaches in the open source is a team sport.
We got this far with your feedback and help (especially on [REST API source](https://dlthub.com/docs/blog/rest-api-source-client)), and are counting on your continuous usage and engagement
to steer our pushing of what's possible into uncharted, but needed directions.

So here's our call to action:

- We're excited to see how you will use our new pipeline generator and we are
eager for your feedback. **[Join our community and let us know how we can improve dlt-init-openapi](https://dlthub.com/community)**
- Got an OpenAPI spec? **[Add it to our specs repository](https://github.com/dlt-hub/openapi-specs)** so others may use it. If the spec doesn't work, please note that in the PR and we will use it for R&D.

*Thank you for being part of our community and for building the future of ETL together!*

*- dltHub Team*
Loading

0 comments on commit 34e97cc

Please sign in to comment.