Skip to content

Commit

Permalink
Merge branch 'master' into devel
Browse files Browse the repository at this point in the history
  • Loading branch information
rudolfix authored Mar 5, 2024
2 parents f608131 + b15a71c commit 2ffcd97
Show file tree
Hide file tree
Showing 4 changed files with 25 additions and 9 deletions.
14 changes: 7 additions & 7 deletions docs/website/blog/2023-06-21-open-api-spec-for-dlt-init.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,17 @@ Today we are releasing a proof of concept of the [`dlt init`](https://dlthub.com

If you build APIs, for example with [FastAPI](https://fastapi.tiangolo.com/), you can, thanks to the [OpenAPI spec,](https://spec.openapis.org/oas/v3.1.0) automatically generate a [python client](https://pypi.org/project/openapi-python-client/0.6.0a4/) and give it to your users. Our demo takes this a step further and enables you to generate advanced `dlt` pipelines that, in essence, convert your API into a live dataset.

You can see how Marcin generates such a pipeline from the OpenAPI spec using the [Pokemon API](https://pokeapi.co/) in the Loom below.
[![marcin-demo](https://camo.githubusercontent.com/1aca1132999dde59bc5b274aeb4d01c79eab525941362491a534ddd8d1015dce/68747470733a2f2f63646e2e6c6f6f6d2e636f6d2f73657373696f6e732f7468756d626e61696c732f32383036623837336261316334653065613338326562336234666261663830382d776974682d706c61792e676966)](https://www.loom.com/share/2806b873ba1c4e0ea382eb3b4fbaf808?sid=501add8b-90a0-4734-9620-c6184d840995)
You can see how Marcin generates such a pipeline from the OpenAPI spec using the [Pokemon API](https://pokeapi.co/) in the Loom below.
[![marcin-demo](https://storage.googleapis.com/dlt-blog-images/openapi_loom_old.png)](https://www.loom.com/share/2806b873ba1c4e0ea382eb3b4fbaf808?sid=501add8b-90a0-4734-9620-c6184d840995)

Part of our vision is that each API will come with a `dlt` pipeline - similar to how these days often it comes with a python client. We believe that very often API users do not really want to deal with endpoints, http requests, and JSON responses. They need live, evolving datasets that they can place anywhere they want so that it's accessible to any workflow.

We believe that API builders will bundle `dlt` pipelines with their APIs only if such a process is hassle free. One answer to that is code generation and the reuse of information from the OpenAPI spec.

This release is a part of a bigger vision for `dlt` of a world centered around accessible data for modern data teams. In these new times code is becoming more disposable, but the data stays valuable. We eventually want to create an ecosystem where hundreds of thousands of pipelines will be created, shared, and deployed. Where datasets, reports, and analytics can be written and shared publicly and privately. [Code generation is automation on steroids](https://dlthub.com/product/#code-generation-is-automation-on-steroids) and we are going to be releasing many more features based on this principle.
## Generating a pipeline for PokeAPI using OpenAPI spec
This release is a part of a bigger vision for `dlt` of a world centered around accessible data for modern data teams. In these new times code is becoming more disposable, but the data stays valuable. We eventually want to create an ecosystem where hundreds of thousands of pipelines will be created, shared, and deployed. Where datasets, reports, and analytics can be written and shared publicly and privately. [Code generation is automation on steroids](https://dlthub.com/product/#code-generation-is-automation-on-steroids) and we are going to be releasing many more features based on this principle.

## Generating a pipeline for PokeAPI using OpenAPI spec

In the embedded loom you saw Marcin pull data from the `dlt` pipeline created from the OpenAPI spec. The proof of concept already uses a few tricks and heuristics to generate useful code. Contrary to what you may think, PokeAPI is a complex one with a lot of linked data types and endpoints!

- It created a resource for all endpoints that return lists of objects.
Expand Down
2 changes: 1 addition & 1 deletion docs/website/blog/2024-02-28-what-is-pyairbyte.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Here at dltHub, we work on the python library for data ingestion. So when I hear

PyAirbyte is an interesting Airbyte’s initiative - similar to the one that Meltano had undertook 3 years ago. It provides a convenient way to download and install Airbyte sources and run them locally storing the data in a cache dataset. Users are allowed to then read the data from this cache.

A Python wrapper on the Airbyte source is quite nice and has a feeling close to [Alto]. The whole process of cloning/pip installing the repository, spawning a separate process to run Airbyte connector and read the data via UNIX pipe is hidden behind Pythonic interface.
A Python wrapper on the Airbyte source is quite nice and has a feeling close to [Alto](https://github.com/z3z1ma/alto). The whole process of cloning/pip installing the repository, spawning a separate process to run Airbyte connector and read the data via UNIX pipe is hidden behind Pythonic interface.

Note that this library is not an Airbyte replacement - the loaders of Airbyte and the library are very different. The library loader uses pandas.to_sql and sql alchemy and is not a replacement for Airbyte destinations that are available in Open Source Airbyte

Expand Down
16 changes: 16 additions & 0 deletions docs/website/docs/dlt-ecosystem/destinations/filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,22 @@ bucket_url = "file:///absolute/path" # three / for absolute path
- `replace` - all files that belong to such tables are deleted from dataset folder, and then the current set of files is added.
- `merge` - falls back to `append`

## File Compression

The filesystem destination in the dlt library uses `gzip` compression by default for efficiency, which may result in the files being stored in a compressed format. This format may not be easily readable as plain text or JSON Lines (`jsonl`) files. If you encounter files that seem unreadable, they may be compressed.

To handle compressed files:

- To disable compression, you can modify the `data_writer.disable_compression` setting in your "config.toml" file. This can be useful if you want to access the files directly without needing to decompress them. For example:
```toml
[normalize.data_writer]
disable_compression=true
```

- To decompress a `gzip` file, you can use tools like `gunzip`. This will convert the compressed file back to its original format, making it readable.

For more details on managing file compression, please visit our documentation on performance optimization: [Disabling and Enabling File Compression](https://dlthub.com/docs/reference/performance#disabling-and-enabling-file-compression).

## Data loading
All the files are stored in a single folder with the name of the dataset that you passed to the `run` or `load` methods of `pipeline`. In our example chess pipeline it is **chess_players_games_data**.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -140,4 +140,4 @@ def get_pages(

# check that stuff was loaded
row_counts = pipeline.last_trace.last_normalize_info.row_counts
assert row_counts["ticket_events"] == 17
assert row_counts["ticket_events"] == 17

0 comments on commit 2ffcd97

Please sign in to comment.