Skip to content

Latest commit

 

History

History
9 lines (6 loc) · 1.41 KB

README.md

File metadata and controls

9 lines (6 loc) · 1.41 KB

rs-parquet-gql

This repo is a Rust implementation of my post on Towards Data Science: Data Access API over Data Lake Tables Without the Complexity.
You can find the post here

In short, it is a GraphQL query service that serves GQL query requests over parquet files in a data lake table. It is implemented using Axum and Apache Arrow Data fusion. This is the intro from the post:

"...providing thin clients the ability to query data lake files fast usually comes at the price of adding more moving parts and processes to our pipeline, in order to either copy and ingest data to more expensive customer-facing warehouses or aggregate and transform it to fit low-latency databases. The purpose of this post is to explore and demonstrate a different and simpler approach to tackle this requirement using lighter in-process query engines. Specifically, I show how we can use in-process engines, such as DuckDB and Arrow Data Fusion, in order to create services that can both handle data lake files and volumes and act as a fast memory store that serves low-latency API calls. Using this approach, we can efficiently consolidate the required functionality into a single query service, which can be horizontally scaled, that will load data, aggregate and store it in memory, and serve API calls efficiently and fast.