You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enter dlt. dlt is a purpose-built EL framework (established competitors include Fivetran, Airbyte, etc.). Compared to some of these competitors, a differentiating factor is that dlt is a lightweight, open-source, Python-first framework—very aligned with Kedro's value prop in the data transformation pipeline space. It also solves problems that Kedro users have struggled with for years; for example, incremental loading is supported natively.
Background
We want to first figure out what is the best way to integrate Kedro and dlt, and then figure out a path to implementing this integration.
My high-level understanding is, this means that Kedro pipeline execution would be triggered from dlt, upon data load. This makes sense for a (batch) production workflow; you expect to run some downstream transformations, maybe even retrain a model, when you have new data.
In general, Kedro could be seen as a direct alternative to dbt in this scenario, especially if the user is using Kedro and Ibis together for data engineer workflows. However, there are a few differences that come to mind:
Kedro pipelines require input datasets to be defined; dbt pipelines do not (models can just reference tables in a target database by name). Initially, I think it's acceptable for users to define catalog entries pointing to the data that dlt would load; however, down the line, it may make sense to smoothen this experience. Could access data after load load as dataframes with ibis dlt-hub/dlt#1095 (or even other integration like pandas output) be used to hand off data directly to a Kedro node?
dlt passes configuration and credentials to dbt, but these are defined more dataset-centric in Kedro (unless the user reuses configuration, but there's not a standard way for this). (Disclaimer: I don't know anything about how the configuration passing works in the dlt-dbt integration yet; maybe this is a non-issue.)
dlt dataset in Kedro
The Kedro-centric view of the world would have a dlt dataset. However, a complicating factor is that a dlt resource can actually result in multiple tables.
In the initial discussion, it was mentioned that we would add an additional destination type for Kedro.
(This option needs more context; I don't think I've fully understood what this would look like.)
dlt pipeline as a Kedro node
A dlt pipeline could be run as a Kedro node that takes data from the source, manipulates it (or not), and loads it to a destination.
However, for this to make sense, we would need to avoid the node function essentially reimplementing dlt.pipeline(...).
Rollout strategy
TBD
Future iterations
TBD
Footnotes
In an initial call with @AstrakhantsevaAA, @VioletM, and @astrojuanlu, I asked whether dlt + dbt users were generally happy with the integration, or if it was something thrown together early on and they would like to redesign—it seems to be the former. ↩
The text was updated successfully, but these errors were encountered:
Introduction
Kedro is a data pipelining framework that is well-suited for defining data transformations (i.e. the "T" in ELT). However, Kedro is not designed for the extract-and-load ("EL") part.
Enter dlt. dlt is a purpose-built EL framework (established competitors include Fivetran, Airbyte, etc.). Compared to some of these competitors, a differentiating factor is that dlt is a lightweight, open-source, Python-first framework—very aligned with Kedro's value prop in the data transformation pipeline space. It also solves problems that Kedro users have struggled with for years; for example, incremental loading is supported natively.
Background
We want to first figure out what is the best way to integrate Kedro and dlt, and then figure out a path to implementing this integration.
Problem
What's in scope
TBD
What's not in scope
TBD
Design
Alternatives considered
dlt-dbt-like integration
You can run dbt with dlt using the dlt runner. This integration is well established, and it works relatively well for users1. https://github.com/dlt-hub/dlt_dbt_hubspot showcases another (real-world) example of how they can be used together.
My high-level understanding is, this means that Kedro pipeline execution would be triggered from dlt, upon data load. This makes sense for a (batch) production workflow; you expect to run some downstream transformations, maybe even retrain a model, when you have new data.
In general, Kedro could be seen as a direct alternative to dbt in this scenario, especially if the user is using Kedro and Ibis together for data engineer workflows. However, there are a few differences that come to mind:
dlt dataset in Kedro
The Kedro-centric view of the world would have a dlt dataset. However, a complicating factor is that a dlt resource can actually result in multiple tables.
In the initial discussion, it was mentioned that we would add an additional destination type for Kedro.
(This option needs more context; I don't think I've fully understood what this would look like.)
dlt pipeline as a Kedro node
A dlt pipeline could be run as a Kedro node that takes data from the source, manipulates it (or not), and loads it to a destination.
However, for this to make sense, we would need to avoid the node function essentially reimplementing
dlt.pipeline(...)
.Rollout strategy
TBD
Future iterations
TBD
Footnotes
In an initial call with @AstrakhantsevaAA, @VioletM, and @astrojuanlu, I asked whether dlt + dbt users were generally happy with the integration, or if it was something thrown together early on and they would like to redesign—it seems to be the former. ↩
The text was updated successfully, but these errors were encountered: