Define best practice for integrating Kedro and dlt #4057

deepyaman · 2024-08-02T20:04:45Z

Introduction

Kedro is a data pipelining framework that is well-suited for defining data transformations (i.e. the "T" in ELT). However, Kedro is not designed for the extract-and-load ("EL") part.

Enter dlt. dlt is a purpose-built EL framework (established competitors include Fivetran, Airbyte, etc.). Compared to some of these competitors, a differentiating factor is that dlt is a lightweight, open-source, Python-first framework—very aligned with Kedro's value prop in the data transformation pipeline space. It also solves problems that Kedro users have struggled with for years; for example, incremental loading is supported natively.

Background

We want to first figure out what is the best way to integrate Kedro and dlt, and then figure out a path to implementing this integration.

Problem

What's in scope

TBD

What's not in scope

TBD

Design

Alternatives considered

dlt-dbt-like integration

You can run dbt with dlt using the dlt runner. This integration is well established, and it works relatively well for users¹. https://github.com/dlt-hub/dlt_dbt_hubspot showcases another (real-world) example of how they can be used together.

My high-level understanding is, this means that Kedro pipeline execution would be triggered from dlt, upon data load. This makes sense for a (batch) production workflow; you expect to run some downstream transformations, maybe even retrain a model, when you have new data.

In general, Kedro could be seen as a direct alternative to dbt in this scenario, especially if the user is using Kedro and Ibis together for data engineer workflows. However, there are a few differences that come to mind:
1. Kedro pipelines require input datasets to be defined; dbt pipelines do not (models can just reference tables in a target database by name). Initially, I think it's acceptable for users to define catalog entries pointing to the data that dlt would load; however, down the line, it may make sense to smoothen this experience. Could access data after load load as dataframes with ibis dlt-hub/dlt#1095 (or even other integration like pandas output) be used to hand off data directly to a Kedro node?
2. dlt passes configuration and credentials to dbt, but these are defined more dataset-centric in Kedro (unless the user reuses configuration, but there's not a standard way for this). (Disclaimer: I don't know anything about how the configuration passing works in the dlt-dbt integration yet; maybe this is a non-issue.)
dlt dataset in Kedro

The Kedro-centric view of the world would have a dlt dataset. However, a complicating factor is that a dlt resource can actually result in multiple tables.

In the initial discussion, it was mentioned that we would add an additional destination type for Kedro.

(This option needs more context; I don't think I've fully understood what this would look like.)
dlt pipeline as a Kedro node

A dlt pipeline could be run as a Kedro node that takes data from the source, manipulates it (or not), and loads it to a destination.

However, for this to make sense, we would need to avoid the node function essentially reimplementing dlt.pipeline(...).

Rollout strategy

TBD

Future iterations

TBD

In an initial call with @AstrakhantsevaAA, @VioletM, and @astrojuanlu, I asked whether dlt + dbt users were generally happy with the integration, or if it was something thrown together early on and they would like to redesign—it seems to be the former. ↩

The text was updated successfully, but these errors were encountered:

deepyaman mentioned this issue Aug 2, 2024

Parent task: Content on Kedro vs complementary tools #3012

Open

github-actions bot mentioned this issue Sep 1, 2024

Monthly issue metrics report #4135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define best practice for integrating Kedro and dlt #4057

Define best practice for integrating Kedro and dlt #4057

deepyaman commented Aug 2, 2024

Define best practice for integrating Kedro and dlt #4057

Define best practice for integrating Kedro and dlt #4057

Comments

deepyaman commented Aug 2, 2024

Introduction

Background

Problem

What's in scope

What's not in scope

Design

Alternatives considered

Rollout strategy

Future iterations

Footnotes