Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define best practice for integrating Kedro and dlt #4057

Open
deepyaman opened this issue Aug 2, 2024 · 0 comments
Open

Define best practice for integrating Kedro and dlt #4057

deepyaman opened this issue Aug 2, 2024 · 0 comments

Comments

@deepyaman
Copy link
Member

Introduction

Kedro is a data pipelining framework that is well-suited for defining data transformations (i.e. the "T" in ELT). However, Kedro is not designed for the extract-and-load ("EL") part.

Enter dlt. dlt is a purpose-built EL framework (established competitors include Fivetran, Airbyte, etc.). Compared to some of these competitors, a differentiating factor is that dlt is a lightweight, open-source, Python-first framework—very aligned with Kedro's value prop in the data transformation pipeline space. It also solves problems that Kedro users have struggled with for years; for example, incremental loading is supported natively.

Background

We want to first figure out what is the best way to integrate Kedro and dlt, and then figure out a path to implementing this integration.

Problem

What's in scope

TBD

What's not in scope

TBD

Design

Alternatives considered

  1. dlt-dbt-like integration

    You can run dbt with dlt using the dlt runner. This integration is well established, and it works relatively well for users1. https://github.com/dlt-hub/dlt_dbt_hubspot showcases another (real-world) example of how they can be used together.

    My high-level understanding is, this means that Kedro pipeline execution would be triggered from dlt, upon data load. This makes sense for a (batch) production workflow; you expect to run some downstream transformations, maybe even retrain a model, when you have new data.

    In general, Kedro could be seen as a direct alternative to dbt in this scenario, especially if the user is using Kedro and Ibis together for data engineer workflows. However, there are a few differences that come to mind:

    1. Kedro pipelines require input datasets to be defined; dbt pipelines do not (models can just reference tables in a target database by name). Initially, I think it's acceptable for users to define catalog entries pointing to the data that dlt would load; however, down the line, it may make sense to smoothen this experience. Could access data after load load as dataframes with ibis dlt-hub/dlt#1095 (or even other integration like pandas output) be used to hand off data directly to a Kedro node?
    2. dlt passes configuration and credentials to dbt, but these are defined more dataset-centric in Kedro (unless the user reuses configuration, but there's not a standard way for this). (Disclaimer: I don't know anything about how the configuration passing works in the dlt-dbt integration yet; maybe this is a non-issue.)
  2. dlt dataset in Kedro

    The Kedro-centric view of the world would have a dlt dataset. However, a complicating factor is that a dlt resource can actually result in multiple tables.

    In the initial discussion, it was mentioned that we would add an additional destination type for Kedro.

    (This option needs more context; I don't think I've fully understood what this would look like.)

  3. dlt pipeline as a Kedro node

    A dlt pipeline could be run as a Kedro node that takes data from the source, manipulates it (or not), and loads it to a destination.

    However, for this to make sense, we would need to avoid the node function essentially reimplementing dlt.pipeline(...).

Rollout strategy

TBD

Future iterations

TBD

Footnotes

  1. In an initial call with @AstrakhantsevaAA, @VioletM, and @astrojuanlu, I asked whether dlt + dbt users were generally happy with the integration, or if it was something thrown together early on and they would like to redesign—it seems to be the former.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant