From 041f06a9a019602c32ff7e3bf2d42b0e391a7323 Mon Sep 17 00:00:00 2001 From: Sean Lopp Date: Thu, 8 Aug 2024 13:11:13 -0600 Subject: [PATCH] Update README.md --- README.md | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 6c2b3143..0dcdf2da 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # Hooli, Inc. Data Engineering -This repository includes Dagster application code developed by the fictional data engineering team at Hooli. +This repository includes Dagster code developed by the fictional data engineering team at Hooli. Another realistic dagster code base is Dagster Lab's own data engineering repository which is publicly shared here: https://github.com/dagster-io/dagster-open-platform ## Getting Started @@ -8,7 +8,8 @@ You can clone and run this example locally: ``` git clone https://github.com/dagster-io/hooli-data-eng-pipelines -pip install -e ".[dev]" +pip install uv +make dependencies make deps make manifest dagster dev @@ -16,19 +17,17 @@ dagster dev ## Code Structure -To understand the structure, start with the file `hooli_data_eng/definitions.py`. This example includes a few key Dagster concepts: - -- *Assets*: are used to represent the datasets the Hooli data team manages. This example includes assets generated from dbt, Python, and other sources. -- *Resources*: represent external systems. This example uses different resources for different environments (DuckDB locally, snowflake + s3 in production). The example also shows how to create custom resources, see `resources/api.py`. -- *Jobs*: allow us to automate when our assets are updated. This example includes jobs that run on a *Schedule* and *Sensors* that trigger jobs to run when upstream data is ready. Jobs can target assets, see `definitions.py` or they can define imperative operations, see `jobs/watch_s3.py`. -- *Job Configuration* allows Dagster to parameterize tasks, this example includes a forecasting model with hyper parameters passed as job config. -- *Partitions and Backfills* allow Dagster to represent partitioned data with no additional code. This example shows how daily partitioned assets can automatically be scheduled daily, and how those same daily partitions can seemlessly roll up into a weekly partitioned asset. -- The asset `big_orders` in `hooli_data_eng/assets/forecasting/__init__.py` uses Spark. Locally, Spark is run through a local PySpark process. In production, a `resources/databricks.py` Databricks *Step Launcher* is used to dynamically create a Spark cluster for processing. -- The asset `model_nb` is an example of *Dagstermill* which lets you run Jupyter Notebooks as assets, including notebooks that should take upstream assets as inputs. -- *Sensors* are used to run jobs based on external events. See for example `hooli_data_eng/jobs/watch_s3.py`. -- *Declarative Scheduling* is used to keep certain marketing and analytics assets up to date based on a stakeholder SLA using freshness policies and auto materialization policies. Examples include `hooli_data_eng/assets/marketing/__init__.py` and `dbt_project/models/ANALYTICS/weekly_order_summary.sql`. -- *Retries* are enabled for both runs and assets, making the pipeline robust to occassional flakiness. See `hooli_data_eng/definitions.py` for examples of retries on jobs, and `hooli_data_eng/assets/marketing/__init__.py` for an example of a more complex retry policy on an asset including backoff and jitter. Flakiness is generated in `hooli_data_eng/resources/api.py`. -- *Alerts* are enabled through Dagster Clould alert policies based on job tags. A custom alert is also specified to notify when assets with SLAs are later than expected. See `hooli_data_eng/assets/delayed_asset_alerts.py`. +The team at Hooli uses multiple projects to allow for different teams to take ownership of their own data products. You will find these projects in separate folders, eg `hooli_data_eng` and `hooli-demo-assets`. The team also uses Dagster to manage a large dbt project which is colocated in this repo in the `dbt_project` folder. + +Each of these projects is deployed to Dagster+ resulting in a single pane of glass across teams, with RBAC enforcing who can launch runs of different assets. The deployment is managed by the `.github/workflows` file, and deploys to a Dasgter+ Hybrid Kubernetes setup. The `dagster_cloud.yaml` file is to configure the projects. + +To see this in action checkout [this video](https://www.youtube.com/watch?v=qiOytuAjdbE&feature=youtu.be). + +> Dev Note: to run multiple projects locally, you will want to use the `workspaces.yml` file. By default, `dagster dev` will run the `hooli_data_eng` project which accounts for the majority of the examples in this repo. + +## Main Features + +TBD. ## Assets spanning Multiple Code Locations