Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #111

Merged
merged 1 commit into from
Aug 8, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 14 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,33 @@
# Hooli, Inc. Data Engineering

This repository includes Dagster application code developed by the fictional data engineering team at Hooli.
This repository includes Dagster code developed by the fictional data engineering team at Hooli. Another realistic dagster code base is Dagster Lab's own data engineering repository which is publicly shared here: https://github.com/dagster-io/dagster-open-platform

## Getting Started

You can clone and run this example locally:

```
git clone https://github.com/dagster-io/hooli-data-eng-pipelines
pip install -e ".[dev]"
pip install uv
make dependencies
make deps
make manifest
dagster dev
```

## Code Structure

To understand the structure, start with the file `hooli_data_eng/definitions.py`. This example includes a few key Dagster concepts:

- *Assets*: are used to represent the datasets the Hooli data team manages. This example includes assets generated from dbt, Python, and other sources.
- *Resources*: represent external systems. This example uses different resources for different environments (DuckDB locally, snowflake + s3 in production). The example also shows how to create custom resources, see `resources/api.py`.
- *Jobs*: allow us to automate when our assets are updated. This example includes jobs that run on a *Schedule* and *Sensors* that trigger jobs to run when upstream data is ready. Jobs can target assets, see `definitions.py` or they can define imperative operations, see `jobs/watch_s3.py`.
- *Job Configuration* allows Dagster to parameterize tasks, this example includes a forecasting model with hyper parameters passed as job config.
- *Partitions and Backfills* allow Dagster to represent partitioned data with no additional code. This example shows how daily partitioned assets can automatically be scheduled daily, and how those same daily partitions can seemlessly roll up into a weekly partitioned asset.
- The asset `big_orders` in `hooli_data_eng/assets/forecasting/__init__.py` uses Spark. Locally, Spark is run through a local PySpark process. In production, a `resources/databricks.py` Databricks *Step Launcher* is used to dynamically create a Spark cluster for processing.
- The asset `model_nb` is an example of *Dagstermill* which lets you run Jupyter Notebooks as assets, including notebooks that should take upstream assets as inputs.
- *Sensors* are used to run jobs based on external events. See for example `hooli_data_eng/jobs/watch_s3.py`.
- *Declarative Scheduling* is used to keep certain marketing and analytics assets up to date based on a stakeholder SLA using freshness policies and auto materialization policies. Examples include `hooli_data_eng/assets/marketing/__init__.py` and `dbt_project/models/ANALYTICS/weekly_order_summary.sql`.
- *Retries* are enabled for both runs and assets, making the pipeline robust to occassional flakiness. See `hooli_data_eng/definitions.py` for examples of retries on jobs, and `hooli_data_eng/assets/marketing/__init__.py` for an example of a more complex retry policy on an asset including backoff and jitter. Flakiness is generated in `hooli_data_eng/resources/api.py`.
- *Alerts* are enabled through Dagster Clould alert policies based on job tags. A custom alert is also specified to notify when assets with SLAs are later than expected. See `hooli_data_eng/assets/delayed_asset_alerts.py`.
The team at Hooli uses multiple projects to allow for different teams to take ownership of their own data products. You will find these projects in separate folders, eg `hooli_data_eng` and `hooli-demo-assets`. The team also uses Dagster to manage a large dbt project which is colocated in this repo in the `dbt_project` folder.

Each of these projects is deployed to Dagster+ resulting in a single pane of glass across teams, with RBAC enforcing who can launch runs of different assets. The deployment is managed by the `.github/workflows` file, and deploys to a Dasgter+ Hybrid Kubernetes setup. The `dagster_cloud.yaml` file is to configure the projects.

To see this in action checkout [this video](https://www.youtube.com/watch?v=qiOytuAjdbE&feature=youtu.be).

> Dev Note: to run multiple projects locally, you will want to use the `workspaces.yml` file. By default, `dagster dev` will run the `hooli_data_eng` project which accounts for the majority of the examples in this repo.

## Main Features

TBD.

## Assets spanning Multiple Code Locations

Expand Down
Loading