Skip to content

Commit

Permalink
[Feature] Auto data generation and fixing data quality issues (#24)
Browse files Browse the repository at this point in the history
This PR removes the seeds from the repo, opting for a flow that installs
the 
[`jaffle-shop-generator`](https://github.com/dbt-labs/jaffle-shop-generator) 
in the project and generates data that way. This allows more data to be
generated easily for more interesting Jaffle Shop projects with a
longer lifespan, and future proofs the repo for updates to `jafgen` (if
`jafgen` adds new entities and interactions, Jaffle Shop can build on
them instead of being stuck with the old static seeds). This PR also
fixes several data quality issues.
  • Loading branch information
gwenwindflower authored Apr 9, 2024
1 parent 9e622f7 commit a637a82
Show file tree
Hide file tree
Showing 18 changed files with 121 additions and 156,098 deletions.
9 changes: 0 additions & 9 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,3 @@ repos:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
- id: ruff-format
- repo: https://github.com/sqlfluff/sqlfluff
rev: "3.0.3"
hooks:
- id: sqlfluff-fix
additional_dependencies:
[
"dbt-metricflow[snowflake,bigquery,postgres]~=0.6.0",
"sqlfluff-templater-dbt~=3.0.3",
]
59 changes: 41 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ This is a sandbox project for exploring the basic functionality and latest featu
![Click 'Use this template'](/.github/static/use-template.gif)
</details>

2. Follow the steps to create a new repository.
2. Follow the steps to create a new repository. You should choose the option to only copy the `main` branch.

## Platform setup

Expand All @@ -27,51 +27,74 @@ This is a sandbox project for exploring the basic functionality and latest featu
### dbt Cloud CLI (if you prefer to work locally)

> [!NOTE]
> If you'd like to use the dbt Cloud CLI, but are a little intimidated by the terminal, we've included a task runner called, fittingly, `task`. It's a simple way to run the commands you need to get started with dbt. You can install it by following the instructions [here](https://taskfile.dev/#/installation). We'll call out the `task` based alternative to each command below. You can also run `task setup` to perform all the setup commands at once.
> If you'd like to use the dbt Cloud CLI, but are a little intimidated by the terminal, we've included config for a task runner called, fittingly, `task`. It's a simple way to run the commands you need to get started with dbt. You can install it by following the instructions [here](https://taskfile.dev/#/installation). We'll call out the `task` based alternative to each command below.
1. Run `git clone [new repo name]` to clone your new repo to your local machine.
1. Run `git clone [new repo name]` (or `gh repo clone [repo owner]/[new repo name]` if you prefer GitHub's excellent CLI) to clone your new repo to your local machine.

2. [Follow Step 1 on this page](https://cloud.getdbt.com/cloud-cli) to install the dbt Cloud CLI, we'll do the other steps in a second.
2. [Follow the steps on this page](https://cloud.getdbt.com/cloud-cli) to install and set up a dbt Cloud connection with the dbt Cloud CLI.

3. Set up a virtual environment and activate it. I like to call my virtual environment `.venv` and add it to my `.gitignore` file (we've already done this if you name your virtual environment '.venv') so that I don't accidentally commit it to the repository, but you can call it whatever you want.
> [!TIP]
> If you're using `task`, you can run `task setup` to skip all the rest of this and run all the setup commands in one easy command. I recommend it!
3. Set up a virtual environment and activate it. I like to call my virtual environment `.venv` and add it to my `.gitignore` file (we've already done this if you name your virtual environment '`.venv`') so that I don't accidentally commit it to the repository, but you can call it whatever you want, just make sure you `.gitignore` it.

```shell
python3 -m venv .venv # create a virtual environment
source .venv/bin/activate # activate the virtual environment
OR
task venv # create a virtual environment

source .venv/bin/activate # activate the virtual environment
```

4. Install the project's requirements into your virtual environment.
```shell
python3 -m pip install --upgrade pip # upgrade pip (always a good idea!)
python3 -m pip install -r requirements.txt # install the project's requirements
OR
task install # install the project's requirements
```

5. [Follow steps 2 and 3 on this page](https://cloud.getdbt.com/cloud-cli) to setup dbt Cloud CLI's connection to dbt Cloud, only if you haven't already done so (we handled step 1 above and will do step 4 together next).

6. Double check that your `dbt_project.yml` is set up correctly by running `dbt list`. You should get back a list of models and tests in your project.
5. Double check that your `dbt_project.yml` is set up correctly by running `dbt list`. You should get back a list of models and tests in your project.

## Project setup

Once your development platform of choice is set up, use the following steps to get the project ready for whatever you'd like to do with it.
1. Run `dbt build` to load the sample data into your raw schema, build your models, and test your project.
1. Run `task gen` to generate a year of synthetic data for the Jaffle Shop.
2. Run `task build` to seed the generated data into your warehouse and build the project.
2. Delete the `jaffle-data` directory now that the raw data is loaded into the warehouse. It will be loaded into a `raw_jaffle_shop` schema in your warehouse. That both `dev` and `prod` targets are set up to use. Take a look at the `generate_schema_name` macro in the `macros` directory to if you're curious how this is done.
3. Run `task clean` to delete the generated data to avoid re-seeding the same data repeatedly for no reason.
#### OR
1. Run `task build`.
1. In your activated virtual environment with dependencies installed, run `jafgen` to generate a year of synthetic data for the Jaffle Shop, no arguments are necessary for the defaults.
2. Run `dbt deps` to install the dbt packages configured in the `packages.yml` file.
3. Run `dbt seed` to seed the generated data into your warehouse.
## Pre-commit and linting with SQLFluff
4. Delete the generated data to avoid re-seeding the same data repeatedly for no reason.
```shell
rm -rf jaffle-data
```
5. Run `dbt build` to build and test the project, make sure you deleted the generated data first or you'll be re-seeding the same data.

This project uses a tool called [pre-commit](https://pre-commit.com/) to automatically run a suite of of processes on your code, like linters and formatters, when you commit. If it finds an issue and updates a file, you'll need to stage the changes and commit them again (the first commit will not have gone through because pre-commit found and fixed an issue). The outcome of this is that your code will be more consistent automatically, and everybody's changes will be running through the same set of processes. We recommend it for any project. You can see the configuration for pre-commit in the `.pre-commit-config.yaml` file. You can run the checks manually with `pre-commit run --all-files` to see what it does without making a commit.
## Pre-commit and SQLFluff

The most important pre-commit hook that runs in this project is [SQLFluff](https://sqlfluff.com/), which will lint your SQL code. It's configured with the `.sqlfluff` file in the root of the project. You can also run this manually, either to lint your code or to fix it automatically (which also functions loosely as a fairly relaxed formatter), with `pre-commit run sqlfluff-lint` or `pre-commit run sqlfluff-fix` respectively, but if you don't, it will still run whenever you commit to ensure the committed code is consistent.
There's an optional tool included with the project called `pre-commit`.
> [!NOTE]
> SQLFluff's dbt templater relies on dbt Core, which conflicts with dbt Cloud CLI for the time being. Thankfully, pre-commit installs its hooks into isolated environments, so you can still use SQLFluff with dbt Cloud CLI via pre-commit, but you can't call SQLFluff directly. The dbt Labs team is actively working on a solution for this issue.
[pre-commit](https://pre-commit.com/) automatically runs a suite of of processes on your code, like linters and formatters, when you commit. If it finds an issue and updates a file, you'll need to stage the changes and commit them again (the first commit will not have gone through because pre-commit found and fixed an issue). The outcome of this is that your code will be more consistent automatically, and everybody's changes will be running through the same set of processes. We recommend it for any project.
You can see the configuration for pre-commit in the `.pre-commit-config.yaml` file. It's installed as part of the project's `requirements.txt`, but you'll need to opt-in to using it by running `pre-commit install`. This will install _git hooks_ which run when you commit. You can also run the checks manually with `pre-commit run --all-files` to see what it does without making a commit.

At present the following checks are run:

- `ruff` - an incredibly fast linter and formatter for Python, in case you add any Python models
- `check-yaml` - which validates YAML files
- `end-of-file-fixer` - which ensures all files end with a newline
- `trailing-whitespace` - which trims trailing whitespace from files

At present, the popular SQL linter and formatter SQLFluff doesn't play nicely with the dbt Cloud CLI, so we've omitted it from this project _for now_. If you'd like auto-formatting and linting for SQL, check out the dbt Cloud IDE!
We have kept a `.sqlfluff` config file to show what that looks like, and to future proof the repo for when the Cloud CLI support linting and formatting.
19 changes: 16 additions & 3 deletions Taskfile.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,35 @@ tasks:
venv:
cmds:
- python3 -m venv .venv
silent: true

install:
cmds:
- source .venv/bin/activate && python3 -m pip install --upgrade pip
- source .venv/bin/activate && python3 -m pip install -r requirements.txt --progress-bar off
- source .venv/bin/activate && python3 -m pip install --upgrade pip --progress-bar off > /dev/null
- source .venv/bin/activate && python3 -m pip install -r requirements.txt --progress-bar off > /dev/null
silent: true

gen:
cmds:
- source .venv/bin/activate && jafgen
silent: true

build:
cmds:
- dbt deps
- dbt seed
- rm -rf jaffle-data
- dbt run
- dbt test

clean:
cmds:
- rm -rf jaffle-data
silent: true

setup:
cmds:
- task: venv
- task: install
- task: gen
- task: build
- task: clean
2 changes: 1 addition & 1 deletion dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ profile: default
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["data-tests"]
seed-paths: ["jaffle-data"]
seed-paths: ["jaffle-data", "seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

Expand Down
Loading

0 comments on commit a637a82

Please sign in to comment.