Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: run DBT from devcontainer #3515

Merged
merged 5 commits into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 41 additions & 17 deletions .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,22 +1,46 @@
# See here for image contents: https://github.com/microsoft/vscode-dev-containers/tree/v0.177.0/containers/python-3/.devcontainer/base.Dockerfile
FROM python:3.9

# [Choice] Python version: 3, 3.9, 3.8, 3.7, 3.6
ARG VARIANT="3.9"
FROM mcr.microsoft.com/vscode/devcontainers/python:0-${VARIANT}
LABEL org.opencontainers.image.source=https://github.com/cal-itp/data-infra

# [Option] Install Node.js
ARG INSTALL_NODE="true"
ARG NODE_VERSION="lts/*"
RUN if [ "${INSTALL_NODE}" = "true" ]; then su vscode -c "umask 0002 && . /usr/local/share/nvm/nvm.sh && nvm install ${NODE_VERSION} 2>&1"; fi
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
USER=calitp

# [Optional] If your pip requirements rarely change, uncomment this section to add them to the image.
# COPY requirements.txt /tmp/pip-tmp/
# RUN pip3 --disable-pip-version-check --no-cache-dir install -r /tmp/pip-tmp/requirements.txt \
# && rm -rf /tmp/pip-tmp
# install gcloud CLI
RUN apt-get update && apt-get install -y apt-transport-https ca-certificates curl gnupg
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg && \
apt-get update -y && apt-get install -y google-cloud-cli

# [Optional] Uncomment this section to install additional OS packages.
# RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \
# && apt-get -y install --no-install-recommends <your-package-list-here>
# install pygraphviz deps
RUN apt-get update && apt-get install -y libgdal-dev libgraphviz-dev graphviz-dev

# [Optional] Uncomment this line to install global node packages.
# RUN su vscode -c "source /usr/local/share/nvm/nvm.sh && npm install -g <your-package-here>" 2>&1
# create and switch to non-root user for devcontainer
RUN useradd --create-home --shell /bin/bash $USER && \
chown -R $USER:$USER /home/$USER
USER $USER

# setup warehouse deps
WORKDIR /home/$USER/app/warehouse
# pip install location for non-root
ENV PATH="$PATH:/home/$USER/.local/bin"
# upgrade pip, install poetry
RUN python -m pip install --upgrade pip && pip install poetry

# copy source files
COPY ./warehouse/pyproject.toml pyproject.toml
COPY ./warehouse/poetry.lock poetry.lock
COPY ./warehouse/dbt_project.yml dbt_project.yml
COPY ./warehouse/packages.yml packages.yml

# install warehouse deps
RUN poetry install
RUN poetry run dbt deps

# install dev deps
RUN pip install black memray pre-commit

# switch back to app root
WORKDIR /home/$USER/app
# CMD for devcontainers
CMD ["sleep", "infinity"]
29 changes: 29 additions & 0 deletions .devcontainer/compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
services:
dev:
build:
context: ..
dockerfile: .devcontainer/Dockerfile
image: data_infra:dev
entrypoint: sleep infinity
environment:
- DBT_PROFILES_DIR=/home/calitp/.dbt
- GOOGLE_APPLICATION_CREDENTIALS=/home/calitp/.config/gcloud/application_default_credentials.json
volumes:
- ..:/home/calitp/app
- ~/.dbt:/home/calitp/.dbt
- ~/.config/gcloud:/home/calitp/.config/gcloud

dbt:
build:
context: ..
dockerfile: .devcontainer/Dockerfile
image: data_infra:dev
entrypoint: ["poetry", "run", "dbt"]
environment:
- DBT_PROFILES_DIR=/home/calitp/.dbt
- GOOGLE_APPLICATION_CREDENTIALS=/home/calitp/.config/gcloud/application_default_credentials.json
volumes:
- ..:/home/calitp/app
- ~/.dbt:/home/calitp/.dbt
- ~/.config/gcloud:/home/calitp/.config/gcloud
working_dir: /home/calitp/app/warehouse
72 changes: 25 additions & 47 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -1,59 +1,37 @@
// For format details, see https://aka.ms/devcontainer.json. For config options, see the README at:
// https://github.com/microsoft/vscode-dev-containers/tree/v0.177.0/containers/python-3
{
"name": "Python 3",
"build": {
"dockerfile": "Dockerfile",
"context": "..",
"args": {
// Update 'VARIANT' to pick a Python version: 3, 3.6, 3.7, 3.8, 3.9
"VARIANT": "3.8",
// Options
"INSTALL_NODE": "false",
"NODE_VERSION": "lts/*"
}
},

// Set *default* container specific settings.json values on container create.
"settings": {
"terminal.integrated.shell.linux": "/bin/bash",
"python.pythonPath": "/usr/local/bin/python",
"python.languageServer": "Pylance",
"python.linting.enabled": true,
"python.linting.pylintEnabled": true,
"python.formatting.autopep8Path": "/usr/local/py-utils/bin/autopep8",
"python.formatting.blackPath": "/usr/local/py-utils/bin/black",
"python.formatting.yapfPath": "/usr/local/py-utils/bin/yapf",
"python.linting.banditPath": "/usr/local/py-utils/bin/bandit",
"python.linting.flake8Path": "/usr/local/py-utils/bin/flake8",
"python.linting.mypyPath": "/usr/local/py-utils/bin/mypy",
"python.linting.pycodestylePath": "/usr/local/py-utils/bin/pycodestyle",
"python.linting.pydocstylePath": "/usr/local/py-utils/bin/pydocstyle",
"python.linting.pylintPath": "/usr/local/py-utils/bin/pylint",
"name": "cal-itp/data-infra",
"dockerComposeFile": ["./compose.yml"],
"service": "dev",
"runServices": ["dev"],
"workspaceFolder": "/home/calitp/app",
"postAttachCommand": ["/bin/bash", ".devcontainer/postAttach.sh"],
"customizations": {
"vscode": {
"settings": {
"terminal.integrated.defaultProfile.linux": "bash",
"terminal.integrated.profiles.linux": {
"bash": {
"path": "/bin/bash"
}
},
"editor.formatOnSave": true,
"files.trimTrailingWhitespace": true,
"files.insertFinalNewline": true
},
"files.insertFinalNewline": true,
"files.encoding": "utf8",
"files.eol": "\n",
"python.languageServer": "Pylance"
},

// Add the IDs of extensions you want installed when the container is created.
"extensions": [
// Add the IDs of extensions you want installed when the container is created.
"extensions": [
"ms-python.python",
"ms-python.vscode-pylance",
"davidanson.vscode-markdownlint",
"bierner.markdown-mermaid",
"mhutchie.git-graph"
],

// Use 'forwardPorts' to make a list of ports inside the container available locally.
// "forwardPorts": [],

// Use 'postCreateCommand' to run commands after the container is created.
// "postCreateCommand": "pip3 install --user -r requirements.txt",

// Comment out connect as root instead. More info: https://aka.ms/vscode-remote/containers/non-root.
"remoteUser": "vscode",
"portsAttributes": {
"8000": {
"label": "documentation website"
}
]
}
}
}
21 changes: 21 additions & 0 deletions .devcontainer/postAttach.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/usr/bin/env bash
set -u

# workaround VS Code devcontainer .git mounting issue
git config --global --add safe.directory /home/calitp/app

# initialize hook environments
pre-commit install --install-hooks --overwrite

cd warehouse/

if [ ! -f ~/.dbt/profiles.yml ]; then
poetry run dbt init
fi

poetry run dbt debug

if [[ $? != 0 ]]; then
gcloud init
gcloud auth application-default login
fi
31 changes: 0 additions & 31 deletions warehouse/Dockerfile.local

This file was deleted.

81 changes: 60 additions & 21 deletions warehouse/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,14 @@ are already configured/installed.

3. Execute `poetry install` to create a virtual environment and install requirements.

> [!NOTE]
> If you run into an error complaining about graphviz (e.g. `fatal error: 'graphviz/cgraph.h' file not found`); see [pygraphviz#398](https://github.com/pygraphviz/pygraphviz/issues/398).
>
> ```bash
> export CFLAGS="-I $(brew --prefix graphviz)/include"
> export LDFLAGS="-L $(brew --prefix graphviz)/lib"
> poetry install
> ```
> [!NOTE]
> If you run into an error complaining about graphviz (e.g. `fatal error: 'graphviz/cgraph.h' file not found`); see [pygraphviz#398](https://github.com/pygraphviz/pygraphviz/issues/398).
>
> ```bash
> export CFLAGS="-I $(brew --prefix graphviz)/include"
> export LDFLAGS="-L $(brew --prefix graphviz)/lib"
> poetry install
> ```

4. Execute `poetry run dbt deps` to install the dbt dependencies defined in `packages.yml` (such as `dbt_utils`).

Expand All @@ -59,15 +59,15 @@ are already configured/installed.

See [the dbt docs on profiles.yml](https://docs.getdbt.com/dbt-cli/configure-your-profile) for more background on this file.

> [!NOTE]
> This default profile template will set a maximum bytes billed of 2 TB; no models should fail with the default lookbacks in our development environment, even with a full refresh. You can override this limit during the init, or change it later by calling init again and choosing to overwrite (or editing the profiles.yml directly).
>
> [!WARNING]
> If you receive a warning similar to the following, do **NOT** overwrite the file. This is a sign that you do not have a `DBT_PROFILES_DIR` variable available in your environment and need to address that first (see step 5).
>
> ```text
> The profile calitp_warehouse already exists in /data-infra/warehouse/profiles.yml. Continue and overwrite it? [y/N]:
> ```
> [!NOTE]
> This default profile template will set a maximum bytes billed of 2 TB; no models should fail with the default lookbacks in our development environment, even with a full refresh. You can override this limit during the init, or change it later by calling init again and choosing to overwrite (or editing the profiles.yml directly).
>
> [!WARNING]
> If you receive a warning similar to the following, do **NOT** overwrite the file. This is a sign that you do not have a `DBT_PROFILES_DIR` variable available in your environment and need to address that first (see step 5).
>
> ```text
> The profile calitp_warehouse already exists in /data-infra/warehouse/profiles.yml. Continue and overwrite it? [y/N]:
> ```

7. Check whether `~/.dbt/profiles.yml` was successfully created, e.g. `cat ~/.dbt/profiles.yml`. If you encountered an error, you may create it by hand and fill it with the same content - this will point your models at BigQuery datasets (schemas) in the `cal-itp-data-infra-staging` project that are prefixed with your name, where operations on them will not impact production data:

Expand Down Expand Up @@ -147,10 +147,10 @@ Once you have performed the setup above, you are good to go run
2. You will need to re-run seeds if new seeds are added, or existing ones are changed.
2. `poetry run dbt run`
1. Wll run all the models, i.e. execute SQL in the warehouse.
2. In the future, you can specify [selections](https://docs.getdbt.com/reference/node-selection/syntax) (via the `-s` or `--select` flags) to run only a subset of models, otherwise this will run *all* the tables.
2. In the future, you can specify [selections](https://docs.getdbt.com/reference/node-selection/syntax) (via the `-s` or `--select` flags) to run only a subset of models, otherwise this will run _all_ the tables.
3. By default, your very first `run` is a [full refresh](https://docs.getdbt.com/reference/commands/run#refresh-incremental-models) but you'll need to pass the `--full-refresh` flag in the future if you want to change the schema of incremental tables, or "backfill" existing rows with new logic.

> [!NOTE]
> [!NOTE]
> In general, it's a good idea to run `seed` and `run --full-refresh` if you think your local environment is substantially outdated (for example, if you haven't worked on dbt models in a few weeks but want to create or modify a model). We have macros in the project that prevent a non-production "full refresh" from actually processing all possible data.

Some additional helpful commands:
Expand All @@ -177,10 +177,10 @@ If this is your first time using the terminal, we recommend reading "[Learning t

You can enable [displaying hidden folders/files in macOS Finder](https://www.macworld.com/article/671158/how-to-show-hidden-files-on-a-mac.html) but generally, we recommend using the terminal when possible for editing these files. Generally, `nano ~/.dbt/profiles.yml` will be the easiest method for editing your personal profiles file. `nano` is a simple terminal-based text editor; you use the arrows keys to navigate and the hotkeys displayed at the bottom to save and exit. Reading an [online tutorial for using `nano`](https://www.howtogeek.com/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/) may be useful if you haven't used a terminal-based editor before.

> [!NOTE]
> [!NOTE]
> These instructions assume you are on macOS, but are largely similar for other operating systems. Most \*nix OSes will have a package manager that you should use instead of Homebrew.
>
> [!NOTE]
> [!NOTE]
> If you get `Operation not permitted` when attempting to use the terminal, you may need to [fix your terminal permissions](https://osxdaily.com/2018/10/09/fix-operation-not-permitted-terminal-error-macos/)

### Install Homebrew (if you haven't)
Expand Down Expand Up @@ -303,6 +303,45 @@ and the cal-itp-data-infra-staging project's default service account (`473674835
since the buckets for compiled Python models (`gs://calitp-dbt-python-models` and `gs://test-calitp-dbt-python-models`)
as well as external tables exist in the production project.

## Run with VS Code Dev Containers

This repository comes with a [Dev Containers](https://containers.dev/) configuration that makes it possible to run everything
within VS Code with minimal dependencies, from any operating system.

1. Ensure you have Docker and Docker Compose installed locally
1. Ensure you have the Dev Containers VS Code extension installed: `ms-vscode-remote.remote-containers`
1. If you have never run the DBT project before, create the following directories locally:

```console
mkdir ~/.dbt
mkdir -p ~/.config/gcloud
```

1. Open this repository in VS Code
1. When prompted, choose `Reopen in Container` or use the Command Palette: `Ctrl/Cmd` + `Shift` + `P` and type `Dev Containers`
1. If you have never run the DBT project before, once the devcontainer has built and opens, you will be guided through the
initialization process for DBT and Google Cloud CLI.

You can also run any DBT command from your local machine via Docker Compose.

Change into the `.devcontainer/` directory:

```console
cd .devcontainer/
```

Then use `docker compose run` with a `dbt <command>`:

```console
docker compose run dbt <command>
```

E.g.

```console
docker compose run dbt debug
```

## Testing Warehouse Image Changes

A person with Docker set up locally can build a development version of the underlying warehouse image at any time after making changes to the Dockerfile or its requirements. From the relevant subfolder, run
Expand Down
3 changes: 0 additions & 3 deletions warehouse/dbt.sh

This file was deleted.

Loading