Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Devcontainer support including Duckdb bootstrapped with minimal synthetic data #57

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/python
{
"name": "Python 3",
// Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile
"image": "mcr.microsoft.com/devcontainers/python:3.12",
// Features to add to the dev container. More info: https://containers.dev/features.
// "features": {},
// Use 'forwardPorts' to make a list of ports inside the container available locally.
// "forwardPorts": [],
// Use 'postCreateCommand' to run commands after the container is created.
"postCreateCommand": "bash ./.devcontainer/scripts/postCreate.sh",
"postStartCommand": "bash ./.devcontainer/scripts/postStart.sh",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This postStart script is not needed

Suggested change
"postStartCommand": "bash ./.devcontainer/scripts/postStart.sh",

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly - and this may be a relic from a fix to get devcontainers working for several projects on my machine (Windows with Docker Desktop using WSL2). If someone can confirm that git does not throw permission errors without that line, please feel free to remove it. It is borrowed from another devcontainer that does a bunch of stuff after starting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The postStart.sh is needed in order to install the duckdb pieces.
I'm working off that shell script so that these happen after the Setup dbt section

I haven't added fancy logic to detect which OS.
My assumption is that for Github Codespaces that it is good enough for demoing for the symposium.
A longer term solution will be needed.

`

Setup dbt

echo "Setting up duckdb synthetic data directory"
mkdir ./data

echo "Installing Duckdb in Github Codespaces which is Ubuntu based OS"
wget https://github.com/duckdb/duckdb/releases/download/v1.0.0/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip

echo "Initialize the duckdb file and exit duckdb"
/workspaces/dbt-synthea/duckdb ./data/synthea_omop_etl.duckdb -s .quit

echo "Debugging dbt to check the connection to duckdb"
dbt debug
echo "Configure dbt dependent Python packages"
dbt deps
echo "Seed dbt duckdb with data"
dbt seed
echo "Compile and Build the dbt project"
dbt compile
dbt build
`

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update - My comment above is obviated by Lawrences awesome work to reconcile the python modules so that dbt debug checks both the connector to duckdb and initializes the file specified in the project yaml file.

// "mounts": [],
"customizations": {
"vscode": {
"extensions": [
"davidanson.vscode-markdownlint",
"editorconfig.editorconfig",
"mads-hartmann.bash-ide-vscode",
"mechatroner.rainbow-csv",
"ms-python.black-formatter",
"ms-python.python",
"ms-python.vscode-pylance",
"njpwerner.autodocstring",
"innoverio.vscode-dbt-power-user"
],
Comment on lines +17 to +27
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these are unused

Suggested change
"extensions": [
"davidanson.vscode-markdownlint",
"editorconfig.editorconfig",
"mads-hartmann.bash-ide-vscode",
"mechatroner.rainbow-csv",
"ms-python.black-formatter",
"ms-python.python",
"ms-python.vscode-pylance",
"njpwerner.autodocstring",
"innoverio.vscode-dbt-power-user"
],
"extensions": [
"davidanson.vscode-markdownlint",
"mechatroner.rainbow-csv",
"ms-python.python",
"ms-python.vscode-pylance",
"njpwerner.autodocstring",
"innoverio.vscode-dbt-power-user"
],

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want to get rid of black?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot there were 3 python files!

"settings": {
"python.formatting.provider": "black",
"python.analysis.completeFunctionParens": true,
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter",
"editor.formatOnSave": true
},
"python.defaultInterpreterPath": "./dbt-env/bin/python3",
"dbt.dbtPythonPathOverride": "./dbt-env/bin/python3"
Comment on lines +35 to +36
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can trim down on unnecessary venv configs

Suggested change
"python.defaultInterpreterPath": "./dbt-env/bin/python3",
"dbt.dbtPythonPathOverride": "./dbt-env/bin/python3"

}
}
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid json

Suggested change
},
}

// Configure tool-specific properties.
// "customizations": {},
// Uncomment to connect as root instead. More info: https://aka.ms/dev-containers-non-root.
// "remoteUser": "root"
}
5 changes: 5 additions & 0 deletions .devcontainer/scripts/minimal_requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dbt-core==1.8.2
dbt-duckdb==1.8.1
pre-commit==3.7.1
black
python-dotenv
Comment on lines +1 to +5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need a requirements.txt file for this (if we only need to install dbt-duckdb)

Suggested change
dbt-core==1.8.2
dbt-duckdb==1.8.1
pre-commit==3.7.1
black
python-dotenv

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure of the purpose of the root requirements.txt too? but I think thats for a different PR!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A devcontainer is for a consistent development experience - and black and pre-commit are the bare minimum and ensure that code contributions meet a minimum standard. We should have sqlfluff included explicitly in there too - I think it gets installed anyway as part of dbt.

I can see that the pre-commit -config.yaml file is very minimal - and can be improved as well. Possibly a subset of what is here -> https://github.com/vvcb/dbt-synthea/blob/vc/databricks/.pre-commit-config.yaml

Typically, one wouldn't need a separate requirements file within the devcontainer and the project requirements.txt should be sufficient. The new PR could address this with a single requirements.txt file for the project with database-specific dependencies and dependencies required only for development (such as black, pre-commit, etc.) specified as optional dependencies. This should allow the user to install the project with pip install dbt_synthea[postgres,dev] for instance.

Development dependencies are going to be different from dependencies required for just running the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your explanation.

Ah that makes more sense, I trimmed out some of it as it wasn't referenced anywhere else and didn't see why it would be needed ~ it's probably fine to leave it in for now and can have another PR to tighten up precommit actions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can/should this be updated now that #67 is merged in?

40 changes: 40 additions & 0 deletions .devcontainer/scripts/postCreate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Set up git
git config --global --add safe.directory /workspaces/dbt-synthea
git config --global init.defaultBranch main

Comment on lines +1 to +4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to be corrected, but I think some of this is unnecessary - but maybe is required to defend against WSL2 instances (as I know git can get funny with Windows' ACLs)? Can leave in but I think not needed!

Regardless, the default branch doesn't need to be changed as the project already exists

Suggested change
# Set up git
git config --global --add safe.directory /workspaces/dbt-synthea
git config --global init.defaultBranch main
# Set up git
git config --global --add safe.directory /workspaces/dbt-synthea
git config --global init.defaultBranch main

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree regarding default branch. But, the add safe.directory directive is required on when running this on WSL2.

# Install requirements
cd /workspaces/dbt-synthea
python -m venv dbt-env
source dbt-env/bin/activate
pip install -r .devcontainer/scripts/minimal_requirements.txt
lawrenceadams marked this conversation as resolved.
Show resolved Hide resolved

# Setup pre-commit
pre-commit install

# Setup bash history search
cat >> ~/.inputrc <<'EOF'
"\e[A": history-search-backward
"\e[B": history-search-forward
EOF



# Setup dbt profile
mkdir /home/vscode/.dbt
cat >> /home/vscode/.dbt/profiles.yml <<'EOF'
synthea_omop_etl:
target: dev
outputs:
dev:
type: duckdb
path: ./data/synthea_omop_etl.duckdb
schema: dbt_synthea_dev
EOF

# Setup dbt

echo "Setting up duckdb synthetic data and dbt"
mkdir ./data

dbt deps
dbt seed
2 changes: 2 additions & 0 deletions .devcontainer/scripts/postStart.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
git config --global --add safe.directory /workspaces/dbt-synthea

Comment on lines +1 to +2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be run on every postStart (if it has already been done onCreate), if at all

Suggested change
git config --global --add safe.directory /workspaces/dbt-synthea

3 changes: 2 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
},
"python.terminal.activateEnvInCurrentTerminal": true,
"python.defaultInterpreterPath": "./dbt-env/bin/python3",
"dbt.dbtPythonPathOverride": "./dbt-env/bin/python3",
Comment on lines 6 to +8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"python.terminal.activateEnvInCurrentTerminal": true,
"python.defaultInterpreterPath": "./dbt-env/bin/python3",
"dbt.dbtPythonPathOverride": "./dbt-env/bin/python3",

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah 😄 I should have scrolled here before responding to the previous comment. Agree with this change - the user should be responsible for managing the environment rather than making this part of the codebase. But, this needs to be discussed with the others as it will change the behaviour of vscode for users.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, i had set it up this way so that folks using VS Code would get their virtualenv activated each time they open the repo, without having to remember to activate it. if there's a way to instruct people to set this up without committing it to the repo, i'm open to doing that instead - is there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting up and activating an environment should be the user's responsibility and outside of source control. What if someone wants to use uv or conda to manage their environments or use something other than vscode for development?

I would propose moving this to the documentation in README and modifying my devcontainer hack to get rid of setting up a virtual environment within a devcontainer - which as @lawrenceadams rightly pointed out, is redundant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case should we add .vscode to .gitignore? (and then maybe we could add this settings json file to an extras folder so users can easily copy it to their local if they want to use it without thinking too much about it)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is to keep .vscode under source control to share settings (eg. spaces Vs tabs, task configuration, etc.) - there is a good discussion here - https://stackoverflow.com/questions/32964920/should-i-commit-the-vscode-folder-to-source-control

However, environment activation steps can simply go in README rather than in code. I would expect the user to be able to setup a virtual environment if they are planning to use DuckDb, Postgres and dbt. If not, good time to learn anyway.

I will update this PR (with rebase).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me! Please go ahead and include that change in this PR :)

"dbt.queryLimit": 500,
"yaml.schemas": {
"https://raw.githubusercontent.com/dbt-labs/dbt-jsonschema/main/schemas/dbt_yml_files.json": [
Expand All @@ -25,4 +26,4 @@
"packages.yml"
]
},
}
}
61 changes: 49 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/OHDSI/dbt-synthea)

# [Under developement] dbt-synthea
The purpose of this project is to re-create the Synthea-->OMOP ETL implemented in https://github.com/OHDSI/ETL-Synthea using [dbt](https://github.com/dbt-labs/dbt-core).

The purpose of this project is to re-create the Synthea-->OMOP ETL implemented in <https://github.com/OHDSI/ETL-Synthea> using [dbt](https://github.com/dbt-labs/dbt-core).

The project is currently under development and is not yet ready for production use.

Expand All @@ -9,6 +12,11 @@ We built dbt-synthea to demonstrate the power of dbt for building OMOP ETLs. **

...and this is just the beginning! We hope someday to grow the OHDSI dbt ecosystem to include a generic dbt project template and source-specific macros and models. Stay tuned and please reach out if you are interested in contributing.

## Devcontainer development

The project supports quick experimentation for developers wishing to test the entire dbt workflow on synthetic data using [DuckDb](https://duckdb.org/) through the use of a [dev container](https://containers.dev/).
This also allows the use of a consistent, pre-configured development environment that can also be used in [GitHub Codespaces](https://github.com/features/codespaces).

## Developer Setup

Currently this project is set up to run an OMOP ETL into either duckdb or Postgres. Setup instructions for each are provided below.
Expand All @@ -18,35 +26,46 @@ By default, the project will source the Synthea and OMOP vocabulary data from se
Users are welcomed, however, to utilize their own Synthea and/or OMOP vocabulary tables as sources. Instructions for the "BYO data" setup are provided below.

### Prerequisites

- See the top of [this page](https://docs.getdbt.com/docs/core/pip-install) for OS & Python requirements. (Do NOT install dbt yet - see below for project installation and setup.)
- It is recommended to use [VS Code](https://code.visualstudio.com/) as your IDE for developing this project. Install the `dbt Power User` extension in VS Code to enjoy a plethora of useful features that make dbt development easier
- This project currently only supports **Synthea v3.0.0**

### Repo Setup

1. Clone this repository to your machine
2. `cd` into the repo directory and set up a virtual environment:

```bash
python3 -m venv dbt-env
```
- If you are using VS Code, create a .env file in the root of your repo workspace (`touch .env`) and add a PYTHONPATH entry for your virtual env (for example, if you cloned your repo in your computer's home directory, the entry will read as: `PYTHONPATH="~/dbt-synthea/dbt-env/bin/python"`)
- Now, in VS Code, once you set this virtualenv as your preferred interpreter for the project, the vscode config in the repo will automatically source this env each time you open a new terminal in the project. Otherwise, each time you open a new terminal to use dbt for this project, run:

- If you are using VS Code, create a .env file in the root of your repo workspace (`touch .env`) and add a PYTHONPATH entry for your virtual env (for example, if you cloned your repo in your computer's home directory, the entry will read as: `PYTHONPATH="~/dbt-synthea/dbt-env/bin/python"`)
- Now, in VS Code, once you set this virtualenv as your preferred interpreter for the project, the vscode config in the repo will automatically source this env each time you open a new terminal in the project. Otherwise, each time you open a new terminal to use dbt for this project, run:

```bash
source dbt-env/bin/activate # activate the environment for Mac and Linux OR
dbt-env\Scripts\activate # activate the environment for Windows
```

4. In your virtual environment, install dbt and other required dependencies as follows:

```bash
pip3 install -r requirements.txt
pre-commit install
```
- This will install dbt-core, the dbt duckdb and postgres adapters, SQLFluff (a SQL linter), pre-commit (in order to run SQLFluff on all newly-committed code in this repo), duckdb (to support bootstrapping scripts), and various dependencies for the listed packages

- This will install dbt-core, the dbt duckdb and postgres adapters, SQLFluff (a SQL linter), pre-commit (in order to run SQLFluff on all newly-committed code in this repo), duckdb (to support bootstrapping scripts), and various dependencies for the listed packages

### DuckDB Setup

1. Create a duckdb database in this repo's `data` directory (e.g. `data/synthea_omop_etl.duckdb`)

2. Set up your [profiles.yml file](https://docs.getdbt.com/docs/core/connect-data-platform/profiles.yml):
- Create a directory `.dbt` in your root directory if one doesn't exist already, then create a `profiles.yml` file in `.dbt`
- Add the following block to the file:

- Create a directory `.dbt` in your root directory if one doesn't exist already, then create a `profiles.yml` file in `.dbt`
- Add the following block to the file:

```yaml
synthea_omop_etl:
outputs:
Expand All @@ -58,22 +77,27 @@ pre-commit install
```

3. Ensure your profile is setup correctly using dbt debug:

```bash
dbt debug
```

4. Load dbt dependencies:

```bash
dbt deps
```

5. **If you'd like to run the default ETL using the pre-seeded Synthea dataset,** run `dbt seed` to load the CSVs with the Synthea dataset and vocabulary data. This materializes the seed CSVs as tables in your target schema (vocab) and a _synthea schema (Synthea tables). **Then, skip to step 9 below.**

```bash
dbt seed
```

6. **If you'd like to run the ETL on your own Synthea dataset,** first toggle the `seed_source` variable in `dbt_project.yml` to `false`. This will tell dbt not to look for the source data in the seed schemas.

7. **[BYO DATA ONLY]** Load your Synthea and Vocabulary data into the database by running the following commands (modify the commands as needed to specify the path to the folder storing the Synthea and vocabulary csv files, respectively). The vocabulary tables will be created in the target schema specified in your profiles.yml for the profile you are targeting. The Synthea tables will be created in a schema named "<target schema>_synthea". **NOTE only Synthea v3.0.0 is supported at this time.**

``` bash
file_dict=$(python3 scripts/python/get_csv_filepaths.py path/to/synthea/csvs)
dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: false}"
Expand All @@ -82,26 +106,32 @@ dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables:
```

8. Seed the location mapper and currently unused empty OMOP tables:

```bash
dbt seed --select states omop
```

9. Build the OMOP tables:

```bash
dbt run
```

10. Run tests:

```bash
dbt test
```

### Postgres Setup

1. Set up a local Postgres database with a dedicated schema for developing this project (e.g. `dbt_synthea_dev`)

2. Set up your [profiles.yml file](https://docs.getdbt.com/docs/core/connect-data-platform/profiles.yml):
- Create a directory `.dbt` in your root directory if one doesn't exist already, then create a `profiles.yml` file in `.dbt`
- Add the following block to the file:

- Create a directory `.dbt` in your root directory if one doesn't exist already, then create a `profiles.yml` file in `.dbt`
- Add the following block to the file:

```yaml
synthea_omop_etl:
outputs:
Expand All @@ -118,23 +148,27 @@ dbt test
```

3. Ensure your profile is setup correctly using dbt debug:

```bash
dbt debug
```

4. Load dbt dependencies:

```bash
dbt deps
```

5. **If you'd like to run the default ETL using the pre-seeded Synthea dataset,** run `dbt seed` to load the CSVs with the Synthea dataset and vocabulary data. This materializes the seed CSVs as tables in your target schema (vocab) and a _synthea schema (Synthea tables). **Then, skip to step 10 below.**

```bash
dbt seed
```

6. **If you'd like to run the ETL on your own Synthea dataset,** first toggle the `seed_source` variable in `dbt_project.yml` to `false`. This will tell dbt not to look for the source data in the seed schemas.

7. **[BYO DATA ONLY]** Create the empty vocabulary and Synthea tables by running the following commands. The vocabulary tables will be created in the target schema specified in your profiles.yml for the profile you are targeting. The Synthea tables will be created in a schema named "<target schema>_synthea".

``` bash
dbt run-operation create_vocab_tables
dbt run-operation create_synthea_tables
Expand All @@ -143,16 +177,19 @@ dbt run-operation create_synthea_tables
8. **[BYO DATA ONLY]** Use the technology/package of your choice to load the OMOP vocabulary and raw Synthea files into these newly-created tables. **NOTE only Synthea v3.0.0 is supported at this time.**

9. Seed the location mapper and currently unused empty OMOP tables:

```bash
dbt seed --select states omop
```

10. Build the OMOP tables:

```bash
dbt run
```

11. Run tests:

```bash
dbt test
```
```
4 changes: 2 additions & 2 deletions package-lock.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to bump this! Nice catch

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
packages:
- package: dbt-labs/dbt_utils
version: 1.2.0
sha1_hash: d4f259856543b0ef301e0b3b0bbc94ccb6b12a54
version: 1.3.0
sha1_hash: 226ae69cdfbc9367e2aa2c472b01f99dbce11de0