Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Superset deployment #3715

Merged
merged 17 commits into from
Aug 16, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
295 changes: 148 additions & 147 deletions environments/conda-linux-64.lock.yml

Large diffs are not rendered by default.

4,266 changes: 2,226 additions & 2,040 deletions environments/conda-lock.yml

Large diffs are not rendered by default.

288 changes: 148 additions & 140 deletions environments/conda-osx-64.lock.yml

Large diffs are not rendered by default.

294 changes: 151 additions & 143 deletions environments/conda-osx-arm64.lock.yml

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,7 @@ dev = [
"ruff-lsp",
"jupyter-resource-usage",
"pygraphviz",
"terraform>=1.9.2"
]

[tool.setuptools]
Expand Down
14 changes: 14 additions & 0 deletions superset/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM apache/superset:4.0.2

# hadolint ignore=DL3002
USER root

COPY --chown=superset superset_config.py /app/
ENV SUPERSET_CONFIG_PATH /app/superset_config.py

# add to requirements file
COPY --chown=superset requirements.txt /app/
RUN pip install --no-cache-dir -r /app/requirements.txt && \
rm /app/requirements.txt
# Switching back to using the `superset` user
USER superset
49 changes: 49 additions & 0 deletions superset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# PUDL Superset
This directory contains files required to build and deploy PUDL's Superset instance.

## Local deployment
To test out a local deployment build the images:

```
docker compose build
```

Before you start the service you'll need to set some environment variables

```
# These auth0 are required for authentication
# For local development it's best that you create your own
# Auth0 project so we don't accidently muck with the production
# auth0 information.
# You can follow the instructions here: https://auth0.com/docs/get-started/auth0-overview/create-applications
export AUTH0_CLIENT_ID="auth0 client id"
jdangerx marked this conversation as resolved.
Show resolved Hide resolved
export AUTH0_CLIENT_SECRET="auth0 client secret"
export AUTH0_DOMAIN="auth0 client domain"

# Set the connection details
export SUPERSET_DB_HOST=postgres
export SUPERSET_DB_PORT=5432
export SUPERSET_DB_USER=superset
export SUPERSET_DB_PASS=superset
export SUPERSET_DB_NAME=superset
```

Then you can start the services

```
docker compose up
```

If this is the first time running superset locally or you recently ran `docker compose down` you'll need to run the commands in `setup.sh`.

## Making changes to the production deployment
TODO: instructions on how to connect to Cloud SQL
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to figure out how to use the Cloud Auth Proxy to make changes to the production database. Also, how can we protect against people making changes to the production database when they are just experimenting with the local deployment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you make changes to the Cloud SQL database for the initial deploy?

I think we can use the Docker version of the Cloud SQL Auth Proxy in a mutate-prod docker compose file. That would just be the same as our dev docker-compose, but with the Cloud SQL proxy standing in for the postgres service. Then executing superset db-upgrade or whatever, within the pudl-superset docker-compose service, would point at prod Cloud SQL.

We'd need to make a new service account to give that Cloud SQL Auth Proxy. To stop people from accidentally changing the prod DB, we could restrict the ability to create a key for that SA to only a subset of Catalyst. So it'd require a fair amount of effort/oversight to be able to make a change to prod DB.


## Deploy to Cloud Run
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the deployment flow be for superset? I can think of a few reasons to trigger a new deployment:
When do we want to update superset

  • package in requirements.txt is updated
  • the base image is updated
  • we make a change to superset_config.py
  • I think we’ll probably want to redeploy superset when there is new data because changes to tables might break shared dashboards. Also, should superset point at nightly or stable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think redeploying Superset for nightly builds is a good way to test that our deployment infrastructure still works, and has the benefit of catching all these other changes too - so long as the cloud build doesn't cost a ton I think that's the easiest way forward.

I also think that eventually we probably want both nightly and stable to be on Superset - we can default to using stable, but have the option to connect to nightly in the database selector if people want.

Once you've made changes to the superset docker image, you can update the production deployment with this command:

```
gcloud builds submit --config cloudbuild.yaml .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love a 1 line redeploy :)

```

This command will use Cloud Build to build the docker image, push it to the Google Cloud Artifact Registry and redeploy the Cloud Run `pudl-superset` service with the new docker image.
29 changes: 29 additions & 0 deletions superset/cloudbuild.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
steps:
- name: "gcr.io/cloud-builders/docker"
args:
[
"build",
"-t",
"us-central1-docker.pkg.dev/catalyst-cooperative-pudl/pudl-superset/pudl-superset:latest",
"--platform",
"linux/amd64",
".",
]
- name: "gcr.io/cloud-builders/docker"
args:
[
"push",
"us-central1-docker.pkg.dev/catalyst-cooperative-pudl/pudl-superset/pudl-superset:latest",
]

- name: "gcr.io/cloud-builders/gcloud"
args:
[
"run",
"deploy",
"pudl-superset",
"--image",
"us-central1-docker.pkg.dev/catalyst-cooperative-pudl/pudl-superset/pudl-superset:latest",
"--region",
"us-central1",
]
31 changes: 31 additions & 0 deletions superset/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
services:
superset:
build:
context: .
dockerfile: Dockerfile
container_name: pudl-superset
environment:
AUTH0_CLIENT_ID: ${AUTH0_CLIENT_ID}
AUTH0_CLIENT_SECRET: ${AUTH0_CLIENT_SECRET}
AUTH0_DOMAIN: ${AUTH0_DOMAIN}
SUPERSET_SECRET_KEY: ${SUPERSET_SECRET_KEY}
SUPERSET_DB_HOST: ${SUPERSET_DB_HOST-postgres}
SUPERSET_DB_PORT: ${SUPERSET_DB_PORT-5432}
SUPERSET_DB_USER: ${SUPERSET_DB_USER-superset}
SUPERSET_DB_PASS: ${SUPERSET_DB_PASS-superset}
SUPERSET_DB_NAME: ${SUPERSET_DB_NAME-superset}
ports:
- "8080:8088"
volumes:
- ${PUDL_OUTPUT}/pudl.duckdb:/app/pudl.duckdb
- ./roles.json:/app/roles.json
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to try to track the role definitions in git so our local deployments can use the roles we're using in production?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes a ton of sense to me :)

depends_on:
- postgres
postgres:
image: postgis/postgis:13-3.1-alpine
environment:
POSTGRES_DB: superset
POSTGRES_USER: superset
POSTGRES_PASSWORD: superset
ports:
- 8081:5432
5 changes: 5 additions & 0 deletions superset/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
duckdb==1.0.0
duckdb-engine==0.13.0
psycopg2-binary==2.9.9
Authlib==1.3.1
pg8000==1.31.2
16 changes: 16 additions & 0 deletions superset/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Description: This script was used to setup the superset instance for the first time.

# Create and admin user
docker compose exec -it superset superset fab create-admin \
--username admin \
--firstname Superset \
--lastname Admin \
--email [email protected] \
--password admin
Copy link
Member

@jdangerx jdangerx Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is run for the dev superset instance, right? What happens in production?

Separately, should we be passing these username/name/email/password as arguments to this script? In prod, we'd have all the secrets set as env vars at this point, right?


# Initialize the database and run migrations
docker compose exec -it superset superset db upgrade
docker compose exec -it superset superset init

# Import custom roles that include a new role that combines permissions of Gamma and sql_user roles
# docker exec -it pudl-superset superset fab import-roles --path /app/roles.json
74 changes: 74 additions & 0 deletions superset/superset_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
"""PUDL's Superset configuration."""

import os

import sqlalchemy as sa
from flask import Flask
from flask_appbuilder.security.manager import (
AUTH_OAUTH,
)

AUTH_TYPE = AUTH_OAUTH

AUTH0_CLIENT_ID = os.environ["AUTH0_CLIENT_ID"]
AUTH0_CLIENT_SECRET = os.environ["AUTH0_CLIENT_SECRET"]
AUTH0_DOMAIN = os.environ["AUTH0_DOMAIN"]

OAUTH_PROVIDERS = [
{
"name": "auth0",
"token_key": "access_token", # Name of the token in the response of access_token_url
"icon": "fa-key", # Icon for the provider
"remote_app": {
"client_id": AUTH0_CLIENT_ID, # Client Id (Identify Superset application)
"client_secret": AUTH0_CLIENT_SECRET, # Secret for this Client Id (Identify Superset application)
"client_kwargs": {"scope": "openid profile email groups"},
"server_metadata_url": f"https://{AUTH0_DOMAIN}/.well-known/openid-configuration",
},
}
]

AUTH_USER_REGISTRATION = True
AUTH_USER_REGISTRATION_ROLE = "GammaSQLLab"


def get_db_connection_string() -> str:
"""Get the database connection string."""
drivername = "postgresql+psycopg2"
host = os.environ.get("SUPERSET_DB_HOST")
port = os.environ.get("SUPERSET_DB_PORT")
username = os.environ["SUPERSET_DB_USER"]
password = os.environ["SUPERSET_DB_PASS"]
database = os.environ["SUPERSET_DB_NAME"]

is_cloud_run = os.environ.get("IS_CLOUD_RUN", False)

if is_cloud_run:
cloud_sql_connection_name = os.environ.get("CLOUD_SQL_CONNECTION_NAME")
# I couldn't figure out how to use unix sockets with the sa.engine.url.URL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine - even the docs seem to just use string URIs.

# class so I'm creating the connection string manually
return f"postgresql+psycopg2://{username}:{password}@/{database}?host=/cloudsql/{cloud_sql_connection_name}"
return str(
sa.engine.url.URL.create(
drivername=drivername,
host=host,
port=port,
username=username,
password=password,
database=database,
)
)


SQLALCHEMY_DATABASE_URI = get_db_connection_string()


def FLASK_APP_MUTATOR(app: Flask) -> None: # noqa: N802
"""Superset function that allows you to configure the Flask app.

Args:
app: The Flask app instance
"""
app.config.update(
PREFERRED_URL_SCHEME="https",
)
Comment on lines +66 to +74
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this didn't resolve the HTTP redirect issue. I think the issue might be related to the auth0 Oauth development tokens we're using.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, nothing in that list looks suspicious to me, I'd love to see the authentication logs and see if there are specific errors showing up, or what the callback URL is being parsed as, etc.

82 changes: 41 additions & 41 deletions terraform/.terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading