Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task: Reconfigure the Docker setup to incorporate the timeseriesdb extension #997

Open
5 tasks
nlebovits opened this issue Nov 13, 2024 · 7 comments
Open
5 tasks
Assignees
Labels
backend help wanted Extra attention is needed

Comments

@nlebovits
Copy link
Collaborator

nlebovits commented Nov 13, 2024

Add TimescaleDB Extension and Configure Hypertables for Time Series Analysis

Describe the Task

Our current PostgreSQL instance is not optimized for time series data. Currently, the script dumps the existing postgres schema into a backup schema named with the date it was created, and then creates a new data. Instead, we should switch to using a single schema with the timescaledb extension. Then we can convert our main tables to hypertables, partitioned monthly, instead of creating a bunch of backup schemas. Additionally, we want to implement data compression policies for data older than one year to optimize storage.

As an optional improvement, consider adding spatial indexing to tables containing geospatial data. If this is implemented, please document the process and any decisions made.

Another optional but beneficial addition is setting up pg_stat to monitor query performance and track table growth over time.

Acceptance Criteria

  • Add timescaledb extension to the PostgreSQL instance.
  • Convert main tables to hypertables with monthly partitioning.
  • Implement data compression policies for data older than one year.
  • (Optional) Add spatial indexing to geospatial tables if deemed useful.
  • (Optional) Set up pg_stat to monitor query performance and table growth over time.

Additional Context

  • Ensure that documentation is thorough, detailing the setup and configuration steps.
  • Provide any relevant references or resources that may assist in understanding timescaledb and PostgreSQL configurations for time series analysis.

Existing Work

I've already put some work into this. Here is:

  1. My docker-compose.yml
services:
  vacant-lots-proj-experimental:
    build:
      context: .
      dockerfile: Dockerfile
    image: vacant-lots-proj-experimental:latest
    environment:
      - GOOGLE_APPLICATION_CREDENTIALS=/app/service-account-key.json
      - CFP_MAPBOX_TOKEN_UPLOADER
      - VACANT_LOTS_DB
      - CLEAN_GREEN_GOOGLE_KEY
      - PYTHONUNBUFFERED=1
      - GOOGLE_CLOUD_BUCKET_NAME
      - GOOGLE_CLOUD_PROJECT
      - CAGP_SLACK_API_TOKEN
    volumes:
      - ./src:/usr/src/app
      - ~/.config/gcloud/application_default_credentials.json:/app/service-account-key.json
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    extra_hosts:
      - host.docker.internal:host-gateway
    network_mode: 'host'

  formatter-experimental:
    build:
      context: .
    volumes:
      - ./src:/usr/src/app
    command: sh -c "pip install ruff && ruff format --exclude '/usr/src/app/awkde/'"
    network_mode: 'host'

  linter-experimental:
    build:
      context: .
    volumes:
      - ./src:/usr/src/app
    command: sh -c "pip install ruff && ruff check --fix --exclude '/usr/src/app/awkde/'"
    network_mode: 'host'

  streetview-experimental:
    build:
      context: .
    environment:
      - GOOGLE_APPLICATION_CREDENTIALS=/app/service-account-key.json
      - CLEAN_GREEN_GOOGLE_KEY
      - VACANT_LOTS_DB
      - GOOGLE_CLOUD_BUCKET_NAME
    volumes:
      - ./src:/usr/src/app
      - ~/.config/gcloud/application_default_credentials.json:/app/service-account-key.json
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    command: sh -c "pipenv run python streetview.py"
    extra_hosts:
      - host.docker.internal:host-gateway
    network_mode: 'host'

  postgres-experimental:
    container_name: cagp-postgres-experimental
    build:
      context: .
      dockerfile: Dockerfile-pg
    environment:
      - PGPORT=5433
      - POSTGRES_PASSWORD
    restart: always
    ports:
      - '5433:5433'
    volumes:
      - database_volume:/var/lib/postgresql/data
      - ./init_pg.sql:/docker-entrypoint-initdb.d/init_pg.sql
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    extra_hosts:
      - host.docker.internal:host-gateway

volumes:
  database_volume:

  1. My Dockerfile
# Use an official Python 3.11 runtime as a parent image
FROM python:3.11.4

# Set the working directory in the container
WORKDIR /usr/src/app

# Install system dependencies for GDAL and Tippecanoe
RUN apt-get update && apt-get install -y \
    libgdal-dev \
    gcc \
    git \
    build-essential \
    libsqlite3-dev \
    zlib1g-dev \
    lsb-release \
    && rm -rf /var/lib/apt/lists/*

# install postgres client 16 for psql and pg_dump executables for backups.
# should match the version used in the other docker file for the postgres install
RUN sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
RUN curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc |  gpg --dearmor -o /etc/apt/trusted.gpg.d/postgresql.gpg
RUN apt update
RUN apt install -y postgresql-client-16

# Set GDAL environment variables
ENV GDAL_VERSION=3.6.2
ENV GDAL_CONFIG=/usr/bin/gdal-config

# Install Pipenv
RUN pip install pipenv

# update pipfile
RUN pipenv lock

# Copy the Pipfile and Pipfile.lock from the src directory
COPY src/Pipfile src/Pipfile.lock ./

# Install the dependencies from Pipfile
RUN pipenv install --deploy --ignore-pipfile

# Clone and build Tippecanoe
RUN git clone https://github.com/felt/tippecanoe.git \
    && cd tippecanoe \
    && make \
    && make install

# Copy the src directory (including awkde)
COPY src/ .

# Install awkde within the Pipenv environment
# Assuming awkde is located in /usr/src/app/awkde
WORKDIR /usr/src/app/awkde
RUN pipenv run pip install .

# Change back to the original working directory
WORKDIR /usr/src/app

# Use Pipenv to run the script
# Adjust the path to your main Python script if needed
CMD ["pipenv", "run", "python", "./script.py"]

  1. My Dockerfile-pg
FROM postgres:16-bullseye

LABEL maintainer="PostGIS Project - https://postgis.net" \
      org.opencontainers.image.description="PostGIS with PostgreSQL 16 and TimescaleDB" \
      org.opencontainers.image.source="https://github.com/postgis/docker-postgis"

ENV POSTGIS_MAJOR 3

# Install dependencies and set up TimescaleDB repository
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
           gnupg \
           postgresql-common \
           apt-transport-https \
           lsb-release \
           wget \
           ca-certificates \
    && yes | /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh \
    && echo "deb https://packagecloud.io/timescale/timescaledb/debian/ $(lsb_release -c -s) main" > /etc/apt/sources.list.d/timescaledb.list \
    && wget --quiet -O - https://packagecloud.io/timescale/timescaledb/gpgkey | gpg --dearmor -o /etc/apt/trusted.gpg.d/timescaledb.gpg \
    && apt-get update \
    && apt-get install -y --no-install-recommends \
           postgresql-16-postgis-3 \
           postgresql-16-postgis-3-scripts \
           timescaledb-2-postgresql-16 \
           postgresql-client-16 \
    && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /docker-entrypoint-initdb.d

# After the TimescaleDB installation, ensure the configuration is updated
RUN echo "shared_preload_libraries = 'timescaledb'" >> /usr/share/postgresql/postgresql.conf.sample

  1. My init_pq.sql
CREATE DATABASE vacantlotdb;
\c vacantlotdb;
CREATE EXTENSION postgis;
CREATE EXTENSION timescaledb;

  1. My new draft of script.py
import os
import sys
from datetime import datetime

import pandas as pd
from config.psql import conn
from data_utils.access_process import access_process
from data_utils.city_owned_properties import city_owned_properties
from data_utils.community_gardens import community_gardens
from data_utils.conservatorship import conservatorship
from data_utils.contig_neighbors import contig_neighbors
from data_utils.deliquencies import deliquencies
from data_utils.dev_probability import dev_probability
from data_utils.drug_crimes import drug_crimes
from data_utils.gun_crimes import gun_crimes
from data_utils.imm_dang_buildings import imm_dang_buildings
from data_utils.l_and_i import l_and_i
from data_utils.nbhoods import nbhoods
from data_utils.negligent_devs import negligent_devs
from data_utils.opa_properties import opa_properties
from data_utils.owner_type import owner_type
from data_utils.park_priority import park_priority
from data_utils.phs_properties import phs_properties
from data_utils.ppr_properties import ppr_properties
from data_utils.priority_level import priority_level
from data_utils.rco_geoms import rco_geoms
from data_utils.tactical_urbanism import tactical_urbanism
from data_utils.tree_canopy import tree_canopy
from data_utils.unsafe_buildings import unsafe_buildings
from data_utils.vacant_properties import vacant_properties
from sqlalchemy import text

# Ensure the directory containing awkde is in the Python path
awkde_path = "/usr/src/app"
if awkde_path not in sys.path:
    sys.path.append(awkde_path)

services = [
    city_owned_properties,
    phs_properties,
    l_and_i,
    rco_geoms,
    tree_canopy,
    nbhoods,
    gun_crimes,
    drug_crimes,
    deliquencies,
    opa_properties,
    unsafe_buildings,
    imm_dang_buildings,
    tactical_urbanism,
    conservatorship,
    owner_type,
    community_gardens,
    park_priority,
    ppr_properties,
    contig_neighbors,
    dev_probability,
    negligent_devs,
]

# Load Vacant Property Data
dataset = vacant_properties()

# Load and join other datasets
for service in services:
    dataset = service(dataset)

# Additional processing and print statements
before_drop = dataset.gdf.shape[0]
dataset.gdf = dataset.gdf.drop_duplicates(subset="opa_id")
after_drop = dataset.gdf.shape[0]
print(
    f"Duplicate dataset rows dropped after initial services: {before_drop - after_drop}"
)

# Add Priority Level
dataset = priority_level(dataset)
print("Distribution of priority level:")
print(dataset.gdf["priority_level"].value_counts())

# Add Access Process
dataset = access_process(dataset)
print("Distribution of access process:")
print(dataset.gdf["access_process"].value_counts())

before_drop = dataset.gdf.shape[0]
dataset.gdf = dataset.gdf.drop_duplicates(subset="opa_id")
after_drop = dataset.gdf.shape[0]
print(f"Duplicate final dataset rows dropped: {before_drop - after_drop}")

# Add create_date column
today_date = datetime.now().strftime("%m-%d-%Y")
dataset.gdf["create_date"] = today_date
dataset.gdf["create_date"] = pd.to_datetime(dataset.gdf["create_date"])

os.makedirs("tmp", exist_ok=True)
dataset.gdf.to_file("tmp/full_dataset.geojson", driver="GeoJSON")
print("Final dataset saved to tmp/ folder.")

# back up old tiles file whether we are reloading data or not
# if backup is None:
#     backup = BackupArchiveDatabase()
# backup.backup_tiles_file()

# Commit the data to PostgreSQL
dataset.gdf.to_postgis("vacant_properties_end", conn, if_exists="replace", index=False)
conn.commit()
print("Data committed to PostgreSQL.")

# Check if the table is already a hypertable and create it if not
with conn.begin():  # Use conn directly if it's an existing Connection object
    result = conn.execute(
        text(
            "SELECT * FROM timescaledb_information.hypertables WHERE hypertable_name = 'vacant_properties_end';"
        )
    )
    if result.rowcount == 0:
        conn.execute(
            text(
                "SELECT create_hypertable('vacant_properties_end', 'create_date', migrate_data => true);"
            )
        )
        print("Table converted to a TimescaleDB hypertable.")
    else:
        print("Table is already a TimescaleDB hypertable.")

# Post to Mapbox
# dataset.build_and_publish(tiles_file_id_prefix)

# if we are reloading, run the diff report, then archive the backup and finally prune old archives
# if FORCE_RELOAD:
#     diff_report = DiffReport(timestamp_string=backup.timestamp_string)
#     diff_report.run()
#     backup.archive_backup_schema()
#     conn.commit()
#     backup.prune_old_archives()
#     conn.commit()

conn.close()

# except Exception as e:
#     error_message = f"Error in backend job: {str(e)}\n\n{traceback.format_exc()}"
#     send_error_to_slack(error_message)
#     raise  # Optionally re-raise the exception


Please create a draft PR that includes strong documentation, including:

  • Changes made to the PostgreSQL configuration.
  • Steps for adding and configuring timescaledb and the setup of hypertables.
  • Notes on any optional enhancements if implemented (spatial indexing, pg_stat setup).
@nlebovits nlebovits added the help wanted Extra attention is needed label Nov 13, 2024
@rmartinsen
Copy link
Contributor

I'd love to take a shot at this one and get involved with the project in general. After talking last night and looking through the issue I'm starting to get a feel for it.

One caveat: I'll be out of town this weekend and won't be able to really sink my teeth into it until the latter part of next week. Will that be an issue?

@nlebovits
Copy link
Collaborator Author

Not an issue at all @rmartinsen ! I've assigned the ticket to you.

@nlebovits nlebovits moved this from To Do to In Development in Clean & Green Philly Nov 14, 2024
@nlebovits
Copy link
Collaborator Author

@rmartinsen

I've been rethinking the development process and realized I was overcomplicating things when we talked the other night. Rather than trying to swap out the database + container, in one go, it's likely easier for you to create a new container for this timescale database and integrate that into our existing docker-compose.yml.

For now, you can create a new container for the TimescaleDB database (with the PostGIS and TimescaleDB extensions) and integrate it into our existing docker-compose.yml. This new container should be accessible from the main vacant-lots-proj container, just like the current PostgreSQL database is. Essentially, we’ll add it as a third container in our setup.

While you're working on the new database container, I'll write a new version of script.py (with a different name, e.g., main.py). This new script will be based on the OPA properties dataset, rather than the vacant properties dataset. Once your new database is ready, we’ll test the new script by writing data to it.

After that, we can:

  1. Archive the old script.py and the current PostgreSQL database.
  2. Dump the old data for backups.
  3. Transition fully to the new setup.

Summary of Tasks

  • You: Set up a new PostgreSQL container with PostGIS + TimescaleDB extensions and integrate it with our existing Docker Compose setup.
  • Me: Write a new version of the ETL script based on the OPA properties dataset, designed to interact with the new TimescaleDB.

Once these two parts are ready, we’ll work together to validate the new ETL script by writing data to the new database. Afterward, I’ll handle deployment to our VM, set up backups, and finalize the migration.

Does this approach make sense to you?

@rmartinsen
Copy link
Contributor

@nlebovits That sounds like a plan. I'll get started on it over the next few days.

@nlebovits
Copy link
Collaborator Author

@rmartinsen see #1014--the new ETL pipeline is outlined there and you should be able to integrate it directly into the new Docker container + pg database

@rmartinsen
Copy link
Contributor

rmartinsen commented Nov 23, 2024

@nlebovits I put up a PR with the changes we discussed. It looks pretty similar to what you'd put together, just with a separate postgres container. Let me know if this is what you're looking for.

@rmartinsen
Copy link
Contributor

Just saw the other comment about #1014. I'll look at integrating that next

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend help wanted Extra attention is needed
Projects
Status: In Development
Development

No branches or pull requests

2 participants