Lebovits/refactor etl pipeline #1014

nlebovits · 2024-11-21T02:41:09Z

This PR does several major things:

Implements main.py to create a new ETL pipeline based around the full OPA properties dataset as opposed to script.py, which is only based around the vacant properties datasets.
Updates most of the data_utils and other files relevant to the pipeline, storying them in the new_etl subdirectory to avoid conflicts with the main pipeline.
Adds a bunch of data quality controls and checks in the pipeline--these are currently coarse and can/should be improved as we continue to develop the pipeline.
Adds parallelization where relevant (e.g., reading in large Carto datasets, calculating KDE) to speed up the pipeline.
Changes how several datapoints are calculated, e.g., L&I complaints are converted to a citywide KDE as opposed to trying to spatially join points to individual parcels.

Currently, it runs without changes to the Docker setup. Run docker compose -d postgres, then docker compose run vacant-lots-proj bash, and then pipenv run python -m main. This is because the data are not written to a postgres database. We're waiting on #997 to do this, so main.py just writes to test_output.parquet in the tmp/ directory for now.

Some points of note:

As with the OPA Property dashboard, we are using PWD Parcels as the basis for property geometries, as the OPA properties dataset now only includes point geometry. This is done via a spatial join and leaves ~38k properties (seemingly randomly distributed) without matching polygon geometry. The pipeline handles this gracefully, but it's something that should ideally be improved in the future. I also tried using the DOR parcels dataset, but it's a nightmare, with a lot of spatially overlapping parcels, poor metadata, and no intelligible way to filter for the right parcel boundaries. Same issue with the city's parcel-level land use dataset. In the future, coming up with better parcel boundaries is ideal, but there may not be a best option.
This definitely still leaves some logging to be cleaned up in the pipeline. Likely worthy of a future PR, but simply not something I have bandwidth for right now.
I haven't added any tests here--they're needed, but, again, no bandwidth.
I'm still skeptical of how n_contiguous is being calculated. It worked fine with the vacant properties data alone; I am not 100% sure I have it working properly here.
There are some data gaps with the park_priority and tree_canopy_gap datasets, which are both national-level data (I think from the same Parkserve dataset). It would be nice to fix these, but it's not a priority right now.

vercel · 2024-11-21T02:41:14Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
vacant-lots-proj	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 22, 2024 0:37am

…yet finished--need to handle mismatched number of observations

nlebovits · 2024-11-22T00:34:13Z

Removing Dockerfile-pg from the PR because it's outside the scope, but I was using this to get around postgis versioning:

FROM postgres:16-bullseye

LABEL maintainer="PostGIS Project - https://postgis.net" \
      org.opencontainers.image.description="PostGIS with PostgreSQL 16" \
      org.opencontainers.image.source="https://github.com/postgis/docker-postgis"

ENV POSTGIS_MAJOR 3

# Install dependencies and PostGIS
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
           gnupg \
           postgresql-common \
           apt-transport-https \
           lsb-release \
           wget \
           ca-certificates \
    && yes | /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh \
    && apt-get update \
    && apt-get install -y --no-install-recommends \
           postgresql-16-postgis-3 \
           postgresql-16-postgis-3-scripts \
           postgresql-client-16 \
    && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /docker-entrypoint-initdb.d

Relevant for #997

…current PR

nlebovits added 2 commits November 18, 2024 08:43

run precommit

cd87b34

skip precommit hook

b40c1cd

github-actions bot added backend frontend labels Nov 21, 2024

precommit hook

e28ded3

vercel bot deployed to Preview November 21, 2024 03:01 View deployment

nlebovits added 2 commits November 20, 2024 22:01

reset regular pipeline to staging

7bdec82

add main.py for new etl script

4255b5d

vercel bot deployed to Preview November 21, 2024 03:04 View deployment

fix imports

e8ceaaa

vercel bot deployed to Preview November 21, 2024 03:38 View deployment

clean up li complaints-related items; add dor parcel boundaries (not …

b228c3a

…yet finished--need to handle mismatched number of observations

vercel bot deployed to Preview November 21, 2024 05:45 View deployment

more formatting

e629bb2

vercel bot deployed to Preview November 22, 2024 00:33 View deployment

Reset Dockerfile-pg to match staging as changes are outside scope of …

ea35c07

…current PR

vercel bot deployed to Preview November 22, 2024 00:37 View deployment

nlebovits marked this pull request as ready for review November 22, 2024 00:37

nlebovits merged commit f4f6964 into staging Nov 22, 2024
11 of 13 checks passed

This was referenced Nov 22, 2024

Feature request: L&I violations aggregated by owner #998

Open

Task: Reconfigure the Docker setup to incorporate the timeseriesdb extension #997

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lebovits/refactor etl pipeline #1014

Lebovits/refactor etl pipeline #1014

nlebovits commented Nov 21, 2024 •

edited

Loading

vercel bot commented Nov 21, 2024 •

edited

Loading

nlebovits commented Nov 22, 2024

Lebovits/refactor etl pipeline #1014

Lebovits/refactor etl pipeline #1014

Conversation

nlebovits commented Nov 21, 2024 • edited Loading

vercel bot commented Nov 21, 2024 • edited Loading

nlebovits commented Nov 22, 2024

nlebovits commented Nov 21, 2024 •

edited

Loading

vercel bot commented Nov 21, 2024 •

edited

Loading