Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lebovits/refactor etl pipeline #1014

Merged
merged 9 commits into from
Nov 22, 2024
Merged

Conversation

nlebovits
Copy link
Collaborator

@nlebovits nlebovits commented Nov 21, 2024

This PR does several major things:

  1. Implements main.py to create a new ETL pipeline based around the full OPA properties dataset as opposed to script.py, which is only based around the vacant properties datasets.
  2. Updates most of the data_utils and other files relevant to the pipeline, storying them in the new_etl subdirectory to avoid conflicts with the main pipeline.
  3. Adds a bunch of data quality controls and checks in the pipeline--these are currently coarse and can/should be improved as we continue to develop the pipeline.
  4. Adds parallelization where relevant (e.g., reading in large Carto datasets, calculating KDE) to speed up the pipeline.
  5. Changes how several datapoints are calculated, e.g., L&I complaints are converted to a citywide KDE as opposed to trying to spatially join points to individual parcels.

Currently, it runs without changes to the Docker setup. Run docker compose -d postgres, then docker compose run vacant-lots-proj bash, and then pipenv run python -m main. This is because the data are not written to a postgres database. We're waiting on #997 to do this, so main.py just writes to test_output.parquet in the tmp/ directory for now.

Some points of note:

  • As with the OPA Property dashboard, we are using PWD Parcels as the basis for property geometries, as the OPA properties dataset now only includes point geometry. This is done via a spatial join and leaves ~38k properties (seemingly randomly distributed) without matching polygon geometry. The pipeline handles this gracefully, but it's something that should ideally be improved in the future. I also tried using the DOR parcels dataset, but it's a nightmare, with a lot of spatially overlapping parcels, poor metadata, and no intelligible way to filter for the right parcel boundaries. Same issue with the city's parcel-level land use dataset. In the future, coming up with better parcel boundaries is ideal, but there may not be a best option.
  • This definitely still leaves some logging to be cleaned up in the pipeline. Likely worthy of a future PR, but simply not something I have bandwidth for right now.
  • I haven't added any tests here--they're needed, but, again, no bandwidth.
  • I'm still skeptical of how n_contiguous is being calculated. It worked fine with the vacant properties data alone; I am not 100% sure I have it working properly here.
  • There are some data gaps with the park_priority and tree_canopy_gap datasets, which are both national-level data (I think from the same Parkserve dataset). It would be nice to fix these, but it's not a priority right now.

Copy link

vercel bot commented Nov 21, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
vacant-lots-proj ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 22, 2024 0:37am

…yet finished--need to handle mismatched number of observations
@nlebovits
Copy link
Collaborator Author

Removing Dockerfile-pg from the PR because it's outside the scope, but I was using this to get around postgis versioning:

FROM postgres:16-bullseye

LABEL maintainer="PostGIS Project - https://postgis.net" \
      org.opencontainers.image.description="PostGIS with PostgreSQL 16" \
      org.opencontainers.image.source="https://github.com/postgis/docker-postgis"

ENV POSTGIS_MAJOR 3

# Install dependencies and PostGIS
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
           gnupg \
           postgresql-common \
           apt-transport-https \
           lsb-release \
           wget \
           ca-certificates \
    && yes | /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh \
    && apt-get update \
    && apt-get install -y --no-install-recommends \
           postgresql-16-postgis-3 \
           postgresql-16-postgis-3-scripts \
           postgresql-client-16 \
    && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /docker-entrypoint-initdb.d

Relevant for #997

@nlebovits nlebovits marked this pull request as ready for review November 22, 2024 00:37
@nlebovits nlebovits merged commit f4f6964 into staging Nov 22, 2024
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant