Skip to content

Commit

Permalink
Add workflow to gen data (#329)
Browse files Browse the repository at this point in the history
* run main workflow in container

* remove container action

* try running existing make code in container

* add postgres to generation job

* set up access to postgres

* config postgres service

* remove misplaced 'if' in workflow

* set password for connecting to postgres and list db at start to confirm
availability

* separate out step to check postgres so that we can easily check the
output

* remove unused script

* remove new version of sqlalchemy since it breaks csvsql in `make import`

* add some conditions for job execution in main workflow

* add check for version of sqlalchemy

* get container to rebuild when requirements change

* adjust logic for when jobs run in main workflow

* try summarizing tables created in main workflow

* make sure we clean up downloads before we start

* try testing csvsql early

* some logging to test csvsql

* turn on verbose output while testing csvsql

* try sending csv file through stdin for csvsql

* upgrade csvkit to 1.3.0 and upgraded its dependencies where needed

* remove files no longer needed in download dir

* set postgres version in dev container and workflow to 9.6 to match
travic CI

* update workflow names

* make use of sql files to create tables

* Update import-file to display schema of created table

* Update import-file to log more info about postgresql tables

* Update import-file to point psql to DATABASE_NAME

* Update import-file to use the right quote around table name in psql

* Update import-file to remove debug logging

* Update Makefile to use saved sql for creating tables from spreadsheet data

* fix Makefile by moving bash into file and saved generated sql for tables that hold spreadsheet data

* some fixes to get csvkit 1.3.0 working - not fully working yet...

* make sure data upload for spreadsheet data does not use inference (ie
alter data) and increase length of filer name field for committees

* debug version of csvkit installed

* verify python version at time of install on travis

* remove sudo for pip install

* remove download/main.py dependency on latest version of sqlalchemy

* use later postgres

* update postgres for dev container also

* download new netfile csvs before import

* gracefully handle records missing transaction data

* add netfile v2 data to database during import

* make sure dir exists for saving v2 csv files

* make netfile v2 download a part of `make download`

* add requirements for netfile v2 code

* update python-dateutil

* try to cause failure when pip install fails

* upgrade babel

* update pytz

* allow csvkit to pull in the correct agate dependencies and add script to
trim whitespace for some columns

* remove whitespace for some key columns

* split contributions by type to multiple elections when a candidate was
in multiple elections

* removed commented code

* create candidate_summary view to associate "Summary" info with specific
election

* add total contributions to digest.json

* use hash of hash for contributions by type

* add total contributions by type and source to digests

* take election into account when calculating total contributions and
contributions by source

* organize totals calculated from various sources in digests.json

* update digests.json to include more totals

* calculate contribution totals for all tickets (candidates and referendums) combined

* add more totals to digest and separate by contributions vs expenditures
vs loans

* update expenditures to be split on election and other calculations to
take election into account

* revert committee contribution list calculator

* some comments about the totals calculated for digests.json

* update digests to only show totals that we want to compare

* add loans to total for contributions by type and origin

* move totals logic out of main

* switch total expenditures calculator to use new candidate_summary view
which joins Summary with candidates using the from and thru dates
instead of the report date. This provides consistency and so if we
decide to use the report date instead for the join, we change it in the
view and consistently apply it everywhere.

* add report on candidate totals

* attempt to get python 3.9 to be used

* don't use sudo for pip install

* remove unused var in calculator

* match up calculator with master branch

* upgrade csvkit

* match schema to latest infered by old csvkit

* make sure we are pushing to the same branch when deploying build

* specify the branch to push to for travis auto-deploy

* add schema.sql file

* don't deploy build on pull request build

* increase size of filer name for committees

* clean up whitespace for some more candidate columns

* remove whitespace from referendums summary

* remove commented out line

* combine removal of leading and trailing white spaces into a single
update

* update build with recent fixes from main branch

* re-use code to create table in bin/import-file

* clean up request to dump database schema

* pick committee distinct on filer ID according to order of value in election column

* remove check for Ballot_Measure_Election when looking for committee name
since it wasn't checked for before

* change image used for workflow to generate website data to match version for pg_dump

* set dev container and github actions to use the same postgres version

* try action checkout v4

* print out some dir info to figure out why git thinks it is not a repo

* cause early git failure so we can try to fix it

* remove tab from github workflow file

* show version of key components when cleaning

* add place to insert new downloads

* get image to be created with new branch and don't use the image during
the create event

* add explicit check for docker image in order to run jobs that require it

* log in to docker early

* build container if it's not there

* try increasing size of filer name col

* put shared postgres settings in global env vars

* clean up dev container

* add post-create-command.sh back

* remove pwd in Dockerfile

* write csv from polars dataframe

* merge requirements for netfile v2 into main requirements file

* allow committee id to be null in H-Loan data

* remove copy of download/requirements.txt from Dockerfile

* move new data to be imported to a different target in Makefile

* provide means to switch to ruby 2.7.1 if needed

* remove whitespace from data_warning column

* make data_warning empty instead of null

* make make-null-empty executable

* maintain a consistent order for the candidates report

* make null empty for data_warning in committees

* output consistent order to ensure that output doesn't change when
postgres version changes

* set null committee name to empty string so that we can get consistent
outputs when the postgres version changes

* use floats everywhere when calculating totals in create-digests

* increase column size for instagram column in candidates table

* add some additional totals for oakland-2024 election in digests.json to
help with debugging

* change election name in digests.json to include full date to correctly
capture multiple elections in the same year

* remove commented postgres 9.6 in workflow

* only run netfile v2 download when credentials are set up
  • Loading branch information
ChenglimEar authored and ckingbailey committed Dec 12, 2024
1 parent 8f89882 commit a9ada3b
Show file tree
Hide file tree
Showing 26 changed files with 363 additions and 81 deletions.
19 changes: 19 additions & 0 deletions .devcontainer/install-ruby-2.7.1.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash --login

# Here's a way to install old ruby 2.7.1 using rvm on ubuntu bookworm
# https://github.com/rvm/rvm/issues/5209

sudo apt install build-essential
cd ~/Downloads
wget https://www.openssl.org/source/openssl-1.1.1t.tar.gz
tar zxvf openssl-1.1.1t.tar.gz
cd openssl-1.1.1t
./config --prefix=$HOME/.openssl/openssl-1.1.1t --openssldir=$HOME/.openssl/openssl-1.1.1t
make
make install
rm -rf ~/.openssl/openssl-1.1.1t/certs
ln -s /etc/ssl/certs ~/.openssl/openssl-1.1.1t/certs
cd ~
rvm install ruby-2.7.1 --with-openssl-dir=$HOME/.openssl/openssl-1.1.1t # replace ruby-x.x.x to install other older versions

rvm use 2.7.1
120 changes: 103 additions & 17 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -1,25 +1,111 @@
# This workflow will later be replaced with logic to "Generate Website Data"
# The verify-gdrive.yml workflow file will be renamed to this one
# We have to introduce this change in steps because GitHub gets confused until
# we add the new workflow file to the master branch
name: "Generate Website Data"
on:
workflow_dispatch:
push:
env:
POSTGRES_USER: app_user
POSTGRES_DB: disclosure-backend
POSTGRES_PASSWORD: app_password
jobs:
generate:
build:
runs-on: ubuntu-latest
env:
REPO_OWNER: ${{ github.repository_owner}}
REPO_BRANCH: ${{ github.ref_name }}
SERVICE_ACCOUNT_KEY_JSON: ${{ secrets.SERVICE_ACCOUNT_KEY_JSON }}
GDRIVE_FOLDER: ${{ vars.GDRIVE_FOLDER }}
outputs:
devcontainer: ${{ steps.filter.outputs.devcontainer }}
noncontainer: ${{ steps.filter.outputs.noncontainer }}
steps:
- uses: actions/checkout@v3
- run: pip install -r gdrive_requirements.txt
- run: python test_pull_from_gdrive.py
- name: Archive pulled files
uses: actions/upload-artifact@v3
- name: Login to GitHub Container Registry
uses: docker/login-action@v3
with:
name: redacted-netfile-files
path: .local/downloads
registry: ghcr.io
username: ${{github.actor}}
password: ${{secrets.GITHUB_TOKEN}}
- uses: actions/checkout@v3
- name: Get changed files
id: changed-files
uses: tj-actions/changed-files@v40
- name: List all changed files
id: filter
run: |
echo ${{github.event_name}}
noncontainer=true
if docker pull ghcr.io/caciviclab/disclosure-backend-static/${{github.ref_name}}:latest; then
devcontainer=false
else
devcontainer=true
fi
for file in ${{ steps.changed-files.outputs.all_changed_files }}; do
echo "$file was changed"
if [[ ${{github.event_name}} = push ]]; then
if [[ $file = .devcontainer* ]]; then
devcontainer=true
elif [[ $file = *requirements.txt* ]]; then
devcontainer=true
elif [[ $file = Gemfile* ]]; then
devcontainer=true
fi
fi
done
echo "devcontainer=$devcontainer" >> $GITHUB_OUTPUT
echo "noncontainer=$noncontainer" >> $GITHUB_OUTPUT
- name: Build dev container
if: steps.filter.outputs.devcontainer == 'true'
run: |
docker build --no-cache --tag ghcr.io/caciviclab/disclosure-backend-static/${{github.ref_name}}:latest -f ./.devcontainer/Dockerfile .
docker push ghcr.io/caciviclab/disclosure-backend-static/${{github.ref_name}}:latest
- name: Check code changes
if: steps.filter.outputs.noncontainer == 'true'
run: |
echo "TODO: run test to verify that code changes are good"
generate:
needs: build
if: needs.build.outputs.noncontainer == 'true'
runs-on: ubuntu-latest
container:
image: ghcr.io/caciviclab/disclosure-backend-static/${{github.ref_name}}:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.github_token }}
env:
REPO_OWNER: ${{ github.repository_owner}}
REPO_BRANCH: ${{ github.ref_name }}
SERVICE_ACCOUNT_KEY_JSON: ${{ secrets.SERVICE_ACCOUNT_KEY_JSON }}
GDRIVE_FOLDER: ${{ vars.GDRIVE_FOLDER }}
PGHOST: postgres
PGDATABASE: ${{ env.POSTGRES_DB }}
PGUSER: ${{ env.POSTGRES_USER }}
PGPASSWORD: ${{ env.POSTGRES_PASSWORD }}
services:
postgres:
image: postgres:15.6-bullseye
env:
POSTGRES_USER: ${{ env.POSTGRES_USER }}
POSTGRES_DB: ${{ env.POSTGRES_DB }}
POSTGRES_PASSWORD: ${{ env.POSTGRES_PASSWORD }}
steps:
- uses: actions/checkout@v4
- name: Check setup
run: |
git -v
# This keeps git from thinking that the current dir is not a repo even though a .git dir exists
git config --global --add safe.directory "$GITHUB_WORKSPACE"
psql -l
echo "c1,c2" > test.csv
echo "a,b" >> test.csv
cat test.csv
csvsql -v --db postgresql:///disclosure-backend --insert test.csv
echo "List tables"
psql -c "SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';"
pip show sqlalchemy
- name: Create csv files
run: |
make clean
make download
make import
make process
- name: Summarize results
run: |
echo "List tables"
psql -c "SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';"
10 changes: 8 additions & 2 deletions .github/workflows/verify-gdrive.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,21 @@ on:
jobs:
check:
runs-on: ubuntu-latest
container:
image: ghcr.io/caciviclab/disclosure-backend-static/${{github.ref_name}}:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.github_token }}

env:
REPO_OWNER: ${{ github.repository_owner}}
REPO_BRANCH: ${{ github.ref_name }}
SERVICE_ACCOUNT_KEY_JSON: ${{ secrets.SERVICE_ACCOUNT_KEY_JSON }}
GDRIVE_FOLDER: ${{ vars.GDRIVE_FOLDER }}
steps:
- uses: actions/checkout@v3
- run: pip install -r gdrive_requirements.txt
- run: python test_pull_from_gdrive.py
- name: Test pull from gdrive
run: python test_pull_from_gdrive.py
- name: Archive pulled files
uses: actions/upload-artifact@v3
with:
Expand Down
38 changes: 33 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,23 @@ CSV_PATH?=downloads/csv
CD := $(shell pwd)
WGET=bin/wget-wrapper --no-verbose --tries=3

ifdef SERVICE_ACCOUNT_KEY_JSON
NETFILE_V2_DOWNLOAD=download-netfile-v2
NETFILE_V2_IMPORT=import-new-data
else ifneq ("$(wildcard .local/SERVICE_ACCOUNT_KEY_JSON.json)","")
NETFILE_V2_DOWNLOAD=download-netfile-v2
NETFILE_V2_IMPORT=import-new-data
endif

clean-spreadsheets:
rm -rf downloads/csv/*.csv downloads/csv/office_elections.csv downloads/csv/measure_committees.csv downloads/csv/elections.csv

clean:
rm -rf downloads/raw downloads/csv
rm -rf downloads/raw downloads/csv .local/downloads .local/csv
git --version
python --version
ruby --version
psql --version

process: process.rb
# todo: remove RUBYOPT variable when activerecord fixes deprecation warnings
Expand All @@ -21,6 +33,9 @@ process: process.rb
bin/report-candidates
git --no-pager diff build/digests.json

download-netfile-v2:
python download/main.py

download-spreadsheets: downloads/csv/candidates.csv downloads/csv/committees.csv \
downloads/csv/referendums.csv downloads/csv/name_to_number.csv \
downloads/csv/office_elections.csv downloads/csv/elections.csv
Expand All @@ -36,7 +51,8 @@ upload-cache:
tar czf - downloads/csv downloads/static downloads/cached-db \
| aws s3 cp - s3://odca-data-cache/$(shell date +%Y-%m-%d).tar.gz --acl public-read

download: download-spreadsheets \
download: $(NETFILE_V2_DOWNLOAD) \
download-spreadsheets \
download-COAK-2014 download-COAK-2015 download-COAK-2016 \
download-COAK-2017 download-COAK-2018 \
download-COAK-2019 download-COAK-2020 \
Expand Down Expand Up @@ -81,13 +97,16 @@ do-import-spreadsheets:
./bin/remove-whitespace $(DATABASE_NAME) candidates Instagram
./bin/remove-whitespace $(DATABASE_NAME) candidates Twitter
./bin/remove-whitespace $(DATABASE_NAME) candidates Bio
./bin/make-null-empty $(DATABASE_NAME) candidates data_warning
./bin/make-null-empty $(DATABASE_NAME) candidates Committee_Name

echo 'DROP TABLE IF EXISTS referendums CASCADE;' | psql $(DATABASE_NAME)
./bin/create-table $(DATABASE_NAME) $(CSV_PATH) referendums
csvsql --db postgresql:///$(DATABASE_NAME) --insert --no-create --no-inference $(CSV_PATH)/referendums.csv
echo 'ALTER TABLE "referendums" ADD COLUMN id SERIAL PRIMARY KEY;' | psql $(DATABASE_NAME)
./bin/remove-whitespace $(DATABASE_NAME) referendums Short_Title
./bin/remove-whitespace $(DATABASE_NAME) referendums Summary
./bin/make-null-empty $(DATABASE_NAME) referendums data_warning

echo 'DROP TABLE IF EXISTS name_to_number CASCADE;' | psql $(DATABASE_NAME)
./bin/create-table $(DATABASE_NAME) $(CSV_PATH) name_to_number
Expand All @@ -98,6 +117,8 @@ do-import-spreadsheets:
csvsql --db postgresql:///$(DATABASE_NAME) --insert --no-create --no-inference $(CSV_PATH)/committees.csv
echo 'ALTER TABLE "committees" ADD COLUMN id SERIAL PRIMARY KEY;' | psql $(DATABASE_NAME)
./bin/remove-whitespace $(DATABASE_NAME) committees Filer_NamL
./bin/make-null-empty $(DATABASE_NAME) committees Filer_NamL
./bin/make-null-empty $(DATABASE_NAME) committees data_warning

echo 'DROP TABLE IF EXISTS office_elections CASCADE;' | psql $(DATABASE_NAME)
./bin/create-table $(DATABASE_NAME) $(CSV_PATH) office_elections
Expand All @@ -110,9 +131,7 @@ do-import-spreadsheets:
csvsql --db postgresql:///$(DATABASE_NAME) --insert --no-create --no-inference downloads/csv/elections.csv
echo 'ALTER TABLE "elections" ADD COLUMN id SERIAL PRIMARY KEY;' | psql $(DATABASE_NAME)

import-data: 496 497 A-Contributions B1-Loans B2-Loans C-Contributions \
D-Expenditure E-Expenditure F-Expenses F461P5-Expenditure F465P3-Expenditure \
F496P3-Contributions G-Expenditure H-Loans I-Contributions Summary
import-data: import-old-data $(NETFILE_V2_IMPORT)
echo 'CREATE TABLE IF NOT EXISTS "calculations" (id SERIAL PRIMARY KEY, subject_id integer, subject_type varchar(30), name varchar(40), value jsonb);' | psql $(DATABASE_NAME)
./bin/remove_duplicate_transactions
./bin/make_view
Expand All @@ -124,9 +143,18 @@ recreatedb:
reindex:
ruby search_index.rb

import-new-data: elections_v2 committees_v2 a_contributions_v2

import-old-data: 496 497 A-Contributions B1-Loans B2-Loans C-Contributions \
D-Expenditure E-Expenditure F-Expenses F461P5-Expenditure F465P3-Expenditure \
F496P3-Contributions G-Expenditure H-Loans I-Contributions Summary

496 497 A-Contributions B1-Loans B2-Loans C-Contributions D-Expenditure E-Expenditure F-Expenses F461P5-Expenditure F465P3-Expenditure F496P3-Contributions G-Expenditure H-Loans I-Contributions Summary:
DATABASE_NAME=$(DATABASE_NAME) ./bin/import-file $(CSV_PATH) $@

elections_v2 committees_v2 a_contributions_v2:
DATABASE_NAME=$(DATABASE_NAME) ./bin/import-file $(CSV_PATH) $@ 0

downloads/csv/candidates.csv:
mkdir -p downloads/csv downloads/raw
$(WGET) -O- \
Expand Down
2 changes: 2 additions & 0 deletions bin/clean
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,6 @@ cat <<-QUERY | psql ${database_name}
DELETE FROM "$table_name"
WHERE "Tran_Date" is NULL;
QUERY
else
echo
fi
Loading

0 comments on commit a9ada3b

Please sign in to comment.