Add workflow to gen data (#329)

* run main workflow in container * remove container action * try running existing make code in container * add postgres to generation job * set up access to postgres * config postgres service * remove misplaced 'if' in workflow * set password for connecting to postgres and list db at start to confirm availability * separate out step to check postgres so that we can easily check the output * remove unused script * remove new version of sqlalchemy since it breaks csvsql in `make import` * add some conditions for job execution in main workflow * add check for version of sqlalchemy * get container to rebuild when requirements change * adjust logic for when jobs run in main workflow * try summarizing tables created in main workflow * make sure we clean up downloads before we start * try testing csvsql early * some logging to test csvsql * turn on verbose output while testing csvsql * try sending csv file through stdin for csvsql * upgrade csvkit to 1.3.0 and upgraded its dependencies where needed * remove files no longer needed in download dir * set postgres version in dev container and workflow to 9.6 to match travic CI * update workflow names * make use of sql files to create tables * Update import-file to display schema of created table * Update import-file to log more info about postgresql tables * Update import-file to point psql to DATABASE_NAME * Update import-file to use the right quote around table name in psql * Update import-file to remove debug logging * Update Makefile to use saved sql for creating tables from spreadsheet data * fix Makefile by moving bash into file and saved generated sql for tables that hold spreadsheet data * some fixes to get csvkit 1.3.0 working - not fully working yet... * make sure data upload for spreadsheet data does not use inference (ie alter data) and increase length of filer name field for committees * debug version of csvkit installed * verify python version at time of install on travis * remove sudo for pip install * remove download/main.py dependency on latest version of sqlalchemy * use later postgres * update postgres for dev container also * download new netfile csvs before import * gracefully handle records missing transaction data * add netfile v2 data to database during import * make sure dir exists for saving v2 csv files * make netfile v2 download a part of `make download` * add requirements for netfile v2 code * update python-dateutil * try to cause failure when pip install fails * upgrade babel * update pytz * allow csvkit to pull in the correct agate dependencies and add script to trim whitespace for some columns * remove whitespace for some key columns * split contributions by type to multiple elections when a candidate was in multiple elections * removed commented code * create candidate_summary view to associate "Summary" info with specific election * add total contributions to digest.json * use hash of hash for contributions by type * add total contributions by type and source to digests * take election into account when calculating total contributions and contributions by source * organize totals calculated from various sources in digests.json * update digests.json to include more totals * calculate contribution totals for all tickets (candidates and referendums) combined * add more totals to digest and separate by contributions vs expenditures vs loans * update expenditures to be split on election and other calculations to take election into account * revert committee contribution list calculator * some comments about the totals calculated for digests.json * update digests to only show totals that we want to compare * add loans to total for contributions by type and origin * move totals logic out of main * switch total expenditures calculator to use new candidate_summary view which joins Summary with candidates using the from and thru dates instead of the report date. This provides consistency and so if we decide to use the report date instead for the join, we change it in the view and consistently apply it everywhere. * add report on candidate totals * attempt to get python 3.9 to be used * don't use sudo for pip install * remove unused var in calculator * match up calculator with master branch * upgrade csvkit * match schema to latest infered by old csvkit * make sure we are pushing to the same branch when deploying build * specify the branch to push to for travis auto-deploy * add schema.sql file * don't deploy build on pull request build * increase size of filer name for committees * clean up whitespace for some more candidate columns * remove whitespace from referendums summary * remove commented out line * combine removal of leading and trailing white spaces into a single update * update build with recent fixes from main branch * re-use code to create table in bin/import-file * clean up request to dump database schema * pick committee distinct on filer ID according to order of value in election column * remove check for Ballot_Measure_Election when looking for committee name since it wasn't checked for before * change image used for workflow to generate website data to match version for pg_dump * set dev container and github actions to use the same postgres version * try action checkout v4 * print out some dir info to figure out why git thinks it is not a repo * cause early git failure so we can try to fix it * remove tab from github workflow file * show version of key components when cleaning * add place to insert new downloads * get image to be created with new branch and don't use the image during the create event * add explicit check for docker image in order to run jobs that require it * log in to docker early * build container if it's not there * try increasing size of filer name col * put shared postgres settings in global env vars * clean up dev container * add post-create-command.sh back * remove pwd in Dockerfile * write csv from polars dataframe * merge requirements for netfile v2 into main requirements file * allow committee id to be null in H-Loan data * remove copy of download/requirements.txt from Dockerfile * move new data to be imported to a different target in Makefile * provide means to switch to ruby 2.7.1 if needed * remove whitespace from data_warning column * make data_warning empty instead of null * make make-null-empty executable * maintain a consistent order for the candidates report * make null empty for data_warning in committees * output consistent order to ensure that output doesn't change when postgres version changes * set null committee name to empty string so that we can get consistent outputs when the postgres version changes * use floats everywhere when calculating totals in create-digests * increase column size for instagram column in candidates table * add some additional totals for oakland-2024 election in digests.json to help with debugging * change election name in digests.json to include full date to correctly capture multiple elections in the same year * remove commented postgres 9.6 in workflow * only run netfile v2 download when credentials are set up
caciviclab · Dec 12, 2024 · a9ada3b · a9ada3b
1 parent 8f89882
commit a9ada3b
Show file tree

Hide file tree

Showing 26 changed files with 363 additions and 81 deletions.
diff --git a/.devcontainer/install-ruby-2.7.1.sh b/.devcontainer/install-ruby-2.7.1.sh
@@ -0,0 +1,19 @@
+#!/bin/bash --login
+
+# Here's a way to install old ruby 2.7.1 using rvm on ubuntu bookworm
+# https://github.com/rvm/rvm/issues/5209
+
+sudo apt install build-essential
+cd ~/Downloads
+wget https://www.openssl.org/source/openssl-1.1.1t.tar.gz
+tar zxvf openssl-1.1.1t.tar.gz
+cd openssl-1.1.1t
+./config --prefix=$HOME/.openssl/openssl-1.1.1t --openssldir=$HOME/.openssl/openssl-1.1.1t
+make
+make install
+rm -rf ~/.openssl/openssl-1.1.1t/certs
+ln -s /etc/ssl/certs ~/.openssl/openssl-1.1.1t/certs
+cd ~
+rvm install ruby-2.7.1 --with-openssl-dir=$HOME/.openssl/openssl-1.1.1t # replace ruby-x.x.x to install other older versions
+
+rvm use 2.7.1
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -1,25 +1,111 @@
-# This workflow will later be replaced with logic to "Generate Website Data"
-# The verify-gdrive.yml workflow file will be renamed to this one
-# We have to introduce this change in steps because GitHub gets confused until
-# we add the new workflow file to the master branch
 name: "Generate Website Data"
 on:
   workflow_dispatch:
+  push:
+env:
+  POSTGRES_USER: app_user
+  POSTGRES_DB: disclosure-backend
+  POSTGRES_PASSWORD: app_password
 jobs:
-  generate:
+  build:
     runs-on: ubuntu-latest
-    env:
-      REPO_OWNER: ${{ github.repository_owner}}
-      REPO_BRANCH: ${{ github.ref_name }}
-      SERVICE_ACCOUNT_KEY_JSON: ${{ secrets.SERVICE_ACCOUNT_KEY_JSON }}
-      GDRIVE_FOLDER: ${{ vars.GDRIVE_FOLDER }}
+    outputs:
+      devcontainer: ${{ steps.filter.outputs.devcontainer }}
+      noncontainer: ${{ steps.filter.outputs.noncontainer }}
     steps:
-    - uses: actions/checkout@v3
-    - run: pip install -r gdrive_requirements.txt
-    - run: python test_pull_from_gdrive.py
-    - name: Archive pulled files
-      uses: actions/upload-artifact@v3
+    - name: Login to GitHub Container Registry
+      uses: docker/login-action@v3
       with:
-        name: redacted-netfile-files
-        path: .local/downloads
+        registry: ghcr.io
+        username: ${{github.actor}}
+        password: ${{secrets.GITHUB_TOKEN}}
+    - uses: actions/checkout@v3
+    - name: Get changed files
+      id: changed-files
+      uses: tj-actions/changed-files@v40
+    - name: List all changed files
+      id: filter
+      run: |
+        echo ${{github.event_name}}
+        noncontainer=true
+        if docker pull ghcr.io/caciviclab/disclosure-backend-static/${{github.ref_name}}:latest; then
+          devcontainer=false
+        else
+          devcontainer=true
+        fi
+        for file in ${{ steps.changed-files.outputs.all_changed_files }}; do
+          echo "$file was changed"
+          if [[ ${{github.event_name}} = push ]]; then
+            if [[ $file = .devcontainer* ]]; then
+              devcontainer=true
+            elif [[ $file = *requirements.txt* ]]; then
+              devcontainer=true
+            elif [[ $file = Gemfile* ]]; then
+              devcontainer=true
+            fi
+          fi
+        done
+        
+        echo "devcontainer=$devcontainer" >> $GITHUB_OUTPUT
+        echo "noncontainer=$noncontainer" >> $GITHUB_OUTPUT
+    - name: Build dev container
+      if: steps.filter.outputs.devcontainer == 'true'
+      run: |
+        docker build --no-cache --tag ghcr.io/caciviclab/disclosure-backend-static/${{github.ref_name}}:latest -f ./.devcontainer/Dockerfile .
+        docker push ghcr.io/caciviclab/disclosure-backend-static/${{github.ref_name}}:latest
+    - name: Check code changes
+      if: steps.filter.outputs.noncontainer == 'true'
+      run: |
+        echo "TODO: run test to verify that code changes are good"
+  generate:
+    needs: build
+    if: needs.build.outputs.noncontainer == 'true'
+    runs-on: ubuntu-latest
+    container:
+      image: ghcr.io/caciviclab/disclosure-backend-static/${{github.ref_name}}:latest
+      credentials:
+        username: ${{ github.actor }}
+        password: ${{ secrets.github_token }}
+      env:
+        REPO_OWNER: ${{ github.repository_owner}}
+        REPO_BRANCH: ${{ github.ref_name }}
+        SERVICE_ACCOUNT_KEY_JSON: ${{ secrets.SERVICE_ACCOUNT_KEY_JSON }}
+        GDRIVE_FOLDER: ${{ vars.GDRIVE_FOLDER }}
+        PGHOST: postgres
+        PGDATABASE: ${{ env.POSTGRES_DB }}
+        PGUSER: ${{ env.POSTGRES_USER }}
+        PGPASSWORD: ${{ env.POSTGRES_PASSWORD }}
+    services:
+      postgres:
+        image: postgres:15.6-bullseye
+        env:
+          POSTGRES_USER: ${{ env.POSTGRES_USER }}
+          POSTGRES_DB: ${{ env.POSTGRES_DB }}
+          POSTGRES_PASSWORD: ${{ env.POSTGRES_PASSWORD }}
+    steps:
+    - uses: actions/checkout@v4
+    - name: Check setup
+      run: |
+        git -v
+        # This keeps git from thinking that the current dir is not a repo even though a .git dir exists
+        git config --global --add safe.directory "$GITHUB_WORKSPACE"
+        psql -l
+        echo "c1,c2" > test.csv
+        echo "a,b" >> test.csv
+        cat test.csv
+        csvsql -v --db postgresql:///disclosure-backend --insert test.csv
+        echo "List tables"
+        psql -c "SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';"
 
+        pip show sqlalchemy
+    - name: Create csv files
+      run: |
+        make clean
+        make download
+        make import
+        make process
+    - name: Summarize results
+      run: |
+        echo "List tables"
+        psql -c "SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';"
+ 
diff --git a/.github/workflows/verify-gdrive.yml b/.github/workflows/verify-gdrive.yml
@@ -4,15 +4,21 @@ on:
 jobs:
   check:
     runs-on: ubuntu-latest
+    container:
+      image: ghcr.io/caciviclab/disclosure-backend-static/${{github.ref_name}}:latest
+      credentials:
+        username: ${{ github.actor }}
+        password: ${{ secrets.github_token }}
+
     env:
       REPO_OWNER: ${{ github.repository_owner}}
       REPO_BRANCH: ${{ github.ref_name }}
       SERVICE_ACCOUNT_KEY_JSON: ${{ secrets.SERVICE_ACCOUNT_KEY_JSON }}
       GDRIVE_FOLDER: ${{ vars.GDRIVE_FOLDER }}
     steps:
     - uses: actions/checkout@v3
-    - run: pip install -r gdrive_requirements.txt
-    - run: python test_pull_from_gdrive.py
+    - name: Test pull from gdrive
+      run: python test_pull_from_gdrive.py
     - name: Archive pulled files
       uses: actions/upload-artifact@v3
       with:

diff --git a/Makefile b/Makefile
@@ -6,11 +6,23 @@ CSV_PATH?=downloads/csv
 CD := $(shell pwd)
 WGET=bin/wget-wrapper --no-verbose --tries=3
 
+ifdef SERVICE_ACCOUNT_KEY_JSON
+	NETFILE_V2_DOWNLOAD=download-netfile-v2
+	NETFILE_V2_IMPORT=import-new-data
+else ifneq ("$(wildcard .local/SERVICE_ACCOUNT_KEY_JSON.json)","")
+	NETFILE_V2_DOWNLOAD=download-netfile-v2
+	NETFILE_V2_IMPORT=import-new-data
+endif
+
 clean-spreadsheets:
 	rm -rf downloads/csv/*.csv  downloads/csv/office_elections.csv  downloads/csv/measure_committees.csv downloads/csv/elections.csv
 
 clean:
-	rm -rf downloads/raw downloads/csv
+	rm -rf downloads/raw downloads/csv .local/downloads .local/csv
+	git --version
+	python --version
+	ruby --version
+	psql --version
 
 process: process.rb
 	# todo: remove RUBYOPT variable when activerecord fixes deprecation warnings
@@ -21,6 +33,9 @@ process: process.rb
 	bin/report-candidates
 	git --no-pager diff build/digests.json
 
+download-netfile-v2: 
+	python download/main.py
+
 download-spreadsheets: downloads/csv/candidates.csv downloads/csv/committees.csv \
 	downloads/csv/referendums.csv downloads/csv/name_to_number.csv \
 	downloads/csv/office_elections.csv downloads/csv/elections.csv
@@ -36,7 +51,8 @@ upload-cache:
 	tar czf - downloads/csv downloads/static downloads/cached-db \
 		| aws s3 cp - s3://odca-data-cache/$(shell date +%Y-%m-%d).tar.gz --acl public-read
 
-download: download-spreadsheets \
+download: $(NETFILE_V2_DOWNLOAD) \
+	download-spreadsheets \
 	download-COAK-2014 download-COAK-2015 download-COAK-2016 \
 	download-COAK-2017 download-COAK-2018 \
 	download-COAK-2019 download-COAK-2020 \
@@ -81,13 +97,16 @@ do-import-spreadsheets:
 	./bin/remove-whitespace $(DATABASE_NAME) candidates Instagram
 	./bin/remove-whitespace $(DATABASE_NAME) candidates Twitter
 	./bin/remove-whitespace $(DATABASE_NAME) candidates Bio
+	./bin/make-null-empty $(DATABASE_NAME) candidates data_warning
+	./bin/make-null-empty $(DATABASE_NAME) candidates Committee_Name
 
 	echo 'DROP TABLE IF EXISTS referendums CASCADE;' | psql $(DATABASE_NAME)
 	./bin/create-table $(DATABASE_NAME) $(CSV_PATH) referendums
 	csvsql --db postgresql:///$(DATABASE_NAME) --insert --no-create --no-inference $(CSV_PATH)/referendums.csv
 	echo 'ALTER TABLE "referendums" ADD COLUMN id SERIAL PRIMARY KEY;' | psql $(DATABASE_NAME)
 	./bin/remove-whitespace $(DATABASE_NAME) referendums Short_Title
 	./bin/remove-whitespace $(DATABASE_NAME) referendums Summary
+	./bin/make-null-empty $(DATABASE_NAME) referendums data_warning
 
 	echo 'DROP TABLE IF EXISTS name_to_number CASCADE;' | psql $(DATABASE_NAME)
 	./bin/create-table $(DATABASE_NAME) $(CSV_PATH) name_to_number
@@ -98,6 +117,8 @@ do-import-spreadsheets:
 	csvsql --db postgresql:///$(DATABASE_NAME) --insert --no-create --no-inference $(CSV_PATH)/committees.csv
 	echo 'ALTER TABLE "committees" ADD COLUMN id SERIAL PRIMARY KEY;' | psql $(DATABASE_NAME)
 	./bin/remove-whitespace $(DATABASE_NAME) committees Filer_NamL
+	./bin/make-null-empty $(DATABASE_NAME) committees Filer_NamL
+	./bin/make-null-empty $(DATABASE_NAME) committees data_warning
 
 	echo 'DROP TABLE IF EXISTS office_elections CASCADE;' | psql $(DATABASE_NAME)
 	./bin/create-table $(DATABASE_NAME) $(CSV_PATH) office_elections
@@ -110,9 +131,7 @@ do-import-spreadsheets:
 	csvsql --db postgresql:///$(DATABASE_NAME) --insert --no-create --no-inference downloads/csv/elections.csv
 	echo 'ALTER TABLE "elections" ADD COLUMN id SERIAL PRIMARY KEY;' | psql $(DATABASE_NAME)
 
-import-data: 496 497 A-Contributions B1-Loans B2-Loans C-Contributions \
-	D-Expenditure E-Expenditure F-Expenses F461P5-Expenditure F465P3-Expenditure \
-	F496P3-Contributions G-Expenditure H-Loans I-Contributions Summary
+import-data: import-old-data $(NETFILE_V2_IMPORT)
 	echo 'CREATE TABLE IF NOT EXISTS "calculations" (id SERIAL PRIMARY KEY, subject_id integer, subject_type varchar(30), name varchar(40), value jsonb);' | psql $(DATABASE_NAME)
 	./bin/remove_duplicate_transactions
 	./bin/make_view
@@ -124,9 +143,18 @@ recreatedb:
 reindex:
 	ruby search_index.rb
 
+import-new-data: elections_v2 committees_v2 a_contributions_v2
+
+import-old-data: 496 497 A-Contributions B1-Loans B2-Loans C-Contributions \
+	D-Expenditure E-Expenditure F-Expenses F461P5-Expenditure F465P3-Expenditure \
+	F496P3-Contributions G-Expenditure H-Loans I-Contributions Summary
+
 496 497 A-Contributions B1-Loans B2-Loans C-Contributions D-Expenditure E-Expenditure F-Expenses F461P5-Expenditure F465P3-Expenditure F496P3-Contributions G-Expenditure H-Loans I-Contributions Summary:
 	DATABASE_NAME=$(DATABASE_NAME) ./bin/import-file $(CSV_PATH) $@
 
+elections_v2 committees_v2 a_contributions_v2:
+	DATABASE_NAME=$(DATABASE_NAME) ./bin/import-file $(CSV_PATH) $@ 0
+
 downloads/csv/candidates.csv:
 	mkdir -p downloads/csv downloads/raw
 	$(WGET) -O- \

diff --git a/bin/clean b/bin/clean
@@ -18,4 +18,6 @@ cat <<-QUERY | psql ${database_name}
   DELETE FROM "$table_name"
   WHERE "Tran_Date" is NULL;
 QUERY
+else
+  echo
 fi
-Original file line number
+Diff line change
@@ Expand Up / @@ -18,4 +18,6 @@ cat <<-QUERY | psql ${database_name} @@
       DELETE FROM "$table_name"
       WHERE "Tran_Date" is NULL;
     QUERY
+    else
+      echo
     fi