Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: Benefits Amplitude events #3468

Merged
merged 9 commits into from
Oct 25, 2024
Merged

Conversation

thekaveman
Copy link
Member

@thekaveman thekaveman commented Sep 19, 2024

Description

We recently completed a big refactor of the models in Benefits, see cal-itp/benefits#1666 for more background.

The last piece of this refactor is updating our new and historic analytics events. The following PRs update the logic for generating new events:

And this PR is for the warehouse side, to handle the new fields and adjust historical data already captured in GCS.

We don't want to merge this PR until all of the above PRs are merged and released to our prod environment.

Closes cal-itp/benefits#2247
Closes cal-itp/benefits#2248
Closes cal-itp/benefits#2249
Closes cal-itp/benefits#2390

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

How has this been tested?

 poetry run dbt run -s +fct_benefits_events
$ poetry run dbt run -s +fct_benefits_events
19:38:34  Running with dbt=1.5.1
19:38:35  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.calitp_warehouse.mart.ad_hoc
19:38:35  Found 420 models, 950 tests, 0 snapshots, 0 analyses, 852 macros, 0 operations, 12 seed files, 175 sources, 4 exposures, 0 metrics, 0 groups
19:38:35  
19:39:53  Concurrency: 8 threads (target='dev')
19:39:53  
19:39:53  1 of 2 START sql view model kegan_staging.stg_amplitude__benefits_events ....... [RUN]
19:39:54  1 of 2 OK created sql view model kegan_staging.stg_amplitude__benefits_events .. [CREATE VIEW (0 processed) in 1.22s]
19:39:54  2 of 2 START sql table model kegan_mart_benefits.fct_benefits_events ........... [RUN]
19:40:15  2 of 2 OK created sql table model kegan_mart_benefits.fct_benefits_events ...... [CREATE TABLE (26.9m rows, 73.1 GiB processed) in 20.37s]
19:40:15  
19:40:15  Finished running 1 view model, 1 table model in 0 hours 1 minutes and 39.82 seconds (99.82s).
19:40:15  
19:40:15  Completed successfully
19:40:15  
19:40:15  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Post-merge follow-ups

Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.

  • Go through any charts etc. using the deprecated fields, and update to the new fields
  • Go through any charts etc. using old values for e.g. eligibility_verifier, and update to the new values
  • Delete the deprecated fields

Copy link

github-actions bot commented Sep 19, 2024

Warehouse report 📦

DAG

Legend (in order of precedence)

Resource type Indicator Resolution
Large table-materialized model Orange Make the model incremental
Large model without partitioning or clustering Orange Add partitioning and/or clustering
View with more than one child Yellow Materialize as a table or incremental
Incremental Light green
Table Green
View White

Copy link
Member

@angela-tran angela-tran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the code walkthrough and explanations! These changes look good to me. 👍

Copy link
Member

@evansiroky evansiroky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you output the logs of dbt run to ensure this works properly? See #3502 for an example of how this is done.

@thekaveman
Copy link
Member Author

thekaveman commented Oct 17, 2024

Can you output the logs of dbt run to ensure this works properly? See #3502 for an example of how this is done.

@evansiroky @vevetron I'm following these instructions: https://github.com/cal-itp/data-infra/blob/main/warehouse/README.md

And I have to say, this is just a brutal developer experience...

  • Need poetry installed to be able to run poetry install
  • Need graphviz installed to be able to install pygraphviz (from poetry install)
  • All instructions assume brew, which is a MacOS tool. I'm on Linux.
  • None of this works from within the included devcontainer config

Does everyone run this on a Mac? I've tried to update the devcontainer to be able to get all this running locally. I got as far as:

  • installing poetry and brew
  • running brew install graphviz
  • using the workaround export CFLAGS... and export LDFLAGS mentioned in the above README

But I still get an error when running poetry install at the pygraphviz step:

/workspaces/data-infra/warehouse$ echo $CFLAGS
-I /home/linuxbrew/.linuxbrew/opt/graphviz/include

/workspaces/data-infra/warehouse$ echo $LDFLAGS
-L /home/linuxbrew/.linuxbrew/opt/graphviz/lib

/workspaces/data-infra/warehouse$ poetry install
The currently activated Python version 3.8.17 is not supported by the project (~3.9).
Trying to find and use a compatible version. 
Using python3.9 (3.9.2)
Installing dependencies from lock file

Package operations: 1 install, 0 updates, 0 removals

  - Installing pygraphviz (1.11): Failed

...

creating build/temp.linux-x86_64-cpython-39/pygraphviz
  x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -ffile-prefix-map=/build/python3.9-RNBry6/python3.9-3.9.2=. -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -I /home/linuxbrew/.linuxbrew/opt/graphviz/include -fPIC -DSWIG_PYTHON_STRICT_BYTE_CHAR -I/tmp/tmpuiqe4_ep/.venv/include -I/usr/include/python3.9 -c pygraphviz/graphviz_wrap.c -o build/temp.linux-x86_64-cpython-39/pygraphviz/graphviz_wrap.o
  pygraphviz/graphviz_wrap.c:168:11: fatal error: Python.h: No such file or directory
    168 | # include <Python.h>
        |           ^~~~~~~~~~

Any idea how to get this working?

@thekaveman
Copy link
Member Author

Alternatively, if you all are already setup to run these DBT commands for verification, that would be really helpful.

@vevetron
Copy link
Contributor

I think everyone who works with DBT right now either uses a local mac or jupyterhub to run and test changes. Linux should work as well, but I don't think anyone is using devcontainers.

@thekaveman
Copy link
Member Author

thekaveman commented Oct 18, 2024

Thanks @vevetron. I got a hold of a Macbook and got as far as running poetry run dbt debug but it gave me this output:

(.venv) kegans-MBP:warehouse kegan$ poetry run dbt debug
20:17:52  Running with dbt=1.5.1
20:17:52  dbt version: 1.5.1
20:17:52  python version: 3.9.6
20:17:52  python path: /Users/kegan/git/data-infra/warehouse/.venv/bin/python
20:17:52  os info: macOS-14.2-arm64-arm-64bit
20:17:52  Using profiles.yml file at /Users/kegan/.dbt/profiles.yml
20:17:52  Using dbt_project.yml file at /Users/kegan/git/data-infra/warehouse/dbt_project.yml
20:17:52  Configuration:
20:17:52  Error importing adapter: No module named 'dbt.adapters.bigquery'
20:17:52    profiles.yml file [ERROR invalid]
20:17:52    dbt_project.yml file [OK found and valid]
20:17:52  Required dependencies:
20:17:52   - git [OK found]

20:17:52  1 check failed:
20:17:52  Profile loading failed for the following reason:
Runtime Error
  Credentials in profile "calitp_warehouse", target "dev" invalid: Runtime Error
    Could not find adapter type bigquery!

My ~/.dbt/profiles.yml file looks like:

calitp_warehouse:
  outputs:
    dev:
      dataproc_batch:
        runtime_config:
          container_image: gcr.io/cal-itp-data-infra/dbt-spark:2023.3.28
          properties:
            spark.dynamicAllocation.maxExecutors: '16'
            spark.executor.cores: '4'
            spark.executor.instances: '4'
            spark.executor.memory: 4g
      dataproc_region: us-west2
      fixed_retries: 1
      gcs_bucket: test-calitp-dbt-python-models
      location: us-west2
      maximum_bytes_billed: 2000000000000
      method: oauth
      priority: interactive
      project: cal-itp-data-infra-staging
      schema: kegan
      submission_method: serverless
      threads: 8
      timeout_seconds: 3000
      type: bigquery
  target: dev

And bq ls has output that seems like I have a connection:

                datasetId                 
 ---------------------------------------- 
  airtable                                
  amplitude                               
  audit                                   
  calitp_py                               
  charlie                                 
  charlie_dbt_test__audit                 
  charlie_gtfs_schedule                   
  charlie_gtfs_views_staging              
  charlie_intermediate                    
  charlie_mart_ad_hoc                     
  charlie_mart_agency_service             
  charlie_mart_feed_aggregator_checks     
  charlie_mart_gtfs                       
  charlie_mart_gtfs_guidelines            
  charlie_mart_gtfs_quality               
  charlie_mart_ntd                        
  charlie_mart_payments                   
  charlie_mart_transit_database           
  charlie_payments                        
  charlie_staging                         
  charlie_views                           
  christian                               
  christian_mart_ad_hoc                   
  christian_mart_audit                    
  christian_mart_benefits                 
  christian_mart_gtfs                     
  christian_mart_gtfs_quality             
  christian_mart_gtfs_schedule_latest     
  christian_mart_ntd                      
  christian_mart_payments                 
  christian_mart_transit_database         
  christian_mart_transit_database_latest  
  christian_staging                       
  ci_staging                              
  eric                                    
  eric_mart_ad_hoc                        
  eric_mart_audit                         
  eric_mart_benefits                      
  eric_mart_gtfs                          
  eric_mart_gtfs_quality                  
  eric_mart_gtfs_schedule_latest          
  eric_mart_ntd                           
  eric_mart_payments                      
  eric_mart_transit_database              
  eric_mart_transit_database_latest       
  eric_payments                           
  eric_staging                            
  eric_views                              
  erika                                   
  erika_dbt_test__audit

Will come back to this a little later and look into it more.

@vevetron
Copy link
Contributor

Your profiles.yml looks exactly the same as mine. My debug statement is almost the same as well.

Maybe retry poetry install? or pip install dbt-bigquery? or maybe it's running the wrong environment.

@thekaveman
Copy link
Member Author

Finally got it running!

I am seeing the same error output that you showed:

$ poetry run dbt run -s +fct_benefits_events
19:32:16  Running with dbt=1.5.1
19:32:16  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.calitp_warehouse.mart.ad_hoc
19:32:17  Found 420 models, 950 tests, 0 snapshots, 0 analyses, 852 macros, 0 operations, 12 seed files, 175 sources, 4 exposures, 0 metrics, 0 groups
19:32:17  
19:32:20  Concurrency: 8 threads (target='dev')
19:32:20  
19:32:20  1 of 2 START sql view model kegan_staging.stg_amplitude__benefits_events ....... [RUN]
19:32:21  1 of 2 OK created sql view model kegan_staging.stg_amplitude__benefits_events .. [CREATE VIEW (0 processed) in 1.26s]
19:32:21  2 of 2 START sql table model kegan_mart_benefits.fct_benefits_events ........... [RUN]
19:32:23  BigQuery adapter: https://console.cloud.google.com/bigquery?project=cal-itp-data-infra-staging&j=bq:us-west2:ee6d3a66-62ef-49c5-818c-709b8d75e98a&page=queryresults
19:32:23  2 of 2 ERROR creating sql table model kegan_mart_benefits.fct_benefits_events .. [ERROR in 2.17s]
19:32:23  
19:32:23  Finished running 1 view model, 1 table model in 0 hours 0 minutes and 6.64 seconds (6.64s).
19:32:23  
19:32:23  Completed with 1 error and 0 warnings:
19:32:23  
19:32:23  Database Error in model fct_benefits_events (models/mart/benefits/fct_benefits_events.sql)
19:32:23    Unrecognized name: event_properties_claims_provider at [158:9]
19:32:23    compiled Code at target/run/calitp_warehouse/models/mart/benefits/fct_benefits_events.sql
19:32:23  
19:32:23  Done. PASS=1 WARN=0 ERROR=1 SKIP=0 TOTAL=2

Will work on getting these corrected.

@thekaveman thekaveman requested a review from vevetron October 21, 2024 19:42
@thekaveman
Copy link
Member Author

thekaveman commented Oct 21, 2024

@vevetron I updated the PR description with the results of running locally, which is now passing.

@thekaveman thekaveman force-pushed the refactor/benefits-events branch from 83ff624 to 34aa4a0 Compare October 25, 2024 18:38
@thekaveman thekaveman merged commit dfab12a into main Oct 25, 2024
4 checks passed
@thekaveman thekaveman deleted the refactor/benefits-events branch October 25, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants