Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revitalize the collection of PUDL usage metrics #128

Open
18 of 30 tasks
jdangerx opened this issue Jun 10, 2024 · 5 comments · Fixed by #162
Open
18 of 30 tasks

Revitalize the collection of PUDL usage metrics #128

jdangerx opened this issue Jun 10, 2024 · 5 comments · Fixed by #162
Assignees
Labels
datasette Relating to Datasette usage metrics Epic github_actions Pull requests that update GitHub Actions code s3 Relating to S3 usage metrics superset Relating to Superset usage metrics

Comments

@jdangerx
Copy link
Member

jdangerx commented Jun 10, 2024

Overview

In order to better trace the development of PUDL, the success of our outreach efforts and the effects of our new Superset instance, we need to revitalize the pudl-usage-metrics repository and collect usage metrics from the following sources:

  • S3
  • Datasette (until retirement)
  • Superset
  • Zenodo
  • Kaggle
  • Github

We're interested in the following types of metrics to start:

  • how many different IPs are accessing our data via each method?
  • what tables and versions of the data are people accessing?

As a first step, we should be able to ETL the logs and metrics from each of these data sources and get a weekly summary that we can look at. As a second step, we want to hook up our metrics to a private Superset dataset and build some dashboards for easy interpretation.

Out of scope

  • Google Analytics - we already have a dashboard we can use to look at these metrics!
  • Migrating out of CloudSQL to another storage backend
  • Setting up a permanent dagster server with sensors and schedules

Infrastructure

The pudl-usage-metrics repository hasn't been maintained for a while. We'll need to get it up to speed to support this development work.

Infrastructure tasks

Preview Give feedback
  1. 0 of 7
    cloud superset
  2. 0 of 1
    github_actions

S3 Logs

Our main programmatic access method. S3 logs are currently mirrored to a GCS bucket. Each request produces one log.

S3 Usage Metrics

Preview Give feedback
  1. 8 of 8
    s3
    e-belfer
  2. 13 of 13
    github_actions s3
    e-belfer

Datasette

While we're planning to retire Datasette, it'd still be helpful to understand the history of usage and to see how usage changes during the transition to Superset. The log ETL that exists in pudl-usage-metrics hasn't worked since the transition to fly.io.

fly.io currently doesn't retain logs for a long time so we need to use the https://github.com/superfly/fly-log-shipper fly log shipper to send logs to S3.

It also doesn't log out the IP address of the datasette requests - guessing that the IP currently logged is the load balancer IP. Usually the load balancer includes some sort of "forwarded this request from original IP" information in the headers, so we should be able to extract that somehow. Seems like we can't configure the datasette access logs so we'll need to set it up behind something we can configure, like NGINX.

Datasette Tasks

Preview Give feedback
  1. 0 of 2
    datasette
    jdangerx

Superset

We're slowly deploying a new data visualization tool! It'll give us a lot of usage information, which we should process and handle. See https://engineering.hometogo.com/monitor-superset-usage-via-superset-c7f9fba79525 for a template.

Superset Tasks

Preview Give feedback

Zenodo

Zenodo API calls return stats on views and downloads for a record at a particular point in time. We should periodically (weekly?) collect stats on all of our archives on Zenodo and archive them for later processing.

Zenodo Tasks

Preview Give feedback
  1. zenodo
    e-belfer
  2. 4 of 4
    zenodo
    e-belfer

Kaggle

Kaggle collects data on views and downloads through its dataset metadata JSON, accessible through the api.metadata_get(KAGGLE_OWNER, KAGGLE_DATASET) call from the KaggleApi. Like Zenodo, this is data reported at the time of query, so we'll need to archive these metrics to see changes over time.

Kaggle

Preview Give feedback
  1. 2 of 2
    github kaggle
    e-belfer
  2. 1 of 1
    github kaggle
    e-belfer
  3. 7 of 7
    github kaggle
    e-belfer
  4. 0 of 3
    kaggle

Github

Migrate our Github metrics archiving from the business repository, and add it to our ETL.

Github

Preview Give feedback
  1. 2 of 4
    github_actions
    bendnorman
  2. 7 of 7
    github kaggle
    e-belfer

Reporting and Visualization

Once the data is processed, we'll need to analyze and report on metrics of interest in order to interpret changes in usage and highlight trends of interest.

Some interesting references for Superset usage dashboards can be found here.

Reporting and Summaries

Preview Give feedback
  1. 0 of 1
@jdangerx
Copy link
Member Author

jdangerx commented Jul 1, 2024

We should timebox this to 5h and prioritize getting S3 parquet logs because of the possibility of replacing datasette altogether.

@bendnorman
Copy link
Member

I think revamping the pudl-usage-metrics repo will take some work. Maybe we can simplify the task by "disabling" the current metrics in the ETL:

  • old datasette logs from the Cloud Run days which are probably still helpful for us
  • intake catalog logs which we never really utilized

and integrate just the s3 logs since those are the highest value / most relevant rn. I opened a PR with my janky s3 log download script and notebook.

The ETl generally works like this:

  1. Pull some logs from GCS
  2. Does some cleaning with pandas and dagster
  3. Load the cleaned logs into Cloud SQL postgres

I have a github action that processes the latest logs and loads them to Cloud SQL. Cloud SQL is kind of expensive so it might make more sense to use BQ.

I think it makes sense to create a quick design doc for the usage metrics revamp, given there is a lot we could do.

@e-belfer e-belfer transferred this issue from catalyst-cooperative/pudl Jul 17, 2024
@e-belfer e-belfer added Epic github_actions Pull requests that update GitHub Actions code s3 Relating to S3 usage metrics datasette Relating to Datasette usage metrics superset Relating to Superset usage metrics labels Aug 14, 2024
@e-belfer e-belfer changed the title Get basic user metrics we technically have access to Revitalize the collection of PUDL usage metrics Aug 14, 2024
@e-belfer
Copy link
Member

I've updated this issue to be an epic reflecting all our logs and possible workflows, and have tried to structure out smaller steps in the tasklists.

@e-belfer e-belfer linked a pull request Sep 13, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from In progress to Done in Catalyst Megaproject Sep 16, 2024
@zaneselvans
Copy link
Member

@e-belfer was this issue supposed to get closed by #162?

@e-belfer e-belfer reopened this Sep 16, 2024
@e-belfer
Copy link
Member

Definitely not!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasette Relating to Datasette usage metrics Epic github_actions Pull requests that update GitHub Actions code s3 Relating to S3 usage metrics superset Relating to Superset usage metrics
Projects
Status: In progress
Development

Successfully merging a pull request may close this issue.

4 participants