Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive raw Kaggle and Github metrics daily #162

Merged
merged 11 commits into from
Sep 16, 2024
Merged

Archive raw Kaggle and Github metrics daily #162

merged 11 commits into from
Sep 16, 2024

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Sep 11, 2024

Overview

Closes part of #161.

What problem does this address?
Runs a daily archive of Github and Kaggle metrics and saves to GCS.

What did you change in this PR?
Created a Kaggle archiving script and moved the Github archiving script from the business repo.

Testing

How did you make sure this worked? How can a reviewer verify this?
We'll need to merge this PR in order to run the action, but you can test running the scripts locally.

To-do list

Tasks

Preview Give feedback

@e-belfer e-belfer self-assigned this Sep 11, 2024
@e-belfer e-belfer requested review from bendnorman and removed request for bendnorman September 11, 2024 20:05
@e-belfer e-belfer changed the title Kaggle pipeline Archive raw Kaggle and Github metrics daily Sep 11, 2024
@e-belfer e-belfer added kaggle Relating to Kaggle usage metrics github Relating to Github usage metrics labels Sep 13, 2024
@e-belfer e-belfer marked this pull request as ready for review September 13, 2024 16:44
@e-belfer e-belfer requested a review from jdangerx September 13, 2024 17:10
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small change requests about module-level variables. Otherwise, looks good! And the slack notifications are saying that it works :)

name: Save Kaggle Metrics
env:
KAGGLE_KEY: ${{ secrets.KAGGLE_KEY }}
KAGGLE_USERNAME: ${{ secrets.KAGGLE_USERNAME }}
Copy link
Member

@jdangerx jdangerx Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: Did you just use your own personal key/username for these? or do we have an organizational kaggle account?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an organizational level account!

raise HTTPError(
f"HTTP error occurred: {http_err}\n\tResponse test: {response.text}"
)
except Exception as err:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would let the non-HTTPError errors raise like normal instead of manually raising them here.


metrics = []
page = 1
while True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: if somehow Github gives us an infinite stream of forks and stargazers (e.g. their pagination stops working for some reason??), we wouldn't want to keep requesting forever.

We could work around this by:

  • limit how many pages to look at
  • set a reasonable timeout on the job itself

In any case, the worst case scenario is the GHA runner eventually timing out after 6 hours of free compute. Hence, non-blocking.

Comment on lines 13 to 14
OWNER = "catalyst-cooperative"
REPO = "pudl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are used anywhere else, should we delete them?

Comment on lines 26 to 37
TOKEN = os.getenv("API_TOKEN_GITHUB", "...")
OWNER = "catalyst-cooperative"
REPO = "pudl"
BUCKET_NAME = "pudl-usage-metrics-archives.catalyst.coop"

BIWEEKLY_METRICS = [
Metric("clones", "clones"),
Metric("popular/paths", "popular_paths"),
Metric("popular/referrers", "popular_referrers"),
Metric("views", "views"),
]
PERSISTENT_METRICS = [Metric("stargazers", "stargazers"), Metric("forks", "forks")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these probably don't need to be module-level variables and can instead be set within each function - the GH-auth related ones can go into make_github_request and the lists of metrics can be in save_metrics. Then it's clear that nobody else depends on these variables.

Comment on lines 11 to 12
KAGGLE_OWNER = "catalystcooperative"
KAGGLE_DATASET = "pudl-project"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these (and BUCKET_NAME) probably don't need to be module-level variables either, sort of like the github metrics situation.

@e-belfer e-belfer requested a review from jdangerx September 16, 2024 14:29
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@e-belfer e-belfer merged commit b9e98b6 into main Sep 16, 2024
5 checks passed
@e-belfer e-belfer deleted the kaggle-pipeline branch September 16, 2024 14:43
@e-belfer e-belfer mentioned this pull request Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
github Relating to Github usage metrics kaggle Relating to Kaggle usage metrics
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Move Github metric archive out of business repo Revitalize the collection of PUDL usage metrics
2 participants