-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive raw Kaggle and Github metrics daily #162
Conversation
75247df
to
8caf839
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small change requests about module-level variables. Otherwise, looks good! And the slack notifications are saying that it works :)
name: Save Kaggle Metrics | ||
env: | ||
KAGGLE_KEY: ${{ secrets.KAGGLE_KEY }} | ||
KAGGLE_USERNAME: ${{ secrets.KAGGLE_USERNAME }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: Did you just use your own personal key/username for these? or do we have an organizational kaggle account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an organizational level account!
raise HTTPError( | ||
f"HTTP error occurred: {http_err}\n\tResponse test: {response.text}" | ||
) | ||
except Exception as err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would let the non-HTTPError
errors raise like normal instead of manually raising them here.
|
||
metrics = [] | ||
page = 1 | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: if somehow Github gives us an infinite stream of forks and stargazers (e.g. their pagination stops working for some reason??), we wouldn't want to keep requesting forever.
We could work around this by:
- limit how many pages to look at
- set a reasonable timeout on the job itself
In any case, the worst case scenario is the GHA runner eventually timing out after 6 hours of free compute. Hence, non-blocking.
OWNER = "catalyst-cooperative" | ||
REPO = "pudl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these are used anywhere else, should we delete them?
TOKEN = os.getenv("API_TOKEN_GITHUB", "...") | ||
OWNER = "catalyst-cooperative" | ||
REPO = "pudl" | ||
BUCKET_NAME = "pudl-usage-metrics-archives.catalyst.coop" | ||
|
||
BIWEEKLY_METRICS = [ | ||
Metric("clones", "clones"), | ||
Metric("popular/paths", "popular_paths"), | ||
Metric("popular/referrers", "popular_referrers"), | ||
Metric("views", "views"), | ||
] | ||
PERSISTENT_METRICS = [Metric("stargazers", "stargazers"), Metric("forks", "forks")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these probably don't need to be module-level variables and can instead be set within each function - the GH-auth related ones can go into make_github_request
and the lists of metrics can be in save_metrics
. Then it's clear that nobody else depends on these variables.
KAGGLE_OWNER = "catalystcooperative" | ||
KAGGLE_DATASET = "pudl-project" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these (and BUCKET_NAME
) probably don't need to be module-level variables either, sort of like the github metrics situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Overview
Closes part of #161.
What problem does this address?
Runs a daily archive of Github and Kaggle metrics and saves to GCS.
What did you change in this PR?
Created a Kaggle archiving script and moved the Github archiving script from the
business
repo.Testing
How did you make sure this worked? How can a reviewer verify this?
We'll need to merge this PR in order to run the action, but you can test running the scripts locally.
To-do list
Tasks