Skip to content
This repository has been archived by the owner on Oct 16, 2024. It is now read-only.

Fix pagerank percentile #2182

Merged
merged 1 commit into from
Nov 21, 2023
Merged

Conversation

jcushman
Copy link
Contributor

I noticed this when a user asked me to explain the meaning of the numbers in our pagerank export, and figured I may as well throw in a fix --

The code to generate pagerank percentiles was obviously intended to give all cases with the same pagerank the same percentile score, but missing a last_score = score line. The upshot is that row one gets a percentile of 1/n, row 2 gets a percentile of 2/n, etc, instead of them all having the same percentile if they have the same score:

id,raw_score,percentile
4,4.076306636737454e-08,0.0
5,4.076306636737454e-08,1.9539073261755196e-07
7,4.076306636737454e-08,3.907814652351039e-07
13,4.076306636737454e-08,5.861721978526558e-07
17,4.076306636737454e-08,7.815629304702078e-07

This just puts in the missing assignment so those would all have percentile 0.0. Note this has the largest effect on the lowest-ranked cases -- about a third of our cases have the same lowest pagerank, so case 1.7 million has a percentile of 33% when it should be 0%. After that there's more gradation so there's less effect from this bug.

@jcushman jcushman requested a review from a team as a code owner November 21, 2023 20:50
@jcushman jcushman requested review from bensteinberg and removed request for a team November 21, 2023 20:50
Copy link

codecov bot commented Nov 21, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (09f8625) 61.97% compared to head (ff5b10a) 61.98%.
Report is 1 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #2182   +/-   ##
========================================
  Coverage    61.97%   61.98%           
========================================
  Files          107      107           
  Lines        11817    11818    +1     
========================================
+ Hits          7324     7325    +1     
  Misses        4493     4493           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@bensteinberg bensteinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

When this is deployed, I think I'll run fab calculate_pagerank_scores ; fab load_pagerank_scores?

@jcushman jcushman merged commit 3521773 into harvard-lil:develop Nov 21, 2023
2 checks passed
@jcushman
Copy link
Contributor Author

I think it would be fab export_citation_graph?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants