Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counting and Visualizing CRAN Downloads with packageRank (with Caveats!) - R-hub blog #101

Open
utterances-bot opened this issue May 11, 2020 · 6 comments

Comments

@utterances-bot
Copy link

Counting and Visualizing CRAN Downloads with packageRank (with Caveats!) - R-hub blog

https://blog.r-hub.io/2020/05/11/packagerank-intro/

Copy link
Member

jeroen commented May 11, 2020

Very cool post! Another interesting metric is the package pagerank in the dependency graph. See for example the code at the end of this post: https://blog.revolutionanalytics.com/2014/12/a-reproducible-r-example-finding-the-most-popular-packages-using-the-pagerank-algorithm.html

@lindbrook
Copy link
Contributor

Thanks!

For what it's worth, the name 'packageRank' is a nod to PageRank. But for my purposes, getting a "better" estimate of user (rather than developer) interest in a package, what I actually want is an "inverse" PageRank algorithm, which discounts rather than credits dependencies. A task for the future.

Copy link

ferroao commented Dec 11, 2020

nice work, I think there might be also some kind of server's bias. IPs that download hundreds and thousands of packages a day, that might not represent real users.

@lindbrook
Copy link
Contributor

Thanks! In the current development version of ‘packageRank’ I’ve been working on functions that try to do what you suggest. Among other things, they try to filter out log entries due to CI/unit testing and “unofficial” efforts to mirror CRAN.

As far as server bias is concerned, tell me what you have in mind. Something along the lines of people who use RStudio’s CRAN Mirror, which generates the logs used to count package downloads, tend to do more testing, package development, etc. than those who use other mirrors?

@ferroao
Copy link

ferroao commented Dec 12, 2020

It seems there are IPs that day after day download the same number of packages. IPs are not real (coded), and change arbitrarily, but it seems odd that a fixed number of IPs (x-axis) download a fixed number of packages (y-axis) in different days. (i.e. some peaks of this graph repeat for different days). Seems non-human.

urls<- paste0("http://cran-logs.rstudio.com/2020/", 
                      "2020-04-25", ".csv.gz")

download.file(urls, paste0("CRANlogs/", "2020-04-25", 
                                   ".csv.gz") )
# install.packages("data.table")
# install.packages("R.utils")

file.db <- data.table::fread("CRANlogs/2020-12-08.csv.gz", header = TRUE, sep = ",", 
                             dec = ".", fill = TRUE
)

pacPerIp <- file.db[ , unique(.SD), by = ip_id, .SDcols = c("package","country")]

pacPerIpSum <- pacPerIp[, .(numberOfPacs=.N),  by = .(ip_id,country)]

pacPerIpSum<-pacPerIpSum[order(pacPerIpSum$numberOfPacs, decreasing = T)]

pacPerIp2 <- pacPerIpSum[ , .(countOfIPs=.N), by = numberOfPacs]

pacPerIp2 <- pacPerIp2[order(pacPerIp2$numberOfPacs, decreasing = F)]

pacPerIp2$numberOfPacs<-factor(pacPerIp2$numberOfPacs, levels=sort(pacPerIp2$numberOfPacs))
# install.packages("ggplot2")
library(ggplot2)
ggplot(pacPerIp2, aes(numberOfPacs, countOfIPs)) +
  geom_col() +   theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5 ) )+ ylim(0,500) 

@lindbrook
Copy link
Contributor

Yes. It's not "human". There seems to be a lot of repeated, regularly scheduled scripted downloads (probably due to chron jobs, AWS, Docker, CI/unit testing, unofficial CRAN mirroring, etc.). Because IP addresses are anonymized, identifying the specific culprit is not a trivial exercise (to me, this concern for privacy is perfectly understandable and should be respected). That said, even without that information I think it's still possible to reduce the contribution of automated download on the overall package download count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants