Counting and Visualizing CRAN Downloads with packageRank (with Caveats!) - R-hub blog #101

utterances-bot · 2020-05-11T11:59:47Z

Counting and Visualizing CRAN Downloads with packageRank (with Caveats!) - R-hub blog

https://blog.r-hub.io/2020/05/11/packagerank-intro/

jeroen · 2020-05-11T11:59:48Z

Very cool post! Another interesting metric is the package pagerank in the dependency graph. See for example the code at the end of this post: https://blog.revolutionanalytics.com/2014/12/a-reproducible-r-example-finding-the-most-popular-packages-using-the-pagerank-algorithm.html

lindbrook · 2020-05-12T20:59:02Z

Thanks!

For what it's worth, the name 'packageRank' is a nod to PageRank. But for my purposes, getting a "better" estimate of user (rather than developer) interest in a package, what I actually want is an "inverse" PageRank algorithm, which discounts rather than credits dependencies. A task for the future.

ferroao · 2020-12-11T01:55:06Z

nice work, I think there might be also some kind of server's bias. IPs that download hundreds and thousands of packages a day, that might not represent real users.

lindbrook · 2020-12-11T23:36:00Z

Thanks! In the current development version of ‘packageRank’ I’ve been working on functions that try to do what you suggest. Among other things, they try to filter out log entries due to CI/unit testing and “unofficial” efforts to mirror CRAN.

As far as server bias is concerned, tell me what you have in mind. Something along the lines of people who use RStudio’s CRAN Mirror, which generates the logs used to count package downloads, tend to do more testing, package development, etc. than those who use other mirrors?

ferroao · 2020-12-12T13:35:42Z

It seems there are IPs that day after day download the same number of packages. IPs are not real (coded), and change arbitrarily, but it seems odd that a fixed number of IPs (x-axis) download a fixed number of packages (y-axis) in different days. (i.e. some peaks of this graph repeat for different days). Seems non-human.

urls<- paste0("http://cran-logs.rstudio.com/2020/", 
                      "2020-04-25", ".csv.gz")

download.file(urls, paste0("CRANlogs/", "2020-04-25", 
                                   ".csv.gz") )
# install.packages("data.table")
# install.packages("R.utils")

file.db <- data.table::fread("CRANlogs/2020-12-08.csv.gz", header = TRUE, sep = ",", 
                             dec = ".", fill = TRUE
)

pacPerIp <- file.db[ , unique(.SD), by = ip_id, .SDcols = c("package","country")]

pacPerIpSum <- pacPerIp[, .(numberOfPacs=.N),  by = .(ip_id,country)]

pacPerIpSum<-pacPerIpSum[order(pacPerIpSum$numberOfPacs, decreasing = T)]

pacPerIp2 <- pacPerIpSum[ , .(countOfIPs=.N), by = numberOfPacs]

pacPerIp2 <- pacPerIp2[order(pacPerIp2$numberOfPacs, decreasing = F)]

pacPerIp2$numberOfPacs<-factor(pacPerIp2$numberOfPacs, levels=sort(pacPerIp2$numberOfPacs))
# install.packages("ggplot2")
library(ggplot2)
ggplot(pacPerIp2, aes(numberOfPacs, countOfIPs)) +
  geom_col() +   theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5 ) )+ ylim(0,500)

lindbrook · 2020-12-13T18:25:20Z

Yes. It's not "human". There seems to be a lot of repeated, regularly scheduled scripted downloads (probably due to chron jobs, AWS, Docker, CI/unit testing, unofficial CRAN mirroring, etc.). Because IP addresses are anonymized, identifying the specific culprit is not a trivial exercise (to me, this concern for privacy is perfectly understandable and should be respected). That said, even without that information I think it's still possible to reduce the contribution of automated download on the overall package download count.

maelle mentioned this issue May 11, 2020

packageRank draft #74

Merged

maelle added the comments 💬 label Oct 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counting and Visualizing CRAN Downloads with packageRank (with Caveats!) - R-hub blog #101

Counting and Visualizing CRAN Downloads with packageRank (with Caveats!) - R-hub blog #101

utterances-bot commented May 11, 2020

jeroen commented May 11, 2020

lindbrook commented May 12, 2020

ferroao commented Dec 11, 2020

lindbrook commented Dec 11, 2020

ferroao commented Dec 12, 2020

lindbrook commented Dec 13, 2020

Counting and Visualizing CRAN Downloads with packageRank (with Caveats!) - R-hub blog #101

Counting and Visualizing CRAN Downloads with packageRank (with Caveats!) - R-hub blog #101

Comments

utterances-bot commented May 11, 2020

Counting and Visualizing CRAN Downloads with packageRank (with Caveats!) - R-hub blog

jeroen commented May 11, 2020

lindbrook commented May 12, 2020

ferroao commented Dec 11, 2020

lindbrook commented Dec 11, 2020

ferroao commented Dec 12, 2020

lindbrook commented Dec 13, 2020