diff --git a/.travis.yml b/.travis.yml index b49c4b66..2933e44c 100644 --- a/.travis.yml +++ b/.travis.yml @@ -11,36 +11,32 @@ ## with this program; if not, write to the Free Software Foundation, Inc., ## 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. ## -## Copyright 2017-2018 by Claus Hunsen +## Copyright 2017-2018,2020 by Claus Hunsen ## All Rights Reserved. +# TravisCI container +os: linux +dist: xenial +warnings_are_errors: false +# R environment, dependencies and information language: r r: - 3.3 - 3.4 - 3.5 - -# TravisCI container -sudo: required -dist: trusty -warnings_are_errors: false - -# # Branches -# branches: -# only: -# - travis -# - claus-updates - -# R dependencies and information + - 3.6 cache: packages repos: CRAN: https://cloud.r-project.org -# installation +# Installation install: + # package dependencies - sudo apt-get install libudunits2-dev + # package installation - Rscript install.R +# Tests script: - Rscript tests.R diff --git a/NEWS.md b/NEWS.md index c5d0d92a..67c3c6e1 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,5 +1,21 @@ # coronet – Changelog +## 3.6 + +### Added +- Add a parameter `editor.definition` to the function `add.vertex.attribute.artifact.editor.count` which can be used to define, if author or committer or both count as editors when computing the attribute values. (#92, ff1e147ba563b2d71f8228afd49492a315a5ad48) +- Add the possibility to filter out patchstack mails from the mails of the `ProjectData`. The option can be toggled using the newly added configuration option `mails.filter.patchstack.mails`. (1608e28ca36610c58d2a5447d12ee2052c6eb976, a932c8cdaa6fe5149c798bc09d9e421ba679c48d) +- Add a new file `util-plot-evaluation.R` containing functions to plot commit edit types per author and project. (PR #171, d4af515f859ce16ffaa0963d6d3d4086bcbb7377, aa542a215f59bc3ed869cfefbc5a25fa050b1fc9. 0a0a5903e7c609dfe805a3471749eb2241efafe2) + +### Changed/Improved + +- Add R version 3.6 to test suite (8b2a52d38475a59c55feb17bb54ed12b9252a937, #161) +- Update `.travis.yml` to improve compatibility with Travis CI (41ce589b3b50fd581a10e6af33ac6b1bbea63bb8) + +### Fixed + +- Ensure sorting of commit-count and LOC-count data.frames to fix tests with R 3.3 (33d63fd50c4b29d45a9ca586c383650f7d29efd5) + ## 3.5 diff --git a/README.md b/README.md index 3b540e6b..a325d7f4 100644 --- a/README.md +++ b/README.md @@ -103,7 +103,7 @@ While `proximity` triggers a file/function-based commit analysis in `Codeface`, When using this network library, the user only needs to give the `artifact` parameter to the [`ProjectConf`](#projectconf) constructor, which automatically ensures that the correct tagging is selected. The configuration files `{project-name}_{tagging}.conf` are mandatory and contain some basic configuration regarding a performed `Codeface` analysis (e.g., project name, name of the corresponding repository, name of the mailing list, etc.). -For further details on those files, please have a look at some [example files](https://github.com/siemens/codeface/tree/master/conf) files in the `Codeface` repository. +For further details on those files, please have a look at some [example files](https://github.com/siemens/codeface/tree/master/conf) in the `Codeface` repository. All the `*.list` files listed above are output files of `codeface-extraction` and contain meta data of, e.g., commits or e-mails to the mailing list, etc., in CSV format. This network library lazily loads and processes these files when needed. @@ -133,7 +133,7 @@ Alternatively, you can run `Rscript install.R` to install the packages. Please insert the project into yours by use of [git submodules](https://git-scm.com/book/en/v2/Git-Tools-Submodules). Furthermore, the file `install.R` installs all needed R packages (see [below](#needed-r-packages)) into your R library. -Although, the use of of [packrat](https://rstudio.github.io/packrat/) with your project is recommended. +Although, the use of [packrat](https://rstudio.github.io/packrat/) with your project is recommended. This library is written in a way to not interfere with the loading order of your project's `R` packages (i.e., `library()` calls), so that the library does not lead to masked definitions. @@ -415,6 +415,8 @@ Additionally, for more examples, the file `showcase.R` is worth a look. * Functionality for the identification of network motifs (subgraph patterns) - `util-plot.R` * Everything needed for plotting networks +- `util-plot-evaluation.R` + * Plotting functions for data evaluation - `util-misc.R` * Helper functions and also legacy functions, both needed in the other files - `showcase.R` @@ -521,6 +523,10 @@ There is no way to update the entries, except for the revision-based parameters. - `commits.filter.untracked.files` * Remove all information concerning untracked files from the commit data. This effect becomes clear when retrieving commits using `get.commits.filtered`, because then the result of which does not contain any commits that solely changed untracked files. Networks built on top of this `ProjectData` do also not contain any information about untracked files. * [*`TRUE`*, `FALSE`] +- `mails.filter.patchstack.mails` + * Filter patchstack mails from the mail data. In a thread, a patchstack spans the first sequence of mails where each mail has been authored by the thread creator and has been sent within a short time window after the preceding mail. The mails spanned by a patchstack are called +'patchstack mails' and for each patchstack, every patchstack mail but the first one are filtered when `mails.filter.patchstack.mails = TRUE`. + * [`TRUE`, *`FALSE`*] - `issues.only.comments` * Only use comments from the issue data on disk and no further events such as references and label changes * [*`TRUE`*, `FALSE`] diff --git a/showcase.R b/showcase.R index 16861cac..8a2828a5 100644 --- a/showcase.R +++ b/showcase.R @@ -17,6 +17,7 @@ ## Copyright 2017 by Felix Prasse ## Copyright 2017-2018 by Thomas Bock ## Copyright 2018 by Jakob Kronawitter +## Copyright 2019 by Klara Schlueter ## All Rights Reserved. @@ -80,6 +81,13 @@ revisions.callgraph = proj.conf$get.value("revisions.callgraph") x.data = ProjectData$new(project.conf = proj.conf) x = NetworkBuilder$new(project.data = x.data, network.conf = net.conf) +## * Evaluation plots ------------------------------------------------------ + +# edit.types = plot.commit.edit.types.in.project(x.data) +# edit.types.scaled = plot.commit.edit.types.in.project(x.data, TRUE) +# editor.types = plot.commit.editor.types.by.author(x.data) +# editor.types.scaled = plot.commit.editor.types.by.author(x.data, TRUE) + ## * Data retrieval -------------------------------------------------------- # x.data$get.commits() diff --git a/tests/test-data.R b/tests/test-data.R index ed7c7d8d..f996eefe 100644 --- a/tests/test-data.R +++ b/tests/test-data.R @@ -12,7 +12,8 @@ ## 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. ## ## Copyright 2018 by Christian Hechtl -## Copyright 2018 by Claus Hunsen +## Copyright 2018-2019 by Claus Hunsen +## Copyright 2019 by Jakob Kronawitter ## All Rights Reserved. @@ -34,6 +35,7 @@ test_that("Compare two ProjectData objects", { ##initialize a ProjectData object with the ProjectConf and clone it into another one proj.conf = ProjectConf$new(CF.DATA, CF.SELECTION.PROCESS, CASESTUDY, ARTIFACT) + proj.conf$update.value("pasta", TRUE) proj.data.one = ProjectData$new(project.conf = proj.conf) proj.data.two = proj.data.one$clone() @@ -43,19 +45,20 @@ test_that("Compare two ProjectData objects", { ## second object, as well, and test for equality. ##change the second data object - proj.data.one$get.commits() + + proj.data.two$get.pasta() expect_false(proj.data.one$equals(proj.data.two), "Two not identical ProjectData objects.") - proj.data.two$get.commits() + proj.data.one$get.pasta() expect_true(proj.data.one$equals(proj.data.two), "Two identical ProjectData objects.") - proj.data.two$get.pasta() + proj.data.one$get.commits() expect_false(proj.data.one$equals(proj.data.two), "Two not identical ProjectData objects.") - proj.data.one$get.pasta() + proj.data.two$get.commits() expect_true(proj.data.one$equals(proj.data.two), "Two identical ProjectData objects.") @@ -123,3 +126,56 @@ test_that("Compare two RangeData objects", { expect_false(proj.data.base$equals(range.data.four)) }) + +test_that("Filter patchstack mails", { + + proj.conf = ProjectConf$new(CF.DATA, CF.SELECTION.PROCESS, CASESTUDY, ARTIFACT) + proj.conf$update.value("mails.filter.patchstack.mails", TRUE) + + ## create the project data + proj.data = ProjectData$new(proj.conf) + + ## retrieve the mails while filtering patchstack mails + mails.filtered = proj.data$get.mails() + + ## create new project with filtering disabled + proj.conf$update.value("mails.filter.patchstack.mails", FALSE) + proj.data = ProjectData$new(proj.conf) + + ## retrieve the mails without filtering patchstack mails + mails.unfiltered = proj.data$get.mails() + + ## get message ids + mails.filtered.mids = mails.filtered[["message.id"]] + mails.unfiltered.mids = mails.unfiltered[["message.id"]] + + expect_equal(setdiff(mails.unfiltered.mids, mails.filtered.mids), c("", + "", + "", + "", + "")) +}) + +test_that("Filter patchstack mails with PaStA enabled", { + proj.conf = ProjectConf$new(CF.DATA, CF.SELECTION.PROCESS, CASESTUDY, ARTIFACT) + proj.conf$update.value("mails.filter.patchstack.mails", TRUE) + proj.conf$update.value("pasta", TRUE) + + proj.data = ProjectData$new(proj.conf) + + ## retrieve filtered PaStA data by calling 'get.pasta' which calls the filtering functionality internally + filtered.pasta = proj.data$get.pasta() + + ## ensure that the remaining mails have not been touched + expect_true("" %in% filtered.pasta[["message.id"]]) + expect_true("" %in% filtered.pasta[["message.id"]]) + expect_true("" %in% filtered.pasta[["message.id"]]) + expect_equal(2, sum(filtered.pasta[["message.id"]] == "")) + + ## ensure that the three PaStA entries relating to the filtered patchstack mails have been merged to a single new + ## PaStA entry which has assigned the message ID of the first patchstack mail + expect_true("" %in% filtered.pasta[["message.id"]]) + + ## ensure that there are no other entries than the ones that have been verified to exist above + expect_equal(6, nrow(filtered.pasta)) +}) diff --git a/tests/test-networks-covariates.R b/tests/test-networks-covariates.R index 44233926..eb7d71e2 100644 --- a/tests/test-networks-covariates.R +++ b/tests/test-networks-covariates.R @@ -818,9 +818,7 @@ test_that("Test add.vertex.attribute.artifact.editor.count", { networks.and.data = get.network.covariates.test.networks("artifact") - expected.attributes = network.covariates.test.build.expected(list(1L), list(1L), list(3L, 1L)) - - expected.attributes = list( + expected.attributes.author = list( range = network.covariates.test.build.expected( c(1L), c(1L), c(3L, 1L)), cumulative = network.covariates.test.build.expected( @@ -834,18 +832,58 @@ test_that("Test add.vertex.attribute.artifact.editor.count", { complete = network.covariates.test.build.expected( c(2L), c(2L), c(3L, 1L)) ) + expected.attributes.committer = list( + range = network.covariates.test.build.expected( + c(1L), c(1L), c(2L, 1L)), + cumulative = network.covariates.test.build.expected( + c(1L), c(1L), c(2L, 1L)), + all.ranges = network.covariates.test.build.expected( + c(1L), c(1L), c(2L, 1L)), + project.cumulative = network.covariates.test.build.expected( + c(1L), c(1L), c(2L, 1L)), + project.all.ranges = network.covariates.test.build.expected( + c(1L), c(1L), c(2L, 1L)), + complete = network.covariates.test.build.expected( + c(1L), c(1L), c(2L, 1L)) + ) + expected.attributes.both = list( + range = network.covariates.test.build.expected( + c(1L), c(2L), c(3L, 1L)), + cumulative = network.covariates.test.build.expected( + c(1L), c(2L), c(3L, 1L)), + all.ranges = network.covariates.test.build.expected( + c(2L), c(2L), c(3L, 1L)), + project.cumulative = network.covariates.test.build.expected( + c(1L), c(2L), c(3L, 1L)), + project.all.ranges = network.covariates.test.build.expected( + c(2L), c(2L), c(3L, 1L)), + complete = network.covariates.test.build.expected( + c(2L), c(2L), c(3L, 1L)) + ) ## Test lapply(AGGREGATION.LEVELS, function(level) { - networks.with.attr = add.vertex.attribute.artifact.editor.count( + networks.with.attr.author = add.vertex.attribute.artifact.editor.count( networks.and.data[["networks"]], networks.and.data[["project.data"]], aggregation.level = level ) + networks.with.attr.committer = add.vertex.attribute.artifact.editor.count( + networks.and.data[["networks"]], networks.and.data[["project.data"]], + aggregation.level = level, editor.definition = "committer" + ) + networks.with.attr.both = add.vertex.attribute.artifact.editor.count( + networks.and.data[["networks"]], networks.and.data[["project.data"]], + aggregation.level = level, editor.definition = c("author", "committer") + ) - actual.attributes = lapply(networks.with.attr, igraph::get.vertex.attribute, name = "editor.count") + actual.attributes.author = lapply(networks.with.attr.author, igraph::get.vertex.attribute, name = "editor.count") + actual.attributes.committer = lapply(networks.with.attr.committer, igraph::get.vertex.attribute, name = "editor.count") + actual.attributes.both = lapply(networks.with.attr.both, igraph::get.vertex.attribute, name = "editor.count") - expect_equal(expected.attributes[[level]], actual.attributes) + expect_equal(expected.attributes.author[[level]], actual.attributes.author) + expect_equal(expected.attributes.committer[[level]], actual.attributes.committer) + expect_equal(expected.attributes.both[[level]], actual.attributes.both) }) }) diff --git a/util-conf.R b/util-conf.R index 974ae65a..0aecfc43 100644 --- a/util-conf.R +++ b/util-conf.R @@ -355,6 +355,12 @@ ProjectConf = R6::R6Class("ProjectConf", inherit = Conf, allowed = c(TRUE, FALSE), allowed.number = 1 ), + mails.filter.patchstack.mails = list( + default = FALSE, + type = "logical", + allowed = c(TRUE, FALSE), + allowed.number = 1 + ), synchronicity = list( default = FALSE, type = "logical", diff --git a/util-core-peripheral.R b/util-core-peripheral.R index e57e162c..028a0856 100644 --- a/util-core-peripheral.R +++ b/util-core-peripheral.R @@ -14,7 +14,7 @@ ## Copyright 2017 by Mitchell Joblin ## Copyright 2017 by Ferdinand Frank ## Copyright 2017 by Sofie Kemper -## Copyright 2017-2019 by Claus Hunsen +## Copyright 2017-2020 by Claus Hunsen ## Copyright 2017 by Felix Prasse ## Copyright 2018-2019 by Christian Hechtl ## Copyright 2018 by Klara Schlüter @@ -637,7 +637,7 @@ get.committer.not.author.commit.count = function(range.data) { res = sqldf::sqldf("SELECT *, COUNT(*) AS `freq` FROM `commits.df` WHERE `committer.name` <> `author.name` GROUP BY `committer.name`, `author.name` - ORDER BY `freq` DESC") + ORDER BY `freq` DESC, `author.name` ASC") logging::logdebug("get.committer.not.author.commit.count: finished.") return(res) @@ -664,7 +664,7 @@ get.committer.and.author.commit.count = function(range.data) { res = sqldf::sqldf("SELECT *, COUNT(*) AS `freq` FROM `commits.df` WHERE `committer.name` = `author.name` GROUP BY `committer.name`, `author.name` - ORDER BY `freq` DESC") + ORDER BY `freq` DESC, `author.name` ASC") logging::logdebug("get.committer.and.author.commit.count: finished.") return(res) @@ -699,7 +699,7 @@ get.committer.or.author.commit.count = function(range.data) { res = sqldf::sqldf("SELECT *, COUNT(*) AS `freq` FROM `ungrouped` GROUP BY `name` - ORDER BY `freq` DESC") + ORDER BY `freq` DESC, `name` ASC") logging::logdebug("get.committer.or.author.commit.count: finished.") return(res) @@ -725,7 +725,7 @@ get.committer.commit.count = function(range.data) { ## Execute a query to get the commit count per author res = sqldf::sqldf("SELECT *, COUNT(*) AS `freq` FROM `commits.df` - GROUP BY `committer.name` ORDER BY `freq` DESC") + GROUP BY `committer.name` ORDER BY `freq` DESC, `committer.name` ASC") logging::logdebug("get.committer.commit.count: finished.") return(res) @@ -751,7 +751,7 @@ get.author.commit.count = function(proj.data) { ## Execute a query to get the commit count per author res = sqldf::sqldf("SELECT `author.name`, COUNT(*) AS `freq` FROM `commits.df` - GROUP BY `author.name` ORDER BY `freq` DESC") + GROUP BY `author.name` ORDER BY `freq` DESC, `author.name` ASC") logging::logdebug("get.author.commit.count: finished.") return(res) @@ -813,7 +813,7 @@ get.author.loc.count = function(proj.data) { ## Execute a query to get the changed lines per author res = sqldf::sqldf("SELECT `author.name`, SUM(`added.lines`) + SUM(`deleted.lines`) AS `loc` FROM `commits.df` - GROUP BY `author.name` ORDER BY `loc` DESC") + GROUP BY `author.name` ORDER BY `loc` DESC, `author.name` ASC") logging::logdebug("get.author.loc.count: finished.") return(res) diff --git a/util-data.R b/util-data.R index 6492ebc7..e1e424df 100644 --- a/util-data.R +++ b/util-data.R @@ -63,6 +63,9 @@ DATASOURCE.TO.ARTIFACT.COLUMN = list( "issues" = "issue.id" ) +## the maximum time difference between subsequent mails of a patchstack +PATCHSTACK.MAIL.DECAY.THRESHOLD = "30 seconds" + ## / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / ## ProjectData ------------------------------------------------------------- @@ -101,6 +104,7 @@ ProjectData = R6::R6Class("ProjectData", commits = NULL, # data.frame ## mails mails = NULL, # data.frame + mails.patchstacks = NULL, # list ## issues issues = NULL, #data.frame ## authors @@ -113,37 +117,147 @@ ProjectData = R6::R6Class("ProjectData", ## timestamps of mail, issue and commit data data.timestamps = NULL, #data.frame - ## * * filtering commits ------------------------------------------- + ## * * commit filtering -------------------------------------------- - #' Filter commits retrieved by the method \code{get.commits} after potentially removing untracked files and the - #' base artifact (see parameters). + #' Filter commits by potentially removing commits to untracked files or to the base artifact (see parameters). #' + #' @param commits the data.frame of commits on which filtering will be applied #' @param remove.untracked.files flag whether untracked files are kept or removed #' @param remove.base.artifact flag whether the base artifact is kept or removed #' - #' @return the commits retrieved by the method \code{get.commits} after all filters have been applied - filter.commits = function(remove.untracked.files, remove.base.artifact) { + #' @return the commits after all filters have been applied + filter.commits = function(commits, remove.untracked.files, remove.base.artifact) { logging::logdebug("filter.commits: starting.") - ## get commit data - commit.data = self$get.commits() - ## filter out the untracked files if (remove.untracked.files) { - commit.data = subset(commit.data, file != UNTRACKED.FILE) + commits = subset(commits, file != UNTRACKED.FILE) } ## filter out the base artifacts (i.e., Base_Feature, File_Level) if (remove.base.artifact) { - commit.data = subset(commit.data, !(artifact %in% BASE.ARTIFACTS)) + commits = subset(commits, !(artifact %in% BASE.ARTIFACTS)) } logging::logdebug("filter.commits: finished.") - return(commit.data) + return(commits) + }, + + ## * * mail filtering ---------------------------------------------- + + #' Filter patchstack mails from the mails that are currently cached in the field \code{mails} and return them. + #' Store detected patchstacks in the field \code{patchstack.mails}. They are used later in the + #' function \code{filter.pasta.data} to also accommodate for the deleted mails in the PaStA data. + #' + #' In a thread, a patchstack spans the first sequence of mails where each mail has been authored by the thread + #' creator and has been sent within a short time window (see \code{PATCHSTACK.MAIL.DECAY.THRESHOLD}) after the + #' preceding mail. + #' The mails spanned by a patchstack are called 'patchstack mails'. + #' + #' For each patchstack, all patchstack mails but the first one are filtered. + #' + #' @return the mail data after filtering patchstack mails + filter.patchstack.mails = function() { + logging::logdebug("filter.patchstack.mails: starting.") + + ## retrieve mails grouped by thread IDs + thread.data = self$group.authors.by.data.column("mails", "thread") + + ## extract the patchstack mails and the filtered mails for each thread + result = parallel::mclapply(thread.data, function(thread) { + + ## ensure that all mails within the thread are ordered correctly + thread = thread[order(thread["date"]), ] + + running = TRUE + i = 1 + + ## find the largest index 'i' for which holds that each mail up to index 'i' has been authored by the + ## thread creator and that all mails up to index 'i' have been received within a succesive time window + ## of 'PATCHSTACK.MAIL.DECAY.THRESHOLD' + while (i < nrow(thread) && running) { + if (thread[1, "author.name"] == thread[i + 1, "author.name"] && + thread[i + 1, "date"] - thread[i, "date"] <= + lubridate::as.duration(PATCHSTACK.MAIL.DECAY.THRESHOLD)) { + i = i + 1 + } else { + running = FALSE + } + } + + ## return the mails of the thread with all patchstack mails but the first one being removed + return (list(keep = thread[setdiff(seq_len(nrow(thread)), seq_len(i)[-1]), ], + patchstack = thread[seq_len(i), ])) + }) + + ## override thread data with filtered thread data + thread.data = lapply(result, function(x) x[["keep"]]) + + ## flatten the list of mail-dataframes (i.e. thread.data) to a single mail-dataframe + mails = plyr::rbind.fill(thread.data) + + ## Retrieve patchstacks from the result above which are used to manipulate the PaStA data. This needs to be + ## done because the PaStA data relates to some of the filtered mails and must be adjusted accordingly. + patchstacks = lapply(result, function(x) x[["patchstack"]]) + + ## only patchstacks that contain at least two mails are considered patchstacks + patchstacks = patchstacks[lapply(patchstacks, nrow) > 1] + + ## store patchstack information + private$mails.patchstacks = patchstacks + + logging::logdebug("filter.patchstack.mails: finished.") + return(mails) }, ## * * PaStA data -------------------------------------------------- + #' Use the information about the deleted patchstack mails that are stored in the field \code{patchstack.mails} + #' to also filter out PaStA information that relates to the deleted mails. + #' + #' The PaStA information is not discarded completely however but instead is gathered for each patchstack and is + #' assigned to the first mail in each patchstack because this very first mail has not been filtered and + #' represents the patchstack. + #' + #' @return the filtered PaStA data + filter.pasta.data = function() { + logging::logdebug("filter.pasta.data: starting.") + + new.pasta = parallel::mclapply(private$mails.patchstacks, function(patchstack) { + + ## get all PaStA data that relates to the current patchstack (do not drop data.frame structure!) + pasta.tmp = private$pasta[private$pasta[["message.id"]] %in% patchstack[["message.id"]], , drop = FALSE] + + ## override all old message IDs with the message ID of the first mail in the patchstack since it + ## is the only one that is kept (if any data is available in 'pasta.tmp') + if (nrow(pasta.tmp) > 0) { + pasta.tmp["message.id"] = patchstack[1, "message.id"] + } + + return(pasta.tmp) + }) + ## combine new re-written PaStA data + new.pasta = plyr::rbind.fill(new.pasta) + + ## remove potential duplicates + new.pasta = unique(new.pasta) + + ## remove old items from PaStA data + ## 1) flatten the list of mail-dataframes (i.e. patchstacks) to a single mail-dataframe + patchstack.mails = plyr::rbind.fill(private$mails.patchstacks) + ## 2) delete any PaStA information that relate to message IDs of mails that will be discarded + pasta = private$pasta[!(private$pasta[["message.id"]] %in% patchstack.mails[["message.id"]]), ] + + ## append the new pasta data to the old pasta data + pasta = plyr::rbind.fill(pasta, new.pasta) + + ## reestablish ordering using the 'revision.set.id' column of the PaStA data + pasta = pasta[order(pasta[["revision.set.id"]]), ] + + logging::logdebug("filter.pasta.data: finished.") + return(pasta) + }, + #' Aggregate PaStA data for convenient merging to main data sources. #' #' In detail, the given PaStA data is independently aggregated by both the @@ -153,17 +267,14 @@ ProjectData = R6::R6Class("ProjectData", #' #' **Note**: The column \code{commit.hash} gets renamed to \code{hash} to match #' the corresponding column in the commit data (see \code{read.commits}). - #' - #' @param pasta.data a data.frame of PaStA data as retrieved from - #' \code{ProjectData$get.pasta.data} - aggregate.pasta.data = function(pasta.data) { + aggregate.pasta.data = function() { logging::logdebug("aggregate.pasta.data: starting.") ## check for data first - if (nrow(pasta.data) == 0) { + if (nrow(private$pasta) == 0) { ## take (empty) input data and no rows from it - private$pasta.mails = pasta.data[0, ] - private$pasta.commits = pasta.data[0, ] + private$pasta.mails = create.empty.pasta.list() + private$pasta.commits = create.empty.pasta.list() } else { ## compute aggregated data.frames for easier merging ## 1) define group function (determines result in aggregated data.frame cells) @@ -171,13 +282,13 @@ ProjectData = R6::R6Class("ProjectData", ## 2) aggregate by message ID group.col = "message.id" private$pasta.mails = aggregate( - as.formula(sprintf(". ~ %s", group.col)), pasta.data, + as.formula(sprintf(". ~ %s", group.col)), private$pasta, group.fun, na.action = na.pass ) ## 3) aggregate by commit hash group.col = "commit.hash" private$pasta.commits = aggregate( - as.formula(sprintf(". ~ %s", group.col)), pasta.data, + as.formula(sprintf(". ~ %s", group.col)), private$pasta, group.fun, na.action = na.pass ) } @@ -189,6 +300,107 @@ ProjectData = R6::R6Class("ProjectData", logging::logdebug("aggregate.pasta.data: finished.") }, + #' Update the PaStA-related columns \code{pasta} and \code{revision.set.id} that are appended to \code{commits} + #' using the currently available PaStA data from the field \code{pasta.commits}. + update.pasta.commit.data = function() { + logging::logdebug("update.pasta.commit.data: starting.") + + ## return immediately if no commits available + if (!is.null(private$mails)) { + + ## remove previous PaStA data + private$commits["pasta"] = NULL + private$commits["revision.set.id"] = NULL + + ## merge PaStA data + private$commits = merge(private$commits, private$pasta.commits, + by = "hash", all.x = TRUE, sort = FALSE) + + ## sort by date again because 'merge' disturbs the order + private$commits = private$commits[order(private$commits[["date"]], decreasing = FALSE), ] + } + + logging::logdebug("update.pasta.commit.data: finished.") + }, + + #' Update the PaStA-related columns \code{pasta} and \code{revision.set.id} that are appended to \code{mails} + #' using the currently available PaStA data from the field \code{pasta.mails}. + update.pasta.mail.data = function() { + logging::logdebug("update.pasta.mail.data: starting.") + + ## return immediately if no mails available + if (!is.null(private$mails)) { + + ## remove previous PaStA data + private$mails["pasta"] = NULL + private$mails["revision.set.id"] = NULL + + ## merge PaStA data + private$mails = merge(private$mails, private$pasta.mails, + by = "message.id", all.x = TRUE, sort = FALSE) + + ## sort by date again because 'merge' disturbs the order + private$mails = private$mails[order(private$mails[["date"]], decreasing = FALSE), ] + } + + logging::logdebug("update.pasta.mail.data: finished.") + }, + + #' Recompute the values of the cached fields \code{pasta.mails} and \code{pasta.commits} using the currrently + #' available PaStA information of the field \code{pasta} and also assign/update this PaStA information to + #' \code{mails} and \code{commits}. + #' + #' This method should be called whenever the field \code{pasta} is changed. + update.pasta.data = function() { + logging::logdebug("update.pasta.data: starting.") + + ## filter patchstack mails from PaStA data if configured + if (private$project.conf$get.value("mails.filter.patchstack.mails")) { + private$pasta = private$filter.pasta.data() + } + + ## aggregate by message IDs and commit hashes + private$aggregate.pasta.data() + + ## update mail data by attaching PaStA data + if (!is.null(private$mails)) { + private$update.pasta.mail.data() + } + + ## update commit data by attaching PaStA data + if (!is.null(private$commits)) { + private$update.pasta.commit.data() + } + + logging::logdebug("update.pasta.data: finished.") + }, + + ## * * synchronicity data ------------------------------------------ + + #' Update the column \code{synchronicity} that is appended to commits using the currently available + #' synchronicity data from the field \code{synchronicity}. + #' + #' This method should be called whenever the field \code{synchronicity} is changed. + update.synchronicity.data = function() { + logging::logdebug("update.synchronicity.data: starting.") + + ## update commit data by attaching synchronicity data + if (!is.null(private$commits)) { + ## remove previous synchronicity data + private$commits["synchronicity"] = NULL + + ## merge synchronicity data + private$commits = merge(private$commits, private$synchronicity, + by = "hash", all.x = TRUE, sort = FALSE) + + ## sort by date again because 'merge' disturbs the order + private$commits = private$commits[order(private$commits[["date"]], decreasing = FALSE), ] + + } + + logging::logdebug("update.synchronicity.data: finished.") + }, + ## * * timestamps -------------------------------------------------- #' Call the getters of the specified data sources in order to @@ -388,6 +600,7 @@ ProjectData = R6::R6Class("ProjectData", get.commits.filtered = function() { if (is.null(private$commits.filtered)) { private$commits.filtered = private$filter.commits( + self$get.commits(), private$project.conf$get.value("commits.filter.untracked.files"), private$project.conf$get.value("commits.filter.base.artifact") ) @@ -408,7 +621,7 @@ ProjectData = R6::R6Class("ProjectData", #' #' @seealso get.commits.filtered get.commits.filtered.uncached = function(remove.untracked.files, remove.base.artifact) { - return (private$filter.commits(remove.untracked.files, remove.base.artifact)) + return (private$filter.commits(self$get.commits(), remove.untracked.files, remove.base.artifact)) }, #' Get the list of commits which have the artifact kind configured in the \code{project.conf}. @@ -445,52 +658,45 @@ ProjectData = R6::R6Class("ProjectData", set.commits = function(commit.data) { logging::loginfo("Setting commit data.") - # TODO: Also check for correct shape (column names and data types) of the passed data - if (is.null(commit.data)) { commit.data = create.empty.commits.list() } - ## append synchronicity data if wanted + ## store commit data + private$commits = commit.data + + ## add synchronicity data if wanted if (private$project.conf$get.value("synchronicity")) { - logging::loginfo("Adding synchronicity data.") - synchronicity.data = self$get.synchronicity() - ## remove previous synchronicity data - if ("synchronicity" %in% colnames(commit.data)) { - commit.data["synchronicity"] = NULL + if (is.null(private$synchronicity)) { + ## get data (no assignment because we just want to trigger anything synchronicity-related) + self$get.synchronicity() + } else { + ## update all synchronicity-related data + private$update.synchronicity.data() } - commit.data = merge(commit.data, synchronicity.data, - by = "hash", all.x = TRUE, sort = FALSE) } ## add PaStA data if wanted if (private$project.conf$get.value("pasta")) { - logging::loginfo("Adding PaStA data.") - ## get data - self$get.pasta() # no assignment because we just want to trigger the read-in - ## remove previous PaStA data - if ("pasta" %in% colnames(commit.data)) { - commit.data["pasta"] = NULL - commit.data["revision.set.id"] = NULL + if (is.null(private$pasta)) { + ## get data (no assignment because we just want to trigger anything PaStA-related) + self$get.pasta() + } else { + ## update all PaStA-related data + private$update.pasta.data() } - ## merge PaStA data - commit.data = merge(commit.data, private$pasta.commits, - by = "hash", all.x = TRUE, sort = FALSE) } - ## sort by date again (because 'merge' is doing bullshit!) - commit.data = commit.data[order(commit.data[["date"]], decreasing = FALSE), ] # sort! - - private$commits = commit.data + ## sort by date + private$commits = private$commits[order(private$commits[["date"]], decreasing = FALSE), ] ## remove cached data for filtered commits as these need to be re-computed after ## changing the data private$commits.filtered = NULL }, - #' Get the synchronicity data. - #' If it does not already exist call the read method. - #' Call the setter function to set the data. + #' Get the synchronicity data. If it is not already stored in the ProjectData, this function triggers a read in + #' from disk. #' #' @return the synchronicity data get.synchronicity = function() { @@ -500,16 +706,19 @@ ProjectData = R6::R6Class("ProjectData", if (private$project.conf$get.value("synchronicity")) { ## if data are not read already, read them if (is.null(private$synchronicity)) { - synchronicity.data = read.synchronicity( + private$synchronicity = read.synchronicity( self$get.data.path.synchronicity(), private$project.conf$get.value("artifact"), private$project.conf$get.value("synchronicity.time.window") ) - ## set actual data - self$set.synchronicity(synchronicity.data) + ## no read of commit data needed here! + + ## update all synchronicity-related data + private$update.synchronicity.data() } } else { + logging::logwarn("You have not set the ProjectConf parameter 'synchronicity' to 'TRUE'! Ignoring...") ## mark synchronicity data as empty self$set.synchronicity(NULL) } @@ -534,16 +743,16 @@ ProjectData = R6::R6Class("ProjectData", ## add synchronicity data to the commit data if configured if (private$project.conf$get.value("synchronicity")) { - logging::loginfo("Updating synchronicity data.") - if (!is.null(private$commits)) { - self$set.commits(private$commits) - } + + ## no read of commit data needed here! + + ## update all synchronicity-related data + private$update.synchronicity.data() } }, - #' Get the PaStA data. - #' If it does not already exist call the read method. - #' Call the setter function to set the data. + #' Get the PaStA data. If it is not already stored in the ProjectData, this function triggers a read in + #' from disk. #' #' @return the PaStA data get.pasta = function() { @@ -553,12 +762,21 @@ ProjectData = R6::R6Class("ProjectData", if (private$project.conf$get.value("pasta")) { ## if data are not read already, read them if (is.null(private$pasta)) { - pasta.data = read.pasta(self$get.data.path.pasta()) - - ## set actual data - self$set.pasta(pasta.data) + ## read PaStA data from disk + private$pasta = read.pasta(self$get.data.path.pasta()) + + ## read mail data if filtering patchstack mails + if (is.null(private$mails) + && private$project.conf$get.value("mails.filter.patchstack.mails")) { + ## just triggering read-in, no storage + self$get.mails() + } else { + ## update all PaStA-related data + private$update.pasta.data() + } } } else { + logging::logwarn("You have not set the ProjectConf parameter 'pasta' to 'TRUE'! Ignoring...") ## mark PaStA data as empty self$set.pasta(NULL) } @@ -581,17 +799,19 @@ ProjectData = R6::R6Class("ProjectData", ## set the actual data private$pasta = data - ## aggregate by message IDs and commit hashes - private$aggregate.pasta.data(private$pasta) - ## add PaStA data to commit and mail data if configured if (private$project.conf$get.value("pasta")) { - logging::loginfo("Updating PaStA data.") - if (!is.null(private$commits)) { - self$set.commits(private$commits) - } - if (!is.null(private$mails)) { - self$set.mails(private$mails) + + ## read mail data if filtering patchstack mails + if (is.null(private$mails) && + private$project.conf$get.value("mails.filter.patchstack.mails")) { + ## just triggering read-in, no storage + self$get.mails() + + } else { + ## update all PaStA-related data + private$update.pasta.data() + } } }, @@ -609,7 +829,7 @@ ProjectData = R6::R6Class("ProjectData", if (is.null(private$mails)) { mails.read = read.mails(self$get.data.path()) - self$set.mails(data = mails.read) + self$set.mails(mails.read) } private$extract.timestamps(source = "mails") @@ -619,33 +839,35 @@ ProjectData = R6::R6Class("ProjectData", #' Set the mail data to the given new data and add PaStA data #' if configured in the field \code{project.conf}. #' - #' @param data the new mail data - set.mails = function(data) { + #' @param mail.data the new mail data + set.mails = function(mail.data) { logging::loginfo("Setting e-mail data.") - if (is.null(data)) { - data = create.empty.mails.list() + if (is.null(mail.data)) { + mail.data = create.empty.mails.list() + } + + ## store mail data + private$mails = mail.data + + ## filter patchstack mails and store again + if (private$project.conf$get.value("mails.filter.patchstack.mails")) { + private$mails = private$filter.patchstack.mails() } ## add PaStA data if wanted if (private$project.conf$get.value("pasta")) { - logging::loginfo("Adding PaStA data.") - ## get data - self$get.pasta() # no assignment because we just want to trigger the read-in - ## remove previous PaStA data - if ("pasta" %in% colnames(data)) { - data["pasta"] = NULL - data["revision.set.id"] = NULL + if (is.null(private$pasta)) { + ## get data (no assignment because we just want to trigger anything PaStA-related) + self$get.pasta() + } else { + ## update all PaStA-related data + private$update.pasta.data() } - ## merge PaStA data - data = merge(data, private$pasta.mails, - by = "message.id", all.x = TRUE, sort = FALSE) } - ## sort by date again (because 'merge' is doing bullshit!) - data = data[order(data[["date"]], decreasing = FALSE), ] # sort! - - private$mails = data + ## sort by date + private$mails = private$mails[order(private$mails[["date"]], decreasing = FALSE), ] }, #' Get the author data. diff --git a/util-init.R b/util-init.R index e307c788..df6db710 100644 --- a/util-init.R +++ b/util-init.R @@ -16,6 +16,7 @@ ## Copyright 2017 by Raphael Nömmer ## Copyright 2017 by Sofie Kemper ## Copyright 2017 by Felix Prasse +## Copyright 2019 by Klara Schlüter ## All Rights Reserved. @@ -60,3 +61,4 @@ source("util-plot.R") source("util-core-peripheral.R") source("util-networks-metrics.R") source("util-networks-covariates.R") +source("util-plot-evaluation.R") diff --git a/util-networks-covariates.R b/util-networks-covariates.R index b6510817..4c4a945a 100644 --- a/util-networks-covariates.R +++ b/util-networks-covariates.R @@ -666,6 +666,8 @@ add.vertex.attribute.author.role = function(list.of.networks, classification.res #' \code{"project.cumulative"}, \code{"project.all.ranges"}, and #' \code{"complete"}. See \code{split.data.by.networks} for #' more details. [default: "range"] +#' @param editor.definition Determines, who is counted as editor of an artifact (one ore more of +#' \code{c("author", "committer")}). [default: "author"] #' @param default.value The default value to add if a vertex has no matching value [default: 0] #' #' @return A list of networks with the added attribute @@ -673,17 +675,28 @@ add.vertex.attribute.artifact.editor.count = function(list.of.networks, project. aggregation.level = c("range", "cumulative", "all.ranges", "project.cumulative", "project.all.ranges", "complete"), + editor.definition = c("author", "committer"), default.value = 0) { aggregation.level = match.arg.or.default(aggregation.level, default = "range") + ## match editor definitions to column name in commit dataframe + if (missing(editor.definition)) { + editor.definition = "author" + } else { + editor.definition = match.arg.or.default(editor.definition, choices = c("author", "committer"), several.ok = TRUE) + } + editor.definition = paste0(editor.definition, ".name") + nets.with.attr = split.and.add.vertex.attribute( list.of.networks, project.data, name, aggregation.level, default.value, function(range, range.data, net) { - lapply(range.data$group.authors.by.data.column("commits", "artifact"), - function(x) { - length(unique(x[["author.name"]])) + vertex.attributes = lapply(range.data$group.authors.by.data.column("commits", "artifact"), + function(artifact.commits) { + editor.count = length(unique(unlist(artifact.commits[editor.definition]))) + return(editor.count) } ) + return(vertex.attributes) } ) diff --git a/util-plot-evaluation.R b/util-plot-evaluation.R new file mode 100644 index 00000000..97dbe409 --- /dev/null +++ b/util-plot-evaluation.R @@ -0,0 +1,129 @@ +## This file is part of coronet, which is free software: you +## can redistribute it and/or modify it under the terms of the GNU General +## Public License as published by the Free Software Foundation, version 2. +## +## This program is distributed in the hope that it will be useful, +## but WITHOUT ANY WARRANTY; without even the implied warranty of +## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +## GNU General Public License for more details. +## +## You should have received a copy of the GNU General Public License along +## with this program; if not, write to the Free Software Foundation, Inc., +## 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. +## +## Copyright 2019 by Klara Schlüter +## All Rights Reserved. + +## / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / +## Libraries --------------------------------------------------------------- + +requireNamespace("ggplot2") ## plotting + + +## / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / +## Plots regarding commit edit and editor types --------------------------------------- + +#' Produces a barplot showing for every editor the number of commits for which he is only author, only committer, and +#' both author and committer. +#' +#' @param data the project data +#' @param percentage.per.author if \code{TRUE}, the barplot shows the relative number of differently edited commits per +#' author: each bar in the barplot (representing the commits of one editor) is scaled to +#' 100%. Otherwise, the absolute number of commits per author is shown in the plot. +#' [default: FALSE] +#' +#' @return a ggplot2/ggraph plot object +plot.commit.editor.types.by.author = function(data, percentage.per.author = FALSE) { + + ## get editor data + and = get.committer.and.author.commit.count(data) + or = get.committer.not.author.commit.count(data) + + ## build data frame as required for plotting + both = data.frame(and[["author.name"]], and[["freq"]]) + colnames(both) = c("editor", "author.and.committer") + + author = aggregate(or[["freq"]], by = list(or[["author.name"]]), FUN = sum) + colnames(author) = c("editor", "only.author") + + committer = aggregate(or[["freq"]], by = list(or[["committer.name"]]), FUN = sum) + colnames(committer) = c("editor", "only.committer") + + plot.data = merge(merge(both, author, all = TRUE), committer, all = TRUE) + plot.data[is.na(plot.data)] = 0 + + ## if desired, calculate percentage of editor types per author + if (percentage.per.author) { + name.column = plot.data[1] + value.columns = plot.data[2:4] + + ## scale data values per author (represented by one line) to 100% + scaled.value.columns = t(apply(value.columns, 1, function(x) {x / sum(x)})) + + plot.data = cbind(name.column, scaled.value.columns) + } + + ## compute order of bars from data: only author < author and committer < only committer + ordered.editors = plot.data[["editor"]][with(plot.data, + order(`only.committer`, `author.and.committer`, `only.author`))] + + ## prepare data for a stacked barplot (prepare for stacking the editor types) + plot.data = reshape2::melt(plot.data) + names(plot.data) = c("editor", "editor.type", "commit.count") + + ## draw plot + plot = ggplot2::ggplot(data = plot.data, mapping = ggplot2::aes(x = factor(editor, levels = ordered.editors), + y = `commit.count`, fill = `editor.type`)) + + ## use data frame values instead of counting entries + ggplot2::geom_bar(stat = 'identity') + + ## rotate y-axis labels by 90 degree + ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, hjust = 1)) + + ## set proper legend items and title + ggplot2::scale_fill_discrete(name = "Commit edit type", + labels = c("author and committer", "only author", "only committer")) + + ## add proper axis labels + ggplot2::labs( + x = "Authors", + y = "Commit count" + ) + return(plot) +} + +#' Produces a barplot showing for how many commits committer and author are the same person and for how many commits +#' committer and author are different. +#' +#' @param data the project data +#' @param relative.y.scale if \code{TRUE}, the y axis shows the percentage of the number of commits of the special edit +#' type with respect to all commits. If \code{FALSE}, the y axis shows the absolut number of +#' commits. [default: FALSE] +#' +#' @return a ggplot2/ggraph plot object +plot.commit.edit.types.in.project = function(data, relative.y.scale = FALSE) { + + ## get commit count + and = get.committer.and.author.commit.count(data) + or = get.committer.not.author.commit.count(data) + + ## build data frame as required for plotting + plot.data = data.frame(c("author.!=.committer", "author.=.committer"), c(sum(or[["freq"]]), sum(and[["freq"]]))) + colnames(plot.data) = c("edit.types", "commit.count") + + ## if desired, calculate values for y axis labes showing percentage of all commits + if (relative.y.scale) { + plot.data = cbind(plot.data[1], plot.data[2] / sum(plot.data[2])) + } + + ## draw plot + plot = ggplot2::ggplot(data = plot.data, mapping = ggplot2::aes(y = `commit.count`, x = `edit.types`)) + + ## use data frame values instead of counting entries + ggplot2::geom_bar(stat = 'identity') + + ## set proper bar labels + ggplot2::scale_x_discrete(labels = c("author.!=.committer" = "author != committer", + "author.=.committer" = "author = committer")) + + ## add proper axis labels + ggplot2::labs( + x = "Edit types", + y = "Commit count" + ) + return(plot) +}