Add commit message merge functionality #193

nlschn · 2021-01-07T20:34:17Z

Prerequisites

I adhere to the coding conventions (described here) in my code.
I have updated the copyright headers of the files I have modified.
I have written appropriate commit messages, i.e., I have recorded the goal, the need, the needed changes, and the location of my code modifications for each commit. This includes also, e.g., referencing to relevant issues.
I have put signed-off tags in all commits.
I have updated the changelog file NEWS.md appropriately.
The pull request is opened against the branch dev.

Description

[Now from mydev branch]
In order to address issue #180 this PR adds functionality to read and process commit message data in "commitMessages.list" files. The first non-white-space line of the message is considered the title, the rest the body of the commit message.
This functionality can be found in "util-read.R"

The ability to merge the commit messages to the commit data is implemented in "set.commits" in "util-data.R" and a new attribute "commit.messages" has been added to "proj.conf" to control what data shall be merged.

For each new functionality tests are provided.

Changelog

Add functionality to read and process commit messages in order to merge them to the commit data (See issue Read in and add commit messages #180). Three values are available for the new attribute commit.messages in proj.conf:
1. none is the default value and does not change the previous behavior of proj.data$set.commits.
2. title merges the commit message titles (i.e. the first non white space line of a commit message) to the commit data. This gives the data frame an additional column title.
3. messages merges both titles and message bodies to the commit data frame. This adds two new columns title and message.body.

In order to test new functionality (i.e. the read.commit.messages function) new files containing test commit data were needed. Add two files containing messages corresponding to the test commit data that already exists. See se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

Add the read.commit.messages and create.empty.commits.list functions to util-read.R as well as a unit test to test the new functions. This allows to read 'commitMessages.list' files and return the commit data separated into message title and body. See se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

Discussion has shown that codeface separates lines with five spaces, not four. So the two test files have been modified to account for that fact. See discussion in se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

Change the line breaks in the expected output to \n's. See se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

Replace five spaces with \n, remove any white space at the beginning and the end of a commit message. See se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

Get the commit messsage data using the new read function and merge either nothing, the title or message and title into the commit.data of the proj.conf instance. See se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

Add the new attribute "commit.messages" to the project configuration class with options "none", "title" and "message" to make it possible to specify what exactly of the commit message data is to be merged to the commit data. See se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

Signed-off-by: Niklas Schneider <[email protected]>

Add two tests for testing the merge functionality for both full commit messages and titles only. Fix bug that merges message body instead of title when selecting option "title" See se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

Signed-off-by: Niklas Schneider <[email protected]>

Also exchange the merge attribute when merging data frames of commit messages from commit.id to hash. Signed-off-by: Niklas Schneider <[email protected]>

Signed-off-by: Niklas Schneider <[email protected]>

As commit.id was the first column of the data frame anyway, merging has not changed the order. But when using the hash column it is taken as the first colum of the resulting data frame. Change the order of the columns in order to not break anything that relies on the order. See se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

clhunsen

Hi @nlschn!

Thank you very much for your PR! 😀👍 And congrats for opening your (kind of) first one for coronet! 🥳 Your work integrates a feature that @bockthom requested for a long time. He is definitely very happy about this! 😁

I added some comments to your changes, with the main goal to improve maintainability:

There is an implementation error in the method ProjectData$get.commit.messages. I highlighted the corresponding comment with the symbol 🐛.
You access and modify data.frames using indices in several places – instead of using columns names. This may cause problems when functionality changes in the future. You should always avoid using indices in any case if possible (at least, outside the file util-read.R).
The algorithm of post-processing commit messages (after they have initially been read from disk) can be tweaked to speed things up and make the implementation more comprehensible.
You have added a configuration option to the class ProjectConf, but you have not updated the README file accordingly. I added a corresponding comment to your update of the file NEWS.md.

Your additions to coronet are a very nice piece of work! Keep it up! 👍

If you have any questions regarding my comments or anything else, please do not hesitate to ask. 🙂

By the way, if you got an e-mail indicating a single new and, in the end, non-existing comment on PR #192: Sorry for this! I accidentally clicked the wrong button when starting to review the PR and, then, deleted the comment right away. 🤦‍♂️

util-data.R

util-conf.R

NEWS.md

util-data.R

util-read.R

util-data.R

@clhunsen

Follow the review suggestions of @clhunsen. See se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

Following the review of se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Remove some empty lines and indent some lines. Also remove commit.message.data.unprocessed variable and use the commit.message.data variable from the beginning. Add column names beforehand in order to enable access without indices. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Create private function update.commit.message.data in util-data.R which handles the merge and change the location where it is called in set.commits. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Fix an error where the value of a variable that is defined in an if block is returned outside that if block. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

README.md

bockthom

Thanks @nlschn, your implementation looks really good. Sorry for my late review, I have hoped to achieve that earlier. Anyway, I just have found a few minor issues addressing consistency. So, please have a look at them (and please don't overstate the number of my comments, most of my comments go into the same direction). Don't hesitate to ask if something is unclear to you.

Also thanks to @clhunsen for your review, you already have pointed out the most important issues.

NEWS.md

README.md

tests/test-data.R

util-read.R

Replaced a loop with a conversion from a list of vectors in a data frame and access its columns directly See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Move functions concerning reading commit messages and the constants used by them to a new section in util.read. Replace subset with proper indexing and minor comment fixes. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Also adapt 'update.commit.messages' to better match the implementation of similar methods. Add 'set.commit.messages' in order to be able to set the commit messages to NULL. See se-sic#193. Signed-off-by: Niklas Schneider <[email protected]>

Introduce new function 'format.commit.ids' in along with new section in util-read.R. Also put format "<commit-%s>" into a constant. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

@clhunsen

Take advice by @clhunsen to replace if else cascade for rearranging columns with better merge call. Also modify test-data tests regarding commit messages: Row names are no longer ignored. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Change order in 'README.md', 'util-conf.R' and 'util-data.R' Also fix table of contents in the readme. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Add (empty) commit message data to all data split tests in 'tests-split.R'. Also sor the additional data sources alphabetically in the tests. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Add (empty) commit message data to all data split tests in 'test-split-sliding-window.R'. Also sort the additional data sources alphabetically in the tests. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

tests/test-split-sliding-window.R

See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

These functions act like cleanup.pasta, they remove rows from the data that are not part of a range data object anymore. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Fix a copy-paste-error where a varibale name was not changed from a prior code snippet. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

nlschn · 2021-01-30T11:10:41Z

I have tested the cleanup.synchronicity function manually on busybox data with the given parameters (artifact=file and synchronicity.time.window=5) as suggested by @bockthom and it seems to work fine.
Except for R-3.3, there is the same error as in #194. 😢

JoJoDeveloping · 2021-01-30T12:33:45Z

Except for R-3.3, there is the same error as in #194. 😢

Is it possible that this fail is a problem with the pipeline in general? Does it also appear if you re-run the pipeline for master or dev? @bockthom

bockthom · 2021-01-30T13:13:41Z

Except for R-3.3, there is the same error as in #194. cry

Is it possible that this fail is a problem with the pipeline in general? Does it also appear if you re-run the pipeline for master or dev? @bockthom

Yes, this problem will occur everywhere where package sqldf is installed on R 3.3. I had a quick look at the test logs and what I can see there is that sqldf cannot be installed on R 3.3 anymore as there might be some dependencies of sqldf not be available anymore. I will have a closer look at this tonight or tomorrow, not sure if there really is a solution for that problem, but I will check.
@nlschn @JoJoDeveloping just ignore the failing pipeline on R 3.3, it is not related to your PRs and you don't have to deal with that 😉

bockthom · 2021-01-31T13:47:46Z

Yes, this problem will occur everywhere where package sqldf is installed on R 3.3. I had a quick look at the test logs and what I can see there is that sqldf cannot be installed on R 3.3 anymore as there might be some dependencies of sqldf not be available anymore. I will have a closer look at this tonight or tomorrow, not sure if there really is a solution for that problem, but I will check.

After spending more than two hours on investigating why the pipeline fails for R 3.3, eventually, I was able to find out what is going wrong there:
Package sqldf imports package RSQLite, which in turn imports package memoise. Last Tuesday (2021-01-26), a new version of package memoise was released (which explains why the pipeline still ran successfully at the beginning of last week). The new release of memoise has introduced a new dependency to package cachem (this dependency was not present before). Package cachem imports package fastmap (which is also new in our dependency tree). And now we are close to the failing builds: The installation and compilation of package fastmap ends up in an "internal compiler error: Segmentation fault" for R 3.3.3. There is no segmentation fault for more recent R versions (>= 3.4), and also no segmentation fault for the even earlier versions R 3.3.1 and R 3.3.2. So only R 3.3.3. is the problem. I will open a bug report on fastmap soon––but I am not sure how they will react on my bug report, as R 3.3.3 is already quite outdated and I am not sure if they are willing to debug old R versions. Even if they will fix this bug, it will take some time until a new release will be available on CRAN. So, for the moment (i.e., current and future PRs on coronet), we have only the following two options: Either completely ignore the build fail on R 3.3 and hope the problem to be fixed soon, or temporarily replace R 3.3.3 by R 3.3.2 in our pipeline to get the green check mark again 😄

bockthom

All the changes in this PR look good already. I have only found some minor issues. And this time I really hope that we can merge the PR afterwards...

util-data.R

clhunsen

This PR looks pretty close to merging! Nice work, @nlschn! 🙂 I only have one tiny bug that needs to be fixed and one (optional) suggestion for improving code comprehensibility. When the bug is fixed and @bockthom's comments are addressed, this PR is ready to be merged. 😀

util-read.R

bockthom

Just spotted two other things:

For the first one: We have a warning notifying the user about cleaning up the PaStA data, which we should also add for the synchronicity data. Could you please add the following warning in line 450 of util-data.R, in update.synchronicity.data (in a similary way as in update.pasta.data):

logging::logwarn("There might be synchronicity data that does not appear in the commit data.
                  To clean this up you can call the function 'cleanup.synchronicity.data()'.")

And also in a similar way for commit messages.

The second one is just an update of the NEWS.md, as we have added these cleanup functions, see comment below.

(Now I am finished with my review. I won't look at the existing parts of this PR any more unless someone else spots any other issues.)

NEWS.md

bockthom · 2021-02-01T11:45:52Z

After spending more than two hours on investigating why the pipeline fails for R 3.3, eventually, I was able to find out what is going wrong there:

Just for documentation reasons: The discussion about the failing pipeline for R 3.3.3 will be continued here: #161 (comment)

See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Warn user when updating commit messages or synchronicity that they should call the corresponding cleanup method. Remove calls to unlist in those cleanup functions. Fix copy-pasted comments. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Fix the regex that removes spaces at the end of a commit message. Change the assignment of the 'commit.message.data' data frame such that no new data frame is instantiated anymore. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

clhunsen

Nice work! 🥳👍

Our CI pipeline does currently not work for "R-3.3" (but for all the other R versions in the pipeline). The reason for that is the following: As of 2021-01-26, there is a new version of package `memoise` (2.0.0), which is imported by `RSQLite`, which in turn is imported by `sqldf`. The new version of `memoise` imports package `cachem`, which in turn imports package `fastmap`. Using all these packages is not a problem per se, they are compatible with R 3.3.3. However, the docker container `r-base:3.3.3`, which we use in our pipeline "R-3.3", uses g++ version 6.3.0-9, which contains a bug. Due to the bug in the compiler, `fastmap` cannot be compiled and installed in tje `r-base:3.3.3` docker container, letting the CI pipeline fail. To circumvent this problem but still keep "R-3.3" in our CI pipeline, we now use docker image `r-base:3.3.2` instead, as there is another g++ version installed in this docker image. With that, the CI pipeline should succeed for "R-3.3" again. Signed-off-by: Thomas Bock <[email protected]>

bockthom

Good work @nlschn, thanks for your endurance.

As everyone has approved this PR and also the test pipeline succeeds again as of the pipeline fix, I will merge right away.

Create a new helper function to be used by 'metrics.centrality' in order to compute the authors from the data.sources of the network's relations automatically. See se-sic#193

nlschn added 5 commits January 2, 2021 12:26

Adapt commit message read test to new test files

17a61ed

Change the line breaks in the expected output to \n's. See se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

Adapt read.commit.messages to handle line breaks correctly

c624c90

Replace five spaces with \n, remove any white space at the beginning and the end of a commit message. See se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

nlschn force-pushed the mydev branch from 8a8cb91 to 1e7a95d Compare January 7, 2021 20:43

nlschn added 5 commits January 7, 2021 21:44

Replace seq with seq_along and add missing log statement in util-read.R

f80b24b

Signed-off-by: Niklas Schneider <[email protected]>

Add description of changes to unversioned section of NEWS.md

359b12c

Signed-off-by: Niklas Schneider <[email protected]>

nlschn force-pushed the mydev branch from 1e7a95d to 359b12c Compare January 7, 2021 20:45

nlschn added 3 commits January 7, 2021 21:54

Remove unnecessary empty lines from several files

70c8395

Also exchange the merge attribute when merging data frames of commit messages from commit.id to hash. Signed-off-by: Niklas Schneider <[email protected]>

Fix a syntax error in util-read

89a6ea6

Signed-off-by: Niklas Schneider <[email protected]>

nlschn force-pushed the mydev branch from 851c337 to 6e9147e Compare January 8, 2021 11:07

clhunsen requested changes Jan 10, 2021

View reviewed changes

nlschn added 5 commits January 13, 2021 14:52

Modify README and NEWS

c9c7ff7

Follow the review suggestions of @clhunsen. See se-sic#180 Signed-off-by: Niklas Schneider <[email protected]>

Rename "message.body" column to "message" everywhere

0457dd5

Following the review of se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Put merge functionality into own function

8e28a1f

Create private function update.commit.message.data in util-data.R which handles the merge and change the location where it is called in set.commits. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Fix error when returning a variable that is not defined

703ab3e

Fix an error where the value of a variable that is defined in an if block is returned outside that if block. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

bockthom reviewed Jan 15, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

bockthom requested changes Jan 15, 2021

View reviewed changes

nlschn added 6 commits January 15, 2021 14:52

Simplify data frame creation in read.commit.messages

7caaa8d

Replaced a loop with a conversion from a list of vectors in a data frame and access its columns directly See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Add helper function to format 'commit.id' column

d5c8c78

Introduce new function 'format.commit.ids' in along with new section in util-read.R. Also put format "<commit-%s>" into a constant. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Change order of data sources to be alphabetical

70b3cb6

Change order in 'README.md', 'util-conf.R' and 'util-data.R' Also fix table of contents in the readme. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

nlschn added 2 commits January 25, 2021 12:17

Change all data split tests to include commit message data

98e83b0

Add (empty) commit message data to all data split tests in 'tests-split.R'. Also sor the additional data sources alphabetically in the tests. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

bockthom reviewed Jan 25, 2021

View reviewed changes

tests/test-split-sliding-window.R Outdated Show resolved Hide resolved

nlschn added 3 commits January 26, 2021 10:02

Fix minor comment issue in 'test-split-sliding-window.R'

c052dfb

See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Add new cleanup functions for commit messages and synchronicity

d3bbae0

These functions act like cleanup.pasta, they remove rows from the data that are not part of a range data object anymore. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

Fix wrong variable name in 'cleanup.synchronicity'

9385084

Fix a copy-paste-error where a varibale name was not changed from a prior code snippet. See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

bockthom reviewed Jan 31, 2021

View reviewed changes

util-data.R Outdated Show resolved Hide resolved

util-data.R Outdated Show resolved Hide resolved

util-data.R Outdated Show resolved Hide resolved

util-data.R Outdated Show resolved Hide resolved

clhunsen previously approved these changes Jan 31, 2021

View reviewed changes

util-read.R Outdated Show resolved Hide resolved

util-read.R Outdated Show resolved Hide resolved

bockthom reviewed Feb 1, 2021

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

bockthom mentioned this pull request Feb 1, 2021

R versions to support and which to cover in test suite #161

Open

nlschn added 3 commits February 1, 2021 17:11

Add cleanup functions to NEWS.md

63b6f79

See se-sic#193 Signed-off-by: Niklas Schneider <[email protected]>

nlschn dismissed clhunsen’s stale review via e1e1ba8 February 1, 2021 16:39

bockthom previously approved these changes Feb 1, 2021

View reviewed changes

clhunsen added the enhancement label Feb 1, 2021

clhunsen added this to the v3.8 milestone Feb 1, 2021

clhunsen previously approved these changes Feb 1, 2021

View reviewed changes

nlschn dismissed stale reviews from clhunsen and bockthom via 18843a8 February 3, 2021 13:34

bockthom approved these changes Feb 3, 2021

View reviewed changes

bockthom merged commit c72188e into se-sic:dev Feb 3, 2021

bockthom mentioned this pull request Sep 1, 2021

Version 4.0 #212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add commit message merge functionality #193

Add commit message merge functionality #193

nlschn commented Jan 7, 2021

clhunsen left a comment

bockthom left a comment

nlschn commented Jan 30, 2021 •

edited

Loading

JoJoDeveloping commented Jan 30, 2021

bockthom commented Jan 30, 2021

bockthom commented Jan 31, 2021 •

edited

Loading

bockthom left a comment

clhunsen left a comment

bockthom left a comment

bockthom commented Feb 1, 2021

clhunsen left a comment

bockthom left a comment

Add commit message merge functionality #193

Add commit message merge functionality #193

Conversation

nlschn commented Jan 7, 2021

Prerequisites

Description

Changelog

clhunsen left a comment

Choose a reason for hiding this comment

bockthom left a comment

Choose a reason for hiding this comment

nlschn commented Jan 30, 2021 • edited Loading

JoJoDeveloping commented Jan 30, 2021

bockthom commented Jan 30, 2021

bockthom commented Jan 31, 2021 • edited Loading

bockthom left a comment

Choose a reason for hiding this comment

clhunsen left a comment

Choose a reason for hiding this comment

bockthom left a comment

Choose a reason for hiding this comment

bockthom commented Feb 1, 2021

clhunsen left a comment

Choose a reason for hiding this comment

bockthom left a comment

Choose a reason for hiding this comment

nlschn commented Jan 30, 2021 •

edited

Loading

bockthom commented Jan 31, 2021 •

edited

Loading