Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add commit message merge functionality #193

Merged
merged 43 commits into from
Feb 3, 2021
Merged

Add commit message merge functionality #193

merged 43 commits into from
Feb 3, 2021

Conversation

nlschn
Copy link
Contributor

@nlschn nlschn commented Jan 7, 2021

Prerequisites

  • I adhere to the coding conventions (described here) in my code.
  • I have updated the copyright headers of the files I have modified.
  • I have written appropriate commit messages, i.e., I have recorded the goal, the need, the needed changes, and the location of my code modifications for each commit. This includes also, e.g., referencing to relevant issues.
  • I have put signed-off tags in all commits.
  • I have updated the changelog file NEWS.md appropriately.
  • The pull request is opened against the branch dev.

Description

[Now from mydev branch]
In order to address issue #180 this PR adds functionality to read and process commit message data in "commitMessages.list" files. The first non-white-space line of the message is considered the title, the rest the body of the commit message.
This functionality can be found in "util-read.R"

The ability to merge the commit messages to the commit data is implemented in "set.commits" in "util-data.R" and a new attribute "commit.messages" has been added to "proj.conf" to control what data shall be merged.

For each new functionality tests are provided.

Changelog

  • Add functionality to read and process commit messages in order to merge them to the commit data (See issue Read in and add commit messages #180). Three values are available for the new attribute commit.messages in proj.conf:
    1. none is the default value and does not change the previous behavior of proj.data$set.commits.
    2. title merges the commit message titles (i.e. the first non white space line of a commit message) to the commit data. This gives the data frame an additional column title.
    3. messages merges both titles and message bodies to the commit data frame. This adds two new columns title and message.body.

In order to test new functionality (i.e. the read.commit.messages
function) new files containing test commit data were needed.
Add two files containing messages corresponding to the test commit data that
already exists.

See se-sic#180

Signed-off-by: Niklas Schneider <[email protected]>
Add the read.commit.messages and create.empty.commits.list functions
to util-read.R as well as a unit test to test the new functions.
This allows to read 'commitMessages.list' files and return the commit
data separated into message title and body.

See se-sic#180

Signed-off-by: Niklas Schneider <[email protected]>
Discussion has shown that codeface separates lines with five spaces, not four.
So the two test files have been modified to account for that fact.

See discussion in se-sic#180

Signed-off-by: Niklas Schneider <[email protected]>
Change the line breaks in the expected output to \n's.

See se-sic#180

Signed-off-by: Niklas Schneider <[email protected]>
Replace five spaces with \n, remove any white space at the beginning and
the end of a commit message.

See se-sic#180

Signed-off-by: Niklas Schneider <[email protected]>
Get the commit messsage data using the new read function and merge either nothing, the
title or message and title into the commit.data of the proj.conf instance.

See se-sic#180

Signed-off-by: Niklas Schneider <[email protected]>
Add the new attribute "commit.messages" to the project configuration class with options "none",
"title" and "message" to make it possible to specify what exactly of the commit message data
is to be merged to the commit data.

See se-sic#180

Signed-off-by: Niklas Schneider <[email protected]>
Add two tests for testing the merge functionality for both full commit messages and
titles only. Fix bug that merges message body instead of title when selecting
option "title"

See se-sic#180

Signed-off-by: Niklas Schneider <[email protected]>
Also exchange the merge attribute when merging data frames of commit
messages from commit.id to hash.

Signed-off-by: Niklas Schneider <[email protected]>
Signed-off-by: Niklas Schneider <[email protected]>
As commit.id was the first column of the data frame anyway,
merging has not changed the order. But when using the hash column
it is taken as the first colum of the resulting data frame.
Change the order of the columns in order to not break anything
that relies on the order.

See se-sic#180

Signed-off-by: Niklas Schneider <[email protected]>
Copy link
Collaborator

@clhunsen clhunsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nlschn!

Thank you very much for your PR! 😀👍 And congrats for opening your (kind of) first one for coronet! 🥳 Your work integrates a feature that @bockthom requested for a long time. He is definitely very happy about this! 😁

I added some comments to your changes, with the main goal to improve maintainability:

  • There is an implementation error in the method ProjectData$get.commit.messages. I highlighted the corresponding comment with the symbol 🐛.
  • You access and modify data.frames using indices in several places – instead of using columns names. This may cause problems when functionality changes in the future. You should always avoid using indices in any case if possible (at least, outside the file util-read.R).
  • The algorithm of post-processing commit messages (after they have initially been read from disk) can be tweaked to speed things up and make the implementation more comprehensible.
  • You have added a configuration option to the class ProjectConf, but you have not updated the README file accordingly. I added a corresponding comment to your update of the file NEWS.md.

Your additions to coronet are a very nice piece of work! Keep it up! 👍

If you have any questions regarding my comments or anything else, please do not hesitate to ask. 🙂


By the way, if you got an e-mail indicating a single new and, in the end, non-existing comment on PR #192: Sorry for this! I accidentally clicked the wrong button when starting to review the PR and, then, deleted the comment right away. 🤦‍♂️

util-data.R Outdated Show resolved Hide resolved
util-conf.R Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
util-data.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-data.R Show resolved Hide resolved
Follow the review suggestions of @clhunsen.

See se-sic#180

Signed-off-by: Niklas Schneider <[email protected]>
Following the review of se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
Remove some empty lines and indent some lines.
Also remove commit.message.data.unprocessed variable and use the
commit.message.data variable from the beginning. Add column names
beforehand in order to enable access without indices.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
Create private function update.commit.message.data in util-data.R
which handles the merge and change the location where it is
called in set.commits.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
Fix an error where the value of a variable that is defined
in an if block is returned outside that if block.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nlschn, your implementation looks really good. Sorry for my late review, I have hoped to achieve that earlier. Anyway, I just have found a few minor issues addressing consistency. So, please have a look at them (and please don't overstate the number of my comments, most of my comments go into the same direction). Don't hesitate to ask if something is unclear to you.

Also thanks to @clhunsen for your review, you already have pointed out the most important issues.

NEWS.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
tests/test-data.R Outdated Show resolved Hide resolved
tests/test-data.R Outdated Show resolved Hide resolved
tests/test-data.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
Replaced a loop with a conversion from a list of vectors
in a data frame and access its columns directly

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
Move functions concerning reading commit messages and the constants used
by them to a new section in util.read.
Replace subset with proper indexing and minor comment fixes.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
Also adapt 'update.commit.messages' to better match the implementation
of similar methods.
Add 'set.commit.messages' in order to be able to set the commit messages
to NULL.

See se-sic#193.

Signed-off-by: Niklas Schneider <[email protected]>
Introduce new function 'format.commit.ids' in along with new section in
util-read.R. Also put format "<commit-%s>" into a constant.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
Take advice by @clhunsen to replace if else cascade for rearranging columns with
better merge call.
Also modify test-data tests regarding commit messages: Row names are no longer
ignored.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
Change order in 'README.md', 'util-conf.R' and 'util-data.R'
Also fix table of contents in the readme.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
Add (empty) commit message data to all data split tests in 'tests-split.R'.
Also sor the additional data sources alphabetically in the tests.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
Add (empty) commit message data to all data split tests in
'test-split-sliding-window.R'.
Also sort the additional data sources alphabetically in the tests.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
These functions act like cleanup.pasta, they remove rows from the data
that are not part of a range data object anymore.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
Fix a copy-paste-error where a varibale name was not changed from a
prior code snippet.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
@nlschn
Copy link
Contributor Author

nlschn commented Jan 30, 2021

I have tested the cleanup.synchronicity function manually on busybox data with the given parameters (artifact=file and synchronicity.time.window=5) as suggested by @bockthom and it seems to work fine.
Except for R-3.3, there is the same error as in #194. 😢

@JoJoDeveloping
Copy link
Contributor

Except for R-3.3, there is the same error as in #194. 😢

Is it possible that this fail is a problem with the pipeline in general? Does it also appear if you re-run the pipeline for master or dev? @bockthom

@bockthom
Copy link
Collaborator

Except for R-3.3, there is the same error as in #194. cry

Is it possible that this fail is a problem with the pipeline in general? Does it also appear if you re-run the pipeline for master or dev? @bockthom

Yes, this problem will occur everywhere where package sqldf is installed on R 3.3. I had a quick look at the test logs and what I can see there is that sqldf cannot be installed on R 3.3 anymore as there might be some dependencies of sqldf not be available anymore. I will have a closer look at this tonight or tomorrow, not sure if there really is a solution for that problem, but I will check.
@nlschn @JoJoDeveloping just ignore the failing pipeline on R 3.3, it is not related to your PRs and you don't have to deal with that 😉

@bockthom
Copy link
Collaborator

bockthom commented Jan 31, 2021

Yes, this problem will occur everywhere where package sqldf is installed on R 3.3. I had a quick look at the test logs and what I can see there is that sqldf cannot be installed on R 3.3 anymore as there might be some dependencies of sqldf not be available anymore. I will have a closer look at this tonight or tomorrow, not sure if there really is a solution for that problem, but I will check.

After spending more than two hours on investigating why the pipeline fails for R 3.3, eventually, I was able to find out what is going wrong there:
Package sqldf imports package RSQLite, which in turn imports package memoise. Last Tuesday (2021-01-26), a new version of package memoise was released (which explains why the pipeline still ran successfully at the beginning of last week). The new release of memoise has introduced a new dependency to package cachem (this dependency was not present before). Package cachem imports package fastmap (which is also new in our dependency tree). And now we are close to the failing builds: The installation and compilation of package fastmap ends up in an "internal compiler error: Segmentation fault" for R 3.3.3. There is no segmentation fault for more recent R versions (>= 3.4), and also no segmentation fault for the even earlier versions R 3.3.1 and R 3.3.2. So only R 3.3.3. is the problem. I will open a bug report on fastmap soon––but I am not sure how they will react on my bug report, as R 3.3.3 is already quite outdated and I am not sure if they are willing to debug old R versions. Even if they will fix this bug, it will take some time until a new release will be available on CRAN. So, for the moment (i.e., current and future PRs on coronet), we have only the following two options: Either completely ignore the build fail on R 3.3 and hope the problem to be fixed soon, or temporarily replace R 3.3.3 by R 3.3.2 in our pipeline to get the green check mark again 😄

Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the changes in this PR look good already. I have only found some minor issues. And this time I really hope that we can merge the PR afterwards...

util-data.R Outdated Show resolved Hide resolved
util-data.R Outdated Show resolved Hide resolved
util-data.R Outdated Show resolved Hide resolved
util-data.R Outdated Show resolved Hide resolved
clhunsen
clhunsen previously approved these changes Jan 31, 2021
Copy link
Collaborator

@clhunsen clhunsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks pretty close to merging! Nice work, @nlschn! 🙂 I only have one tiny bug that needs to be fixed and one (optional) suggestion for improving code comprehensibility. When the bug is fixed and @bockthom's comments are addressed, this PR is ready to be merged. 😀

util-read.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just spotted two other things:

For the first one: We have a warning notifying the user about cleaning up the PaStA data, which we should also add for the synchronicity data. Could you please add the following warning in line 450 of util-data.R, in update.synchronicity.data (in a similary way as in update.pasta.data):

logging::logwarn("There might be synchronicity data that does not appear in the commit data.
                  To clean this up you can call the function 'cleanup.synchronicity.data()'.")

And also in a similar way for commit messages.

The second one is just an update of the NEWS.md, as we have added these cleanup functions, see comment below.

(Now I am finished with my review. I won't look at the existing parts of this PR any more unless someone else spots any other issues.)

NEWS.md Outdated Show resolved Hide resolved
@bockthom
Copy link
Collaborator

bockthom commented Feb 1, 2021

After spending more than two hours on investigating why the pipeline fails for R 3.3, eventually, I was able to find out what is going wrong there:

Just for documentation reasons: The discussion about the failing pipeline for R 3.3.3 will be continued here: #161 (comment)

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
Warn user when updating commit messages or synchronicity that they should
call the corresponding cleanup method.
Remove calls to unlist in those cleanup functions.
Fix copy-pasted comments.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
Fix the regex that removes spaces at the end of a commit message.
Change the assignment of the 'commit.message.data' data frame such that
no new data frame is instantiated anymore.

See se-sic#193

Signed-off-by: Niklas Schneider <[email protected]>
bockthom
bockthom previously approved these changes Feb 1, 2021
@clhunsen clhunsen added this to the v3.8 milestone Feb 1, 2021
clhunsen
clhunsen previously approved these changes Feb 1, 2021
Copy link
Collaborator

@clhunsen clhunsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! 🥳👍

Our CI pipeline does currently not work for "R-3.3" (but for all the other
R versions in the pipeline). The reason for that is the following:
As of 2021-01-26, there is a new version of package `memoise` (2.0.0),
which is imported by `RSQLite`, which in turn is imported by `sqldf`.
The new version of `memoise` imports package `cachem`, which in turn
imports package `fastmap`. Using all these packages is not a problem per
se, they are compatible with R 3.3.3. However, the docker container
`r-base:3.3.3`, which we use in our pipeline "R-3.3", uses g++ version
6.3.0-9, which contains a bug. Due to the bug in the compiler, `fastmap`
cannot be compiled and installed in tje `r-base:3.3.3` docker container,
letting the CI pipeline fail.

To circumvent this problem but still keep "R-3.3" in our CI pipeline, we
now use docker image `r-base:3.3.2` instead, as there is another g++
version installed in this docker image. With that, the CI pipeline should
succeed for "R-3.3" again.

Signed-off-by: Thomas Bock <[email protected]>
@nlschn nlschn dismissed stale reviews from clhunsen and bockthom via 18843a8 February 3, 2021 13:34
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @nlschn, thanks for your endurance.

As everyone has approved this PR and also the test pipeline succeeds again as of the pipeline fix, I will merge right away.

@bockthom bockthom merged commit c72188e into se-sic:dev Feb 3, 2021
nlschn added a commit to nlschn/coronet that referenced this pull request Feb 24, 2021
Create a new helper function to be used by 'metrics.centrality' in order
to compute the authors from the data.sources of the network's relations
automatically.

See se-sic#193
@bockthom bockthom mentioned this pull request Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants