Skip to content
Braden Simpson edited this page Aug 9, 2013 · 2 revisions

#Vision This project should contribute a method for comparing corpus’ of text from different sources. These sources could initially include commits comments, commit messages, pull request comments, issue comments, all from the GhTorrent dataset.

What we would do is create a database of all the known words, and their tf-idf relative to each “Domain” that we choose. Some of these domains could include the four mentioned above, and would be created by us.

Domain Priority - 1(low) -> 5(high)
Commit comments 5
Pull Request Comments 5
Issue Comments 5
Commit messages 3
Top(10?) projects 2

What we would then do is come up with a few research questions on the data, some that are interesting to me (braden) :

  • Are there more / less technical types of communication among a project? If so where are they?
  • How do projects compare to each other? languages? geographic locations? distributed projects? closed / open source? etc.
  • Can we generalize findings used from commit messages to those on other types of data (commit, issue, pull request comments)

We then write a paper or two based on some of those research questions, and are able to supplement the community by providing a way to compare domains / corpuses of text.

Clone this wiki locally