This script tries to identify different aliases, i.e., ([optional login
], name
, email
) tuples, used by the same person in GitHub / GHTorrent data.
There are several reasons why multiple aliases occur. For example, since on GitHub the name and email address of committers and authors are set locally in each developer's git client, rather than globally at GitHub level, there is variation in these attributes across devices and time. Moreover, GHTorrent may introduce artificial user accounts when encountering contributions by "unknown" users while crawling data from GitHub's API.
A csv
file or database table such as users
in GHTorrent. See the Alias class for possible fields.
Important: each alias must have a unique numeric id. The script will produce a map of alias ids to person ids.
The script:
-
For every pair of aliases, collects clues that could indicate the aliases belong to the same person, e.g., the
email
address is the same, thename
is the same, or the prefix of theemail
address matches the user'slogin
. See here. -
Creates clusters of aliases that share clues, as candidates for merging. See here.
-
Uses heuristics to decide whether each of the previous clusters is valid. For example, if all have the same
email
then the cluster is considered valid and all candidates are merged. Similarly, if all candidates in the cluster have the samefull name
andemail domain
(after clearer options have been exhausted) then the cluster is considered valid and all candidates are merged. See here.
There are three main files generated by the script.
-
idm_map.csv
is a map of alias user ids (first column) to the unique person id (second column). -
idm_log.csv
is a log file with information on what aliases have been merged and why, i.e., what clues were used to make that decision. -
idm_maybe.csv
is another log file, with identical structure toidm_log.csv
, listing all the clusters that could have potentially been validated (candidate aliases for merging) because they also share clues. However, as the heuristics are implemented now, they haven't been merged.
Important: Carefully inspect these files manually. If you observe (many) false positives in idm_log.csv
, it means the heuristics were too greedy and should be made more conservative. If instead you observe (many) false negatives in idm_maybe.csv
, it means the heuristics were too conservative and can be made more greedy.
For more details see section II.A.a from this MSR 2015 paper:
@inproceedings{vasilescu2015msrdata,
author = {Vasilescu, Bogdan and Serebrenik, Alexander and Filkov, Vladimir},
title = {A Data Set for Social Diversity Studies of {GitHub} Teams},
booktitle = {12th Working Conference on Mining Software Repositories, Data Track},
year = {2015},
series = {MSR},
pages = {514--517},
publisher = {IEEE},
doi = {http://dx.doi.org/10.1109/MSR.2015.77}
}