Description
When detecting renames between m
and n
files, as well as when it detects similarities in git range-diff
between m
and n
commits, Git currently performs m
times n
comparisons, which is quite expensive for larger m
and n
.
There are a number of ways to accelerate that, primarily by pre-processing the file contents/commits and then trying to find correspondences in a more guided manner on the processed items. The most obvious approaches are all based on performing some sort of Nearest Neighbor Search.
A large part of this project will be to compare the available approaches to determine which one to implement, then implement it in libgit.a
and use it for the rename detection and for commit matching in git range-diff
.
Ideas
Approximate Nearest Neighbor Search
There are quite a few algorithms to perform "Approximate Nearest Neighbor Search". See e.g. a comparison at https://github.com/erikbern/ann-benchmarks.
- Hierarchical Navigable Small World graphs
- Navigating Spreading-out Graph (there is also some MIT-licensed C++ source code)
- Neighborhood Graph and Tree
Locality-sensitive hashing
Locality-sensitive hashing was made famous by its application in web search.
Other methods
- Classifying with a pre-trained Support Vector Machine (see e.g. how Gerrit does it)