Accelerate rename detection and range-diff

When detecting renames between `m` and `n` files, as well as when it detects similarities in `git range-diff` between `m` and `n` commits, Git currently performs `m` times `n` comparisons, which is quite expensive for larger `m` and `n`.

There are a number of ways to accelerate that, primarily by pre-processing the file contents/commits and then trying to find correspondences in a more guided manner on the processed items. The most obvious approaches are all based on performing some sort of Nearest Neighbor Search.

A large part of this project will be to compare the available approaches to determine which one to implement, then implement it in `libgit.a` and use it for the rename detection and for commit matching in `git range-diff`.

# Ideas

## Approximate Nearest Neighbor Search

There are quite a few algorithms to perform "Approximate Nearest Neighbor Search". See e.g. a comparison at https://github.com/erikbern/ann-benchmarks.

- [Hierarchical Navigable Small World graphs](https://arxiv.org/ftp/arxiv/papers/1603/1603.09320.pdf)
- [Navigating Spreading-out Graph](https://arxiv.org/pdf/1707.00143.pdf) (there is also some MIT-licensed [C++ source code](https://github.com/ZJULearning/nsg))
- [Neighborhood Graph and Tree](https://github.com/yahoojapan/NGT)

## Locality-sensitive hashing

[Locality-sensitive hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) was made famous by its application in web search.

- [SimHash](https://en.wikipedia.org/wiki/SimHash)
- [MinHash](https://en.wikipedia.org/wiki/MinHash)

## Other methods

- Classifying with a pre-trained Support Vector Machine (see e.g. [how Gerrit does it](https://gerrit-review.googlesource.com/c/gerrit/+/91253/18/gerrit-server/src/main/java/com/google/gerrit/server/git/SimilarityDecisionFunction.java))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accelerate rename detection and range-diff #519

Ideas

Approximate Nearest Neighbor Search

Locality-sensitive hashing

Other methods

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Accelerate rename detection and range-diff #519

Description

Ideas

Approximate Nearest Neighbor Search

Locality-sensitive hashing

Other methods

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions