Skip to content

Accelerate rename detection and range-diff #519

Open
@dscho

Description

@dscho

When detecting renames between m and n files, as well as when it detects similarities in git range-diff between m and n commits, Git currently performs m times n comparisons, which is quite expensive for larger m and n.

There are a number of ways to accelerate that, primarily by pre-processing the file contents/commits and then trying to find correspondences in a more guided manner on the processed items. The most obvious approaches are all based on performing some sort of Nearest Neighbor Search.

A large part of this project will be to compare the available approaches to determine which one to implement, then implement it in libgit.a and use it for the rename detection and for commit matching in git range-diff.

Ideas

Approximate Nearest Neighbor Search

There are quite a few algorithms to perform "Approximate Nearest Neighbor Search". See e.g. a comparison at https://github.com/erikbern/ann-benchmarks.

Locality-sensitive hashing

Locality-sensitive hashing was made famous by its application in web search.

Other methods

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions