Skip to content

Help minimize export dumps differences when inspecting --dry-run results #657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Beuc
Copy link

@Beuc Beuc commented Apr 15, 2025

A no-op test rewrite of the git-filter-repo repo, using ./git-filter-repo --proceed, unexpectedly results in a different repo.

Diffing the fast-export dumps with --dry-run, it appears git-filter-repo is using Python's default sorting algorithm for Git trees, rather than the Git-specific algorithm, which is described a bit at: https://stackoverflow.com/questions/78981525/how-git-sort-entries-in-trees

@@ -1451,25 +1329,23 @@
 D testcases/expected/case1-twenty
 D testcases/inputs/case1
 M 100755 0a13abf13c87a13ae56c1fc664baa8b44bdac1de testcases/t9390-repo-filter.sh
+M 100644 de3799fa8cd84a9ac56fad023ddf3da44eb444bb testcases/t9390/case1
 M 100644 e0c88454095bd887852791401a3b2336ad47d012 testcases/t9390/case1-filename
 M 100644 a1aa78ff11f3a40cafdcd9b531e8c1342e8a75e5 testcases/t9390/case1-ten
 M 100644 488cbd9d101b107809611940d1cd00a8f72e1b17 testcases/t9390/case1-twenty
-M 100644 de3799fa8cd84a9ac56fad023ddf3da44eb444bb testcases/t9390/case1

Unlike described above though, it appears the fast-export dumps treats all files as directory when sorting them (case1 is a file in the example above), so appending "/" to all entries maintains the original tree order.

This avoids unnecessary tree hash variation and early, cascading commit id changes, which shouldn't happen before a commit is effectively rewritten by git-filter-repo.

With this change, ./git-filter-repo --proceed is stable and produces an identical repository, with the same HEAD commit id.

@newren
Copy link
Owner

newren commented Apr 16, 2025

The change looks reasonable, but the commit message doesn't look right to me.

A no-op test rewrite of the git-filter-repo repo, using ./git-filter-repo --proceed, unexpectedly results in a different repo.

Yes, we apparently have a signed commit in our history, and rewriting will strip that signature.

Diffing the fast-export dumps with --dry-run, it appears git-filter-repo is using Python's default sorting algorithm for Git trees, rather than the Git-specific algorithm, which is described a bit at: https://stackoverflow.com/questions/78981525/how-git-sort-entries-in-trees

@@ -1451,25 +1329,23 @@
 D testcases/expected/case1-twenty
 D testcases/inputs/case1
 M 100755 0a13abf13c87a13ae56c1fc664baa8b44bdac1de testcases/t9390-repo-filter.sh
+M 100644 de3799fa8cd84a9ac56fad023ddf3da44eb444bb testcases/t9390/case1
 M 100644 e0c88454095bd887852791401a3b2336ad47d012 testcases/t9390/case1-filename
 M 100644 a1aa78ff11f3a40cafdcd9b531e8c1342e8a75e5 testcases/t9390/case1-ten
 M 100644 488cbd9d101b107809611940d1cd00a8f72e1b17 testcases/t9390/case1-twenty
-M 100644 de3799fa8cd84a9ac56fad023ddf3da44eb444bb testcases/t9390/case1

git-filter-repo doesn't write tree objects directly[*]; it has git-fast-import do that. git-fast-import states that the order of the filemodify directives given to it does not matter, and in fact I can find in the code where it is doing its own sorting.

Additionally, I tried to duplicate what you stated here and have no idea how you generated the above diff. Digging further myself, I can't find any difference. For example:

$ mkdir tmp
$ git clone [email protected]:newren/git-filter-repo copy1
$ git clone [email protected]:newren/git-filter-repo copy2
$ cd copy2
$ ./git-filter-repo --proceed
$ cd ..
$ diff -u <(git -C copy1 log --format=%T --all) <(git -C copy2 log --format=%T --all)

which shows that every single tree in both the original history and the rewritten history match.

Unlike described above though, it appears the fast-export dumps treats all files as directory when sorting them (case1 is a file in the example above), so appending "/" to all entries maintains the original tree order.

Fair enough.

This avoids unnecessary tree hash variation and early, cascading commit id changes, which shouldn't happen before a commit is effectively rewritten by git-filter-repo.

Can you provide more details about the exact commands you are running and the differences you see? I don't see any tree differences.

With this change, ./git-filter-repo --proceed is stable and produces an identical repository, with the same HEAD commit id.

No it doesn't; the signed commit in the history is stripped so the HEAD commit id and all commits going back to 2019 are different.

@Beuc
Copy link
Author

Beuc commented Apr 16, 2025

Hi,

First, sorry for the confusion, I'm not sure how I got to conclude the repos were identical yesterday, maybe I ended up working on an already rewritten repository at a point, or forgot to drop a dry-run somewhere -_-'

Speaking of --dry-run, here's how to generate the diff:

./git-filter-repo --proceed --dry-run
diff -u .git/filter-repo/fast-export.original .git/filter-repo/fast-export.filtered | less  # search for '^\+M' or 'case1'

So this patch would only help minimize export dumps differences when inspecting --dry-run results.
Do you want me to rewrite the commit message accordingly?
EDIT: done

Indeed, the last matching commit is b9c6254 and the next 9cf87ae has a signature, which I had missed.
I suppose there's no way to maintain signatures for unchanged commits?

Diffing the fast-export dumps with `--dry-run`, it appears git-filter-repo is using Python's default sorting algorithm for Git trees, rather than the Git-specific algorithm, which is described a bit at:
https://stackoverflow.com/questions/78981525/how-git-sort-entries-in-trees

```
$ diff -u .git/filter-repo/fast-export.original .git/filter-repo/fast-export.filtered
...
@@ -1451,25 +1329,23 @@
 D testcases/expected/case1-twenty
 D testcases/inputs/case1
 M 100755 0a13abf13c87a13ae56c1fc664baa8b44bdac1de testcases/t9390-repo-filter.sh
+M 100644 de3799fa8cd84a9ac56fad023ddf3da44eb444bb testcases/t9390/case1
 M 100644 e0c88454095bd887852791401a3b2336ad47d012 testcases/t9390/case1-filename
 M 100644 a1aa78ff11f3a40cafdcd9b531e8c1342e8a75e5 testcases/t9390/case1-ten
 M 100644 488cbd9d101b107809611940d1cd00a8f72e1b17 testcases/t9390/case1-twenty
-M 100644 de3799fa8cd84a9ac56fad023ddf3da44eb444bb testcases/t9390/case1
```

Unlike described above though, it appears the fast-export dumps treats all files as directory when sorting them (`case1` is a file in the example above), so appending "/" to all entries maintains the original tree order.

Note: git-filter-repo doesn't write tree objects directly; it has git-fast-import do that. git-fast-import states that the order of the filemodify directives given to it does not matter. So this doesn't change the resulting Git repository.
@Beuc Beuc changed the title Avoid extra commit id changes by respecting Git tree ordering Help minimize export dumps differences when inspecting --dry-run results Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants