-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pack-objects: create new name-hash algorithm #1785
base: master
Are you sure you want to change the base?
Changes from all commits
9c8f8f3
612dbd1
e173de6
543382b
4d2381a
80ba362
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,7 +15,8 @@ SYNOPSIS | |
[--revs [--unpacked | --all]] [--keep-pack=<pack-name>] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Junio C Hamano wrote (reply to this): "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> +static inline uint32_t pack_full_name_hash(const char *name)
> +{
> + const uint32_t bigp = 1234572167U;
> + uint32_t c, hash = bigp;
> +
> + if (!name)
> + return 0;
> +
> + /*
> + * Do the simplest thing that will resemble pseduo-randomness: add
"pseduo" -> "pseudo"
> + * random multiples of a large prime number with a binary shift.
> + * The goal is not to be cryptographic, but to be generally
> + * uniformly distributed.
> + */
Other than that, there is no substantial difference since the
previous iteration; this step looks good.
Thanks. |
||
[--cruft] [--cruft-expiration=<time>] | ||
[--stdout [--filter=<filter-spec>] | <base-name>] | ||
[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list> | ||
[--shallow] [--keep-true-parents] [--[no-]sparse] | ||
[--full-name-hash] < <object-list> | ||
|
||
|
||
DESCRIPTION | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,7 +9,9 @@ git-repack - Pack unpacked objects in a repository | |
SYNOPSIS | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Junio C Hamano wrote (reply to this): "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> diff --git a/t/t0450/txt-help-mismatches b/t/t0450/txt-help-mismatches
> index 28003f18c92..c4a15fd0cb8 100644
> --- a/t/t0450/txt-help-mismatches
> +++ b/t/t0450/txt-help-mismatches
> @@ -45,7 +45,6 @@ rebase
> remote
> remote-ext
> remote-fd
> -repack
> reset
> restore
> rev-parse
This is very much appreciated ;-) |
||
-------- | ||
[verse] | ||
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx] | ||
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] | ||
[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] | ||
[--write-midx] [--full-name-hash] | ||
|
||
DESCRIPTION | ||
----------- | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -266,6 +266,14 @@ struct configured_exclusion { | |
static struct oidmap configured_exclusions; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Junio C Hamano wrote (reply to this): "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> From: Derrick Stolee <[email protected]>
>
> Add a new environment variable to opt-in to the --full-name-hash option
> in 'git pack-objects'. This allows for extra testing of the feature
> without repeating all of the test scenarios.
This also allows the programmer on the C implementation side to be a
bit lazy, as the --full-name-hash option does not have to be plumbed
through from the end-user facing commands (like "bundle") down to
the underlying "pack-objects" command ;-).
As an end-user facing tweak mechanism, an environment variable is
the most clunky, followed by a configuration variable (which can be
used via "git -c" and exhibits the same clunkiness as an environment
variable), and a command line parameter is the most versatile in
allowing users to customize the behaviour per-invocation of the
commands. So in the longer term, we probably want to plumb through
the option, like you did for "repack -> pack-objects" call chain,
for all end-user visible commands that call into pack-objects.
But for testing purposes, the solution presented here is of course
good enough.
> Second, there are two tests in t5616-partial-clone.sh that I believe are
> actually broken scenarios. While the client is set up to clone the
> 'promisor-server' repo via a treeless partial clone filter (tree:0),
> that filter does not translate to the 'server' repo. Thus, fetching from
> these repos causes the server to think that the client has all reachable
> trees and blobs from the commits advertised as 'haves'. This leads the
> server to providing a thin pack assuming those objects as delta bases.
In short, the tests are based on broken assumption and checking
bogus outcome? Somebody familiar with the partial clone area should
probably take a look into it and fix the tests if that is the case.
> - if (write_bitmap_index && use_full_name_hash)
> + if (write_bitmap_index && use_full_name_hash > 0)
> die(_("currently, the --full-name-hash option is incompatible with --write-bitmap-index"));
> + if (use_full_name_hash < 0)
> + use_full_name_hash = git_env_bool("GIT_TEST_FULL_NAME_HASH", 0);
OK. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Derrick Stolee wrote (reply to this): On 9/19/24 6:22 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
>> Second, there are two tests in t5616-partial-clone.sh that I believe are
>> actually broken scenarios. While the client is set up to clone the
>> 'promisor-server' repo via a treeless partial clone filter (tree:0),
>> that filter does not translate to the 'server' repo. Thus, fetching from
>> these repos causes the server to think that the client has all reachable
>> trees and blobs from the commits advertised as 'haves'. This leads the
>> server to providing a thin pack assuming those objects as delta bases.
>
> In short, the tests are based on broken assumption and checking
> bogus outcome? Somebody familiar with the partial clone area should
> probably take a look into it and fix the tests if that is the case.
That is my understanding, yes. It is a common issue that I've had when
using partial clones with multiple remotes and forgetting to modify the
filter for each, so is also a usability wart that could use mending
(say, by adding the existing filter when adding a new remote).
Thanks,
-Stolee
|
||
|
||
static struct oidset excluded_by_config; | ||
static int use_full_name_hash = -1; | ||
|
||
static inline uint32_t pack_name_hash_fn(const char *name) | ||
{ | ||
if (use_full_name_hash) | ||
return pack_full_name_hash(name); | ||
return pack_name_hash(name); | ||
} | ||
|
||
/* | ||
* stats | ||
|
@@ -1698,7 +1706,7 @@ static int add_object_entry(const struct object_id *oid, enum object_type type, | |
return 0; | ||
} | ||
|
||
create_object_entry(oid, type, pack_name_hash(name), | ||
create_object_entry(oid, type, pack_name_hash_fn(name), | ||
exclude, name && no_try_delta(name), | ||
found_pack, found_offset); | ||
return 1; | ||
|
@@ -1912,7 +1920,7 @@ static void add_preferred_base_object(const char *name) | |
{ | ||
struct pbase_tree *it; | ||
size_t cmplen; | ||
unsigned hash = pack_name_hash(name); | ||
unsigned hash = pack_name_hash_fn(name); | ||
|
||
if (!num_preferred_base || check_pbase_path(hash)) | ||
return; | ||
|
@@ -3422,7 +3430,7 @@ static void show_object_pack_hint(struct object *object, const char *name, | |
* here using a now in order to perhaps improve the delta selection | ||
* process. | ||
*/ | ||
oe->hash = pack_name_hash(name); | ||
oe->hash = pack_name_hash_fn(name); | ||
oe->no_try_delta = name && no_try_delta(name); | ||
|
||
stdin_packs_hints_nr++; | ||
|
@@ -3572,7 +3580,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type | |
entry = packlist_find(&to_pack, oid); | ||
if (entry) { | ||
if (name) { | ||
entry->hash = pack_name_hash(name); | ||
entry->hash = pack_name_hash_fn(name); | ||
entry->no_try_delta = no_try_delta(name); | ||
} | ||
} else { | ||
|
@@ -3595,7 +3603,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type | |
return; | ||
} | ||
|
||
entry = create_object_entry(oid, type, pack_name_hash(name), | ||
entry = create_object_entry(oid, type, pack_name_hash_fn(name), | ||
0, name && no_try_delta(name), | ||
pack, offset); | ||
} | ||
|
@@ -4426,6 +4434,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) | |
OPT_STRING_LIST(0, "uri-protocol", &uri_protocols, | ||
N_("protocol"), | ||
N_("exclude any configured uploadpack.blobpackfileuri with this protocol")), | ||
OPT_BOOL(0, "full-name-hash", &use_full_name_hash, | ||
N_("optimize delta compression across identical path names over time")), | ||
OPT_END(), | ||
}; | ||
|
||
|
@@ -4573,6 +4583,11 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) | |
if (pack_to_stdout || !rev_list_all) | ||
write_bitmap_index = 0; | ||
|
||
if (write_bitmap_index && use_full_name_hash > 0) | ||
die(_("currently, the --full-name-hash option is incompatible with --write-bitmap-index")); | ||
if (use_full_name_hash < 0) | ||
use_full_name_hash = git_env_bool("GIT_TEST_FULL_NAME_HASH", 0); | ||
|
||
if (use_delta_islands) | ||
strvec_push(&rp, "--topo-order"); | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
/* | ||
* test-name-hash.c: Read a list of paths over stdin and report on their | ||
* name-hash and full name-hash. | ||
*/ | ||
|
||
#include "test-tool.h" | ||
#include "git-compat-util.h" | ||
#include "pack-objects.h" | ||
#include "strbuf.h" | ||
|
||
int cmd__name_hash(int argc UNUSED, const char **argv UNUSED) | ||
{ | ||
struct strbuf line = STRBUF_INIT; | ||
|
||
while (!strbuf_getline(&line, stdin)) { | ||
uint32_t name_hash = pack_name_hash(line.buf); | ||
uint32_t full_hash = pack_full_name_hash(line.buf); | ||
|
||
printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf); | ||
} | ||
|
||
strbuf_release(&line); | ||
return 0; | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
#!/bin/sh | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Junio C Hamano wrote (reply to this): "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> From: Derrick Stolee <[email protected]>
>
> As custom options are added to 'git pack-objects' and 'git repack' to
> adjust how compression is done, use this new performance test script to
> demonstrate their effectiveness in performance and size.
>
> The recently-added --full-name-hash option swaps the default name-hash
> algorithm with one that attempts to uniformly distribute the hashes
> based on the full path name instead of the last 16 characters.
>
> This has a dramatic effect on full repacks for repositories with many
> versions of most paths. It can have a negative impact on cases such as
> pushing a single change.
>
> This can be seen by running pt5313 on the open source fluentui
> repository [1]. Most commits will have this kind of output for the thin
> and big pack cases, though certain commits (such as [2]) will have
> problematic thin pack size for other reasons.
>
> [1] https://github.com/microsoft/fluentui
> [2] a637a06df05360ce5ff21420803f64608226a875
>
> Checked out at the parent of [2], I see the following statistics:
>
> Test this tree
> ------------------------------------------------------------------
> 5313.2: thin pack 0.02(0.01+0.01)
> 5313.3: thin pack size 1.1K
> 5313.4: thin pack with --full-name-hash 0.02(0.01+0.00)
> 5313.5: thin pack size with --full-name-hash 3.0K
> 5313.6: big pack 1.65(3.35+0.24)
> 5313.7: big pack size 58.0M
> 5313.8: big pack with --full-name-hash 1.53(2.52+0.18)
> 5313.9: big pack size with --full-name-hash 57.6M
> 5313.10: repack 176.52(706.60+3.53)
> 5313.11: repack size 446.7K
> 5313.12: repack with --full-name-hash 37.47(134.18+3.06)
> 5313.13: repack size with --full-name-hash 183.1K
>
> Note that this demonstrates a 3x size _increase_ in the case that
> simulates a small "git push". The size change is neutral on the case of
> pushing the difference between HEAD and HEAD~1000.
>
> However, the full repack case is both faster and more efficient.
Nice.
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
> t/perf/p5313-pack-objects.sh | 71 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 71 insertions(+)
> create mode 100755 t/perf/p5313-pack-objects.sh
"wc -c" -> "test_file_size" or "test-tool path-utils file-size"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Derrick Stolee wrote (reply to this): On 9/19/24 5:58 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
>> t/perf/p5313-pack-objects.sh | 71 ++++++++++++++++++++++++++++++++++++
>> 1 file changed, 71 insertions(+)
>> create mode 100755 t/perf/p5313-pack-objects.sh
> > "wc -c" -> "test_file_size" or "test-tool path-utils file-size"?
Thanks. I was unfamiliar with those options.
I will squash this change into the next version. The use of a
variable for the packfile name was important at some point while
I was testing this with various repos on different platforms.
--- >8 ---
diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
index 5dcf52acb0..bf6f0d69e4 100755
--- a/t/perf/p5313-pack-objects.sh
+++ b/t/perf/p5313-pack-objects.sh
@@ -25,7 +25,7 @@ test_perf 'thin pack' '
'
test_size 'thin pack size' '
- wc -c <out
+ test_file_size out
'
test_perf 'thin pack with --full-name-hash' '
@@ -33,7 +33,7 @@ test_perf 'thin pack with --full-name-hash' '
'
test_size 'thin pack size with --full-name-hash' '
- wc -c <out
+ test_file_size out
'
test_perf 'big pack' '
@@ -41,7 +41,7 @@ test_perf 'big pack' '
'
test_size 'big pack size' '
- wc -c <out
+ test_file_size out
'
test_perf 'big pack with --full-name-hash' '
@@ -49,7 +49,7 @@ test_perf 'big pack with --full-name-hash' '
'
test_size 'big pack size with --full-name-hash' '
- wc -c <out
+ test_file_size out
'
test_perf 'repack' '
@@ -57,7 +57,8 @@ test_perf 'repack' '
'
test_size 'repack size' '
- wc -c <.git/objects/pack/pack-*.pack
+ pack=$(ls .git/objects/pack/pack-*.pack) &&
+ test_file_size "$pack"
'
test_perf 'repack with --full-name-hash' '
@@ -65,7 +66,8 @@ test_perf 'repack with --full-name-hash' '
'
test_size 'repack size with --full-name-hash' '
- wc -c <.git/objects/pack/pack-*.pack
+ pack=$(ls .git/objects/pack/pack-*.pack) &&
+ test_file_size "$pack"
'
test_done
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Junio C Hamano wrote (reply to this): Derrick Stolee <[email protected]> writes:
> On 9/19/24 5:58 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
>
>>> t/perf/p5313-pack-objects.sh | 71 ++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 71 insertions(+)
>>> create mode 100755 t/perf/p5313-pack-objects.sh
>> "wc -c" -> "test_file_size" or "test-tool path-utils file-size"?
>
> Thanks. I was unfamiliar with those options.
I was, too ;-). There are a few other test pieces that use "wc -c"
to count bytes, but it made me a bit nervous to use it to count
bytes in a binary file.
> test_size 'repack size' '
> - wc -c <.git/objects/pack/pack-*.pack
> + pack=$(ls .git/objects/pack/pack-*.pack) &&
> + test_file_size "$pack"
> '
I am perfectly fine (and actually even prefer) to pay the cost to
fork "ls" here, but if you really wanted to avoid it, we could do
something like
pack=$(
set x .git/objects/pack/pack-*.pack &&
test $# = 2 && echo "$2" || exit 1
) &&
test_file_size "$pack"
;-)
> test_perf 'repack with --full-name-hash' '
> @@ -65,7 +66,8 @@ test_perf 'repack with --full-name-hash' '
> '
>
> test_size 'repack size with --full-name-hash' '
> - wc -c <.git/objects/pack/pack-*.pack
> + pack=$(ls .git/objects/pack/pack-*.pack) &&
> + test_file_size "$pack"
> '
>
> test_done |
||
|
||
test_description='Tests pack performance using bitmaps' | ||
. ./perf-lib.sh | ||
|
||
GIT_TEST_PASSING_SANITIZE_LEAK=0 | ||
export GIT_TEST_PASSING_SANITIZE_LEAK | ||
|
||
test_perf_large_repo | ||
|
||
test_expect_success 'create rev input' ' | ||
cat >in-thin <<-EOF && | ||
$(git rev-parse HEAD) | ||
^$(git rev-parse HEAD~1) | ||
EOF | ||
|
||
cat >in-big <<-EOF | ||
$(git rev-parse HEAD) | ||
^$(git rev-parse HEAD~1000) | ||
EOF | ||
' | ||
|
||
test_perf 'thin pack' ' | ||
git pack-objects --thin --stdout --revs --sparse <in-thin >out | ||
' | ||
|
||
test_size 'thin pack size' ' | ||
test_file_size out | ||
' | ||
|
||
test_perf 'thin pack with --full-name-hash' ' | ||
git pack-objects --thin --stdout --revs --sparse --full-name-hash <in-thin >out | ||
' | ||
|
||
test_size 'thin pack size with --full-name-hash' ' | ||
test_file_size out | ||
' | ||
|
||
test_perf 'big pack' ' | ||
git pack-objects --stdout --revs --sparse <in-big >out | ||
' | ||
|
||
test_size 'big pack size' ' | ||
test_file_size out | ||
' | ||
|
||
test_perf 'big pack with --full-name-hash' ' | ||
git pack-objects --stdout --revs --sparse --full-name-hash <in-big >out | ||
' | ||
|
||
test_size 'big pack size with --full-name-hash' ' | ||
test_file_size out | ||
' | ||
|
||
test_perf 'repack' ' | ||
git repack -adf | ||
' | ||
|
||
test_size 'repack size' ' | ||
pack=$(ls .git/objects/pack/pack-*.pack) && | ||
test_file_size "$pack" | ||
' | ||
|
||
test_perf 'repack with --full-name-hash' ' | ||
git repack -adf --full-name-hash | ||
' | ||
|
||
test_size 'repack size with --full-name-hash' ' | ||
pack=$(ls .git/objects/pack/pack-*.pack) && | ||
test_file_size "$pack" | ||
' | ||
|
||
test_done |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
#!/bin/sh | ||
|
||
test_description='Tests pack performance using bitmaps' | ||
. ./perf-lib.sh | ||
|
||
GIT_TEST_PASSING_SANITIZE_LEAK=0 | ||
export GIT_TEST_PASSING_SANITIZE_LEAK | ||
|
||
test_perf_large_repo | ||
|
||
test_size 'paths at head' ' | ||
git ls-tree -r --name-only HEAD >path-list && | ||
wc -l <path-list | ||
' | ||
|
||
test_size 'number of distinct name-hashes' ' | ||
cat path-list | test-tool name-hash >name-hashes && | ||
cat name-hashes | awk "{ print \$1; }" | sort -n | uniq -c >name-hash-count && | ||
wc -l <name-hash-count | ||
' | ||
|
||
test_size 'number of distinct full-name-hashes' ' | ||
cat name-hashes | awk "{ print \$2; }" | sort -n | uniq -c >full-name-hash-count && | ||
wc -l <full-name-hash-count | ||
' | ||
|
||
test_size 'maximum multiplicity of name-hashes' ' | ||
cat name-hash-count | \ | ||
sort -nr | \ | ||
head -n 1 | \ | ||
awk "{ print \$1; }" | ||
' | ||
|
||
test_size 'maximum multiplicity of fullname-hashes' ' | ||
cat full-name-hash-count | \ | ||
sort -nr | \ | ||
head -n 1 | \ | ||
awk "{ print \$1; }" | ||
' | ||
|
||
test_done |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,7 +45,6 @@ rebase | |
remote | ||
remote-ext | ||
remote-fd | ||
repack | ||
reset | ||
restore | ||
rev-parse | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):