Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack-objects: allow --path-walk with --shallow #699

Merged
merged 4 commits into from
Oct 22, 2024

Conversation

derrickstolee
Copy link
Collaborator

@derrickstolee derrickstolee commented Oct 21, 2024

This pull request aims to correct a pretty big issue when dealing with UNINTERESTING objects in the path-walk API. They somehow were only exposed when trying to perform a push from a shallow clone.

This will require rewriting the upstream version so this is avoided from the start, but we can do a forward fix for now.

The key issue is that the path-walk API was not walking UNINTERESTING trees at the right time, and the way it was being done was more complicated than it needed to be. This changes some of the way the path-walk API works in the presence of UNINTERSTING commits, but these are good changes to make.

I had briefly attempted to remove the use of the edge_aggressive option in struct path_walk_info in favor of using the --objects-edge-aggressive option in the revision struct. When I started down that road, though, I somehow got myself into a bind of things not working correctly. I backed out to this version that is working with our test cases.

I tested this using the thin and big pack tests in p5313 which had the same performance as before this change.

The new change is that in a shallow clone we can get the same git push improvements.

I was hung up on testing this for a long time as I wasn't getting the same results in my shallow clone as in my regular clones. It turns out that I had forgotten to use --no-reuse-delta in my test command, so it was picking the deltas that were given by the initial clone instead of picking new ones per the algorithm. 🤦🏻

@derrickstolee
Copy link
Collaborator Author

I created derrickstolee#31 as a potential extension to include a more complicated edge walk to better mimic the behavior when constructing a packfile with shallow commits in it.

@derrickstolee
Copy link
Collaborator Author

I created derrickstolee#31 as a potential extension to include a more complicated edge walk to better mimic the behavior when constructing a packfile with shallow commits in it.

This is really important. I was able to confirm that without the second commit, my pushes from a shallow client to a full clone caused huge packfile sizes.

I'm not sure what caused this to happen, as the revision walk should have been the same. But I guess we need to do things the hard way.

Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of edge_uninteresting makes a ton of sense to me. I wonder whether we can add a unit or regression test to verify that the delta is as intended (i.e. that suboptimal deltas are avoided)?

builtin/pack-objects.c Outdated Show resolved Hide resolved
path-walk.c Outdated Show resolved Hide resolved
path-walk.c Outdated Show resolved Hide resolved
path-walk.h Outdated Show resolved Hide resolved
@dscho
Copy link
Member

dscho commented Oct 21, 2024

I'm not sure what caused this to happen, as the revision walk should have been the same.

Maybe the "shallow commits" are treated as uninteresting, which would be the explanation of bad deltas because they contain the valuable delta targets in --depth=1 clones.

@derrickstolee
Copy link
Collaborator Author

I'm not sure what caused this to happen, as the revision walk should have been the same.

Maybe the "shallow commits" are treated as uninteresting, which would be the explanation of bad deltas because they contain the valuable delta targets in --depth=1 clones.

In the case where I'm pushing a single commit on top of my shallow commit, that shallow commit should be revealed as a "boundary" commit. I will try to debug into things to figure out what's going on to cause this to be different.

@derrickstolee derrickstolee force-pushed the shallow-path-walk branch 2 times, most recently from 38b444f to 3c94d2c Compare October 22, 2024 03:04
@derrickstolee derrickstolee marked this pull request as draft October 22, 2024 03:40
@derrickstolee
Copy link
Collaborator Author

I'm taking a break for the night. I'm running in circles and need to come back with fresh eyes.

These two tests in t5616-partial-clone.sh are actually already broken
and there are comments supporting that. Those comments were focused on
the GIT_TEST_FULL_NAME_HASH variable, but they also apply to this one.
We will want to avoid issues here.

Signed-off-by: Derrick Stolee <[email protected]>
@derrickstolee derrickstolee marked this pull request as ready for review October 22, 2024 14:01
@derrickstolee
Copy link
Collaborator Author

This version should be good to go. My testing that caused issues last night were 100% because I wasn't using --no-reuse-delta and that was causing my git pack-objects command to trip over existing deltas. That won't happen for our targeted use case which has freshly created the objects and so no deltas will exist on disk to reuse.

Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, thank you so much!

My only suggestions in addition to the suggestions below would be to maybe try to reword the second commit message (maybe briefly explain for what scenarios the revs API's edge_hint_aggressive flag was introduced, i.e. stress the why over the what, and then merely mention that we need to address the same in the push-in-shallow scenario), and to fix the typos in the third commit message.

path-walk.c Show resolved Hide resolved
t/t5538-push-shallow.sh Outdated Show resolved Hide resolved
t/t5538-push-shallow.sh Show resolved Hide resolved
@derrickstolee
Copy link
Collaborator Author

Here's the range-diff for the latest push:
1:  17ce490c3ea = 1:  17ce490c3ea t5616: mark tests as bogus with --path-walk
2:  d13133bf96e ! 2:  8599051ccb4 path-walk: add new 'edge_aggressive' option
    @@ Commit message
     
         In preparation for allowing both the --shallow and --path-walk options
         in the 'git pack-objects' builtin, create a new 'edge_aggressive' option
    -    in the path-walk API.
    +    in the path-walk API. This option will help walk the boundary more
    +    thoroughly and help avoid sending extra objects during fetches and
    +    pushes.
     
         The only use of the 'edge_hint_aggressive' option in the revision API is
         within mark_edges_uninteresting(), which is usually called before
    @@ Commit message
         are walked until a boundary is found.
     
         We didn't use this in the past because we would mark objects
    -    UNINTERESTING after doing the initial commit walk to the boundary. But
    -    we actually should be marking these objects as UNINTERESTING, but we
    -    shouldn't _emit_ them all via the path-walk algorithm or else our delta
    -    calculations will get really slow.
    +    UNINTERESTING after doing the initial commit walk to the boundary. While
    +    we should be marking these objects as UNINTERESTING, we shouldn't _emit_
    +    them all via the path-walk algorithm or else our delta calculations will
    +    get really slow.
     
         Based on these observations, the way we were handling the UNINTERESTING
         flag in walk_objects_by_path() was overly complicated and buggy. A lot
3:  1203ee56eb8 ! 3:  2065f5464fc pack-objects: allow --shallow and --path-walk
    @@ Commit message
         However, before the previous change, a trivial removal of the warning
         would cause a failure in t5500-fetch-pack.sh when
         GIT_TEST_PACK_PATH_WALK is enabled. The shallow fetch would provide more
    -    objets than we desired, due to some incorrect behavior of the path-walk
    -    API, especially around walking unintersting objects.
    +    objects than we desired, due to some incorrect behavior of the path-walk
    +    API, especially around walking uninteresting objects.
     
    -    Now that these things were fixed, we can make this change. Further, we
    -    can add a test to show that the Git client is similarly careful about
    -    selecting the right objects during 'git push' from a shallow clone.
    +    To also cover the symmetrical case of pushing from a shallow clone, add
    +    a new test to t5538-push-shallow.sh that confirms the correct behavior
    +    of pushing only the new object. This works to validate both the
    +    --path-walk and --no-path-walk case when toggling the
    +    GIT_TEST_PACK_PATH_WALK environment variable. This test would have
    +    failed in the --path-walk case if we created it before the previous
    +    change.
     
         Signed-off-by: Derrick Stolee <[email protected]>
     
    @@ t/t5538-push-shallow.sh: EOF
     +
     +test_expect_success 'push new commit from shallow clone to origin is efficient' '
     +	git init origin &&
    -+	echo a >origin/a &&
    -+	git -C origin add a &&
    -+	git -C origin commit -m "base" &&
    -+	echo b >origin/b &&
    -+	git -C origin add b &&
    -+	git -C origin commit -m "tip" &&
    ++	test_commit -C origin a &&
    ++	test_commit -C origin b &&
     +
     +	git clone --depth=1 "file://$(pwd)/origin" client &&
     +	git -C client checkout -b topic &&

In preparation for allowing both the --shallow and --path-walk options
in the 'git pack-objects' builtin, create a new 'edge_aggressive' option
in the path-walk API. This option will help walk the boundary more
thoroughly and help avoid sending extra objects during fetches and
pushes.

The only use of the 'edge_hint_aggressive' option in the revision API is
within mark_edges_uninteresting(), which is usually called before
between prepare_revision_walk() and before visiting commits with
get_revision(). In prepare_revision_walk(), the UNINTERESTING commits
are walked until a boundary is found.

We didn't use this in the past because we would mark objects
UNINTERESTING after doing the initial commit walk to the boundary. While
we should be marking these objects as UNINTERESTING, we shouldn't _emit_
them all via the path-walk algorithm or else our delta calculations will
get really slow.

Based on these observations, the way we were handling the UNINTERESTING
flag in walk_objects_by_path() was overly complicated and buggy. A lot
of it can be removed and simplified to work with this new approach.

It also means that we will see the UNINTERESTING boundaries of paths
when doing a default path-walk call, changing some existing test cases.

Signed-off-by: Derrick Stolee <[email protected]>
@derrickstolee
Copy link
Collaborator Author

derrickstolee commented Oct 22, 2024

Huge thanks to @dscho for prompting me to add an extra test case, demonstrating an issue with shallow pushes. We now have a lot more confidence in the correctness.

Range-diff for latest push
1:  8599051ccb4 ! 1:  1af435e33b7 path-walk: add new 'edge_aggressive' option
    @@ path-walk.c: static void clear_strmap(struct strmap *map)
      	strmap_init(map);
      }
      
    -+static void show_edge(struct commit *commit UNUSED)
    ++static struct repository *edge_repo;
    ++static struct type_and_oid_list *edge_tree_list;
    ++
    ++static void show_edge(struct commit *commit)
     +{
    -+	/* empty */
    ++	struct tree *t = repo_get_commit_tree(edge_repo, commit);
    ++
    ++	if (!t)
    ++		return;
    ++
    ++	if (commit->object.flags & UNINTERESTING)
    ++		t->object.flags |= UNINTERESTING;
    ++
    ++	if (t->object.flags & SEEN)
    ++		return;
    ++	t->object.flags |= SEEN;
    ++
    ++	oid_array_append(&edge_tree_list->oids, &t->object.oid);
     +}
     +
      /**
    @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
     +	 * This is particularly important when 'edge_aggressive' is set.
     +	 */
     +	info->revs->edge_hint_aggressive = info->edge_aggressive;
    ++
    ++	edge_repo = info->revs->repo;
    ++	edge_tree_list = root_tree_list;
     +	mark_edges_uninteresting(info->revs, show_edge, info->prune_all_uninteresting);
     +
      	info->revs->blob_objects = info->revs->tree_objects = 0;
2:  2065f5464fc = 2:  2361d5830cd pack-objects: allow --shallow and --path-walk
-:  ----------- > 3:  423e4cdb537 t5538: add test to confirm deltas in shallow pushes

Update: I also rechecked the thin pack tests in p5313 across a variety of full and shallow repos. Everything looks good there.

Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

There does not appear to be anything particularly incompatible about the
--shallow and --path-walk options of 'git pack-objects'. If shallow
commits are to be handled differently, then it is by the revision walk
that defines the commit set and which are interesting or uninteresting.

However, before the previous change, a trivial removal of the warning
would cause a failure in t5500-fetch-pack.sh when
GIT_TEST_PACK_PATH_WALK is enabled. The shallow fetch would provide more
objects than we desired, due to some incorrect behavior of the path-walk
API, especially around walking uninteresting objects.

To also cover the symmetrical case of pushing from a shallow clone, add
a new test to t5538-push-shallow.sh that confirms the correct behavior
of pushing only the new object. This works to validate both the
--path-walk and --no-path-walk case when toggling the
GIT_TEST_PACK_PATH_WALK environment variable. This test would have
failed in the --path-walk case if we created it before the previous
change.

Signed-off-by: Derrick Stolee <[email protected]>
It can be notoriously difficult to detect if delta bases are being
computed properly during 'git push'. Construct an example where it will
make a kilobyte worth of difference when a delta base is not found. We
can then use the progress indicators to distinguish between bytes and
KiB depending on whether the delta base is found and used.

Signed-off-by: Derrick Stolee <[email protected]>
@derrickstolee
Copy link
Collaborator Author

I pushed the minor test name change updates in order to trigger new builds that won't time out on mac now that #700 is merged.

@dscho
Copy link
Member

dscho commented Oct 22, 2024

As to win test (8) failing, that's a flake that I documented here.

@dscho dscho merged commit 81ca930 into microsoft:vfs-2.47.0 Oct 22, 2024
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants