Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

octopus-merge (part 2: blob-merge) #1585

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

octopus-merge (part 2: blob-merge) #1585

wants to merge 6 commits into from

Conversation

Byron
Copy link
Owner

@Byron Byron commented Sep 7, 2024

Implement an octopus merge based on trees, and (mostly) equivalent to merge-ORT in Git. The foundation, tree-editing, was implemented in #1566.

Related to gitbutlerapp/gitbutler#4793.

Tasks

This PR was re-focussed on blob-based conflict detection and the generation of blob merge-results with conflict markers.

  • sketch API and document all knowledge from Git docs through it
    • flatten merge into platform API, similar to how it's done in diff Bring merge::State to being able to execute the merge.
  • Pipeline implementation and tests
    • renormalization also works
  • Platform implementation and tests
    • generalize merge-command invocation - lots of context must be passed
      • See if shell-quoting of the %P argument for merge-drivers is needed for it to get through to the driver program
      • Allow to pass additional environment variables so GIT_REFLOG_ACTION, GIT_EXEC_PATH and GIT_PREFIX can be passed
    • Can 'zealous' conflict resolution (or information) be part of that, whether it's used or not? Is it decided here?
    • binary merge
      • with large file and binary file support (i.e. it doesn't fail due to these conditions)
    • text merge
      • resolve conflicts with union, ours and theirs
      • merge style
      • diff3 style
      • zdiff3 style
      • integrate tests for diff3-conflict-markers.sh (some special cases)
      • tests for newline handling (in baseline)
      • integrate tests for merge-file (less simplistic input)
      • How are newlines written in the <<< lines that the text merge is adding? Probably configurable?
      • What about deletions when zealously handling hunks? Have a test for that!
    • assure tests run into every condition.
  • validate that a difference in line-ending style will clash (it should, tokens-with-line-endings are used) (also for gix-diff.

Questions

  • How to deal with MERGE_HEAD and additional files the merge might create?
    • For merging commits only
  • Does Git provide binary data to external merge programs (and diff drivers, for that matter?)
    • Yes, and it also does that for files that are too large (larger than 1023MB). It has no problem reading them from disk to memory though, which is inefficient, and we should do better.

Next PR / Outscoped

  • textconv with context, see this gist for details.
    • There seem to be different 'tiers' of tools, some don't get GIT_DIR set, others do.
    • It also seems that diff-programs get too much context right now, but that depends on how much is passed to them by the caller as gix-command::Context.
  • How to model virtual-merge-bases? Can be none or many, user should have control over how this is done.
  • Understand how conflicts are currently handled in GitButler.
  • Actual tree-based merging
    • See also, all the merge options
    • Submodule merges are also possible! Maybe outscope it though! libgit2 also doesn't try it.
    • diff3-conflict-markers.sh - be sure to capture the 'empty tree' label , but also other special cases
  • Make blob-merge based conflict handling possible in the tree merge from gix at least.

Research

Everything is about MergeORT.

  • it uses an empty tree if there is no merge-base - we must allow the same.
  • it allows for multiple merge-bases, creating a virtual one by merging all merge-bases together using the same algorithm, recursively.
  • merges can have conflicts without a individual files being involved, for instance when directory renames clash
  • Must make sure that possible types of conflicts are properly communicated, to not degenerate information
  • It puts conflict-markers in the result tree, with annotations to provide additional context
  • Need resolution configuration, see git2::MergeOptions.
  • data stored by path, and is interned in the map to allow pointer-based comparisons
    • merge-info with everything one needs to know, also related to renames
    • or conflict information
    • it uses a memory-pool/arena to get memory for many paths all at once (and also release it like that)
  • paths start out as conflicted, and then can later be changed to non-conflicting if a content-based merged succeeded.
    • If it remains conflicts, the meta-data is used to produce an 'as merged as possible' version with conflict markers that can be checked out to the working tree.
  • hunks can partially overlap, but can also be resolved line-by line to some extend.

Crates

Mostly for performance optimization, also interesting for handling the index, and the tree::editor types.

Building Blocks for String Interning

String Interining

Handle Special Cases

  • A file was renamed differently
  • deal with "merge.directoryRenames"

Questions

Is git2::merge_trees() a trivial merge? Does it handle all the cases of MergeORT?

Maybe not, but it definitely handles rename tracking. See these git2 flags for more information.

How does rename-tracking affect a tree-merge?

The first round is done without renames, then there is a second round to find renames, and perform the merge of renamed items.

How is an octopus merge implemented, particularly with Merge ORT?

  • recursive mere-base merge to get virtual merge base
  • three-way tree iterator merge
  • rename tracking

References

Removed Utility

Prints a patch with context 3, using bytes only, based on imara-diff.

/// An adapted version of the unified diff writer to get a first idea.
        // TODO: remove this
        #[allow(dead_code)]
        mod unified_test {
            use bstr::ByteVec;
            use imara_diff::intern::{InternedInput, Interner, Token};
            use imara_diff::Sink;
            use std::ops::Range;

            /// A [`Sink`] that creates a textual diff
            /// in the format typically output by git or gnu-diff if the `-u` option is used
            pub struct UnifiedDiffBuilder<'a, W>
            where
                W: std::io::Write,
            {
                before: &'a [Token],
                after: &'a [Token],
                interner: &'a Interner<&'a [u8]>,

                pos: u32,
                before_hunk_start: u32,
                after_hunk_start: u32,
                before_hunk_len: u32,
                after_hunk_len: u32,

                buffer: Vec<u8>,
                dst: W,
            }

            impl<'a, W> UnifiedDiffBuilder<'a, W>
            where
                W: std::io::Write,
            {
                /// Create a new `UnifiedDiffBuilder` for the given `input`,
                /// that will writes it output to the provided implementation of [`Write`].
                pub fn with_writer(input: &'a InternedInput<&'a [u8]>, writer: W) -> Self {
                    Self {
                        before_hunk_start: 0,
                        after_hunk_start: 0,
                        before_hunk_len: 0,
                        after_hunk_len: 0,
                        buffer: Vec::with_capacity(8),
                        dst: writer,
                        interner: &input.interner,
                        before: &input.before,
                        after: &input.after,
                        pos: 0,
                    }
                }

                fn print_tokens(&mut self, tokens: &[Token], prefix: char) {
                    for &token in tokens {
                        self.buffer.push_char(prefix);
                        self.buffer.extend_from_slice(self.interner[token]);
                    }
                }

                fn flush(&mut self) {
                    if self.before_hunk_len == 0 && self.after_hunk_len == 0 {
                        return;
                    }

                    let end = (self.pos + 3).min(self.before.len() as u32);
                    self.update_pos(end, end);

                    writeln!(
                        &mut self.dst,
                        "@@ -{},{} +{},{} @@",
                        self.before_hunk_start + 1,
                        self.before_hunk_len,
                        self.after_hunk_start + 1,
                        self.after_hunk_len,
                    )
                    .unwrap();
                    self.dst.write_all(&self.buffer).unwrap();
                    self.buffer.clear();
                    self.before_hunk_len = 0;
                    self.after_hunk_len = 0
                }

                fn update_pos(&mut self, print_to: u32, move_to: u32) {
                    self.print_tokens(&self.before[self.pos as usize..print_to as usize], ' ');
                    let len = print_to - self.pos;
                    self.pos = move_to;
                    self.before_hunk_len += len;
                    self.after_hunk_len += len;
                }
            }

            impl<W> Sink for UnifiedDiffBuilder<'_, W>
            where
                W: std::io::Write,
            {
                type Out = W;

                fn process_change(&mut self, before: Range<u32>, after: Range<u32>) {
                    if before.start - self.pos > 6 {
                        self.flush();
                        self.pos = before.start - 3;
                        self.before_hunk_start = self.pos;
                        self.after_hunk_start = after.start - 3;
                    }
                    self.update_pos(before.start, before.end);
                    self.before_hunk_len += before.end - before.start;
                    self.after_hunk_len += after.end - after.start;
                    self.print_tokens(&self.before[before.start as usize..before.end as usize], '-');
                    self.print_tokens(&self.after[after.start as usize..after.end as usize], '+');
                }

                fn finish(mut self) -> Self::Out {
                    self.flush();
                    self.dst
                }
            }
        }
    }

@Byron Byron changed the title octopus-merge (part 2) octopus-merge (part 2: blob-diff) Sep 11, 2024
@Byron Byron changed the title octopus-merge (part 2: blob-diff) octopus-merge (part 2: blob-merge) Sep 11, 2024
@Byron
Copy link
Owner Author

Byron commented Sep 12, 2024

@EliahKagan Just as a note in case you'd like to submit a first patch to Git that will definitely be able to land, maybe to warm-up with the mailing-list workflow, then I have something for you.

The attribute-documentation about three-way merges contains a reference to merge.default which is indeed used in code as well.

However, the git-config documentation doesn't mention it at all.

@Byron Byron force-pushed the merge branch 3 times, most recently from 0ae3791 to ff0975c Compare September 13, 2024 12:13
gix-diff/src/blob/pipeline.rs Outdated Show resolved Hide resolved
@Byron Byron force-pushed the merge branch 3 times, most recently from 22e88d7 to 75bbe2b Compare September 14, 2024 07:39
@Byron Byron force-pushed the merge branch 3 times, most recently from f7643ae to 917e52a Compare September 15, 2024 05:39
@Byron Byron force-pushed the merge branch 7 times, most recently from 9f30cc2 to 2f9c91b Compare September 18, 2024 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants