Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same-named local subrepos with different origins cannot be converted due to ambiguity in the mapping #188

Open
jdanielp opened this issue Jan 16, 2020 · 11 comments
Labels
contribution-required Maintainer has no time and/or no plans to work on this, interested parties should submit a patch verified-bug A confirmed bug in fast-export

Comments

@jdanielp
Copy link

jdanielp commented Jan 16, 2020

Please see attached an example repo and mapping file that cannot be converted with the given submodule mapping technique.

Top level repo is srt_ambiguous. In v1 and v2 branches there is a srt_subx subrepo. But, they have different origins per branch. in branch v1 srt_subx corresponds to srt_suba repo. In v2 branch srt_subx refers to srt_subb repo.

I suspect the mapping file convention will need to be extended to define per-branch mappings. As with issue #187, I am not familiar-enough with the inner workings of hg2git and hg-fast-export(.sh|.py) to hack this myself.

ambiguous.zip

@frej frej added contribution-required Maintainer has no time and/or no plans to work on this, interested parties should submit a patch verified-bug A confirmed bug in fast-export labels Jan 17, 2020
@frej
Copy link
Owner

frej commented Jan 26, 2020

When extending the mapping file format, be sure to also handle the case when a subrepo is changed without a change in the current branch. I'm thinking of a case when, for example, you have been using an internally generated conversion of a Subversion repo, but when upstream switched to Mercurial, you started using that repo directly.

@jdanielp
Copy link
Author

I'm making my way through this. It's doable without too much surgery so far as I can tell. I have submodule_mappings converted to deal with different urls in different branches. Now I need to deal with the mapping_cache, marks_cache, and subrepo_cache. Those give me some concern since I'll need to learn their usage, etc.

@frej Do you have an example repo for the scenario you describe?

@frej
Copy link
Owner

frej commented Jan 28, 2020

@jdanielp: Sorry, no I don't have an example. Another thing, make sure that you stay backwards compatible so that existing incremental setups are not broken.

@jdanielp
Copy link
Author

jdanielp commented Jan 29, 2020

@frej

I was able to get a solution in place this morning. That solution works for the example repository attached on this issue.

Then I realized that I had to get my validation script working so I could validate bigger repos automatically rather than manually spot checking a few commits. I have that script working now (hopefully), but it's late, so I will verify my changes work Wednesday.

The basic idea of my changes is that I allow the user to specify a json file for --subrepo-map. No .sh script changes needed. The code to load the mapping sees that it's a .json and uses a different function I added.

With json, the user can define a per-branch mapping in a file format that lends itself to nested dicts. For example, here is the mapping file for the srt_ambiguous example repo:

{
    "srt_subx": { "": "c:/Users/Jpietkiewicz/repos/conversion/git/srt_suba",
                 "v2": "c:/Users/Jpietkiewicz/repos/conversion/git/srt_subb"
                }
}

I did something special in that file. You'll notice that the first entry is an empty string for the branch name. That's intentional. A user might have only one branch that uses a different url. Therefore, I defined "" to be the fallback. If a branch name does not exist in the keys for the sub-dict srt_subx, then the url for "" is returned. I put that into a helper function to keep that logic contained to one place.

Also, you don't have to have a mapping per branch:

A made-up example:

{
    "srt_subx": { "": "c:/Users/Jpietkiewicz/repos/conversion/git/srt_suba",
                 "v2": "c:/Users/Jpietkiewicz/repos/conversion/git/srt_subb"
                }
    "another_subrepo": "c:/Users/Jpietkiewicz/repos/conversion/git/another_sub"
}

Any consumers of subrepo_cache or submodule_mappings now need to know that subrepo_cache[name] or submodule_mapping[name] might be a dict or it might be a str. The places that make use of subrepo_cache and submodule_mapping have been altered by my changes to handle appropriately.

Edit: I misspoke on subrepo_cache. That is always a nested dict by submodule name, then branch name. It's late. In any case, the code has been updated to handle the new scheme for both.

Unless I did something completely boneheaded, this should all be backwards compatible with the old mapping format and behavior. If the user uses anything but a .json, it will use the old load_mapping functionality. I can run some tests on my end to re-convert some repos to make sure they match how they were converted before my changes. Do you have any tests or example repos for testing?

I'm not sure what you mean by 'existing incremental setups'. I'm asking to make sure I don't ignore or misunderstand. Do you mean scripts that used to work before should still work now? Then yes, my changes will definitely accommodate.

One last thing I remembered - I'll obviously need to update the README-SUBMODULES.md once we've agreed on the strategy. I'll hold off on that until we get the code in a good place.

I know this is alot of info. I should probably have made you a PR to look at (if only just for the preliminaries). I'm a newbie with PRs or I'd just do one now. I'm assuming I would just keep updating the branch for the PR as we go back and forth. Like I said - total noob when it comes to PRs. I need to go slow so I don't do something stupid this late at night. :)

Off to bed.

@frej
Copy link
Owner

frej commented Jan 29, 2020

@MokhamedDakhraui and @johannesc I would appreciate your input on this issue as I expect you to have a better grasp of this part of the code than I have.

Do you mean scripts that used to work before should still work now?

Exactly, I don't want to break someones setup that has been running for years just because fast-export is updated, having aquired a new feature.

[The following depends on my assumption that what's stored in the super Mercurial repo when a sub-repo is updated, is the hash of the sub-repo. If that's not true, just tell me I'm wrong]

During conversion we write the .git/hg2git-mapping file which lists all hashes on the Mercurial side and links them to the Mercurial changeset number which, in .git/hg2git-marks is mapped to the corresponding Git hash. If we were to extend the subrepo mapping file to, for each mercurial subrepo path, allow multiple Git repos, we could, at startup, load all the .git/hg2git-mapping files. Then, during conversion of the super repo, when we detect that a submodule has changed, we look up the hash and by that we can figure out which converted sub-repo it belongs to and update the subrepo file accordingly. That way there is no need for manual configuration per branch, just "for this path, one of these converted GIt repos will contain the commits you need".

@johannesc
Copy link
Contributor

I think using json as an additional format should work and be backwards compatible.

Note that you should always use relative paths to the subrepos, otherwise you will end up with the absolute path/url in the git history. So your example should probably be:

{
    "srt_subx": { "": "../srt_suba",
                 "v2": "../srt_subb"
                }
    "another_subrepo": "../another_sub"
}

Another option is to use yaml which I find a little easier to hand write. Your example could then look like:

srt_subx:
  "": "../srt_suba"
  v2: "../srt_subb"

another_subrepo:
  "": "../another_sub"

Parsing yaml in python is as easy as parsing json.

@jdanielp
Copy link
Author

@johannesc Thanks for the feedback. I will have a look at yaml instead.

Actually, your comment addresses another burning question I had about the mappings. I realize we're going on a tangent,but I'll try to be quick. I assume relative paths mean I always have to keep a copy of the submodules next to my repo, e.g. in ~/repos/ I have srt_ambiguous, another_subrepo, srt_suba, and srt_subb. Currently we work directly in srt_ambiguous alone without the need for subrepos alongside. Is keeping a copy adjacent to the top level repo a common practice with git submodules or is this a requirement for this type of conversion? The paths couldn't be urls to bitbucket repos since fast-export looks in those actual folders for the hg2git-* files, right?

@frej I know I originally thought I'd have to mess with the format of the hg2git-* files, but I did not in my current implementation. I know I thought about it, but I don't remember why I did not. I think the assumption was that the the sha1 would not collide between different branches and so it does not matter so long as the submodule_mappings remained consistent. I will obviously need to spend some time thinking about that some more and providing justification.

Unfortunately, I have to put this one on pause for another day or so. I am investigating another issue in my converted work repo that has me scratching my head. Looks like we have some commits that change hgsubstate that refer to hashes I don't see in the mercurial repo. Not sure what is happening there. Not sure if related to #187 or not, but I doubt it.

I'll keep pressing forward on this one as soon as I figure out what's going on with the head-scratcher.

@johannesc
Copy link
Contributor

@jdanielp When working with fast-export you need a local copy of the subrepos. Once you migrate to git they will be "hidden" inside the .git/modules/ when you clone a repository (if you only checkout a version where the submodule is removed the submodule will not be downloaded).

Using relative url means that you can move your repository from bitbucket to github to sourceforge without any trouble as long as you move all the repositories.

I hope I made things more clear to you!

@jdanielp
Copy link
Author

@johannesc You made things much clearer. And you taught me about the existence of .git/modules; I did not know about that. Thank you.

@frej
I was able to get my other problem diagnosed and am able to focus back on this issue. Turns out mercurial lets you commit 'corrupted' .hgsubstate that refer to non-existent commits in the subrepo if the user does not correctly commit in the correct order (commit in subrepo, then at top level repo). At least, that's what my boss and I think happened in the few cases we found in our repo. I'll need to manually intervene for those commits somehow, or detect the problem and make sure that the .gitmodules file has an entry for those commits. I can probably figure out relatively quickly how to reproduce that issue and submit a new issue with a new example repo. Still not sure what the correct course of action is in that scenario, but it ate my lunch today and I was not able to work on this.

I will not be able to work on this again until Monday unfortunately. I need to take Fri-Sun for some personal time.

I will be back in earnest on this (#188) issue on Monday. The good news is that I seem to have worked out all the kinks in my validation script and it's currently running (slowly) on the big work repo. That means I should have a way to validate my changes for this issue.

I will ask my boss about the possibility of "donating" that backup script to this project. That way we at least have a way to run regression tests on new changes.

Regards to all. Sorry for the long posts and not much work. :)

@jdanielp
Copy link
Author

@frej Sorry to leave you hanging on this. Some things have come up at work that have pulled me off this issue. I've confirmed my fix will convert the repo successfully, but I don't know how future-proof the fix is. At least the good news is that my validation script has verified the conversion with my fix is successful.

I hope to be back on this in a week or two. Unfortunately, I cannot be sure of the exact timing due to the other work issues. If you want me to put a patch here for someone else to look at, let me know and I will do that.

@frej
Copy link
Owner

frej commented Feb 22, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution-required Maintainer has no time and/or no plans to work on this, interested parties should submit a patch verified-bug A confirmed bug in fast-export
Projects
None yet
Development

No branches or pull requests

3 participants