Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize git-fat #34

Open
bhklein opened this issue Jan 21, 2015 · 5 comments
Open

Parallelize git-fat #34

bhklein opened this issue Jan 21, 2015 · 5 comments

Comments

@bhklein
Copy link

bhklein commented Jan 21, 2015

Hi all,
Has anyone looked into parallelizing some of the git-fat commands such as the one here? I was thinking about trying this myself but would first like to see if anyone here had any thoughts on the matter.
Thanks

@abraithwaite
Copy link

I don't think parallelizing git-fat would really help to be frank. The major commands are all io bound. push/pull are network, filters are called by git, and anything moving files around are likely disk bound.

I'd be delighted to be proven wrong though, so if you'd like to investigate and submit a pull request with some numbers showing improvements, I'd welcome it.

@justinclift
Copy link

With data that goes over ssh (eg rsync over ssh) it might be worth thinking about.

Ssh itself is hugely not throughput oriented (and not multi-threaded AFAIK), so fast networks using it synchronously generally don't get anywhere near wire speed. Running parallel transfers with it does get to about wire speed (as done with lftp and others).

Parallelizing at least that part of git-fat might be a significant win for large repo's with rsync-over-ssh remote stores

@AndrewJDR
Copy link

So i've actually been looking at that very function (checkout()).

I wanted to assess this by just trying to backgrounding all the git checkout-index calls and seeing how it performed, but the problem I ran into was the following line:
sub.check_call(['git', 'checkout-index', '--index', '--force', fname])
Causes a lock on the git index, so there's no way to run simultaneous git checkout-index calls. My next step was to try to find a way around this, perhaps by using an external tool capable of understanding the git index in a more multi-threaded way (I think https://rtyley.github.io/bfg-repo-cleaner/ does something like this I believe), but I haven't gotten that far yet.

Also see my issue here for more of my performance notes about checkout():
#37

@abraithwaite
Copy link

Hmm, windows performance will be one tough cookie to crack it seems. Just a heads up though if we do end up doing something about it, I'd like to keep all dependencies optional and configurable. The thing that attracted me to git-fat in the first place was the fact it was only one file and used rsync.

@AndrewJDR
Copy link

It will be tough. The more I look at it, the more I'm thinking that it cannot be truly addressed without changes to git itself. At this point I'm not convinced the smudge/clean filter approach is very good for performance in general until that happens. Git annex has a long post on this:
http://git-annex.branchable.com/todo/smudge/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants