Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch: On-the-fly UTF-8 conversion #128

Closed
tsdh opened this issue Jun 14, 2018 · 5 comments
Closed

Patch: On-the-fly UTF-8 conversion #128

tsdh opened this issue Jun 14, 2018 · 5 comments

Comments

@tsdh
Copy link

tsdh commented Jun 14, 2018

Over the last weekend, I've migrated our huge mercurial repository with 16 years of history (nearly 200.000 commits, about 20.000 java files) to git using the hg-fast-export.sh script.

My goal was to also convert our java files from a wild mix of ISO-8859-15, Cp1252, UTF-8, or simply broken to UTF-8. I've tested a git filter-branch --tree-filter ... approach but someone at StackOverflow suggested to do the conversion either in fast-export or fast-import directly. That's what I did, and it worked like a charm. Conversion time increased from about 4-5 hours to 56 hours, though. Guessing the current encoding using chardet is a bit costly, and decoding/encoding is expensive, too.

The patch is against the hg-4.6-compat branch.

From 081e406e8b1454bcbaf2a73a403ae2a229667970 Mon Sep 17 00:00:00 2001
From: Tassilo Horn <[email protected]>
Date: Sun, 10 Jun 2018 08:27:10 +0200
Subject: [PATCH] Convert java files to UTF-8 on the fly

Also change sourceEncoding in pom.xml accordingly
---
 hg-fast-export.py | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/hg-fast-export.py b/hg-fast-export.py
index 0714b30..eea3ad0 100755
--- a/hg-fast-export.py
+++ b/hg-fast-export.py
@@ -11,6 +11,7 @@ from optparse import OptionParser
 import re
 import sys
 import os
+import chardet
 
 if sys.platform == "win32":
   # On Windows, sys.stdout is initially opened in text mode, which means that
@@ -123,6 +124,8 @@ def get_author(logmessage,committer,authors):
       return r
   return committer
 
+src_enc_rx=re.compile(r'\<project\.build\.sourceEncoding\>.*\</project\.build\.sourceEncoding\>')
+
 def export_file_contents(ctx,manifest,files,hgtags,encoding=''):
   count=0
   max=len(files)
@@ -132,6 +135,17 @@ def export_file_contents(ctx,manifest,files,hgtags,encoding=''):
       sys.stderr.write('Skip %s\n' % (file))
       continue
     d=ctx.filectx(file).data()
+
+    if (d != None) and (file.endswith('.java')):
+      enc=chardet.detect(d)['encoding']
+      if (enc != None) and (enc != "ascii") and (enc != 'utf-8'):
+        if (enc != 'Windows-1252') and (enc != 'ISO-8859-1') and (enc != 'ISO-8859-15'):
+          enc = 'utf8' # fallback: at least we get an replacement character instead of garbage
+        d=u''.join(d.decode(enc, 'replace')).encode('utf8', 'replace')
+    elif (d != None) and (file.endswith('pom.xml')):
+      d=src_enc_rx.sub('<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>', d)
+
+
     if encoding:
       filename=file.decode(encoding).encode('utf8')
     else:
-- 
2.17.1

Of course, the patch is completely specific to my use-case (target encoding fixed, only do it on java files, also change the encoding setting in Maven pom.xml files) but it might be the start for a new conversion option.

(Just to address the obvious question: why not update the encoding with a normal commit?
Because (depending on how many non-ASCII characters you have in your code) it becomes almost impossible to cherry-pick/graft/merge commits from the era before the UTF-8 change-over without getting tons of merge conflicts.)

@frej
Copy link
Owner

frej commented Jun 15, 2018

It is fun to see that fast-export is used. Have you noted issue #95? When it is complete, it will provide a way to filter file contents during conversion without having to modify fast-export.

As you say, this is completely specific to your use-case, so I'll close this issue for now, perhaps it can serve as inspiration for a future --filter-contents filter.

@frej frej closed this as completed Jun 15, 2018
@frej
Copy link
Owner

frej commented Jun 22, 2018

#95 is now merged, the above mangling of file contents can now be done by an external program using --filter-contents.

@tsdh
Copy link
Author

tsdh commented Jun 30, 2018

@frej #95 came a bit too late for myself, but it looks like it would have done the job, too. Great to have that now. I guess I'll have to convert some more repositories in the coming month. 👍

@Utkarsh-nk
Copy link

Utkarsh-nk commented Mar 17, 2022

@frej @tsdh I am using the fast-export to convert some mercurial repos to git but facing ASCII encoding errors during java compilation.

Before finding this thread , I tried below approach and that fixed my current failures
add below in .gitattributes file and check-in the file in mercurial repo
*.java encoding=utf-8

use below in the javac ant target.
<compilerarg line="-encoding utf-8"/>
Can you provide your thoughts if above makes sense.

If the above approach is not correct then can you please point towards how I can solve the encoding using the --filter-contents option.

@frej
Copy link
Owner

frej commented Mar 18, 2022

@Utkarsh-nk, I'm sorry but I don't understand your question nor what your issue with fast-export is.

If your issue is with how to use --filter-contents, it is quite simple, you just add it to the command line as --filter-contents <path-to-your-filter> (./hg-fast-export.sh --help shows this). The readme documents the script's interface.

If you're asking about how to write a filter that detects an arbitrary encoding and converts it to UTF-8 or how to configure a Java build system to accept a particular encoding, this is not the right forum.

BTW, as far as I can tell from the man page for gitattributes, the directive encoding=utf-8 only sets the encoding to use when displaying the file in GUI tools. If you want Git to convert from the internal UTF-8 to something else when checking out files, you should use working-tree-encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants