-
-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFE#1726] refactor: filters: create(Reader|Writer), get(Input|Output)Encoding overrides #802
[RFE#1726] refactor: filters: create(Reader|Writer), get(Input|Output)Encoding overrides #802
Conversation
…verrides - Bump [email protected] - Use BOMInputStream class where hand-crafted detectors - Prefer override get(Input|Output)Encoding method instead of create(Reader|Writer) Signed-off-by: Hiroshi Miura <[email protected]>
Signed-off-by: Hiroshi Miura <[email protected]>
Signed-off-by: Hiroshi Miura <[email protected]>
Signed-off-by: Hiroshi Miura <[email protected]>
Signed-off-by: Hiroshi Miura <[email protected]>
Signed-off-by: Hiroshi Miura <[email protected]>
Signed-off-by: Hiroshi Miura <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a point reviewer should know.
Charset charset; | ||
if (inEncoding != null) { | ||
charset = Charset.forName(inEncoding); | ||
BOMInputStream bomInputStream = BOMInputStream.builder().setFile(inFile) | ||
.setByteOrderMarks(ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE).get(); | ||
bomLastParsedFile = bomInputStream.getBOM(); | ||
String charset; | ||
if (bomLastParsedFile != null) { | ||
charset = bomLastParsedFile.getCharsetName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a place to look where core change of AbstractFiilter
class. I use BOMInputStream.builder()
to detect BOM for UTF-8, UTF-16BE and UTF-16LE.
if (bomLastParsedFile != null) { | ||
charset = Charset.forName(bomLastParsedFile.getCharsetName()); | ||
} else if (outEncoding != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed to put priority for detected charset when already detected.
protected String getInputEncoding(FilterContext filterContext, File infile) throws IOException { | ||
String encoding = filterContext.getInEncoding(); | ||
if (encoding == null && isSourceEncodingVariable()) { | ||
try (HTMLReader hreader = new HTMLReader(infile.getAbsolutePath(), StandardCharsets.UTF_8.name())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fallback default charset is UTF-8 when there is no hint.
@@ -52,9 +52,9 @@ | |||
* @author Maxym Mykhalchuk | |||
* @author Didier Briel | |||
*/ | |||
public class HTMLReader extends Reader { | |||
public class HTMLReader extends Reader implements AutoCloseable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve HTMLReader
to be autocloseable
/** | ||
* Creates a reader of an input file. | ||
* <p> | ||
* Override because of keep buggy behavior in OmegaT 5.7.1 or before. It set | ||
* default encoding US-ASCII but Java standard InputStreamReader class | ||
* wrongly accept non-ASCII characters as-is. | ||
* </p> | ||
* | ||
* @param inFile | ||
* The source file. | ||
* @param inEncoding | ||
* Encoding of the input file, if the filter supports it. | ||
* Otherwise null. | ||
* @return The reader for the source file | ||
* @throws IOException | ||
* If any I/O Error occurs upon reader creation | ||
*/ | ||
@Override | ||
public BufferedReader createReader(File inFile, String inEncoding) throws IOException { | ||
Charset charset; | ||
if (inEncoding != null) { | ||
charset = Charset.forName(inEncoding); | ||
} else { | ||
charset = Charset.defaultCharset(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we finally change a buggy behavior. In previous versions, the ResourceBundleFilter
can read platform default charset, such as Windows-1251, that is not compliant with Java standard.
new BOMInputStream(new FileInputStream(file)), StandardCharsets.US_ASCII))) { | ||
BOMInputStream.builder().setFile(file).get(), StandardCharsets.US_ASCII))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change come from a change in COMMONS-IO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use Default charset such as Windows-1251, Shift-JIS, etc.
We can use StandardCharsets.UTF_8
Signed-off-by: Hiroshi Miura <[email protected]>
The change here will affected all the filter authors and existing filters. |
Not only the behavior for the 5 filters you mention in the code description? |
AbstractFilter class is able to be used as a base, extends, to implement IFilter interface that any filter plugin should implement; |
All questions are answered. |
Improvement, refactor and style change for AbstractFilter class.
Some filter override
processFile(File, File, fc)
to enforce encoding, but I prefer to implementgetInputEncoding
andgetOutputEncoding
methods.Pull request type
Refactoring
Which ticket is resolved?
What does this PR change?
StandardCharsets.UTF_8
rather thanCharset.defaultCharset()
Other information