Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE#1726] refactor: filters: create(Reader|Writer), get(Input|Output)Encoding overrides #802

Merged
merged 9 commits into from
Nov 21, 2023

Conversation

miurahr
Copy link
Member

@miurahr miurahr commented Nov 15, 2023

Improvement, refactor and style change for AbstractFilter class.

Some filter override processFile(File, File, fc) to enforce encoding, but I prefer to implement getInputEncoding and getOutputEncoding methods.

Pull request type

  • Other (describe below)
    Refactoring

Which ticket is resolved?

What does this PR change?

  • Bump [email protected]
    • Modify MagicComment and TMXReader2 classes
  • Use BOMInputStream class where hand-crafted detectors in AbstractFilter class
    • Change fallback default charsets to StandardCharsets.UTF_8 rather than Charset.defaultCharset()
  • Prefer override get(Input|Output)Encoding method instead of create(Reader|Writer)
    • MozillaDTDFilter
    • MozillaLangFilter
    • ResourceBundleFilter
    • MoodlePHPFilter
    • PoFilter
  • Apply spotless
  • Fix missing javadoc for AbstractFilter class

Other information

…verrides

- Bump [email protected]
- Use BOMInputStream class where hand-crafted detectors
- Prefer override get(Input|Output)Encoding method instead of create(Reader|Writer)

Signed-off-by: Hiroshi Miura <[email protected]>
@miurahr miurahr marked this pull request as ready for review November 15, 2023 12:28
@miurahr miurahr changed the title refactor: filters: create(Reader|Writer), get(Input|Output)Encoding overrides [RFE#1726] refactor: filters: create(Reader|Writer), get(Input|Output)Encoding overrides Nov 15, 2023
Copy link
Member Author

@miurahr miurahr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a point reviewer should know.

Comment on lines -338 to +334
Charset charset;
if (inEncoding != null) {
charset = Charset.forName(inEncoding);
BOMInputStream bomInputStream = BOMInputStream.builder().setFile(inFile)
.setByteOrderMarks(ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE).get();
bomLastParsedFile = bomInputStream.getBOM();
String charset;
if (bomLastParsedFile != null) {
charset = bomLastParsedFile.getCharsetName();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a place to look where core change of AbstractFiilter class. I use BOMInputStream.builder() to detect BOM for UTF-8, UTF-16BE and UTF-16LE.

Comment on lines +361 to +363
if (bomLastParsedFile != null) {
charset = Charset.forName(bomLastParsedFile.getCharsetName());
} else if (outEncoding != null) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed to put priority for detected charset when already detected.

protected String getInputEncoding(FilterContext filterContext, File infile) throws IOException {
String encoding = filterContext.getInEncoding();
if (encoding == null && isSourceEncodingVariable()) {
try (HTMLReader hreader = new HTMLReader(infile.getAbsolutePath(), StandardCharsets.UTF_8.name())) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fallback default charset is UTF-8 when there is no hint.

@@ -52,9 +52,9 @@
* @author Maxym Mykhalchuk
* @author Didier Briel
*/
public class HTMLReader extends Reader {
public class HTMLReader extends Reader implements AutoCloseable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve HTMLReader to be autocloseable

Comment on lines -149 to -173
/**
* Creates a reader of an input file.
* <p>
* Override because of keep buggy behavior in OmegaT 5.7.1 or before. It set
* default encoding US-ASCII but Java standard InputStreamReader class
* wrongly accept non-ASCII characters as-is.
* </p>
*
* @param inFile
* The source file.
* @param inEncoding
* Encoding of the input file, if the filter supports it.
* Otherwise null.
* @return The reader for the source file
* @throws IOException
* If any I/O Error occurs upon reader creation
*/
@Override
public BufferedReader createReader(File inFile, String inEncoding) throws IOException {
Charset charset;
if (inEncoding != null) {
charset = Charset.forName(inEncoding);
} else {
charset = Charset.defaultCharset();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we finally change a buggy behavior. In previous versions, the ResourceBundleFilter can read platform default charset, such as Windows-1251, that is not compliant with Java standard.

new BOMInputStream(new FileInputStream(file)), StandardCharsets.US_ASCII))) {
BOMInputStream.builder().setFile(file).get(), StandardCharsets.US_ASCII))) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change come from a change in COMMONS-IO

Copy link
Member Author

@miurahr miurahr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use Default charset such as Windows-1251, Shift-JIS, etc.
We can use StandardCharsets.UTF_8

src/org/omegat/filters2/AbstractFilter.java Outdated Show resolved Hide resolved
src/org/omegat/filters2/AbstractFilter.java Outdated Show resolved Hide resolved
@miurahr
Copy link
Member Author

miurahr commented Nov 16, 2023

The change here will affected all the filter authors and existing filters.

@brandelune
Copy link
Contributor

The change here will affected all the filter authors and existing filters.

Not only the behavior for the 5 filters you mention in the code description?

@miurahr
Copy link
Member Author

miurahr commented Nov 16, 2023

The change here will affected all the filter authors and existing filters.

Not only the behavior for the 5 filters you mention in the code description?

AbstractFilter class is able to be used as a base, extends, to implement IFilter interface that any filter plugin should implement;

@miurahr
Copy link
Member Author

miurahr commented Nov 19, 2023

All questions are answered.

@miurahr miurahr merged commit 7fc2c08 into master Nov 21, 2023
8 checks passed
@miurahr miurahr deleted the topic/miurahr/filters/base-abstract-reader-aware-bom branch November 21, 2023 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants