[RFE#1726] refactor: filters: create(Reader|Writer), get(Input|Output)Encoding overrides #802

miurahr · 2023-11-15T02:42:13Z

Improvement, refactor and style change for AbstractFilter class.

Some filter override processFile(File, File, fc) to enforce encoding, but I prefer to implement getInputEncoding and getOutputEncoding methods.

Pull request type

Other (describe below)
Refactoring

Which ticket is resolved?

#1726 AbstractFilter should handle BOM when create reader
https://sourceforge.net/p/omegat/feature-requests/1726/

What does this PR change?

Bump [email protected]
- Modify MagicComment and TMXReader2 classes
Use BOMInputStream class where hand-crafted detectors in AbstractFilter class
- Change fallback default charsets to StandardCharsets.UTF_8 rather than Charset.defaultCharset()
Prefer override get(Input|Output)Encoding method instead of create(Reader|Writer)
- MozillaDTDFilter
- MozillaLangFilter
- ResourceBundleFilter
- MoodlePHPFilter
- PoFilter
Apply spotless
Fix missing javadoc for AbstractFilter class

Other information

…verrides - Bump [email protected] - Use BOMInputStream class where hand-crafted detectors - Prefer override get(Input|Output)Encoding method instead of create(Reader|Writer) Signed-off-by: Hiroshi Miura <[email protected]>

Signed-off-by: Hiroshi Miura <[email protected]>

miurahr

Here is a point reviewer should know.

miurahr · 2023-11-15T15:27:35Z

src/org/omegat/filters2/AbstractFilter.java

-        Charset charset;
-        if (inEncoding != null) {
-            charset = Charset.forName(inEncoding);
+        BOMInputStream bomInputStream = BOMInputStream.builder().setFile(inFile)
+                .setByteOrderMarks(ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE).get();
+        bomLastParsedFile = bomInputStream.getBOM();
+        String charset;
+        if (bomLastParsedFile != null) {
+            charset = bomLastParsedFile.getCharsetName();


Here is a place to look where core change of AbstractFiilter class. I use BOMInputStream.builder() to detect BOM for UTF-8, UTF-16BE and UTF-16LE.

miurahr · 2023-11-15T15:28:47Z

src/org/omegat/filters2/AbstractFilter.java

+        if (bomLastParsedFile != null) {
+            charset = Charset.forName(bomLastParsedFile.getCharsetName());
+        } else if (outEncoding != null) {


I've changed to put priority for detected charset when already detected.

miurahr · 2023-11-15T15:30:43Z

src/org/omegat/filters2/html2/HTMLFilter2.java

+    protected String getInputEncoding(FilterContext filterContext, File infile) throws IOException {
+        String encoding = filterContext.getInEncoding();
+        if (encoding == null && isSourceEncodingVariable()) {
+            try (HTMLReader hreader = new HTMLReader(infile.getAbsolutePath(), StandardCharsets.UTF_8.name())) {


fallback default charset is UTF-8 when there is no hint.

miurahr · 2023-11-15T15:37:15Z

src/org/omegat/filters2/html2/HTMLReader.java

@@ -52,9 +52,9 @@
 * @author Maxym Mykhalchuk
 * @author Didier Briel
 */
-public class HTMLReader extends Reader {
+public class HTMLReader extends Reader implements AutoCloseable {


Improve HTMLReader to be autocloseable

miurahr · 2023-11-15T15:42:18Z

src/org/omegat/filters2/text/bundles/ResourceBundleFilter.java

-    /**
-     * Creates a reader of an input file.
-     * <p>
-     * Override because of keep buggy behavior in OmegaT 5.7.1 or before. It set
-     * default encoding US-ASCII but Java standard InputStreamReader class
-     * wrongly accept non-ASCII characters as-is.
-     * </p>
-     *
-     * @param inFile
-     *            The source file.
-     * @param inEncoding
-     *            Encoding of the input file, if the filter supports it.
-     *            Otherwise null.
-     * @return The reader for the source file
-     * @throws IOException
-     *             If any I/O Error occurs upon reader creation
-     */
-    @Override
-    public BufferedReader createReader(File inFile, String inEncoding) throws IOException {
-        Charset charset;
-        if (inEncoding != null) {
-            charset = Charset.forName(inEncoding);
-        } else {
-            charset = Charset.defaultCharset();
-        }


Here we finally change a buggy behavior. In previous versions, the ResourceBundleFilter can read platform default charset, such as Windows-1251, that is not compliant with Java standard.

src/org/omegat/filters2/po/PoFilter.java

miurahr · 2023-11-15T15:47:10Z

src/org/omegat/util/MagicComment.java

-                new BOMInputStream(new FileInputStream(file)), StandardCharsets.US_ASCII))) {
+                BOMInputStream.builder().setFile(file).get(), StandardCharsets.US_ASCII))) {


This change come from a change in COMMONS-IO

miurahr

We should not use Default charset such as Windows-1251, Shift-JIS, etc.
We can use StandardCharsets.UTF_8

src/org/omegat/filters2/AbstractFilter.java

Signed-off-by: Hiroshi Miura <[email protected]>

miurahr · 2023-11-16T01:33:31Z

The change here will affected all the filter authors and existing filters.

brandelune · 2023-11-16T02:44:21Z

The change here will affected all the filter authors and existing filters.

Not only the behavior for the 5 filters you mention in the code description?

miurahr · 2023-11-16T07:29:50Z

The change here will affected all the filter authors and existing filters.

Not only the behavior for the 5 filters you mention in the code description?

AbstractFilter class is able to be used as a base, extends, to implement IFilter interface that any filter plugin should implement;

…-aware-bom

miurahr · 2023-11-19T02:23:20Z

All questions are answered.

miurahr added 2 commits November 15, 2023 11:41

refactor: ResourceBundleFilter: reduce duplicated code

f154f52

Signed-off-by: Hiroshi Miura <[email protected]>

miurahr added the refactoring label Nov 15, 2023

miurahr requested a review from damien-rembert November 15, 2023 02:42

miurahr added 4 commits November 15, 2023 12:24

fix: set charsets for InputStreamReader constructor

65a8010

Signed-off-by: Hiroshi Miura <[email protected]>

style: Update supressions and some fix

24e38c2

Signed-off-by: Hiroshi Miura <[email protected]>

docs: update Javadoc in AbstractFilter

e407c46

Signed-off-by: Hiroshi Miura <[email protected]>

fix: avoid NPE

a60ea75

Signed-off-by: Hiroshi Miura <[email protected]>

miurahr marked this pull request as ready for review November 15, 2023 12:28

miurahr changed the title ~~refactor: filters: create(Reader|Writer), get(Input|Output)Encoding overrides~~ [RFE#1726] refactor: filters: create(Reader|Writer), get(Input|Output)Encoding overrides Nov 15, 2023

miurahr added enhancement documentation labels Nov 15, 2023

fix: html2filter: fallback default encoding to be UTF-8

6893dd4

Signed-off-by: Hiroshi Miura <[email protected]>

miurahr commented Nov 15, 2023

View reviewed changes

src/org/omegat/filters2/po/PoFilter.java Outdated Show resolved Hide resolved

miurahr commented Nov 15, 2023

View reviewed changes

miurahr commented Nov 16, 2023

View reviewed changes

src/org/omegat/filters2/AbstractFilter.java Outdated Show resolved Hide resolved

src/org/omegat/filters2/AbstractFilter.java Outdated Show resolved Hide resolved

fix: AbstractFilter: fallback default encoding to be UTF-8

3e9e2f7

Signed-off-by: Hiroshi Miura <[email protected]>

miurahr requested review from brandelune and t-cordonnier November 16, 2023 01:32

Merge branch 'master' into topic/miurahr/filters/base-abstract-reader…

9fd3fd7

…-aware-bom

miurahr merged commit 7fc2c08 into master Nov 21, 2023
8 checks passed

miurahr deleted the topic/miurahr/filters/base-abstract-reader-aware-bom branch November 21, 2023 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFE#1726] refactor: filters: create(Reader|Writer), get(Input|Output)Encoding overrides #802

[RFE#1726] refactor: filters: create(Reader|Writer), get(Input|Output)Encoding overrides #802

miurahr commented Nov 15, 2023 •

edited

Loading

miurahr left a comment

miurahr Nov 15, 2023

miurahr Nov 15, 2023

miurahr Nov 15, 2023

miurahr Nov 15, 2023

miurahr Nov 15, 2023

miurahr Nov 15, 2023

miurahr left a comment

miurahr commented Nov 16, 2023

brandelune commented Nov 16, 2023

miurahr commented Nov 16, 2023

miurahr commented Nov 19, 2023

		new BOMInputStream(new FileInputStream(file)), StandardCharsets.US_ASCII))) {
		BOMInputStream.builder().setFile(file).get(), StandardCharsets.US_ASCII))) {

[RFE#1726] refactor: filters: create(Reader|Writer), get(Input|Output)Encoding overrides #802

[RFE#1726] refactor: filters: create(Reader|Writer), get(Input|Output)Encoding overrides #802

Conversation

miurahr commented Nov 15, 2023 • edited Loading

Pull request type

Which ticket is resolved?

What does this PR change?

Other information

miurahr left a comment

Choose a reason for hiding this comment

miurahr Nov 15, 2023

Choose a reason for hiding this comment

miurahr Nov 15, 2023

Choose a reason for hiding this comment

miurahr Nov 15, 2023

Choose a reason for hiding this comment

miurahr Nov 15, 2023

Choose a reason for hiding this comment

miurahr Nov 15, 2023

Choose a reason for hiding this comment

miurahr Nov 15, 2023

Choose a reason for hiding this comment

miurahr left a comment

Choose a reason for hiding this comment

miurahr commented Nov 16, 2023

brandelune commented Nov 16, 2023

miurahr commented Nov 16, 2023

miurahr commented Nov 19, 2023

miurahr commented Nov 15, 2023 •

edited

Loading