Review all Encoding usage for BOM compatibility #1027

paulirwin · 2024-11-17T00:25:53Z

Is there an existing issue for this?

I have searched the existing issues

Task description

Java's StandardCharsets.UTF_8 does not write a Byte-Order Mark (BOM), while .NET's System.Text.Encoding.UTF8 does include a BOM by default. We have ensured that the IOUtils.CHARSET_UTF_8 does not include a BOM to match Java, and as part of #1018 we've added an internal Support class to allow for using StandardCharsets.UTF_8, but we need to review all usage of System.Text.Encoding.UTF8 to determine if it should be replaced with StandardCharsets.UTF_8 or IOUtils.CHARSET_UTF_8 (whatever best matches the corresponding Java Lucene code) to avoid BOM issues.

The text was updated successfully, but these errors were encountered:

paulirwin · 2024-12-24T23:06:33Z

I have reviewed usages of Encoding (most commonly Encoding.UTF8), and determined that most usages do not need to be changed. The following cases do not result in a BOM being generated:

Encoding.UTF8.GetBytes(string)
J2N.Text.StringExtensions.GetBytes(this string, Encoding)
Use of Encoding.Default or Encoding.GetEncoding(0) which uses a BOM-less UTF8 encoding on modern .NET, or the system current code page on .NET Framework

The following cases ignore a BOM if present, and do not fail if there is not a BOM, and thus do not need to be changed to a BOM-less Encoding:

Any TextReader use (such as StreamReader)
IOUtils.GetDecodingReader(...)
Encoding.UTF8.GetString(byte[])
FileStream with FileAccess.Read

So you'll see in the PR that the amount of changes to address BOM issues are not very many; that's because most fall into those buckets above.

NightOwl888 · 2024-12-26T14:07:29Z

Looks like you missed OfflineSorter. The tests specifically failed when it was configured to use a BOM, although I didn't analyze it at a high level to find out why that was the case. No objections if you wish to investigate this, but it definitely makes a difference as far as the tests are concerned.

It has gone through several rounds of refactoring since then, but currently it has a DEFAULT_ENCODING field that we added to ensure the tests pass. So, we have a couple of options:

Remove the DEFAULT_ENCODING field and replace it with IOUtils.CHARSET_UTF_8. Update the OfflineSorter documentation for ByteSequencesReader and ByteSequencesWriter to indicate that constructor overloads that accept BinaryReader and BinaryWriter should use IOUtils.CHARSET_UTF_8.
Initialize the DEFAULT_ENCODING field with the same instance as IOUtils.CHARSET_UTF_8.

Given the fact that we added this field specifically because OfflineSorter requires there to be no BOM (which difers from the .NET default), this could go either way. Given that we recently changed IOUtils.CHARSET_UTF_8 to remove the BOM, using it wasn't an option when the DEFAULT_ENCODING field was added. If it were, it would have been reused in this case and the field wouldn't have been added.

Side note: perhaps we should also rename IOUtils.CHARSET_UTF_8 because it is public and "CharSet" is Java nomenclature. ENCODING_UTF8_NO_BOM would be a better name.

paulirwin added the is:task A chore to be done label Nov 17, 2024

paulirwin added this to the 4.8.0-beta00018 milestone Nov 17, 2024

paulirwin mentioned this issue Nov 17, 2024

Test review A-D, #259 #1018

Merged

4 tasks

paulirwin added the pri:normal label Nov 21, 2024

paulirwin self-assigned this Dec 24, 2024

paulirwin added a commit to paulirwin/lucene.net that referenced this issue Dec 24, 2024

SWEEP: Use BOM-less UTF-8 encoding for writes, apache#1027

2b3bf9e

paulirwin linked a pull request Dec 24, 2024 that will close this issue

SWEEP: Use BOM-less UTF-8 encoding for writes, #1027 #1075

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review all Encoding usage for BOM compatibility #1027

Review all Encoding usage for BOM compatibility #1027

paulirwin commented Nov 17, 2024

paulirwin commented Dec 24, 2024

NightOwl888 commented Dec 26, 2024

Review all Encoding usage for BOM compatibility #1027

Review all Encoding usage for BOM compatibility #1027

Comments

paulirwin commented Nov 17, 2024

Is there an existing issue for this?

Task description

paulirwin commented Dec 24, 2024

NightOwl888 commented Dec 26, 2024