You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Java's StandardCharsets.UTF_8 does not write a Byte-Order Mark (BOM), while .NET's System.Text.Encoding.UTF8 does include a BOM by default. We have ensured that the IOUtils.CHARSET_UTF_8 does not include a BOM to match Java, and as part of #1018 we've added an internal Support class to allow for using StandardCharsets.UTF_8, but we need to review all usage of System.Text.Encoding.UTF8 to determine if it should be replaced with StandardCharsets.UTF_8 or IOUtils.CHARSET_UTF_8 (whatever best matches the corresponding Java Lucene code) to avoid BOM issues.
The text was updated successfully, but these errors were encountered:
I have reviewed usages of Encoding (most commonly Encoding.UTF8), and determined that most usages do not need to be changed. The following cases do not result in a BOM being generated:
Looks like you missed OfflineSorter. The tests specifically failed when it was configured to use a BOM, although I didn't analyze it at a high level to find out why that was the case. No objections if you wish to investigate this, but it definitely makes a difference as far as the tests are concerned.
It has gone through several rounds of refactoring since then, but currently it has a DEFAULT_ENCODING field that we added to ensure the tests pass. So, we have a couple of options:
Remove the DEFAULT_ENCODING field and replace it with IOUtils.CHARSET_UTF_8. Update the OfflineSorter documentation for ByteSequencesReader and ByteSequencesWriter to indicate that constructor overloads that accept BinaryReader and BinaryWriter should use IOUtils.CHARSET_UTF_8.
Initialize the DEFAULT_ENCODING field with the same instance as IOUtils.CHARSET_UTF_8.
Given the fact that we added this field specifically because OfflineSorter requires there to be no BOM (which difers from the .NET default), this could go either way. Given that we recently changed IOUtils.CHARSET_UTF_8 to remove the BOM, using it wasn't an option when the DEFAULT_ENCODING field was added. If it were, it would have been reused in this case and the field wouldn't have been added.
Side note: perhaps we should also rename IOUtils.CHARSET_UTF_8 because it is public and "CharSet" is Java nomenclature. ENCODING_UTF8_NO_BOM would be a better name.
Is there an existing issue for this?
Task description
Java's
StandardCharsets.UTF_8
does not write a Byte-Order Mark (BOM), while .NET'sSystem.Text.Encoding.UTF8
does include a BOM by default. We have ensured that theIOUtils.CHARSET_UTF_8
does not include a BOM to match Java, and as part of #1018 we've added an internal Support class to allow for usingStandardCharsets.UTF_8
, but we need to review all usage ofSystem.Text.Encoding.UTF8
to determine if it should be replaced withStandardCharsets.UTF_8
orIOUtils.CHARSET_UTF_8
(whatever best matches the corresponding Java Lucene code) to avoid BOM issues.The text was updated successfully, but these errors were encountered: