Skip to content

Conversation

@paulirwin
Copy link
Contributor

@paulirwin paulirwin commented Aug 27, 2025

  • You've read the Contributor Guide and Code of Conduct.
  • You've included unit or integration tests for your change, where applicable.
  • You've included inline docs for your change, where applicable.
  • There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Some fixes to BytesRef/UnicodeUtil UTF-8 ToString behavior.

Related #1024

Description

  • Change UnicodeUtil's UTF8toUTF16 to throw a DecoderFallbackException instead of FormatException.
  • Make BytesRef's DebuggerDisplay use Utf8ToStringWithFallback

paulirwin and others added 2 commits August 26, 2025 20:24
…ache#1024

Changed UTF8toUTF16 method to throw DecoderFallbackException instead of FormatException
when invalid UTF-8 is encountered. This aligns with .NET conventions where
DecoderFallbackException is the appropriate exception type for character decoding
issues (equivalent to Java's CharacterCodingException).

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Changed DebuggerDisplay attribute to use Utf8ToStringWithFallback() instead of
Utf8ToString() to prevent exceptions when debugging BytesRef instances that
contain invalid UTF-8 sequences.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@paulirwin paulirwin added the notes:bug-fix Contains a fix for a bug label Aug 27, 2025
@paulirwin paulirwin changed the title More BytesRef/UnicodeUtil UTF-8-to-string improvements, #1024 More BytesRef/UnicodeUtil UTF-8-to-string fixes, #1024 Aug 27, 2025
#endif
// LUCENENET specific: Not implementing ICloneable per Microsoft's recommendation
[DebuggerDisplay("{ToString()} {Utf8ToString()}")]
[DebuggerDisplay("{ToString()} {Utf8ToStringWithFallback()}")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I originally meant by "fallback" was the same logic we use for the formatter:

if (bytesRef.TryUtf8ToString(out var utf8String))
{
return utf8String;
}
else
{
return bytesRef.ToString();
}

That is, when there is invalid UTF-8 in the BytesRef, it would fall back to ToString(), which shows the raw bytes. This would give the developer a clear indication that the bytes were invalid. It also seems like displaying the bytes before the string is reversed. The string is the important bit, so it should go first.

IMO, whatever the solution, there should be a very clear indication that the bytes are invalid. Showing the fallback characters looks a bit strange, but it is hard to tell when developing whether they are actually coming from the code you are debugging, or the debugger itself.

So I propose:

  1. Show the string at the beginning and replace it with the text "Invalid UTF-8" if is invalid.
  2. Show the bytes afterward.
  3. Ideally, show each on separate lines (which requires writing a small custom visualizer).

A custom visualizer for this might look like:

using System;
using System.Diagnostics;
using System.Text;
using Microsoft.VisualStudio.DebuggerVisualizers;

public class MyClassVisualizer : DialogDebuggerVisualizer
{
    protected override void Show(IDialogVisualizerService windowService, IVisualizerObjectProvider objectProvider)
    {
        // Get the object being visualized
        BytesRef bytesRef = objectProvider.GetObject() as BytesRef;
        if (bytesRef is null)
        {
            return;
        }

        // Prepare the display content (multi-line)
        StringBuilder sb = new StringBuilder();
        sb.Append("Content = ");
        if (bytesRef.TryUtf8ToString(out var utf8String))
            sb.AppendLine(utf8String);
        else
            sb.AppendLine("Invalid UTF-8");
        sb.Append("Bytes = ");
        sb.AppendLine(bytesRef .ToString());

        // Show the content in a dialog window
        windowService.ShowDialog(sb.ToString());
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

notes:bug-fix Contains a fix for a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants