MBS-13453: Avoid double-decoding some errors #3323

reosarevok · 2024-07-10T10:01:34Z

Fix MBS-13453

Problem

We are getting wide error ISEs that hide the real errors sometimes when there are PSQL problems.

Solution

While the idea in 20640de was correct, the change was too wide in application; it was also decoding errors that did not need it, causing a wide character ISE because of the double decoding.

mwiencek looked into the root causes and said: "We do always get a UTF-8 string. But sometimes it's just a raw byte string (with no UTF8 flag set), and other times it's already been decoded (UTF8 flag on)."

So this checks whether the flag is on, and only decodes the string if it is not.

Testing

Manually.

This specific issue can be triggered for testing by changing the host in DBDefs to something dumb and utf8ish. I used '♥localhost'.

The double-decoding could be triggered (before this patch) by doing a similar change on a .sql data file for a test and then running it - I changed work.sql to have an invalid artist ID (to trigger an error) and '♥' instead of 'ABBA' as an AC name. Then I just triggered the error with prove -lv t/tests.t :: --tests Edit::Work::Create

yvanzo

Thank you! Changes makes perfect sense.

Would it be feasible to make the test described in commit message a CI test?

lib/MusicBrainz/Server/Connector.pm

mwiencek · 2025-03-26T03:22:00Z

The actual cause of the double encodings we saw with MBS-11207 seems to be Devel::StackTrace::as_string - so this just decodes those error strings rather than all of them.

Do you remember how you came to that conclusion? Because I'm not sure that that's the issue...I tried playing with Devel::StackTrace::as_string in the REPL and it doesn't seem to perform any encoding on its output. A look at its source code didn't show that either.

The "Wide character" error that happens when you run the tests is being thrown by the decode_utf8($msg) call we had in HandleError. I tried inspecting the string there during the different error conditions, and became even more confused (the utf8 flag seems to be inverted from what you'd expected when decode_utf8 throws or doesn't).

I don't particularly like adding evals to work around spooky behavior, but maybe we should keep it inside HandleError instead, unless you can remember how Devel::StackTrace is breaking it.

            eval { $msg = decode_utf8($msg); };
            $exception->throw( sqlstate => $state, message => $msg );

reosarevok · 2025-03-26T08:03:18Z

I honestly have zero memory of this by now. I think I just looked at what stopped breaking when changing things in different places, and this was the one I found, but it might be deeper than that. I do remember being very confused about why the stuff was different in some cases than others, at least...

Happy to try with your option, anyway, if you tested that it does the same.

reosarevok · 2025-03-27T10:00:35Z

I tested your suggestion now and it seems to work - I still don't understand what the origin of the issue is and would like to figure it out, but maybe this is good enough for now. We should have a comment explaining the issue though and why we have the eval there - wanna write that since you properly looked into it now and I only did it a year ago?

mwiencek · 2025-03-27T16:07:55Z

I tested your suggestion now and it seems to work - I still don't understand what the origin of the issue is and would like to figure it out, but maybe this is good enough for now. We should have a comment explaining the issue though and why we have the eval there - wanna write that since you properly looked into it now and I only did it a year ago?

Well, I'm also not sure why we have an eval there. :) It appears that sometimes the error message we get back from DBD::Pg is already UTF-8 encoded, and other times it's a character string, but I don't know the reason.

I'm actually wondering if we should remove the decode entirely and intentionally regress on MBS-11207.

From what I understand, the mojibake in MBS-11207 has primarily only been observed for connection errors. Initially it was seen by yvanzo when the postgres locale was set to fr, and another time when we intentionally added a wide character to the connection string (creating a bogus host name). While this is annoying, the double-encoding won't cause any cascading errors.
The eval around the decode doesn't make the decoding correct, and it can introduce weird artifacts of its own. For example, if the decoded character string doesn't contain any wide characters it can still succeed but output garbage if you interpret it as UTF-8 again. decode('utf-8', "\xC3\xA5") (å) correctly gives the code point "\x{e5}", but if the output handle expects an encoded UTF-8 string, it'll just display �. (But I haven't been able to reproduce such display, which hints again that I'm not sure where the issue is...)

reosarevok · 2025-03-27T16:54:24Z

I can live with that as well if you prefer - ideally we'd still figure out where this is coming from, but...

mwiencek · 2025-03-27T19:18:57Z

It appears that sometimes the error message we get back from DBD::Pg is already UTF-8 encoded, and other times it's a character string, but I don't know the reason.

This isn't exactly right. After inspecting the $msg in HandleError with Devel::Peek under various circumstances, we do always get a UTF-8 string. But sometimes it's just a raw byte string (with no UTF8 flag set), and other times it's already been decoded (UTF8 flag on). So I think the code below should be fine. No eval needed, IIUC, since it's impossible for $msg to contain wide characters (as Perl understands them) if the UTF8 flag is off. (This is confusing, but utf8::is_utf8 has nothing to do with whether the string contains UTF8-encoded bytes or not.)

diff --git a/lib/MusicBrainz/Server/Connector.pm b/lib/MusicBrainz/Server/Connector.pm
index b0f5b66222..871037b727 100644
--- a/lib/MusicBrainz/Server/Connector.pm
+++ b/lib/MusicBrainz/Server/Connector.pm
@@ -3,6 +3,7 @@ use Moose;
 use MusicBrainz::Server::Exceptions;
 use DBIx::Connector;
 use Sql;
+use Encode qw( decode_utf8 );

 has 'conn' => (
     isa        => 'DBIx::Connector',
@@ -55,6 +56,10 @@ sub _build_conn
             my $exception = 'MusicBrainz::Server::Exceptions::DatabaseError';
             $exception .= '::StatementTimedOut'
                 if $state eq '57014';
+            # Sometimes we receive a byte string that doesn't have the UTF8
+            # flag set; other times it's already been decoded (MBS-11207).
+            $msg = decode_utf8($msg)
+                unless utf8::is_utf8($msg);
             $exception->throw( sqlstate => $state, message => $msg );
         },
         RaiseError        => 0,

While the idea in 20640de was correct, the change was too wide in application; it was also decoding errors that did not need it, causing a wide character ISE because of the double decoding. The double-decoding could be triggered (before this patch) by doing a utf8 change on a sql data file for a test and then running it - I changed work.sql to have an invalid work ID (to trigger an error) and ♥ instead of Test as a work name. Then I just triggered the error with prove -lv t/tests.t :: --tests Edit::Work::Create mwiencek looked into the root causes and said: "We do always get a UTF-8 string. But sometimes it's just a raw byte string (with no UTF8 flag set), and other times it's already been decoded (UTF8 flag on)." So this checks whether the flag is on, and only decodes the string if it is not. Co-Authored-By: Michael Wiencek <[email protected]>

reosarevok · 2025-03-28T07:27:41Z

Ok, that does seem more elegant. Ideally we'd still find out why only some come pre-decoded, but it's a lot less messy than before. Changed (and changed the commit message accordingly).

mwiencek · 2025-03-28T15:59:38Z

Thanks, I replaced parts of the PR description with the updated commit message to avoid confusion if we have to revisit this.

reosarevok added the Bug Bugs that should be checked/fixed soonish label Jul 10, 2024

yvanzo reviewed Jul 16, 2024

View reviewed changes

lib/MusicBrainz/Server/Connector.pm Outdated Show resolved Hide resolved

reosarevok force-pushed the MBS-13453 branch 2 times, most recently from a2fc983 to 4fb2bfe Compare March 27, 2025 09:59

reosarevok force-pushed the MBS-13453 branch from 4fb2bfe to 6a78576 Compare March 28, 2025 07:26

mwiencek merged commit 629b522 into metabrainz:master Mar 28, 2025
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MBS-13453: Avoid double-decoding some errors #3323

MBS-13453: Avoid double-decoding some errors #3323

reosarevok commented Jul 10, 2024 •

edited by mwiencek

Loading

yvanzo left a comment

mwiencek commented Mar 26, 2025

reosarevok commented Mar 26, 2025

reosarevok commented Mar 27, 2025

mwiencek commented Mar 27, 2025 •

edited

Loading

reosarevok commented Mar 27, 2025

mwiencek commented Mar 27, 2025

reosarevok commented Mar 28, 2025

mwiencek commented Mar 28, 2025

MBS-13453: Avoid double-decoding some errors #3323

MBS-13453: Avoid double-decoding some errors #3323

Conversation

reosarevok commented Jul 10, 2024 • edited by mwiencek Loading

Fix MBS-13453

Problem

Solution

Testing

yvanzo left a comment

Choose a reason for hiding this comment

mwiencek commented Mar 26, 2025

reosarevok commented Mar 26, 2025

reosarevok commented Mar 27, 2025

mwiencek commented Mar 27, 2025 • edited Loading

reosarevok commented Mar 27, 2025

mwiencek commented Mar 27, 2025

reosarevok commented Mar 28, 2025

mwiencek commented Mar 28, 2025

reosarevok commented Jul 10, 2024 •

edited by mwiencek

Loading

mwiencek commented Mar 27, 2025 •

edited

Loading