Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MBS-13453: Avoid double-decoding some errors #3323

Merged
merged 1 commit into from
Mar 28, 2025

Conversation

reosarevok
Copy link
Member

@reosarevok reosarevok commented Jul 10, 2024

Fix MBS-13453

Problem

We are getting wide error ISEs that hide the real errors sometimes when there are PSQL problems.

Solution

While the idea in 20640de was correct, the change was too wide in application; it was also decoding errors that did not need it, causing a wide character ISE because of the double decoding.

mwiencek looked into the root causes and said: "We do always get a UTF-8 string. But sometimes it's just a raw byte string (with no UTF8 flag set), and other times it's already been decoded (UTF8 flag on)."

So this checks whether the flag is on, and only decodes the string if it is not.

Testing

Manually.

This specific issue can be triggered for testing by changing the host in DBDefs to something dumb and utf8ish. I used '♥localhost'.

The double-decoding could be triggered (before this patch) by doing a similar change on a .sql data file for a test and then running it - I changed work.sql to have an invalid artist ID (to trigger an error) and '♥' instead of 'ABBA' as an AC name. Then I just triggered the error with prove -lv t/tests.t :: --tests Edit::Work::Create

@reosarevok reosarevok added the Bug Bugs that should be checked/fixed soonish label Jul 10, 2024
Copy link
Contributor

@yvanzo yvanzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Changes makes perfect sense.

Would it be feasible to make the test described in commit message a CI test?

@mwiencek
Copy link
Member

The actual cause of the double encodings we saw with MBS-11207 seems to be Devel::StackTrace::as_string - so this just decodes those error strings rather than all of them.

Do you remember how you came to that conclusion? Because I'm not sure that that's the issue...I tried playing with Devel::StackTrace::as_string in the REPL and it doesn't seem to perform any encoding on its output. A look at its source code didn't show that either.

The "Wide character" error that happens when you run the tests is being thrown by the decode_utf8($msg) call we had in HandleError. I tried inspecting the string there during the different error conditions, and became even more confused (the utf8 flag seems to be inverted from what you'd expected when decode_utf8 throws or doesn't).

I don't particularly like adding evals to work around spooky behavior, but maybe we should keep it inside HandleError instead, unless you can remember how Devel::StackTrace is breaking it.

            eval { $msg = decode_utf8($msg); };
            $exception->throw( sqlstate => $state, message => $msg );

@reosarevok
Copy link
Member Author

I honestly have zero memory of this by now. I think I just looked at what stopped breaking when changing things in different places, and this was the one I found, but it might be deeper than that. I do remember being very confused about why the stuff was different in some cases than others, at least...

Happy to try with your option, anyway, if you tested that it does the same.

@reosarevok reosarevok force-pushed the MBS-13453 branch 2 times, most recently from a2fc983 to 4fb2bfe Compare March 27, 2025 09:59
@reosarevok
Copy link
Member Author

I tested your suggestion now and it seems to work - I still don't understand what the origin of the issue is and would like to figure it out, but maybe this is good enough for now. We should have a comment explaining the issue though and why we have the eval there - wanna write that since you properly looked into it now and I only did it a year ago?

@mwiencek
Copy link
Member

mwiencek commented Mar 27, 2025

I tested your suggestion now and it seems to work - I still don't understand what the origin of the issue is and would like to figure it out, but maybe this is good enough for now. We should have a comment explaining the issue though and why we have the eval there - wanna write that since you properly looked into it now and I only did it a year ago?

Well, I'm also not sure why we have an eval there. :) It appears that sometimes the error message we get back from DBD::Pg is already UTF-8 encoded, and other times it's a character string, but I don't know the reason.

I'm actually wondering if we should remove the decode entirely and intentionally regress on MBS-11207.

  • From what I understand, the mojibake in MBS-11207 has primarily only been observed for connection errors. Initially it was seen by yvanzo when the postgres locale was set to fr, and another time when we intentionally added a wide character to the connection string (creating a bogus host name). While this is annoying, the double-encoding won't cause any cascading errors.
  • The eval around the decode doesn't make the decoding correct, and it can introduce weird artifacts of its own. For example, if the decoded character string doesn't contain any wide characters it can still succeed but output garbage if you interpret it as UTF-8 again. decode('utf-8', "\xC3\xA5") (å) correctly gives the code point "\x{e5}", but if the output handle expects an encoded UTF-8 string, it'll just display �. (But I haven't been able to reproduce such display, which hints again that I'm not sure where the issue is...)

@reosarevok
Copy link
Member Author

I can live with that as well if you prefer - ideally we'd still figure out where this is coming from, but...

@mwiencek
Copy link
Member

It appears that sometimes the error message we get back from DBD::Pg is already UTF-8 encoded, and other times it's a character string, but I don't know the reason.

This isn't exactly right. After inspecting the $msg in HandleError with Devel::Peek under various circumstances, we do always get a UTF-8 string. But sometimes it's just a raw byte string (with no UTF8 flag set), and other times it's already been decoded (UTF8 flag on). So I think the code below should be fine. No eval needed, IIUC, since it's impossible for $msg to contain wide characters (as Perl understands them) if the UTF8 flag is off. (This is confusing, but utf8::is_utf8 has nothing to do with whether the string contains UTF8-encoded bytes or not.)

diff --git a/lib/MusicBrainz/Server/Connector.pm b/lib/MusicBrainz/Server/Connector.pm
index b0f5b66222..871037b727 100644
--- a/lib/MusicBrainz/Server/Connector.pm
+++ b/lib/MusicBrainz/Server/Connector.pm
@@ -3,6 +3,7 @@ use Moose;
 use MusicBrainz::Server::Exceptions;
 use DBIx::Connector;
 use Sql;
+use Encode qw( decode_utf8 );

 has 'conn' => (
     isa        => 'DBIx::Connector',
@@ -55,6 +56,10 @@ sub _build_conn
             my $exception = 'MusicBrainz::Server::Exceptions::DatabaseError';
             $exception .= '::StatementTimedOut'
                 if $state eq '57014';
+            # Sometimes we receive a byte string that doesn't have the UTF8
+            # flag set; other times it's already been decoded (MBS-11207).
+            $msg = decode_utf8($msg)
+                unless utf8::is_utf8($msg);
             $exception->throw( sqlstate => $state, message => $msg );
         },
         RaiseError        => 0,

While the idea in 20640de
was correct, the change was too wide in application;
it was also decoding errors that did not need it, causing
a wide character ISE because of the double decoding.

The double-decoding could be triggered (before this patch)
by doing a utf8 change on a sql data file for a test and then
running it - I changed work.sql to have an invalid work ID
(to trigger an error) and ♥ instead of Test as a work name.
Then I just triggered the error with
prove -lv t/tests.t :: --tests Edit::Work::Create

mwiencek looked into the root causes and said:
"We do always get a UTF-8 string. But sometimes it's just a raw
byte string (with no UTF8 flag set), and other times it's already
been decoded (UTF8 flag on)."

So this checks whether the flag is on, and only decodes the string
if it is not.

Co-Authored-By: Michael Wiencek <[email protected]>
@reosarevok
Copy link
Member Author

Ok, that does seem more elegant. Ideally we'd still find out why only some come pre-decoded, but it's a lot less messy than before. Changed (and changed the commit message accordingly).

@mwiencek
Copy link
Member

Thanks, I replaced parts of the PR description with the updated commit message to avoid confusion if we have to revisit this.

@mwiencek mwiencek merged commit 629b522 into metabrainz:master Mar 28, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Bugs that should be checked/fixed soonish
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants