json Utf8 behavior #964

yanick · 2013-09-22T17:32:27Z

Coming from #771

In a nutshell, we make Dancer's from_json behave like 'decode_json', but don't do the reverse for 'encode_json'. Hilarity ensues.

This fix document our helpful behavior, and make the from_json and to_json.

Now, I'm kinda scared of the comment that was there for 'encode_json', to the effect that the utf8 encoding would be done down the line. I still have to make due dilligences to make sure this fix doesn't inject any double-encoding bug.

Deals with the funny stuff seen in #771. The documentation should also make clear what we do turn on the utf8 flag on by default. We should also make sure this doesn't break other stuff at a distance -- I'm quite weary of the comment in 'serialize' that we don't utf8 the thing there because it's done "later on"...

ambs · 2013-09-22T17:37:22Z

FEAR!

dagolden · 2013-09-23T01:55:58Z

The broader concern I have is about what other utf8 handling is going on. I'm very disturbed to see utf8::is_utf8 in Dancer::Request, because that doesn't mean what people often think it means and in this case, it seems to be doing the wrong thing (possibly UTF-8 decoding non UTF-8 bytes between 0x80 and 0xff) which sure looks similar to what #771 could be about. You're getting a character string, possibly Latin-1, and handing it to JSON which is expecting UTF-8 and so "boom".

If you're not really careful decoding at the I/O gateways of the application, you can get really crazy strings and it's hard to know what to do with it. I offer as evidence this nasty dzil code for guessing whether or not to encode META.yml files.

From what I can see, Dancer::Request is reading the input handle raw into body and then everything else just guesses from there. Nothing appears to check the charset on content-type which is what probably should be done to decode the content body before doing anything else with it. Then you'd be working with characters and could UTF-8 decode JSON without the UTF-8 flag.

(I don't know what that AJAX is doing in the #771 example. It it setting an appropriate content-type? Does the content-type match the content?)

So... I'm not sure this PR is really getting at the underlying issue, but I'm not familiar enough with the Dancer1 guts to do or test the invasive surgery it probably needs.

yanick · 2013-09-24T00:30:04Z

On 13-09-22 09:55 PM, David Golden wrote:

The broader concern I have is about what other utf8 handling is going
on.I'm very disturbed to see |utf8::is_utf8| in Dancer::Request,
because that doesn't mean what people often think it means and in this
case, it seems to be doing the wrong thing (possibly UTF-8 decoding non
UTF-8 bytes between 0x80 and 0xff) which sure looks similar to what #771
#771 could be about. You're
getting a character string, possibly Latin-1, and handing it to JSON
which is expecting UTF-8 and so "boom".

(I don't what that AJAX is doing in the #771
#771 example. It it setting
an appropriate content-type? Does the content-type match the content?)

So... I'm not sure this PR is really getting at the underlying issue,
but I'm not familiar enough with the Dancer1 guts to do or test the
invasive surgery it probably needs.

Hmm... So you're saying that Dancer1 need a fix, but that ain't it.
Okay, fair enough. :-)

Lemme try to reformulate that into a proposal:

Dancer1's problem with utf8 seems to be that we make assumptions about
the encoding of our inputs and outputs and, well, those assumptions
aren't always right. From here, it seems there are two main paths we can
take.

The first is to continue the very hard job of converting every input
into utf8 before it get into Dancer core, and convert the outputs into
whatever the use wants. As we've seen, it's always harder than it sound,
and it sounds pretty hard.

The second, is to let the developer figure out what he wants. We make no
assumption whatsoever on anything that comes in, except perhaps if its
encoding is explicitly given in the request header, and we don't do any
shenanigan on the way out.

I must say, the second path seems to be the saner of the two. At it is,
I've just checked, and there are not that many utf8-diddling bits in the
Dancer guts. Soooo, I propose that I grab a set of bonesaws, try to
remove all utf8 shenanigans, and... see how badly the test cases are
going to blow. Once I have something to show, we'll reconvene and see if
the result looks like a fumbling in the right direction, or if I should
be shot before I do any more damage.

Sounds like a plan? :-)

Joy,
`/anick

dagolden · 2013-09-24T01:16:40Z

I think the goal should be to try to get input into characters, work with characters (including serialization/deserialization) and then put output into an encoding based on what the app developer says they want (presumably UTF-8).

I'm not an expert on this, but what jumps out at me are these points of input:

the request URI and query string -- presumably these are URI decoded into characters (does Plack do this? Does Dancer do this?)
body content (whether form data or a blob) -- this should be decoded based on content-type charset of the body or of MIME parts (RFC2616 says assume Latin1 if missing and content type is text and I'd leave it as raw octets for anything else). What is HTTP::Body giving you? octets or characters?
parameters set from within Dancer itself -- presumably these are already characters

Generally, Dancer shouldn't have to guess at encodings. The RFCs are pretty careful about making sure charset is well defined for different kinds of fields, etc. So I think it's less about being tricky and more about just following the spec.

yanick · 2013-09-26T01:56:12Z

This sounds very sane. I like.

I still have to dig in the guts of Dancer to figure out where exactly we juggle with the encodings. Once I think I have a grasp on how it goes, I'll amend this PR with my proposals. Stay tuned. :-)

karenetheridge · 2020-05-01T16:29:26Z

Please at the same time switch off of JSON.pm (JSON::MaybeXS provides a pp-or-XS interface which is very battle-tested), to avoid the extra insanity that JSON::XS would drag in.

yanick added 3 commits September 22, 2013 13:13

json encoding test

eeccedf

amend documentation for the utf8 behavior

cb0dbc7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json Utf8 behavior #964

json Utf8 behavior #964

yanick commented Sep 22, 2013

ambs commented Sep 22, 2013

dagolden commented Sep 23, 2013

yanick commented Sep 24, 2013

dagolden commented Sep 24, 2013

yanick commented Sep 26, 2013

karenetheridge commented May 1, 2020

json Utf8 behavior #964

Are you sure you want to change the base?

json Utf8 behavior #964

Conversation

yanick commented Sep 22, 2013

ambs commented Sep 22, 2013

dagolden commented Sep 23, 2013

yanick commented Sep 24, 2013

dagolden commented Sep 24, 2013

yanick commented Sep 26, 2013

karenetheridge commented May 1, 2020