Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json Utf8 behavior #964

Open
wants to merge 3 commits into
base: devel
Choose a base branch
from
Open

json Utf8 behavior #964

wants to merge 3 commits into from

Conversation

yanick
Copy link
Contributor

@yanick yanick commented Sep 22, 2013

Coming from #771

In a nutshell, we make Dancer's from_json behave like 'decode_json', but don't do the reverse for 'encode_json'. Hilarity ensues.

This fix document our helpful behavior, and make the from_json and to_json.

Now, I'm kinda scared of the comment that was there for 'encode_json', to the effect that the utf8 encoding would be done down the line. I still have to make due dilligences to make sure this fix doesn't inject any double-encoding bug.

Deals with the funny stuff seen in #771.

The documentation should also make clear what we do turn on the
utf8 flag on by default.

We should also make sure this doesn't
break other stuff at a distance -- I'm quite weary of the
comment in 'serialize' that we don't utf8 the thing there
because it's done "later on"...
@ambs
Copy link
Member

ambs commented Sep 22, 2013

FEAR!

@dagolden
Copy link
Contributor

The broader concern I have is about what other utf8 handling is going on. I'm very disturbed to see utf8::is_utf8 in Dancer::Request, because that doesn't mean what people often think it means and in this case, it seems to be doing the wrong thing (possibly UTF-8 decoding non UTF-8 bytes between 0x80 and 0xff) which sure looks similar to what #771 could be about. You're getting a character string, possibly Latin-1, and handing it to JSON which is expecting UTF-8 and so "boom".

If you're not really careful decoding at the I/O gateways of the application, you can get really crazy strings and it's hard to know what to do with it. I offer as evidence this nasty dzil code for guessing whether or not to encode META.yml files.

From what I can see, Dancer::Request is reading the input handle raw into body and then everything else just guesses from there. Nothing appears to check the charset on content-type which is what probably should be done to decode the content body before doing anything else with it. Then you'd be working with characters and could UTF-8 decode JSON without the UTF-8 flag.

(I don't know what that AJAX is doing in the #771 example. It it setting an appropriate content-type? Does the content-type match the content?)

So... I'm not sure this PR is really getting at the underlying issue, but I'm not familiar enough with the Dancer1 guts to do or test the invasive surgery it probably needs.

@yanick
Copy link
Contributor Author

yanick commented Sep 24, 2013

On 13-09-22 09:55 PM, David Golden wrote:

The broader concern I have is about what other utf8 handling is going
on.I'm very disturbed to see |utf8::is_utf8| in Dancer::Request,
because that doesn't mean what people often think it means and in this
case, it seems to be doing the wrong thing (possibly UTF-8 decoding non
UTF-8 bytes between 0x80 and 0xff) which sure looks similar to what #771
#771 could be about. You're
getting a character string, possibly Latin-1, and handing it to JSON
which is expecting UTF-8 and so "boom".

(I don't what that AJAX is doing in the #771
#771 example. It it setting
an appropriate content-type? Does the content-type match the content?)

So... I'm not sure this PR is really getting at the underlying issue,
but I'm not familiar enough with the Dancer1 guts to do or test the
invasive surgery it probably needs.

Hmm... So you're saying that Dancer1 need a fix, but that ain't it.
Okay, fair enough. :-)

Lemme try to reformulate that into a proposal:

Dancer1's problem with utf8 seems to be that we make assumptions about
the encoding of our inputs and outputs and, well, those assumptions
aren't always right. From here, it seems there are two main paths we can
take.

The first is to continue the very hard job of converting every input
into utf8 before it get into Dancer core, and convert the outputs into
whatever the use wants. As we've seen, it's always harder than it sound,
and it sounds pretty hard.

The second, is to let the developer figure out what he wants. We make no
assumption whatsoever on anything that comes in, except perhaps if its
encoding is explicitly given in the request header, and we don't do any
shenanigan on the way out.

I must say, the second path seems to be the saner of the two. At it is,
I've just checked, and there are not that many utf8-diddling bits in the
Dancer guts. Soooo, I propose that I grab a set of bonesaws, try to
remove all utf8 shenanigans, and... see how badly the test cases are
going to blow. Once I have something to show, we'll reconvene and see if
the result looks like a fumbling in the right direction, or if I should
be shot before I do any more damage.

Sounds like a plan? :-)

Joy,
`/anick

@dagolden
Copy link
Contributor

I think the goal should be to try to get input into characters, work with characters (including serialization/deserialization) and then put output into an encoding based on what the app developer says they want (presumably UTF-8).

I'm not an expert on this, but what jumps out at me are these points of input:

  • the request URI and query string -- presumably these are URI decoded into characters (does Plack do this? Does Dancer do this?)
  • body content (whether form data or a blob) -- this should be decoded based on content-type charset of the body or of MIME parts (RFC2616 says assume Latin1 if missing and content type is text and I'd leave it as raw octets for anything else). What is HTTP::Body giving you? octets or characters?
  • parameters set from within Dancer itself -- presumably these are already characters

Generally, Dancer shouldn't have to guess at encodings. The RFCs are pretty careful about making sure charset is well defined for different kinds of fields, etc. So I think it's less about being tricky and more about just following the spec.

@yanick
Copy link
Contributor Author

yanick commented Sep 26, 2013

This sounds very sane. I like.

I still have to dig in the guts of Dancer to figure out where exactly we juggle with the encodings. Once I think I have a grasp on how it goes, I'll amend this PR with my proposals. Stay tuned. :-)

@karenetheridge
Copy link

Please at the same time switch off of JSON.pm (JSON::MaybeXS provides a pp-or-XS interface which is very battle-tested), to avoid the extra insanity that JSON::XS would drag in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants