-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace the form decoder with URLCodec #15
base: master
Are you sure you want to change the base?
Conversation
- This addresses the following URLdecoding issue ring-clojure/ring#269
Hm, looks like the codec library encodes/decodes utf-16 differently than the Java one.
|
Yes, I noticed that as well when I tried to fix it, and it's not just UTF-16 that it encodes differently. Java treats every non-percent-encoded character as literal, while Commons Codec treats them all as a series of bytes. So for instance, take the previous Shift-JIS encoded example you used:
Notice that there's a "W", "o", "P", "R"and "C" in there. Commons Codec says "treat them as ASCII bytes" whereas Java says, "treat them as literal characters". Now, that makes me think Java is correct and Commons Codec is wrong, but you mentioned that browsers encode the same way as Commons Codec? Is there a site I can check the Shift-JIS encoding at? Otherwise I guess I'll build a small handler with the right encoding. |
Found one! With the example under discussion. (My perception tells me that browsers encode the Commons Codec way) |
And as far as the "correctness" of Java vs Codec goes... Reading the percent encoding section makes me think so too, |
https://github.com/nwjs/chromium.src/blob/df7f8c8582b9a78c806a7fa1e9d3f3ba51f7a698/third_party/WebKit/Source/platform/weborigin/KURL.cpp#L628 the chromium encode flow is
so Shift-JIS, UTF-16 encoding should not be an issue, since the string should be convert to utf-8 string before encode? I guess the decode flow should be just reverse?
|
But if that's the case we could just decode with UTF-8. The string "%83%82%83W%83o%83P%83R%83%8F%83C" represents a series of bytes encoded in Shift-JIS, not UTF-8. user=> (.decode (URLCodec.) "%83%82%83W%83o%83P%83R%83%8F%83C" "Shift-JIS")
"モジバケコワイ"
user=> (.decode (URLCodec.) "%83%82%83W%83o%83P%83R%83%8F%83C" "UTF-8")
"���W�o�P�R���C" |
I think adding support for a pluggable decoder is the best way to go for these reasons:
Here is what I have cooked up as an idea, Would greatly appreciate if I can get your feedback on this idea. |
I don't think the encoder should be pluggable. We might as well use two libraries instead. If Chrome and other browsers decode in the same way as Commons Codec does, we should adopt that form. |
@weavejester |
The |
@weavejester With the commons implementation? |
- test data has been updated accordingly
@weavejester So I just went ahead and replaced the implementation of url-encode and url-decode using URL codec. Please let me know if this is not what was asked. The tests are failing in travis, but run and pass just fine on my machine with jdk8. Would you know how to reconcile this assuming I implemented everything correctly? |
No, they need to be rewritten to handle encoding in the same way. We still need to differentiate between form encoding and URL encoding, so I don't think we can use Commons Codec. In practice it means we'll probably need to set up a manual When I have time I'll take a look at this and write an updated implementation. If you want to you can have a bash at it yourself, but it'll be more complex than just substituting in the Commons Codec. |
(is (= (url-encode "foo+bar") "foo+bar")) | ||
(is (= (url-encode "foo bar") "foo%20bar"))) | ||
(is (= (url-encode "foo/bar" "UTF-16") "%FE%FF%00f%00o%00o%00%2F%00b%00a%00r")) | ||
(is (= (url-encode "foo+bar") "foo%2Bbar")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. "foo+bar" should be encoded as "foo+bar", hence the test.
(is (= (url-encode "foo bar") "foo%20bar"))) | ||
(is (= (url-encode "foo/bar" "UTF-16") "%FE%FF%00f%00o%00o%00%2F%00b%00a%00r")) | ||
(is (= (url-encode "foo+bar") "foo%2Bbar")) | ||
(is (= (url-encode "foo bar") "foo+bar"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"foo bar" should be encoded as "foo%20bar".
Thanks for the feedback and it totally makes sense. |
This addresses the decoding part of the following issue
Params middleware does not decode some strings correctly ring#269, which would fix my problem but does not address the other parts that are mentioned (encoder/custom percent encoder)
Would appreciate feedback whether I should address the other concerns, or this change itself could be merged in.