-
-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to optionally allow surrogate pair entities (#165) #174
Add support to optionally allow surrogate pair entities (#165) #174
Conversation
First of all: thank you for contributing this feature. Second: could you add a description that explains why this is needed, what is the use case (ideally with a simple example) and so on? |
Description added. |
try { | ||
streamThrough(sr); | ||
fail("Expected an exception for invalid surrogate pair"); | ||
} catch (Exception e) { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be better to catch specific Exception
subtype and check exception message that it fails for expected reason (just match a phrase, no need to match exact message).
First of all: apologies for ultra-slow follow up here. But I am ready to get this merged. One minor legal thing first: if we haven't yet gotten a CLA, we'd need one from here (it's under Jackson project, that's fine): https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf and it's easiest to print, fill & sign, scan/photo, email to This is only needed once for all contributions to Woodstox, and once I get it, I can proceed with merge. Thank you again for this contribution! EDIT: I gave wrong email address -- CLA actually needs to be sent to |
* If pair value is not in range of low surrogate values, then throw an error | ||
*/ | ||
if (pairValue < 0xDC00 || pairValue > 0xDFFF) { | ||
reportNoSurrogatePair(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be better if we reported both leading surrogate and second non-surrogate entity value?
} else { | ||
mInputPtr = ptr; // so error points to correct char | ||
throwUnexpectedChar(c, "; expected a hex digit (0-9a-fA-F)."); | ||
boolean isValueHighSurrogate = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so, I assume this works, but is quite hard to follow. What I'd suggest is to split code in such a way that the original decoding for the first entity is within initial method; and the rest of processing in separate method, only called if:
- Entity surrogate pair decoding is enabled
- First entity is valid first surrogate
This probably leads to slight duplication of code (decoding loop) but allows removing the outer loop since each method only decodes one entity.
And if duplicate looks too ugly decoding loops can be extracted into separate method too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to refactor this part in my free time (maybe on the weekend), but my first assumptions was to write functional code with minimal changes in the class architecture, to not conflict your rules and coding style. I haven't seen before any contributor guide, so I wanted to be careful and do this most safety way. Please give me a time, because I don't have much free time after hours
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Magmaruss understood! I could try my hand at change, but was worried I might get some parts wrong easily.
And my suggestion was not meant as critique; I much appreciate the contribution and it follows all reasonable expectations (esp. in absence of much documentation on what is expected of contributions).
@Magmaruss quick note: I gave a wrong email address to send CLA to -- CLA actually needs to be sent to |
NOTE: CLA received; can now focus on finalizing PR. |
@Magmaruss I hope you don't mind my applying changes I suggested here, wrt naming and adding Beyond this, I think it'd be good to refactor decoding loop, possibly the way I suggested (but not necessarily, there may well be other ways to improve readability). Also thank you for getting CLA; I am hoping we can get this PR merged soon! |
02607fe
to
1dcb8be
Compare
* This unit test checks that resolving of surrogate pairs works | ||
* as expected, including different ways of writing entities | ||
*/ | ||
public void testValidSurrogatePairEntities() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there be a test or two for invalid ones too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test for invalid ones is already there - testInvalidSurrogatePairEntities
method: https://github.com/FasterXML/woodstox/pull/174/files/eb5189d107d911709774c79e1ed445865567d7ff#diff-50c2abc037511dda94147cf808bde7666e630ad76d08151e92ff86fb56b6f17aR466
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I meant the case when new option was disabled, as per your other comment.
@Magmaruss Ok, I did some more tweaks and I think things look good. About the only thing I think that is needed would be a test or two for invalid cases; one for catching surrogate pairs without enabling their use; and another one for just invalid pairings with handling enabled. Thank you again for contributing this! |
I can write simple test for disabled option, but for now when disabled it's throwing "Illegal character entity..." - default behavior before introducing the feature. Should I replace this exception with something special when first valid high surrogate detected or leave the default behavior and add test only? |
Either way works for me: obviously improved error messages are always a plus, but if you want to limit the scope of changes & extra work, verifying default behavior is fine also. |
Hi @Magmaruss! I was wondering if you might be able to add basic unit test? No rush if you don't have time -- I just hope to be able to merge this in, get 6.6.0 released, and then be able to do #134 (increase min JDK baseline to 8 for Woodstox). |
Yes, sorry for that, but i am having a very difficult time. I will try to deliver it tonight or tomorrow at the latest |
@Magmaruss Sorry did not mean to add pressure at all. I appreciate your help here and please take your time -- I don't want this to be another stress point. |
Test case for disabled option added. Please tell if you need something more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok good, let's merge this!
This feature is an answer for the issue #165. It's extending the XML reader functionality with a new configuration option (default disabled) to support reading XML with unicode characters written using surrogate pair (UTF-16 encoding). This feature is very important to correctly read XML from the legacy services which produces responses this way and there is no way to change the behavior of them.
Example:
The external service produces a response with a unicode character written using UCS2:
Default behavior of XMLStreamReader is throwing an exception with message:
After setting the
P_RESOLVE_ENTITY_SURROGATE_PAIRS
parameter totrue
it will be readed with no exception converting the surrogate pair entities to the appropriate unicode (🎅)