Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to optionally allow surrogate pair entities (#165) #174

Merged
merged 13 commits into from
Jan 16, 2024

Conversation

Magmaruss
Copy link
Contributor

@Magmaruss Magmaruss commented Jun 18, 2023

This feature is an answer for the issue #165. It's extending the XML reader functionality with a new configuration option (default disabled) to support reading XML with unicode characters written using surrogate pair (UTF-16 encoding). This feature is very important to correctly read XML from the legacy services which produces responses this way and there is no way to change the behavior of them.

Example:
The external service produces a response with a unicode character written using UCS2:

<response>
    <value>Merry Christmas &#55356;&#57221;</value>
</response>

Default behavior of XMLStreamReader is throwing an exception with message:

Illegal character entity: expansion character (code 0xd83c

After setting the P_RESOLVE_ENTITY_SURROGATE_PAIRS parameter to true it will be readed with no exception converting the surrogate pair entities to the appropriate unicode (🎅)

@cowtowncoder
Copy link
Member

First of all: thank you for contributing this feature.

Second: could you add a description that explains why this is needed, what is the use case (ideally with a simple example) and so on?

@Magmaruss
Copy link
Contributor Author

Description added.

try {
streamThrough(sr);
fail("Expected an exception for invalid surrogate pair");
} catch (Exception e) { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to catch specific Exception subtype and check exception message that it fails for expected reason (just match a phrase, no need to match exact message).

@cowtowncoder
Copy link
Member

cowtowncoder commented Nov 4, 2023

First of all: apologies for ultra-slow follow up here. But I am ready to get this merged.

One minor legal thing first: if we haven't yet gotten a CLA, we'd need one from here (it's under Jackson project, that's fine):

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

and it's easiest to print, fill & sign, scan/photo, email to info at fasterxml dot com.
(for project you can add Woodstox or FasterXML/Woodstox).

This is only needed once for all contributions to Woodstox, and once I get it, I can proceed with merge.
I added some notes already; just need to re-read main logic and should be good to merge.

Thank you again for this contribution!

EDIT: I gave wrong email address -- CLA actually needs to be sent to info at fasterxml dot com instead; cla email alias did not yet exist unlike I thought.

* If pair value is not in range of low surrogate values, then throw an error
*/
if (pairValue < 0xDC00 || pairValue > 0xDFFF) {
reportNoSurrogatePair(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better if we reported both leading surrogate and second non-surrogate entity value?

} else {
mInputPtr = ptr; // so error points to correct char
throwUnexpectedChar(c, "; expected a hex digit (0-9a-fA-F).");
boolean isValueHighSurrogate = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so, I assume this works, but is quite hard to follow. What I'd suggest is to split code in such a way that the original decoding for the first entity is within initial method; and the rest of processing in separate method, only called if:

  1. Entity surrogate pair decoding is enabled
  2. First entity is valid first surrogate

This probably leads to slight duplication of code (decoding loop) but allows removing the outer loop since each method only decodes one entity.
And if duplicate looks too ugly decoding loops can be extracted into separate method too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to refactor this part in my free time (maybe on the weekend), but my first assumptions was to write functional code with minimal changes in the class architecture, to not conflict your rules and coding style. I haven't seen before any contributor guide, so I wanted to be careful and do this most safety way. Please give me a time, because I don't have much free time after hours

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Magmaruss understood! I could try my hand at change, but was worried I might get some parts wrong easily.
And my suggestion was not meant as critique; I much appreciate the contribution and it follows all reasonable expectations (esp. in absence of much documentation on what is expected of contributions).

@cowtowncoder
Copy link
Member

@Magmaruss quick note: I gave a wrong email address to send CLA to -- CLA actually needs to be sent to info at fasterxml dot com instead; cla email alias did not yet exist unlike I thought.
So just in case you already sent one, could you please re-send it. Apologies for mix-up.

@cowtowncoder
Copy link
Member

NOTE: CLA received; can now focus on finalizing PR.

@cowtowncoder
Copy link
Member

@Magmaruss I hope you don't mind my applying changes I suggested here, wrt naming and adding @since tags. I am open to different naming too if you strongly prefer yet different naming.

Beyond this, I think it'd be good to refactor decoding loop, possibly the way I suggested (but not necessarily, there may well be other ways to improve readability).

Also thank you for getting CLA; I am hoping we can get this PR merged soon!

@cowtowncoder cowtowncoder changed the title Support for entity surrogate pairs (#165) Add support to optionally allow surrogate pair entities (#165) Nov 13, 2023
* This unit test checks that resolving of surrogate pairs works
* as expected, including different ways of writing entities
*/
public void testValidSurrogatePairEntities()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be a test or two for invalid ones too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I meant the case when new option was disabled, as per your other comment.

@cowtowncoder
Copy link
Member

@Magmaruss Ok, I did some more tweaks and I think things look good.

About the only thing I think that is needed would be a test or two for invalid cases; one for catching surrogate pairs without enabling their use; and another one for just invalid pairings with handling enabled.
I hope to get this merged soon, and then soon after release 6.6.0.

Thank you again for contributing this!

@Magmaruss
Copy link
Contributor Author

I can write simple test for disabled option, but for now when disabled it's throwing "Illegal character entity..." - default behavior before introducing the feature. Should I replace this exception with something special when first valid high surrogate detected or leave the default behavior and add test only?

@cowtowncoder
Copy link
Member

I can write simple test for disabled option, but for now when disabled it's throwing "Illegal character entity..." - default behavior before introducing the feature. Should I replace this exception with something special when first valid high surrogate detected or leave the default behavior and add test only?

Either way works for me: obviously improved error messages are always a plus, but if you want to limit the scope of changes & extra work, verifying default behavior is fine also.

@cowtowncoder
Copy link
Member

Hi @Magmaruss! I was wondering if you might be able to add basic unit test? No rush if you don't have time -- I just hope to be able to merge this in, get 6.6.0 released, and then be able to do #134 (increase min JDK baseline to 8 for Woodstox).

@Magmaruss
Copy link
Contributor Author

Yes, sorry for that, but i am having a very difficult time. I will try to deliver it tonight or tomorrow at the latest

@cowtowncoder
Copy link
Member

@Magmaruss Sorry did not mean to add pressure at all. I appreciate your help here and please take your time -- I don't want this to be another stress point.

@Magmaruss
Copy link
Contributor Author

Test case for disabled option added. Please tell if you need something more.

Copy link
Member

@cowtowncoder cowtowncoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok good, let's merge this!

@cowtowncoder cowtowncoder added this to the 6.6.0 milestone Jan 16, 2024
@cowtowncoder cowtowncoder merged commit 172371f into FasterXML:master Jan 16, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants