-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Highlighting in documents doesn't handle CDATA sections correctly #521
Comments
Attempted fix in 981b8a5, but this causes a StackOverflowError in the regex evaluation (even though small-scale test succeeds). Possibly the document is too large for handling this way. See https://stackoverflow.com/a/7510006 |
It might be time to consider rewriting highlighting of document fragments using something like https://jsoup.org/ If that doesn't seem practical for whatever reason, another alternative is to loop through the document character by character, only using regexes whenever we find a |
@KCMertens mentioned that Saxon's parsing can be customized as well, including how to deal with unbalanced tags; maybe this could be a good solution |
Here is how I've done it in the past using commandline arguments: https://github.com/INL/vws-conversie/blob/master/saxon/run-xslt-tagsoup.sh#L28 Tagsoup specifically is written to be lenient. |
I think TagSoup would transform e.g.
to
instead of
The latter is what we currently do and (arguably) what we need for our purposes. (although we should probably add an ellipsis inside the new start tag, e.g. |
"Tags" inside a CDATA are seen as actual (unbalanced) XML open tags, and closing tags are added at the end of the document.
Example:
https://portal.clarin.ivdnt.org/blacklab-server-new/opensonar/docs/WR-P-E-C-0000000129/contents?query=%5Bword%3D%22schip%22%5D&wordstart=7000
The text was updated successfully, but these errors were encountered: