Highlighting in documents doesn't handle CDATA sections correctly #521

jan-niestadt · 2024-06-25T12:04:59Z

"Tags" inside a CDATA are seen as actual (unbalanced) XML open tags, and closing tags are added at the end of the document.

Example:

https://portal.clarin.ivdnt.org/blacklab-server-new/opensonar/docs/WR-P-E-C-0000000129/contents?query=%5Bword%3D%22schip%22%5D&wordstart=7000

jan-niestadt · 2024-06-25T12:34:22Z

Attempted fix in 981b8a5, but this causes a StackOverflowError in the regex evaluation (even though small-scale test succeeds). Possibly the document is too large for handling this way. See https://stackoverflow.com/a/7510006

jan-niestadt · 2024-06-25T12:38:31Z

It might be time to consider rewriting highlighting of document fragments using something like https://jsoup.org/

If that doesn't seem practical for whatever reason, another alternative is to loop through the document character by character, only using regexes whenever we find a < (and possibly not using them at all for comments or CDATA, which can get large, unlike tags).

jan-niestadt · 2024-07-09T12:54:18Z

@KCMertens mentioned that Saxon's parsing can be customized as well, including how to deal with unbalanced tags; maybe this could be a good solution

KCMertens · 2024-07-09T13:11:29Z

Here is how I've done it in the past using commandline arguments: https://github.com/INL/vws-conversie/blob/master/saxon/run-xslt-tagsoup.sh#L28
That's different of course, but docs for doing it programatically are here: https://saxonica.com/html/documentation9.6/sourcedocs/controlling-parsing.html

Tagsoup specifically is written to be lenient.

jan-niestadt · 2024-07-19T08:54:46Z

I think TagSoup would transform e.g.

a snippet.</s> <s>It starts halfway through a sentence!</s>

to

a snippet.<s></s> <s>It starts halfway through a sentence!</s>

instead of

<s>a snippet.</s> <s>It starts halfway through a sentence!</s>

The latter is what we currently do and (arguably) what we need for our purposes. (although we should probably add an ellipsis inside the new start tag, e.g. <s>… to show that some words are probably missing there)

jan-niestadt added the bug label Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlighting in documents doesn't handle CDATA sections correctly #521

Highlighting in documents doesn't handle CDATA sections correctly #521

jan-niestadt commented Jun 25, 2024

jan-niestadt commented Jun 25, 2024

jan-niestadt commented Jun 25, 2024

jan-niestadt commented Jul 9, 2024

KCMertens commented Jul 9, 2024

jan-niestadt commented Jul 19, 2024

Highlighting in documents doesn't handle CDATA sections correctly #521

Highlighting in documents doesn't handle CDATA sections correctly #521

Comments

jan-niestadt commented Jun 25, 2024

jan-niestadt commented Jun 25, 2024

jan-niestadt commented Jun 25, 2024

jan-niestadt commented Jul 9, 2024

KCMertens commented Jul 9, 2024

jan-niestadt commented Jul 19, 2024