HTML Ampersand Character Codes #57

Irishx · 2018-10-25T10:53:11Z

I notice that very often raw texts are taken from the web and they contain & or " strings in the raw text.

These strings are not recognized by ucto and are split into multiple tokens: & amp ;

Obviously the user is responsible for clean text input but these HTML codes are easily overlooked in large quantities of data.
We could:

add a rule to recognize HTML codes and keep them as 1 token
add a rule to recognize HTML codes and replace them with the actual character they represent
perhaps only give a warning --this text contains HTML codes-- ?
just close this issue without any changes ;-)

kosloot · 2018-10-25T11:14:28Z

Hmm, interesting. And probably useful
I think the first step would be to create a rule to detect these 'amp-encoded' entities.
A next step might be to translate them.
e.g. see for instance here

Note; We should handle both é ( é) and ß (ß)

kosloot · 2018-10-25T12:06:17Z

Hmm, maybe such a rule isn't that easy.
Één (de facto: Één) may very well be captured, but still.
The assigned class would NOT be WORD, i suppose. That might be confusing, as
Één IS tokenized as a word.
But I thinks it is worth experimenting...

proycon · 2018-10-30T10:10:52Z

These are called XML/HTML (character) entities technically. It might indeed be a nice feature to have ucto detect and substitute these, though not of the highest priority I'd say. Best make it opt-in with perhaps an automatic detection and suggestion to enable it?

Note; We should handle both é ( é) and ß (ß)

Yes, and both hexademical and decimal representation.

kosloot · 2018-10-30T16:31:39Z

Well,
solving this as an ucto configuration rule is almost? impossible.
HTML entities may occur everywhere, not just in WORDS.

The only feasible solution seems to implement this as a filter in ucto, replacing all kinds of entities by their UTF8 variant.
I will label this as an enhancement.

kosloot assigned proycon, kosloot and Irishx Oct 25, 2018

kosloot added the enhancement label Oct 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML Ampersand Character Codes #57

HTML Ampersand Character Codes #57

Irishx commented Oct 25, 2018

kosloot commented Oct 25, 2018

kosloot commented Oct 25, 2018

proycon commented Oct 30, 2018

kosloot commented Oct 30, 2018

HTML Ampersand Character Codes #57

HTML Ampersand Character Codes #57

Comments

Irishx commented Oct 25, 2018

kosloot commented Oct 25, 2018

kosloot commented Oct 25, 2018

proycon commented Oct 30, 2018

kosloot commented Oct 30, 2018