You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hmm, interesting. And probably useful
I think the first step would be to create a rule to detect these 'amp-encoded' entities.
A next step might be to translate them.
e.g. see for instance here
Note; We should handle both é ( é) and ß (ß)
Hmm, maybe such a rule isn't that easy. Één (de facto: Één) may very well be captured, but still.
The assigned class would NOT be WORD, i suppose. That might be confusing, as
Één IS tokenized as a word.
But I thinks it is worth experimenting...
These are called XML/HTML (character) entities technically. It might indeed be a nice feature to have ucto detect and substitute these, though not of the highest priority I'd say. Best make it opt-in with perhaps an automatic detection and suggestion to enable it?
Note; We should handle both é ( é) and ß (ß)
Yes, and both hexademical and decimal representation.
Well,
solving this as an ucto configuration rule is almost? impossible.
HTML entities may occur everywhere, not just in WORDS.
The only feasible solution seems to implement this as a filter in ucto, replacing all kinds of entities by their UTF8 variant.
I will label this as an enhancement.
I notice that very often raw texts are taken from the web and they contain & or " strings in the raw text.
These strings are not recognized by ucto and are split into multiple tokens: & amp ;
Obviously the user is responsible for clean text input but these HTML codes are easily overlooked in large quantities of data.
We could:
The text was updated successfully, but these errors were encountered: