Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Ampersand Character Codes #57

Open
Irishx opened this issue Oct 25, 2018 · 4 comments
Open

HTML Ampersand Character Codes #57

Irishx opened this issue Oct 25, 2018 · 4 comments
Assignees

Comments

@Irishx
Copy link
Contributor

Irishx commented Oct 25, 2018

I notice that very often raw texts are taken from the web and they contain & or " strings in the raw text.

These strings are not recognized by ucto and are split into multiple tokens: & amp ;

Obviously the user is responsible for clean text input but these HTML codes are easily overlooked in large quantities of data.
We could:

  • add a rule to recognize HTML codes and keep them as 1 token
  • add a rule to recognize HTML codes and replace them with the actual character they represent
  • perhaps only give a warning --this text contains HTML codes-- ?
  • just close this issue without any changes ;-)
@kosloot
Copy link
Contributor

kosloot commented Oct 25, 2018

Hmm, interesting. And probably useful
I think the first step would be to create a rule to detect these 'amp-encoded' entities.
A next step might be to translate them.
e.g. see for instance here

Note; We should handle both é ( é) and ß (ß)

@kosloot
Copy link
Contributor

kosloot commented Oct 25, 2018

Hmm, maybe such a rule isn't that easy.
Één (de facto: Één) may very well be captured, but still.
The assigned class would NOT be WORD, i suppose. That might be confusing, as
Één IS tokenized as a word.
But I thinks it is worth experimenting...

@proycon
Copy link
Member

proycon commented Oct 30, 2018

These are called XML/HTML (character) entities technically. It might indeed be a nice feature to have ucto detect and substitute these, though not of the highest priority I'd say. Best make it opt-in with perhaps an automatic detection and suggestion to enable it?

Note; We should handle both é ( é) and ß (ß)

Yes, and both hexademical and decimal representation.

@kosloot
Copy link
Contributor

kosloot commented Oct 30, 2018

Well,
solving this as an ucto configuration rule is almost? impossible.
HTML entities may occur everywhere, not just in WORDS.

The only feasible solution seems to implement this as a filter in ucto, replacing all kinds of entities by their UTF8 variant.
I will label this as an enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants