Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaping of non-ascii characters in entry XML ID #1107

Open
1313ou opened this issue Oct 4, 2024 · 17 comments
Open

Escaping of non-ascii characters in entry XML ID #1107

1313ou opened this issue Oct 4, 2024 · 17 comments
Labels
bug Something isn't working
Milestone

Comments

@1313ou
Copy link
Contributor

1313ou commented Oct 4, 2024

The escape_lemma(lemma) function, whose purpose is to format the lemma so it is valid XML id, is flawed when it comes to escaping non-ascii characters.

It converts any such characters to '-%04x-' % ord(c), which used 4 times withe the current data:
'oewn-Se-00f1-or-n',
'oewn-Se-00f1-ora-n',
'oewn-Se-00f1-orita-n',
'oewn-Capital-003a-_Critique_of_Political_Economy-n',

Decoding would involve the reverse process of converting any r'-[0-9A-Fa-f]{4}-' back to the character.

The snag is such sequences as

-abbe-
-abed-
-face-
-baba-
-bead-
-beef-
-cafe-
-caff-
-dada-
-dead-
-deaf-
-deed-
-fade-
-feed-

also match, qualifying as valid hex sequences (in addition to any four-digit like -1000-).

These sequences will be found in:

oewn-1000-a
oewn-1000-n
oewn-1728-n
oewn-2019-nCoV_acute_respiratory_disease-n
oewn-Adad-n
oewn-Bade-n
oewn-Beda-n
oewn-Bede-n
oewn-Daba-n
oewn-Edda-n
oewn-Rain-in-the-Face-n
oewn-abbe-n
oewn-abed-r
oewn-about-face-n
oewn-about-face-v
oewn-baba-n
oewn-babe-n
oewn-bead-n
oewn-bead-v
oewn-beef-n
oewn-beef-v
oewn-cafe-n
oewn-caff-n
oewn-cede-v
oewn-dace-n
oewn-dada-n
oewn-dead-a
oewn-dead-air_space-n
oewn-dead-burned_lime-n
oewn-dead-end-a
oewn-dead-end_street-n
oewn-dead-man-ap-s-fingers-n
oewn-dead-man-ap-s_float-n
oewn-dead-men-ap-s-fingers-n
oewn-dead-n
oewn-dead-on-a
oewn-dead-r
oewn-deaf-a
oewn-deaf-aid-n
oewn-deaf-and-dumb-a
oewn-deaf-and-dumb_person-n
oewn-deaf-mute-a
oewn-deaf-mute-n
oewn-deaf-muteness-n
oewn-deaf-mutism-n
oewn-deaf-n
oewn-deaf-v
oewn-deed-n
oewn-drop-dead-r
oewn-edda-n
oewn-face-amount_certificate_company-n
oewn-face-harden-v
oewn-face-lift-v
oewn-face-n
oewn-face-off-n
oewn-face-saving-a
oewn-face-to-face-a
oewn-face-to-face-r
oewn-face-v
oewn-fade-n
oewn-fade-v
oewn-feed-n
oewn-feed-v
oewn-force-feed-v
oewn-full-face-a
oewn-in-your-face-a
oewn-lie-abed-n
oewn-pousse-cafe-n
oewn-pudding-face-n
oewn-sick-abed-a
oewn-stone-dead-a
oewn-stone-deaf-a
oewn-stone-face-n
oewn-tone-deaf-a
oewn-volte-face-n

thus making decoding hazardous (because it's impossible to tell the string 'face' from the hex 'face').

Added to that, the '-de.*-' sequences will result in unicode surrogate characters reserved for coding and raising an error when printed.

Te good news is that unicode letters can be be part of an XML ID

Here are regular expressions for valid NameStartChar and NameChar based on the XML 1.0 specification:

name_start_char_re = re.compile(r'^[A-Z_a-z\xC0-\xD6\xD8-\xF6\xF8-\u02FF\u0370-\u037D\u037F-\u1FFF'
                                r'\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF'
                                r'\uF900-\uFDCF\uFDF0-\uFFFD]$')

name_char_re = re.compile(r'^[A-Z_a-z0-9\x2D\x2E\xB7\xC0-\xD6\xD8-\xF6\xF8-\u02FF'
                          r'\u0300-\u036F\u203F-\u2040\u0370-\u037D\u037F-\u1FFF'
                          r'\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF'
                          r'\uF900-\uFDCF\uFDF0-\uFFFD]$')

With this in mind,

'é' is valid in XML ID (including as start character)
'ñ' is valid in XML ID (including as start character)
'汉' is valid in XML ID (including as start character)

That settles the problem for 3 cases out of 4: señor, señora, señorita. Only remaining problem, the colon, currently used only in 'Capital: Critique of Political Economy'
Why not use '-cn' (-cl- is not available being used by cl for centilitre) which would yield

oewn-Capital-cn-_Critique_of_Political_Economy-n

instead of

oewn-Capital-003a-_Critique_of_Political_Economy-n

Risky if a 'cn' is later introduced, abbreviation for China for instance. Personnally I would not accept colons within lemma which has been generating problems from the start, only for a single entry. Besides 'Capital: Critique of Political_Economy' can hardly be argued to be a lemma or a dictionary entry (possibly an encyclopedia entry).

This affects only XML so there is nothing to fix but code because the XML is just derived, not source. However tools that work from XML will have to be reviewed if they try to unescape lemmas in entry ids and work with XML not generated with fixed code.

While correcting, I suggest replacing

    elif c == '-':
            return '-'

which is a NO-OP, with

     elif c == ':':
          return '-cn-'

and change

    else:
        return '-%04x-' % ord(c)

to

    elif name_char_re.match(c) or name_char_re.match(c):
        return c
    raise ValueError(f'Illegal character {c}')

with the regexprs above

@1313ou 1313ou added the bug Something isn't working label Oct 4, 2024
@jmccrae
Copy link
Member

jmccrae commented Oct 4, 2024

I think this is a good point and I fully agree. The implementation you propose is good and so please go ahead and make the PR (but check that it is compatible with #1096)

However, I don't think we should assume that IDs should be invertible and unique. For example if we were to introduce accented forms of word, e.g., 'protégé', I think we should at least consider whether we just combine with the existing identifier oewn-protege-n rather than introduced some identifier like oewn-prot-eac-g-eac--n. Of course, this assumes that accents are not meaningful in English ('resumé'/'resume' being maybe an exception but they are also different parts-of-speech).

@jmccrae
Copy link
Member

jmccrae commented Oct 4, 2024

Actually, I see we do have some accented entries already:

oewn-Se-00f1-or-n
oewn-Se-00f1-ora-n
oewn-Se-00f1-orita-n

So maybe we can fix some accented characters as well?

@oewntk
Copy link

oewntk commented Oct 7, 2024

I don't think we should assume that IDs should be invertible and unique
I agree they don't have to be invertible. However XML IDs have to be unique otherwise the document won't validate. The issue is that the current generation of IDs might lead to collisions and duplicates.
I suspect what you had in mind was more the definition of a lexical entry: is 'protege' a single entry with different forms (protégé, protege) or two different entries with same synset membership (the current scheme in PWN31, it seems)?

@jmccrae
Copy link
Member

jmccrae commented Oct 8, 2024

When I think about it more, I think that we should probably have two lexical entries for 'protege' and 'protégé' so they both appear in the UI (https://en-word.net). Alternative forms are not shown in the interface normally. If XML IDs can support most accented characters as @1313ou says then there is no problem with having nice IDs as well.

@jmccrae jmccrae added this to the 2024 Release milestone Oct 11, 2024
@goodmami
Copy link
Member

This discussion is related to globalwordnet/schemas#55, right? If so, you can compare to what we do for OMW: https://github.com/omwn/omw-data/blob/main/scripts/util.py

The basic strategy is to allow anything that is allowed in XML IDs, then to use HTML4 entity names with some additional names in a custom mapping. Finally, since - acts as the escaping operator, literal hyphens use the double-escape strategy: --.

>>> import util
>>> util.escape_lemma("Capital: Critique of Political Economy")
'Capital-colon-_Critique_of_Political_Economy'
>>> util.escape_lemma("protégé")
'protégé'
>>> util.escape_lemma("ex-president")
'ex--president'

@1313ou
Copy link
Contributor Author

1313ou commented Oct 12, 2024

Much better to have an escape character instead of a -xy- sequence, with xy thought to be unlikely in the input.

Changing would affect
1310 -ap- sequences (apostrophe)
22 -sl- (slash)
2 -pl- (plus)
1 -ex- (exclamation)

It would also remove the confusion between
-ex- and ex-
-cm- and cm (for centimeter)
-lb- and lb (pound)
-pl- and PL for Poland
-cn- and CN for China

@jmccrae
Copy link
Member

jmccrae commented Oct 16, 2024

@1313ou is this issue closed or do you wish to implement @goodmami's suggestion?

It seems like a more standard and simple solution to me.

@1313ou
Copy link
Contributor Author

1313ou commented Oct 16, 2024

You can close the issue as it was originally about non-ascii characters.

But in the discussion , the issue was also raised about escaping ASCII characters that can't be found in an XML-ID (comma, colon, brackets, exclamation, question ...).

I can implement @goodmami's suggestion (in fact copy his code, though I suspect validation would also have to be tweaked) if you give me some time and if you give the greenlight to changing some 1350 IDs for entries only.

Does it affect both entry IDs (escaped lemmas) and sense IDs (escaped sensekeys) ?

@1313ou 1313ou closed this as completed Oct 16, 2024
@jmccrae
Copy link
Member

jmccrae commented Oct 16, 2024

Yeah, I don't think it is a problem to change the entry IDs.

This probably doesn't need to affect the sense keys as they do not need to be valid XML ids.

@1313ou
Copy link
Contributor Author

1313ou commented Oct 16, 2024

The sense IDs are mapped/derived from sensekeys through map_sense_key() in wordnet.py. This is to get an acceptable XML ID for senses. What I was meaning is : Do we have to escape them the same way, which does make sense. This affects XML only. Sensekeys in YAML are unaffected.

@1313ou
Copy link
Contributor Author

1313ou commented Oct 16, 2024

This would yield:

protégé -> protégé
señor -> señor
a.b.c -> a.b.c
a-b-c -> a--b--c
a_b_c -> a-lowbar-b-lowbar-c
a b c -> a_b_c
a:b:c -> a-colon-b-colon-c
a/b/c -> a-sol-b-sol-c
a|b|c -> a-vert-b-vert-c
a\b\c -> a-bsol-b-bsol-c
a,b,c -> a-comma-b-comma-c
a;b;c -> a-semi-b-semi-c
a=b=c -> a-equals-b-equals-c
a+b+c -> a-plus-b-plus-c
a(b)c -> a-lpar-b-rpar-c
a{b}c -> a-lbrace-b-rbrace-c
a[b]c -> a-lsqb-b-rsqb-c
a!b!c -> a-excl-b-excl-c
a?b?c -> a-quest-b-quest-c
a$b$c -> a-dollar-b-dollar-c
a#b#c -> a-num-b-num-c
a%b%c -> a-percnt-b-percnt-c
a@b@c -> a-commat-b-commat-c
a^b^c -> a-Hat-b-Hat-c
a*b*c -> a-ast-b-ast-c
a'b'c -> a-apos-b-apos-c
a`b`c -> a-grave-b-grave-c
a´b´c -> a-acute-b-acute-c
a‘quoted’b -> a-lsquo-quoted-rsquo-b
a→b←c -> ERROR '→' [2192] is illegal character in XML ID and no escape sequence is defined

@goodmami
Copy link
Member

Thanks for these examples, @1313ou. Here are the differences from my implementation:

- a_b_c -> a-lowbar-b-lowbar-c
+ a_b_c -> a_b_c

This was a design choice on my end to keep _ even though it became conflated with the replacement for spaces. Even though reversibility wasn't a requirement, I think I prefer your approach here.

- a´b´c -> a-acute-b-acute-c
+ a´b´c -> aacutebacutec
- a‘quoted’b -> a-lsquo-quoted-rsquo-b
+ a‘quoted’b -> alsquoquotedrsquob

These reveal a bug in my script. For entity names looked up from Python's html.entities.codepoint2name, I was forgetting to put the - escapes around them.

- a→b←c -> ERROR '→' [2192] is illegal character in XML ID and no escape sequence is defined
+ a→b←c -> ararrblarrc

For me this is the same problem as above, but I'm not sure why you're getting an error?

@1313ou
Copy link
Contributor Author

1313ou commented Oct 18, 2024

About the arrows, there are arrows and arrows... Some are excluded (a→b←c'), some are allowed ('a🠀b🠂c'). The ones excuded here have codepoint x2190 and x2192 . the ones allowed x1F800 and x1F802. There are 3 arrow blocks I know of in Unicode.

@1313ou
Copy link
Contributor Author

1313ou commented Nov 13, 2024

To avoid confusion, let's say that this does not deal with YAML ids. These are under lesser constraints, the main one
still being, for sense ids, to be parsable (their raison d'être) sensekeys, see issue #1123

XML IDs, and for that matter xsd:id, are under stricter constraints.

As suggested above I have been implementing @goodmami's proposal of singling out an escaping dash/hyphen character (repeated if dash is really meant). The scheme encloses between dashes what looks like HTML entity names.
Being HTML5 standard, they appear to be readable. More importantly, the solution also seems to be more robust than the current one for reasons given above in this issue.

In the oewn-core xml module, I have defined an interface
that materializes as 2 different implementations :

The latter does the following for XML-invalid (extended) ASCII characters:

'-' 2D ESCAPED AS --
' ' 20 ESCAPED AS _
'!' 21 ESCAPED AS -excl-
'#' 23 ESCAPED AS -num-
'$' 24 ESCAPED AS -dollar-
'%' 25 ESCAPED AS -percnt-
'&' 26 ESCAPED AS -amp-
"'" 27 ESCAPED AS -apos-
'(' 28 ESCAPED AS -lpar-
')' 29 ESCAPED AS -rpar-
'*' 2A ESCAPED AS -ast-
'+' 2B ESCAPED AS -plus-
',' 2C ESCAPED AS -comma-
'/' 2F ESCAPED AS -sol-
'{' 7B ESCAPED AS -lbrace-
'|' 7C ESCAPED AS -vert-
'}' 7D ESCAPED AS -rbrace-
'~' 7E ESCAPED AS -tilde-
'¢' A2 ESCAPED AS -cent-
'£' A3 ESCAPED AS -pound-
'§' A7 ESCAPED AS -sect-
'©' A9 ESCAPED AS -copy-
'®' AE ESCAPED AS -reg-
'°' B0 ESCAPED AS -deg-
'´' B4 ESCAPED AS -acute-
'¶' B6 ESCAPED AS -para-
'º' BA ESCAPED AS -ordm-

Contrary to the above, Unicode 'letters' like these don't need to be escaped :

ñ F1  start mid in señor
é E9  start mid in protégé
è E8  start mid
à E0  start mid in à propos
â E2  start mid
ç E7  start mid
汉 6C49  start mid
...

This is already implemented so things like 'tête-à-tête' now pose no problem.

COLON

The colon is a valid character that can be part of an XML ID as per XML 1.0 and 1.1 specifications.
But, being a namespace separator, it cannot be part of a xsd:id (a subset of XML ID)

oewn-despotical·3:01:00:: is a valid XML ID but not a valid xsd:id

As there doesn't seem to be any XSD validation plan in the offing,
the option of leaving the colon or escaping it is a matter of choice.

THE MIDDLE DOT

Interestingly the middle dot · is allowed. In my view it would do a better job than the double underscore in sensekeys.

oewn-despotical·3:01:00::,oewn-despotical·3.01.00.. are better looking than oewn-despotical__3.01.00..
and since the unescaping code is bound to be updated to conform to the dash scheme, why not use it ?

COMPARING ESCAPE SCHEMES

Here is what things would look like for existing escapable cases, if this proposal is accepted:

' (apostrophe)
	mendel's_law%1:09:00::
		--new--> oewn-mendel-apos-s_law·1.09.00..
		--leg--> oewn-mendel-ap-s_law__1.09.00..
	light-o'-love%1:18:00::
		--new--> oewn-light--o-apos---love·1.18.00..
		--leg--> oewn-light-o-ap--love__1.18.00..
, (comma)
	prince_william,_duke_of_cumberland%1:18:01::
		--new--> oewn-prince_william-comma-_duke_of_cumberland·1.18.01..
		--leg--> oewn-prince_william-cm-_duke_of_cumberland__1.18.01..
+ (plus)
	lgbtqia+%1:14:00::
		--new--> oewn-lgbtqia-plus-·1.14.00..
		--leg--> oewn-lgbtqia-pl-__1.14.00..
! (exclamation)
	yahoo!%1:10:01::
		--new--> oewn-yahoo-excl-·1.10.01..
		--leg--> oewn-yahoo-ex-__1.10.01..
/ (slash)
	a/c%1:06:00::
		--new--> oewn-a-sol-c·1.06.00..
		--leg--> oewn-a-sl-c__1.06.00..
	20/20%1:09:00::
		--new--> oewn-20-sol-20·1.09.00..
		--leg--> oewn-20-sl-20__1.09.00..
	km/s%1:28:01::
		--new--> oewn-km-sol-s·1.28.01..
		--leg--> oewn-km-sl-s__1.28.01..
- (hyphen or dash)
	by-product%1:19:00::
		--new--> oewn-by--product·1.19.00..
		--leg--> oewn-by-product__1.19.00..
	moving-coil_microphone%1:06:01::
		--new--> oewn-moving--coil_microphone·1.06.01..
		--leg--> oewn-moving-coil_microphone__1.06.01..
: (colon)
	capital:_critique_of_political_economy%1:10:01::
		--new--> oewn-capital-colon-_critique_of_political_economy·1.10.01..
		--leg--> oewn-capital-cn-_critique_of_political_economy__1.10.01..

If this proposal suits you, the code can be imported in the scripts.

@1313ou 1313ou reopened this Nov 13, 2024
@goodmami
Copy link
Member

Thanks for writing this up, @1313ou. This looks good and is pretty close to what I'm doing with OMW. A few thoughts/notes:

  1. The middle dot character looks better, but it's also much harder to type (at least on my keyboard). That may not be a concern if you don't expect people to type the IDs.
  2. I recently switched to escape _ to -lowbar- for OMW, but for WNDB to WN-LMF conversion this resulted in things like Mendel's_law (in WNDB) becoming mendel-apos-s-lowbar-law. Since the _ in WNDB was already an encoding of a space, I just decoded it back to a space before escaping again (i.e., Mendel's_lawMendel's lawMendel-apos-s_law). I think you don't have this problem since you're not starting from WNDB.
  3. It probably makes sense to do NFC or NFKC unicode normalization first so we are using é (U+00e9) and not (U+0065,U+0301). I'm not yet doing this in OMW but I probably should, and not just for IDs.

@1313ou
Copy link
Contributor Author

1313ou commented Nov 14, 2024

@goodmami, I agree with 3 that Unicode normalisation is mandatory. However I don't see this as an XML problem but as a source problem. Take 'Señor'

In src/yaml/entries-s.yaml we have

Señor:
  n:
    sense:
    - id: 'señor%1:10:01::'
      synset: 06353232-n

and in src/yaml/noun-commmunication.yaml

06353232-n:
  members:
  - Senor
  - Señor

We have cross-references that would fail if the 3 occurences of ñ were coded differently (as either u00f1 or n + u0303)... unless the YAML library is doing Unicode normalization. Of course the problem would propagate to XML but it's not up to XML to suppress it silently (let exceptions propagate, you get a better chance of catching the problem).

@goodmami
Copy link
Member

However I don't see this as an XML problem but as a source problem.

You're exactly right, we can be doing this as a general QA pass over all the data and not just for XML or IDs.

I can't imagine any reason why we'd want to use combining characters when canonical single-character versions exist, but NFC normalization would also change things like roman-numerals ( (U+2160) to I (U+0049)) and unit signs ( (U+212B) to Å (U+00C5)). I'm less sure if we want NFKC normalization which would also change things like to x2.

@jmccrae jmccrae modified the milestones: 2024 Release, 2025 Release Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants