-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escaping of non-ascii characters in entry XML ID #1107
Comments
I think this is a good point and I fully agree. The implementation you propose is good and so please go ahead and make the PR (but check that it is compatible with #1096) However, I don't think we should assume that IDs should be invertible and unique. For example if we were to introduce accented forms of word, e.g., 'protégé', I think we should at least consider whether we just combine with the existing identifier |
Actually, I see we do have some accented entries already:
So maybe we can fix some accented characters as well? |
|
When I think about it more, I think that we should probably have two lexical entries for 'protege' and 'protégé' so they both appear in the UI (https://en-word.net). Alternative forms are not shown in the interface normally. If XML IDs can support most accented characters as @1313ou says then there is no problem with having nice IDs as well. |
This discussion is related to globalwordnet/schemas#55, right? If so, you can compare to what we do for OMW: https://github.com/omwn/omw-data/blob/main/scripts/util.py The basic strategy is to allow anything that is allowed in XML IDs, then to use HTML4 entity names with some additional names in a custom mapping. Finally, since >>> import util
>>> util.escape_lemma("Capital: Critique of Political Economy")
'Capital-colon-_Critique_of_Political_Economy'
>>> util.escape_lemma("protégé")
'protégé'
>>> util.escape_lemma("ex-president")
'ex--president' |
Much better to have an escape character instead of a -xy- sequence, with xy thought to be unlikely in the input. Changing would affect It would also remove the confusion between |
You can close the issue as it was originally about non-ascii characters. But in the discussion , the issue was also raised about escaping ASCII characters that can't be found in an XML-ID (comma, colon, brackets, exclamation, question ...). I can implement @goodmami's suggestion (in fact copy his code, though I suspect validation would also have to be tweaked) if you give me some time and if you give the greenlight to changing some 1350 IDs for entries only. Does it affect both entry IDs (escaped lemmas) and sense IDs (escaped sensekeys) ? |
Yeah, I don't think it is a problem to change the entry IDs. This probably doesn't need to affect the sense keys as they do not need to be valid XML ids. |
The sense IDs are mapped/derived from sensekeys through map_sense_key() in wordnet.py. This is to get an acceptable XML ID for senses. What I was meaning is : Do we have to escape them the same way, which does make sense. This affects XML only. Sensekeys in YAML are unaffected. |
This would yield:
|
Thanks for these examples, @1313ou. Here are the differences from my implementation: - a_b_c -> a-lowbar-b-lowbar-c
+ a_b_c -> a_b_c This was a design choice on my end to keep - a´b´c -> a-acute-b-acute-c
+ a´b´c -> aacutebacutec
- a‘quoted’b -> a-lsquo-quoted-rsquo-b
+ a‘quoted’b -> alsquoquotedrsquob These reveal a bug in my script. For entity names looked up from Python's - a→b←c -> ERROR '→' [2192] is illegal character in XML ID and no escape sequence is defined
+ a→b←c -> ararrblarrc For me this is the same problem as above, but I'm not sure why you're getting an error? |
About the arrows, there are arrows and arrows... Some are excluded (a→b←c'), some are allowed ('a🠀b🠂c'). The ones excuded here have codepoint x2190 and x2192 . the ones allowed x1F800 and x1F802. There are 3 arrow blocks I know of in Unicode. |
To avoid confusion, let's say that this does not deal with YAML ids. These are under lesser constraints, the main one XML IDs, and for that matter xsd:id, are under stricter constraints. As suggested above I have been implementing @goodmami's proposal of singling out an escaping dash/hyphen character (repeated if dash is really meant). The scheme encloses between dashes what looks like HTML entity names. In the oewn-core xml module, I have defined an interface
The latter does the following for XML-invalid (extended) ASCII characters:
Contrary to the above, Unicode 'letters' like these don't need to be escaped :
This is already implemented so things like 'tête-à-tête' now pose no problem. COLON The colon is a valid character that can be part of an XML ID as per XML 1.0 and 1.1 specifications.
As there doesn't seem to be any XSD validation plan in the offing, THE MIDDLE DOT Interestingly the middle dot · is allowed. In my view it would do a better job than the double underscore in sensekeys.
COMPARING ESCAPE SCHEMES Here is what things would look like for existing escapable cases, if this proposal is accepted:
If this proposal suits you, the code can be imported in the scripts. |
Thanks for writing this up, @1313ou. This looks good and is pretty close to what I'm doing with OMW. A few thoughts/notes:
|
@goodmami, I agree with 3 that Unicode normalisation is mandatory. However I don't see this as an XML problem but as a source problem. Take 'Señor' In src/yaml/entries-s.yaml we have
and in src/yaml/noun-commmunication.yaml
We have cross-references that would fail if the 3 occurences of ñ were coded differently (as either u00f1 or n + u0303)... unless the YAML library is doing Unicode normalization. Of course the problem would propagate to XML but it's not up to XML to suppress it silently (let exceptions propagate, you get a better chance of catching the problem). |
You're exactly right, we can be doing this as a general QA pass over all the data and not just for XML or IDs. I can't imagine any reason why we'd want to use combining characters when canonical single-character versions exist, but NFC normalization would also change things like roman-numerals ( |
The escape_lemma(lemma) function, whose purpose is to format the lemma so it is valid XML id, is flawed when it comes to escaping non-ascii characters.
It converts any such characters to '-%04x-' % ord(c), which used 4 times withe the current data:
'oewn-Se-00f1-or-n',
'oewn-Se-00f1-ora-n',
'oewn-Se-00f1-orita-n',
'oewn-Capital-003a-_Critique_of_Political_Economy-n',
Decoding would involve the reverse process of converting any r'-[0-9A-Fa-f]{4}-' back to the character.
The snag is such sequences as
also match, qualifying as valid hex sequences (in addition to any four-digit like -1000-).
These sequences will be found in:
thus making decoding hazardous (because it's impossible to tell the string 'face' from the hex 'face').
Added to that, the '-de.*-' sequences will result in unicode surrogate characters reserved for coding and raising an error when printed.
Te good news is that unicode letters can be be part of an XML ID
Here are regular expressions for valid NameStartChar and NameChar based on the XML 1.0 specification:
With this in mind,
That settles the problem for 3 cases out of 4: señor, señora, señorita. Only remaining problem, the colon, currently used only in 'Capital: Critique of Political Economy'
Why not use '-cn' (-cl- is not available being used by cl for centilitre) which would yield
instead of
Risky if a 'cn' is later introduced, abbreviation for China for instance. Personnally I would not accept colons within lemma which has been generating problems from the start, only for a single entry. Besides 'Capital: Critique of Political_Economy' can hardly be argued to be a lemma or a dictionary entry (possibly an encyclopedia entry).
This affects only XML so there is nothing to fix but code because the XML is just derived, not source. However tools that work from XML will have to be reviewed if they try to unescape lemmas in entry ids and work with XML not generated with fixed code.
While correcting, I suggest replacing
which is a NO-OP, with
and change
to
with the regexprs above
The text was updated successfully, but these errors were encountered: