Description
The escape_lemma(lemma) function, whose purpose is to format the lemma so it is valid XML id, is flawed when it comes to escaping non-ascii characters.
It converts any such characters to '-%04x-' % ord(c), which used 4 times withe the current data:
'oewn-Se-00f1-or-n',
'oewn-Se-00f1-ora-n',
'oewn-Se-00f1-orita-n',
'oewn-Capital-003a-_Critique_of_Political_Economy-n',
Decoding would involve the reverse process of converting any r'-[0-9A-Fa-f]{4}-' back to the character.
The snag is such sequences as
-abbe-
-abed-
-face-
-baba-
-bead-
-beef-
-cafe-
-caff-
-dada-
-dead-
-deaf-
-deed-
-fade-
-feed-
also match, qualifying as valid hex sequences (in addition to any four-digit like -1000-).
These sequences will be found in:
oewn-1000-a
oewn-1000-n
oewn-1728-n
oewn-2019-nCoV_acute_respiratory_disease-n
oewn-Adad-n
oewn-Bade-n
oewn-Beda-n
oewn-Bede-n
oewn-Daba-n
oewn-Edda-n
oewn-Rain-in-the-Face-n
oewn-abbe-n
oewn-abed-r
oewn-about-face-n
oewn-about-face-v
oewn-baba-n
oewn-babe-n
oewn-bead-n
oewn-bead-v
oewn-beef-n
oewn-beef-v
oewn-cafe-n
oewn-caff-n
oewn-cede-v
oewn-dace-n
oewn-dada-n
oewn-dead-a
oewn-dead-air_space-n
oewn-dead-burned_lime-n
oewn-dead-end-a
oewn-dead-end_street-n
oewn-dead-man-ap-s-fingers-n
oewn-dead-man-ap-s_float-n
oewn-dead-men-ap-s-fingers-n
oewn-dead-n
oewn-dead-on-a
oewn-dead-r
oewn-deaf-a
oewn-deaf-aid-n
oewn-deaf-and-dumb-a
oewn-deaf-and-dumb_person-n
oewn-deaf-mute-a
oewn-deaf-mute-n
oewn-deaf-muteness-n
oewn-deaf-mutism-n
oewn-deaf-n
oewn-deaf-v
oewn-deed-n
oewn-drop-dead-r
oewn-edda-n
oewn-face-amount_certificate_company-n
oewn-face-harden-v
oewn-face-lift-v
oewn-face-n
oewn-face-off-n
oewn-face-saving-a
oewn-face-to-face-a
oewn-face-to-face-r
oewn-face-v
oewn-fade-n
oewn-fade-v
oewn-feed-n
oewn-feed-v
oewn-force-feed-v
oewn-full-face-a
oewn-in-your-face-a
oewn-lie-abed-n
oewn-pousse-cafe-n
oewn-pudding-face-n
oewn-sick-abed-a
oewn-stone-dead-a
oewn-stone-deaf-a
oewn-stone-face-n
oewn-tone-deaf-a
oewn-volte-face-n
thus making decoding hazardous (because it's impossible to tell the string 'face' from the hex 'face').
Added to that, the '-de.*-' sequences will result in unicode surrogate characters reserved for coding and raising an error when printed.
Te good news is that unicode letters can be be part of an XML ID
Here are regular expressions for valid NameStartChar and NameChar based on the XML 1.0 specification:
name_start_char_re = re.compile(r'^[A-Z_a-z\xC0-\xD6\xD8-\xF6\xF8-\u02FF\u0370-\u037D\u037F-\u1FFF'
r'\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF'
r'\uF900-\uFDCF\uFDF0-\uFFFD]$')
name_char_re = re.compile(r'^[A-Z_a-z0-9\x2D\x2E\xB7\xC0-\xD6\xD8-\xF6\xF8-\u02FF'
r'\u0300-\u036F\u203F-\u2040\u0370-\u037D\u037F-\u1FFF'
r'\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF'
r'\uF900-\uFDCF\uFDF0-\uFFFD]$')
With this in mind,
'é' is valid in XML ID (including as start character)
'ñ' is valid in XML ID (including as start character)
'汉' is valid in XML ID (including as start character)
That settles the problem for 3 cases out of 4: señor, señora, señorita. Only remaining problem, the colon, currently used only in 'Capital: Critique of Political Economy'
Why not use '-cn' (-cl- is not available being used by cl for centilitre) which would yield
oewn-Capital-cn-_Critique_of_Political_Economy-n
instead of
oewn-Capital-003a-_Critique_of_Political_Economy-n
Risky if a 'cn' is later introduced, abbreviation for China for instance. Personnally I would not accept colons within lemma which has been generating problems from the start, only for a single entry. Besides 'Capital: Critique of Political_Economy' can hardly be argued to be a lemma or a dictionary entry (possibly an encyclopedia entry).
This affects only XML so there is nothing to fix but code because the XML is just derived, not source. However tools that work from XML will have to be reviewed if they try to unescape lemmas in entry ids and work with XML not generated with fixed code.
While correcting, I suggest replacing
elif c == '-':
return '-'
which is a NO-OP, with
elif c == ':':
return '-cn-'
and change
else:
return '-%04x-' % ord(c)
to
elif name_char_re.match(c) or name_char_re.match(c):
return c
raise ValueError(f'Illegal character {c}')
with the regexprs above