Skip to content

Escaping of non-ascii characters in entry XML ID #1107

Open
@1313ou

Description

@1313ou

The escape_lemma(lemma) function, whose purpose is to format the lemma so it is valid XML id, is flawed when it comes to escaping non-ascii characters.

It converts any such characters to '-%04x-' % ord(c), which used 4 times withe the current data:
'oewn-Se-00f1-or-n',
'oewn-Se-00f1-ora-n',
'oewn-Se-00f1-orita-n',
'oewn-Capital-003a-_Critique_of_Political_Economy-n',

Decoding would involve the reverse process of converting any r'-[0-9A-Fa-f]{4}-' back to the character.

The snag is such sequences as

-abbe-
-abed-
-face-
-baba-
-bead-
-beef-
-cafe-
-caff-
-dada-
-dead-
-deaf-
-deed-
-fade-
-feed-

also match, qualifying as valid hex sequences (in addition to any four-digit like -1000-).

These sequences will be found in:

oewn-1000-a
oewn-1000-n
oewn-1728-n
oewn-2019-nCoV_acute_respiratory_disease-n
oewn-Adad-n
oewn-Bade-n
oewn-Beda-n
oewn-Bede-n
oewn-Daba-n
oewn-Edda-n
oewn-Rain-in-the-Face-n
oewn-abbe-n
oewn-abed-r
oewn-about-face-n
oewn-about-face-v
oewn-baba-n
oewn-babe-n
oewn-bead-n
oewn-bead-v
oewn-beef-n
oewn-beef-v
oewn-cafe-n
oewn-caff-n
oewn-cede-v
oewn-dace-n
oewn-dada-n
oewn-dead-a
oewn-dead-air_space-n
oewn-dead-burned_lime-n
oewn-dead-end-a
oewn-dead-end_street-n
oewn-dead-man-ap-s-fingers-n
oewn-dead-man-ap-s_float-n
oewn-dead-men-ap-s-fingers-n
oewn-dead-n
oewn-dead-on-a
oewn-dead-r
oewn-deaf-a
oewn-deaf-aid-n
oewn-deaf-and-dumb-a
oewn-deaf-and-dumb_person-n
oewn-deaf-mute-a
oewn-deaf-mute-n
oewn-deaf-muteness-n
oewn-deaf-mutism-n
oewn-deaf-n
oewn-deaf-v
oewn-deed-n
oewn-drop-dead-r
oewn-edda-n
oewn-face-amount_certificate_company-n
oewn-face-harden-v
oewn-face-lift-v
oewn-face-n
oewn-face-off-n
oewn-face-saving-a
oewn-face-to-face-a
oewn-face-to-face-r
oewn-face-v
oewn-fade-n
oewn-fade-v
oewn-feed-n
oewn-feed-v
oewn-force-feed-v
oewn-full-face-a
oewn-in-your-face-a
oewn-lie-abed-n
oewn-pousse-cafe-n
oewn-pudding-face-n
oewn-sick-abed-a
oewn-stone-dead-a
oewn-stone-deaf-a
oewn-stone-face-n
oewn-tone-deaf-a
oewn-volte-face-n

thus making decoding hazardous (because it's impossible to tell the string 'face' from the hex 'face').

Added to that, the '-de.*-' sequences will result in unicode surrogate characters reserved for coding and raising an error when printed.

Te good news is that unicode letters can be be part of an XML ID

Here are regular expressions for valid NameStartChar and NameChar based on the XML 1.0 specification:

name_start_char_re = re.compile(r'^[A-Z_a-z\xC0-\xD6\xD8-\xF6\xF8-\u02FF\u0370-\u037D\u037F-\u1FFF'
                                r'\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF'
                                r'\uF900-\uFDCF\uFDF0-\uFFFD]$')

name_char_re = re.compile(r'^[A-Z_a-z0-9\x2D\x2E\xB7\xC0-\xD6\xD8-\xF6\xF8-\u02FF'
                          r'\u0300-\u036F\u203F-\u2040\u0370-\u037D\u037F-\u1FFF'
                          r'\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF'
                          r'\uF900-\uFDCF\uFDF0-\uFFFD]$')

With this in mind,

'é' is valid in XML ID (including as start character)
'ñ' is valid in XML ID (including as start character)
'汉' is valid in XML ID (including as start character)

That settles the problem for 3 cases out of 4: señor, señora, señorita. Only remaining problem, the colon, currently used only in 'Capital: Critique of Political Economy'
Why not use '-cn' (-cl- is not available being used by cl for centilitre) which would yield

oewn-Capital-cn-_Critique_of_Political_Economy-n

instead of

oewn-Capital-003a-_Critique_of_Political_Economy-n

Risky if a 'cn' is later introduced, abbreviation for China for instance. Personnally I would not accept colons within lemma which has been generating problems from the start, only for a single entry. Besides 'Capital: Critique of Political_Economy' can hardly be argued to be a lemma or a dictionary entry (possibly an encyclopedia entry).

This affects only XML so there is nothing to fix but code because the XML is just derived, not source. However tools that work from XML will have to be reviewed if they try to unescape lemmas in entry ids and work with XML not generated with fixed code.

While correcting, I suggest replacing

    elif c == '-':
            return '-'

which is a NO-OP, with

     elif c == ':':
          return '-cn-'

and change

    else:
        return '-%04x-' % ord(c)

to

    elif name_char_re.match(c) or name_char_re.match(c):
        return c
    raise ValueError(f'Illegal character {c}')

with the regexprs above

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions