Escaping of non-ascii characters in entry XML ID

The escape_lemma(lemma) function, whose purpose is to format the lemma so it is valid XML id, is flawed when it comes to escaping non-ascii characters.

It converts any such characters to  '-%04x-'  % ord(c), which used 4 times withe the current data:
    'oewn-Se-00f1-or-n',
    'oewn-Se-00f1-ora-n',
    'oewn-Se-00f1-orita-n',
    'oewn-Capital-003a-_Critique_of_Political_Economy-n',

Decoding would involve the reverse process of converting any  r'\-[0-9A-Fa-f]{4}\-' back to the character.

The snag is such sequences as 
```
-abbe-
-abed-
-face-
-baba-
-bead-
-beef-
-cafe-
-caff-
-dada-
-dead-
-deaf-
-deed-
-fade-
-feed-

```
also match, qualifying as valid hex sequences (in addition to any four-digit like -1000-).

These sequences will be found in:

> oewn-1000-a
> oewn-1000-n
> oewn-1728-n
> oewn-2019-nCoV_acute_respiratory_disease-n
> oewn-Adad-n
> oewn-Bade-n
> oewn-Beda-n
> oewn-Bede-n
> oewn-Daba-n
> oewn-Edda-n
> oewn-Rain-in-the-Face-n
> oewn-abbe-n
> oewn-abed-r
> oewn-about-face-n
> oewn-about-face-v
> oewn-baba-n
> oewn-babe-n
> oewn-bead-n
> oewn-bead-v
> oewn-beef-n
> oewn-beef-v
> oewn-cafe-n
> oewn-caff-n
> oewn-cede-v
> oewn-dace-n
> oewn-dada-n
> oewn-dead-a
> oewn-dead-air_space-n
> oewn-dead-burned_lime-n
> oewn-dead-end-a
> oewn-dead-end_street-n
> oewn-dead-man-ap-s-fingers-n
> oewn-dead-man-ap-s_float-n
> oewn-dead-men-ap-s-fingers-n
> oewn-dead-n
> oewn-dead-on-a
> oewn-dead-r
> oewn-deaf-a
> oewn-deaf-aid-n
> oewn-deaf-and-dumb-a
> oewn-deaf-and-dumb_person-n
> oewn-deaf-mute-a
> oewn-deaf-mute-n
> oewn-deaf-muteness-n
> oewn-deaf-mutism-n
> oewn-deaf-n
> oewn-deaf-v
> oewn-deed-n
> oewn-drop-dead-r
> oewn-edda-n
> oewn-face-amount_certificate_company-n
> oewn-face-harden-v
> oewn-face-lift-v
> oewn-face-n
> oewn-face-off-n
> oewn-face-saving-a
> oewn-face-to-face-a
> oewn-face-to-face-r
> oewn-face-v
> oewn-fade-n
> oewn-fade-v
> oewn-feed-n
> oewn-feed-v
> oewn-force-feed-v
> oewn-full-face-a
> oewn-in-your-face-a
> oewn-lie-abed-n
> oewn-pousse-cafe-n
> oewn-pudding-face-n
> oewn-sick-abed-a
> oewn-stone-dead-a
> oewn-stone-deaf-a
> oewn-stone-face-n
> oewn-tone-deaf-a
> oewn-volte-face-n

thus making decoding hazardous (because it's impossible to tell the string 'face' from the hex 'face').

Added to that, the '-de.*-' sequences will result in unicode surrogate characters reserved for coding and raising an error when printed.

Te good news is that unicode **letters** can be be part of an XML ID

Here are regular expressions for valid NameStartChar and NameChar based on the XML 1.0 specification:
```
name_start_char_re = re.compile(r'^[A-Z_a-z\xC0-\xD6\xD8-\xF6\xF8-\u02FF\u0370-\u037D\u037F-\u1FFF'
                                r'\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF'
                                r'\uF900-\uFDCF\uFDF0-\uFFFD]$')

name_char_re = re.compile(r'^[A-Z_a-z0-9\x2D\x2E\xB7\xC0-\xD6\xD8-\xF6\xF8-\u02FF'
                          r'\u0300-\u036F\u203F-\u2040\u0370-\u037D\u037F-\u1FFF'
                          r'\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF'
                          r'\uF900-\uFDCF\uFDF0-\uFFFD]$')

```

With this in mind,

> 'é' is 	valid in XML ID (including as start character)
> 'ñ' is 	valid in XML ID (including as start character)
> '汉' is 	valid in XML ID (including as start character)

That settles the problem for 3 cases out of 4: señor, señora, señorita. Only remaining problem, the colon, currently used only in 'Capital: Critique of Political Economy'
Why not use '-cn' (-cl- is not available being used by cl for centilitre) which would yield

> oewn-Capital-cn-_Critique_of_Political_Economy-n 

instead of

> oewn-Capital-003a-_Critique_of_Political_Economy-n 

Risky if a 'cn' is later introduced, abbreviation for China for instance. Personnally I would not accept colons within lemma  which has been generating problems from the start, only for a single entry. Besides 'Capital: Critique of Political_Economy' can hardly be argued to be a lemma or a dictionary entry (possibly an  encyclopedia entry).

This affects only XML so there is nothing to fix but code because the XML is just derived, not source. However tools that work from XML will have to be reviewed if they try to unescape lemmas in entry ids and work with XML not generated with fixed code.

While correcting, I suggest replacing
```
    elif c == '-':
            return '-'

```
which is a NO-OP, with 
```
     elif c == ':':
          return '-cn-'
```      

and change

```
    else:
        return '-%04x-' % ord(c)

```
to
```
    elif name_char_re.match(c) or name_char_re.match(c):
        return c
    raise ValueError(f'Illegal character {c}')
```
with the regexprs above





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Escaping of non-ascii characters in entry XML ID #1107

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Escaping of non-ascii characters in entry XML ID #1107

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions