Skip to content

Support for script in addition to language #70

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kba opened this issue Apr 8, 2021 · 21 comments
Closed

Support for script in addition to language #70

kba opened this issue Apr 8, 2021 · 21 comments

Comments

@kba
Copy link

kba commented Apr 8, 2021

With the @LANG attribute, ISO 639-3 codes can be used to express the language of the String, TextLine, TextBlock, and at some point PAGE (#55). But what about the script the text is written in? With a @SCRIPT attribute in all places where @LANG is allowed, one could express the script e.,g. as a 4-letter ISO 15924 code. PAGE-XML has support for both, roughly representing ISO 639 and ISO 15924.

@mittagessen
Copy link
Contributor

ISO 15924 is a horrible mess and not a good place to start. The 'granularity' of the descriptors varies widely and their semantics are unclear, e.g. there exist separate identifiers for the different styles of Syriac but all Arabic scripts share one identifier. Free text fields might be better in the absence of a good standard.

@artunit
Copy link
Member

artunit commented May 5, 2021

As per the 2021-04-29 Board Meeting, I am adding a link to BCP 47, which might be one possible approach for script identification, based on positive experience at Google.

@cipriandinu
Copy link
Member

I think that xsd:language can cover also the script topic. According with xsd:language definition, this is the set of language codes defined by RFC 1766. RFC 1766 (https://www.rfc-editor.org/rfc/pdfrfc/rfc1766.txt.pdf) defines the language as:

Language-Tag = Primary-tag ( "-" Subtag )
Primary-tag = 1
8ALPHA
Subtag = 1*8ALPHA

where first subtag is:

  • All 2-letter codes are interpreted as ISO 3166 alpha-2
    country codes denoting the area in which the language is
    used.
  • Codes of 3 to 8 letters may be registered with the IANA by
    anyone who feels a need for it, according to the rules in
    chapter 5 of this document.
    The information in the subtag may for instance be:
  • Country identification, such as en-US (this usage is
    described in ISO 639)
  • Dialect or variant information, such as no-nynorsk or en-
    cockney
  • Languages not listed in ISO 639 that are not variants of
    any listed language, which can be registered with the i-
    prefix, such as i-cherokee
    - Script variations, such as az-arabic and az-cyrillic

The list of registered codes cand be found here:

https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Using this is possible to encode also the scripts - there are defined about 212 possible scripts that can be used as subtag

Changing to a different standard probably will create back-compatibility issues.

@cipriandinu
Copy link
Member

After discussion on ALTO board, I propose to close this item, based on the previous comment. @kba please review and give us your feedback. I will set the topic for voting (ACCEPT means agree to close the item)

@kba
Copy link
Author

kba commented Apr 26, 2022

You're right, xsd:language is flexible enough to express both. We will use the RFC 1766 mechanics for language/script. That should also support subtyping Arabic as @mittagessen wanted AFAICT.

@kba
Copy link
Author

kba commented Apr 26, 2022

ACCEPT

5 similar comments
@cipriandinu
Copy link
Member

ACCEPT

@cneud
Copy link
Member

cneud commented May 3, 2022

ACCEPT

@ntra00
Copy link
Member

ntra00 commented May 3, 2022

ACCEPT

@Haighton
Copy link

Haighton commented May 4, 2022

ACCEPT

@acpopat
Copy link
Member

acpopat commented May 4, 2022

ACCEPT

@cipriandinu
Copy link
Member

@kba it looks like in fact I was a bit outdated, the last definition of xsd:language refers BCP47, that replaces older standards. But from ALTO perspective we use xsd:language and this alows to cover any specific script as well (this was the original topic) - either using BCP47 or much older RFC1776 mechanism.

@cowboyMontana
Copy link
Member

ACCEPT

7 similar comments
@JLoitzenbauer-CRKN
Copy link

ACCEPT

@callylaw
Copy link
Member

callylaw commented Oct 6, 2022

ACCEPT

@hanyelsawy
Copy link
Member

ACCEPT

@Haighton
Copy link

ACCEPT

@c-sebastien
Copy link

ACCEPT

@jukervin
Copy link
Member

ACCEPT

@ntra00
Copy link
Member

ntra00 commented Dec 15, 2022

ACCEPT

@cipriandinu
Copy link
Member

Closed based on existing votes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests