-
Notifications
You must be signed in to change notification settings - Fork 4
Support for script in addition to language #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
ISO 15924 is a horrible mess and not a good place to start. The 'granularity' of the descriptors varies widely and their semantics are unclear, e.g. there exist separate identifiers for the different styles of Syriac but all Arabic scripts share one identifier. Free text fields might be better in the absence of a good standard. |
As per the 2021-04-29 Board Meeting, I am adding a link to BCP 47, which might be one possible approach for script identification, based on positive experience at Google. |
I think that xsd:language can cover also the script topic. According with xsd:language definition, this is the set of language codes defined by RFC 1766. RFC 1766 (https://www.rfc-editor.org/rfc/pdfrfc/rfc1766.txt.pdf) defines the language as: Language-Tag = Primary-tag ( "-" Subtag ) where first subtag is:
The list of registered codes cand be found here: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry Using this is possible to encode also the scripts - there are defined about 212 possible scripts that can be used as subtag Changing to a different standard probably will create back-compatibility issues. |
After discussion on ALTO board, I propose to close this item, based on the previous comment. @kba please review and give us your feedback. I will set the topic for voting (ACCEPT means agree to close the item) |
You're right, |
ACCEPT |
5 similar comments
ACCEPT |
ACCEPT |
ACCEPT |
ACCEPT |
ACCEPT |
@kba it looks like in fact I was a bit outdated, the last definition of xsd:language refers BCP47, that replaces older standards. But from ALTO perspective we use xsd:language and this alows to cover any specific script as well (this was the original topic) - either using BCP47 or much older RFC1776 mechanism. |
ACCEPT |
7 similar comments
ACCEPT |
ACCEPT |
ACCEPT |
ACCEPT |
ACCEPT |
ACCEPT |
ACCEPT |
Closed based on existing votes |
With the
@LANG
attribute, ISO 639-3 codes can be used to express the language of the String, TextLine, TextBlock, and at some point PAGE (#55). But what about the script the text is written in? With a@SCRIPT
attribute in all places where@LANG
is allowed, one could express the script e.,g. as a 4-letter ISO 15924 code. PAGE-XML has support for both, roughly representing ISO 639 and ISO 15924.The text was updated successfully, but these errors were encountered: