-
-
Notifications
You must be signed in to change notification settings - Fork 35
Add bidi support and address UAX31/UTS55 requirements #884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
82fcef3
c5baba6
ca63819
1e172fd
afd5ef0
c7a41fc
b0cd0a5
cacc5e9
1fb0f92
a79fb8d
86a20f8
768a8a8
fd9fc57
734ef49
4541758
d751181
cbd0457
0df963e
be8fa43
f110af7
d8c6d0f
d5fb3bb
82af41f
d9d79bc
7858961
e7aa24c
86fc1d4
d5303c2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -134,17 +134,23 @@ A **_<dfn>local variable</dfn>_** is a _variable_ created as the result of a _lo | |
> > An exception to this is: whitespace inside a _pattern_ is **always** significant. | ||
|
||
> [!NOTE] | ||
> The syntax assumes that each _message_ will be displayed with a left-to-right display order | ||
> The MessageFormat 2 syntax assumes that each _message_ will be displayed | ||
> with a left-to-right display order | ||
> and be processed in the logical character order. | ||
> The syntax also permits the use of right-to-left characters in _identifiers_, | ||
> The syntax permits the use of right-to-left characters in _identifiers_, | ||
> _literals_, and other values. | ||
> This can result in confusion when viewing the _message_. | ||
> This can result in confusion when viewing the message | ||
> or users might incorrectly insert bidi controls or marks that negatively affect the output | ||
> of the message. | ||
> | ||
> To assist with this, the syntax permits the use of various controls and | ||
> strongly-directional markers in both optional and required _whitespace_ | ||
> in a _message_, as well was encouraging the use of isolating controls | ||
> with _expressions_ and _quoted patterns_. | ||
> See: [whitespace](#whitespace) (below) for more information. | ||
> | ||
> Additional restrictions or requirements, | ||
> such as permitting the use of certain bidirectional control characters in the syntax, | ||
> might be added during the Tech Preview to better manage bidirectional text. | ||
> Feedback on the creation and management of _messages_ | ||
> containing bidirectional tokens is strongly desired. | ||
> Additional restrictions or requirements might be added during the | ||
> Tech Preview to better manage bidirectional text. | ||
|
||
A _message_ can be a _simple message_ or it can be a _complex message_. | ||
|
||
|
@@ -160,7 +166,7 @@ Whitespace at the start or end of a _simple message_ is significant, | |
and a part of the _text_ of the _message_. | ||
|
||
```abnf | ||
simple-message = [s] [simple-start pattern] | ||
simple-message = owsp [simple-start pattern] | ||
simple-start = simple-start-char / escaped-char / placeholder | ||
``` | ||
|
||
|
@@ -176,7 +182,7 @@ Whitespace at the start or end of a _complex message_ is not significant, | |
and does not affect the processing of the _message_. | ||
|
||
```abnf | ||
complex-message = [s] *(declaration [s]) complex-body [s] | ||
complex-message = owsp *(declaration owsp) complex-body owsp | ||
``` | ||
|
||
### Declarations | ||
|
@@ -193,8 +199,8 @@ A **_<dfn>local-declaration</dfn>_** binds a _variable_ to the resolved value of | |
|
||
```abnf | ||
declaration = input-declaration / local-declaration | ||
input-declaration = input [s] variable-expression | ||
local-declaration = local s variable [s] "=" [s] expression | ||
input-declaration = input owsp variable-expression | ||
local-declaration = local wsp variable owsp "=" owsp expression | ||
``` | ||
|
||
_Variables_, once declared, MUST NOT be redeclared. | ||
|
@@ -254,7 +260,7 @@ A _quoted pattern_ starts with a sequence of two U+007B LEFT CURLY BRACKET `{{` | |
and ends with a sequence of two U+007D RIGHT CURLY BRACKET `}}`. | ||
|
||
```abnf | ||
quoted-pattern = "{{" pattern "}}" | ||
quoted-pattern = owsp "{{" pattern "}}" owsp | ||
``` | ||
|
||
A _quoted pattern_ MAY be empty. | ||
|
@@ -352,8 +358,8 @@ otherwise, a corresponding _Data Model Error_ will be produced during processing | |
_Literal_ _keys_ are compared by their contents, not their syntactical appearance. | ||
|
||
```abnf | ||
matcher = match-statement s variant *([s] variant) | ||
match-statement = match 1*(s selector) | ||
matcher = match-statement wsp variant *(owsp variant) | ||
match-statement = match 1*(wsp selector) | ||
``` | ||
|
||
> A _message_ with a _matcher_: | ||
|
@@ -425,7 +431,7 @@ Each _key_ is separated from each other by whitespace. | |
Whitespace is permitted but not required between the last _key_ and the _quoted pattern_. | ||
|
||
```abnf | ||
variant = key *(s key) [s] quoted-pattern | ||
variant = owsp key *(wsp key) owsp quoted-pattern | ||
key = literal / "*" | ||
``` | ||
|
||
|
@@ -461,9 +467,9 @@ A **_<dfn>function-expression</dfn>_** contains a _function_ without an _operand | |
expression = literal-expression | ||
/ variable-expression | ||
/ function-expression | ||
literal-expression = "{" [s] literal [s function] *(s attribute) [s] "}" | ||
variable-expression = "{" [s] variable [s function] *(s attribute) [s] "}" | ||
function-expression = "{" [s] function *(s attribute) [s] "}" | ||
literal-expression = "{" owsp literal [wsp function] *(wsp attribute) owsp "}" | ||
variable-expression = "{" owsp variable [wsp function] *(wsp attribute) owsp "}" | ||
function-expression = "{" owsp function *(wsp attribute) owsp "}" | ||
``` | ||
|
||
There are several types of _expression_ that can appear in a _message_. | ||
|
@@ -520,7 +526,7 @@ The _identifier_ MAY be followed by one or more _options_. | |
_Options_ are not required. | ||
|
||
```abnf | ||
function = ":" identifier *(s option) | ||
function = ":" identifier *(wsp option) | ||
``` | ||
|
||
> A _message_ with a _function_ operating on the _variable_ `$now`: | ||
|
@@ -549,7 +555,7 @@ and will produce a _Duplicate Option Name_ error during processing. | |
The order of _options_ is not significant. | ||
|
||
```abnf | ||
option = identifier [s] "=" [s] (literal / variable) | ||
option = identifier owsp "=" owsp (literal / variable) | ||
``` | ||
|
||
> Examples of _functions_ with _options_ | ||
|
@@ -594,8 +600,8 @@ It MAY include _options_. | |
is a _pattern_ part ending a span. | ||
|
||
```abnf | ||
markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone | ||
/ "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close | ||
markup = "{" owsp "#" identifier *(wsp option) *(wsp attribute) owsp ["/"] "}" ; open and standalone | ||
/ "{" owsp "/" identifier *(wsp option) *(wsp attribute) owsp "}" ; close | ||
``` | ||
|
||
> A _message_ with one `button` markup span and a standalone `img` markup element: | ||
|
@@ -637,7 +643,7 @@ all but the last _attribute_ with the same _identifier_ are ignored. | |
The order of _attributes_ is not otherwise significant. | ||
|
||
```abnf | ||
attribute = "@" identifier [[s] "=" [s] literal] | ||
attribute = "@" identifier [owsp "=" owsp literal] | ||
``` | ||
|
||
> Examples of _expressions_ and _markup_ with _attributes_: | ||
|
@@ -763,7 +769,7 @@ in this release. | |
|
||
```abnf | ||
variable = "$" name | ||
option = identifier [s] "=" [s] (literal / variable) | ||
option = identifier owsp "=" owsp (literal / variable) | ||
|
||
identifier = [namespace ":"] name | ||
namespace = name | ||
|
@@ -803,24 +809,104 @@ and inside _patterns_ only escape `{` and `}`. | |
|
||
### Whitespace | ||
|
||
**_<dfn>Whitespace</dfn>_** is defined as one or more of | ||
U+0009 CHARACTER TABULATION (tab), | ||
U+000A LINE FEED (new line), | ||
U+000D CARRIAGE RETURN, | ||
U+3000 IDEOGRAPHIC SPACE, | ||
or U+0020 SPACE. | ||
The syntax limits whitespace characters outside of a _pattern_ to the following: | ||
`U+0009 CHARACTER TABULATION` (tab), | ||
`U+000A LINE FEED` (new line), | ||
`U+000D CARRIAGE RETURN`, | ||
`U+3000 IDEOGRAPHIC SPACE`, | ||
or `U+0020 SPACE`. | ||
|
||
Inside _patterns_ and _quoted literals_, | ||
whitespace is part of the content and is recorded and stored verbatim. | ||
Whitespace is not significant outside translatable text, except where required by the syntax. | ||
|
||
There are two whitespace productions in the syntax. | ||
**_<dfn>Optional whitespace</dfn>_** is whitespace that is not required by the syntax, | ||
but which users might want to include to increase the readability of a _message_. | ||
**_<dfn>Required whitespace</dfn>_** is whitespace that is required by the syntax. | ||
|
||
Both types of whitespace optionally permit the use of the bidirectional isolate controls | ||
and certain strongly directional marks. | ||
These can assist users in presenting _messages_ that contain right-to-left | ||
text, _literals_, or _names_ (including those for _functions_, _options_, | ||
_option values_, and _keys_) | ||
|
||
_Messages_ that contain right-to-left (aka RTL) characters SHOULD use one of the | ||
following mechanisms to make messages display intelligibly in plain-text editors: | ||
|
||
1. Use paired isolating bidi controls `U+2066 LEFT-TO-RIGHT ISOLATE` | ||
and `U+2069 POP DIRECTIONAL ISOLATE` as permitted by the ABNF around | ||
parts of any _message_ containing RTL characters: | ||
- _inside_ of _placeholder_ markers `{` and `}` | ||
- _outside_ _quoted-pattern_ markers `{{` and `}}` | ||
- _identifiers_ | ||
- _literals_ (This is especially important for individual _keys_ in a _variant_) | ||
- _option_ values | ||
aphillips marked this conversation as resolved.
Show resolved
Hide resolved
|
||
2. Use the 'local-effect' bidi marks | ||
`U+061C ARABIC LETTER MARK`, `U+200E LEFT-TO-RIGHT MARK` or | ||
`U+200F RIGHT-TO-LEFT MARK` as permitted by the ABNF before or after _identifiers_, | ||
_names_, unquoted _literals_, or _option_ values, | ||
especially when the values contain a mix of neutral, weakly directional, and | ||
strongly directional characters. | ||
|
||
> [!IMPORTANT] | ||
> Always take care **not** to add bidirectional controls or marks | ||
> where they would be semantically significant | ||
> or where they would unintentionally become part of the _message_'s output: | ||
> - do not put them inside of a _literal_ except when they are part of the value, | ||
> (instead put them outside of _literal_ quotes, such as `<LRM>|...|<LRM>`) | ||
> - do not put them inside quoted _patterns_ except when they are part of the text, | ||
> (instead put them outside of quoted _patterns_, such as `<LRI>{{...}}<PDI>`) | ||
> - do not put them outside _placeholders_, | ||
> (instead put them inside the _placeholder_, such as `{<LRI>$foo :number<PDI>}`) | ||
> | ||
> Controls placed inside _literal_ quotes or quoted _patterns_ are part of the _literal_ | ||
> or _pattern_. | ||
> Controls in a _pattern_ will appear in the output of the message. | ||
> Controls inside _literal_ quotes are part of the _literal_ and | ||
> will be considered in operations such as matching a _key_ to a _selector_. | ||
|
||
> [!NOTE] | ||
> Users cannot be expected to create or manage bidirectional controls or | ||
> marks in _messages_, since the characters are invisible and can be difficult | ||
> to manage. | ||
> Tools (such as resource editors or translation editors) | ||
> and other implementations of MessageFormat 2 serialization are strongly | ||
> encouraged to provide paired isolates around any right-to-left | ||
> syntax as described above so that _messages_ display appropriately as plain text. | ||
|
||
These definitions of _whitespace_ implement | ||
[UAX#31 Requirement R3a-2](https://www.unicode.org/reports/tr31/#R3a-2). | ||
It is a profile of R3a-1 in that specification because: | ||
the following pattern whitespace characters are not allowed: | ||
`U+000B FORM FEED`, | ||
`U+000C VERTICAL TABULATION`, | ||
`U+0085 NEXT LINE`, | ||
`U+2028 LINE SEPARATOR` and | ||
`U+2029 PARAGRAPH SEPARATOR`; | ||
the character `U+3000 IDEOGRAPHIC SPACE` | ||
_is_ interpreted as whitespace, | ||
and the directional isolates U+2066..U+2069 | ||
are treated as ignorable format controls. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So is U+061C, I think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, and LRM/RLM There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, but LRM and RLM are part of R3a-1, see the first note under https://www.unicode.org/reports/tr31/#R3a. The profile includes adding U+061C to the ignorable format controls, not just the isolates. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The second note talks about ALM, including:
In any case, I now have a list of ignorable format controls, which might be overkill, but saves reading rule R3a 😉. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Indeed, but you still need to say that the profile adds it! I agree that listing the set is probably better than the diff at this point. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could also not allow ALM as a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel uncomfortable removing bidi marks. The ALM was added many years after RLM/LRM and its differences with RLM are minor. But we want bidi language users to have the tools they need to make things look right (and still be functional). I would hate to remove it because we need to add a couple of words to the spec. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My main concern here is that treating it as an ignorable format control requires us to deviate further from the XML name production. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this change is the right thing to do. ALM is an invisible, default-ignorable, non-spacing code point. As noted elsewhere, it was added to Unicode after XML/XMLName were defined. According to XML's rules, an ALM all-by-itself is a valid identifier. That seems like a bug, not a feature. Maybe we should call out the deviation more clearly and maybe (wearing my other chair hat) W3C should be called on to do an erratum. @macchiati Any thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I strongly agree; it should be added. |
||
|
||
> [!NOTE] | ||
> The character U+3000 IDEOGRAPHIC SPACE is included in whitespace for | ||
> compatibility with certain East Asian keyboards and input methods, | ||
> in which users might accidentally create these characters in a _message_. | ||
|
||
```abnf | ||
s = 1*( SP / HTAB / CR / LF / %x3000 ) | ||
; Optional whitespace | ||
owsp = *(s / bidi) | ||
|
||
; Required whitespace | ||
wsp = (owsp) 1*s (owsp) | ||
|
||
; Bidirectional marks and isolates | ||
; ALM / LRM / RLM / LRI, RLI, FSI & PDI | ||
bidi = %x061C / %x200E / %x200F / %x2066-2069 | ||
|
||
; Whitespace characters | ||
s = SP / HTAB / CR / LF / %x3000 | ||
``` | ||
|
||
## Complete ABNF | ||
|
Uh oh!
There was an error while loading. Please reload this page.