-
Notifications
You must be signed in to change notification settings - Fork 215
Update UAX#29 text segmenter data rules to 16.0. #6367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Now that the issues with InCB in Unicode 15.1 have been resolved (see UTC-179-C26), we should get rid of our own derivation here: icu4x/provider/source/src/segmenter/mod.rs Lines 292 to 336 in 7e287f1
Right now it is not causing any problems, since our custom derivation is consistent with what Unicode 16.0 does, but in 17.0 there will be substantial changes to the derivation, see UTC-179-C27 and https://www.unicode.org/reports/tr44/proposed.html#Derivation_InCB. These 17.0 changes do not come with any changes to the rules themselves, so the update should be just as easy as this one, but only if we are actually using the InCB from UTC. |
InCB isn't implemented in ICU4X yet, I will create a PR to add this property. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If updating the properties is all we need and no UAX29 rules needs a change, we should update these comments point to 16.0.0 and the new URLs.
icu4x/provider/source/data/segmenter/grapheme.toml
Lines 5 to 6 in b886706
# These grapheme boundary rules are based on UAX #29, Unicode Version 15.1.0. | |
# https://www.unicode.org/reports/tr29/tr29-43.html |
icu4x/provider/source/data/segmenter/sentence.toml
Lines 5 to 6 in b886706
# These sentence boundary rules are based on UAX #29, Unicode Version 15.1.0. | |
# https://www.unicode.org/reports/tr29/tr29-43.html |
icu4x/provider/source/data/segmenter/word.toml
Lines 5 to 6 in b886706
# These word boundary rules are based on UAX #29, Unicode Version 15.1.0. | |
# https://www.unicode.org/reports/tr29/tr29-43.html |
From #6367, @eggrobin suggests to use Indic_Conjunct_Break (InCB) property for Grapheme Cluster Break. Also, `InCB.toml` is incomplete yet like the following, since it is added by ICU76 as a draft API. ``` values = [ {discr = 0, long = "None", short = "None"}, ] ``` It means that names (short / long / parse) are empty for this implementation.
From Unicode 16.0, Extend if Indic_Conjunct_Break doesn't include ccc.
for i in 0..(CODEPOINT_TABLE_LEN as u32) { | ||
if let Some(c) = char::from_u32(i) { | ||
if incb.get(c) == IndicConjunctBreak::Consonant { | ||
properties_map[c as usize] = property_index; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's probably more efficient to use incb.iter_ranges_for_value
than to call incb.get
a million times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here does this already, I'm going to fix it in a followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unicode 16.0.0 changes are property value only. So updating data files and test data.