Skip to content

Update UAX#29 text segmenter data rules to 16.0. #6367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 8, 2025

Conversation

makotokato
Copy link
Member

Unicode 16.0.0 changes are property value only. So updating data files and test data.

@robertbastian robertbastian requested review from eggrobin and removed request for robertbastian March 27, 2025 13:17
@eggrobin
Copy link
Member

Now that the issues with InCB in Unicode 15.1 have been resolved (see UTC-179-C26), we should get rid of our own derivation here:

// The Indic_Conjunct_Break property is separate from the Grapheme_Cluster_Break property.
// See https://unicode.org/reports/tr44/#Indic_Conjunct_Break
if p.name == "InCBConsonant" || p.name == "InCBLinker" || p.name == "InCBExtend"
{
let gcb_extend = gcb_name_to_enum
.get_loose("Extend")
.expect("property name should be valid!");
for i in 0..(CODEPOINT_TABLE_LEN as u32) {
if let Some(c) = char::from_u32(i) {
let insc_value = insc.get(c);
let sc = script.get(c);
let is_gb9c_script = sc == Script::Bengali
|| sc == Script::Devanagari
|| sc == Script::Gujarati
|| sc == Script::Malayalam
|| sc == Script::Oriya
|| sc == Script::Telugu;
let is_incb_consonant = insc_value
== IndicSyllabicCategory::Consonant
&& is_gb9c_script;
let is_incb_linker =
insc_value == IndicSyllabicCategory::Virama && is_gb9c_script;
// InCB = Linker or InCB = Consonant
if (p.name == "InCBConsonant" && is_incb_consonant)
|| (p.name == "InCBLinker" && is_incb_linker)
// ZWJ is InCB=Extend, but is in a different GCB class anyway so
// it needs to be special-cased in the tables.
// NOTE(eggrobin): UAX #44, Version 15.1, instead excludes based
// on InSC.
// I believe that to be a defect in that version of Unicode.
// This has been brought to the attention of the Properties and
// Algorithms Group.
|| (p.name == "InCBExtend"
&& (gb.get32(i) == gcb_extend
&& ccc.get32(i) != CanonicalCombiningClass::NotReordered
&& !is_incb_consonant
&& !is_incb_linker))
{
properties_map[c as usize] = property_index;
}
}
}
continue;
}

Right now it is not causing any problems, since our custom derivation is consistent with what Unicode 16.0 does, but in 17.0 there will be substantial changes to the derivation, see UTC-179-C27 and https://www.unicode.org/reports/tr44/proposed.html#Derivation_InCB. These 17.0 changes do not come with any changes to the rules themselves, so the update should be just as easy as this one, but only if we are actually using the InCB from UTC.

@makotokato
Copy link
Member Author

InCB isn't implemented in ICU4X yet, I will create a PR to add this property.

Copy link
Contributor

@aethanyc aethanyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If updating the properties is all we need and no UAX29 rules needs a change, we should update these comments point to 16.0.0 and the new URLs.

# These grapheme boundary rules are based on UAX #29, Unicode Version 15.1.0.
# https://www.unicode.org/reports/tr29/tr29-43.html

# These sentence boundary rules are based on UAX #29, Unicode Version 15.1.0.
# https://www.unicode.org/reports/tr29/tr29-43.html

# These word boundary rules are based on UAX #29, Unicode Version 15.1.0.
# https://www.unicode.org/reports/tr29/tr29-43.html

@sffc sffc removed their request for review April 4, 2025 07:44
makotokato added a commit that referenced this pull request Apr 7, 2025
From #6367, @eggrobin suggests
to use Indic_Conjunct_Break (InCB) property for Grapheme Cluster Break.

Also, `InCB.toml` is incomplete yet like the following, since it is
added by ICU76 as a draft API.
```
values = [
  {discr = 0, long = "None", short = "None"},
]
```
It means that names (short / long / parse) are empty for this
implementation.
Manishearth
Manishearth previously approved these changes Apr 7, 2025
Comment on lines +289 to +295
for i in 0..(CODEPOINT_TABLE_LEN as u32) {
if let Some(c) = char::from_u32(i) {
if incb.get(c) == IndicConjunctBreak::Consonant {
properties_map[c as usize] = property_index;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's probably more efficient to use incb.iter_ranges_for_value than to call incb.get a million times

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here does this already, I'm going to fix it in a followup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Manishearth Manishearth merged commit 211db9b into unicode-org:main Apr 8, 2025
29 checks passed
@makotokato makotokato deleted the uax29-16.0 branch April 10, 2025 01:45
Manishearth added a commit that referenced this pull request Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants