Update UAX#29 text segmenter data rules to 16.0. #6367

makotokato · 2025-03-27T11:09:05Z

Unicode 16.0.0 changes are property value only. So updating data files and test data.

eggrobin · 2025-03-27T13:34:00Z

Now that the issues with InCB in Unicode 15.1 have been resolved (see UTC-179-C26), we should get rid of our own derivation here:

icu4x/provider/source/src/segmenter/mod.rs

Lines 292 to 336 in 7e287f1

    
           // The Indic_Conjunct_Break property is separate from the Grapheme_Cluster_Break property. 
        
           // See https://unicode.org/reports/tr44/#Indic_Conjunct_Break 
        
           if p.name == "InCBConsonant" || p.name == "InCBLinker" || p.name == "InCBExtend" 
        
           { 
        
               let gcb_extend = gcb_name_to_enum 
        
                   .get_loose("Extend") 
        
                   .expect("property name should be valid!"); 
        
               for i in 0..(CODEPOINT_TABLE_LEN as u32) { 
        
                   if let Some(c) = char::from_u32(i) { 
        
                       let insc_value = insc.get(c); 
        
                       let sc = script.get(c); 
        
                       let is_gb9c_script = sc == Script::Bengali 
        
                           || sc == Script::Devanagari 
        
                           || sc == Script::Gujarati 
        
                           || sc == Script::Malayalam 
        
                           || sc == Script::Oriya 
        
                           || sc == Script::Telugu; 
        
                       let is_incb_consonant = insc_value 
        
                           == IndicSyllabicCategory::Consonant 
        
                           && is_gb9c_script; 
        
                       let is_incb_linker = 
        
                           insc_value == IndicSyllabicCategory::Virama && is_gb9c_script; 
        
                       // InCB = Linker or InCB = Consonant 
        
                       if (p.name == "InCBConsonant" && is_incb_consonant) 
        
                           || (p.name == "InCBLinker" && is_incb_linker) 
        
                           // ZWJ is InCB=Extend, but is in a different GCB class anyway so 
        
                           // it needs to be special-cased in the tables. 
        
                           // NOTE(eggrobin): UAX #44, Version 15.1, instead excludes based 
        
                           // on InSC. 
        
                           // I believe that to be a defect in that version of Unicode. 
        
                           // This has been brought to the attention of the Properties and 
        
                           // Algorithms Group. 
        
                           || (p.name == "InCBExtend" 
        
                               && (gb.get32(i) == gcb_extend 
        
                                   && ccc.get32(i) != CanonicalCombiningClass::NotReordered 
        
                                   && !is_incb_consonant 
        
                                   && !is_incb_linker)) 
        
                       { 
        
                           properties_map[c as usize] = property_index; 
        
                       } 
        
                   } 
        
               } 
        
               continue; 
        
           }

Right now it is not causing any problems, since our custom derivation is consistent with what Unicode 16.0 does, but in 17.0 there will be substantial changes to the derivation, see UTC-179-C27 and https://www.unicode.org/reports/tr44/proposed.html#Derivation_InCB. These 17.0 changes do not come with any changes to the rules themselves, so the update should be just as easy as this one, but only if we are actually using the InCB from UTC.

makotokato · 2025-03-28T04:35:28Z

InCB isn't implemented in ICU4X yet, I will create a PR to add this property.

provider/source/data/segmenter/uprops/small/ExtPict.toml

aethanyc

If updating the properties is all we need and no UAX29 rules needs a change, we should update these comments point to 16.0.0 and the new URLs.

icu4x/provider/source/data/segmenter/grapheme.toml

Lines 5 to 6 in b886706

    
           # These grapheme boundary rules are based on UAX #29, Unicode Version 15.1.0. 
        
           # https://www.unicode.org/reports/tr29/tr29-43.html

icu4x/provider/source/data/segmenter/sentence.toml

Lines 5 to 6 in b886706

    
           # These sentence boundary rules are based on UAX #29, Unicode Version 15.1.0. 
        
           # https://www.unicode.org/reports/tr29/tr29-43.html

icu4x/provider/source/data/segmenter/word.toml

Lines 5 to 6 in b886706

    
           # These word boundary rules are based on UAX #29, Unicode Version 15.1.0. 
        
           # https://www.unicode.org/reports/tr29/tr29-43.html

components/segmenter/tests/testdata/GraphemeBreakExtraTest.txt

@eggrobin

From #6367, @eggrobin suggests to use Indic_Conjunct_Break (InCB) property for Grapheme Cluster Break. Also, `InCB.toml` is incomplete yet like the following, since it is added by ICU76 as a draft API. ``` values = [ {discr = 0, long = "None", short = "None"}, ] ``` It means that names (short / long / parse) are empty for this implementation.

From Unicode 16.0, Extend if Indic_Conjunct_Break doesn't include ccc.

robertbastian · 2025-04-08T08:49:59Z

provider/source/src/segmenter/mod.rs

+                        for i in 0..(CODEPOINT_TABLE_LEN as u32) {
+                            if let Some(c) = char::from_u32(i) {
+                                if incb.get(c) == IndicConjunctBreak::Consonant {
+                                    properties_map[c as usize] = property_index;
+                                }
+                            }
+                        }


nit: it's probably more efficient to use incb.iter_ranges_for_value than to call incb.get a million times

The code here does this already, I'm going to fix it in a followup

Followup from #6367

makotokato requested review from sffc, robertbastian, Manishearth, aethanyc and a team as code owners March 27, 2025 11:09

robertbastian requested review from eggrobin and removed request for robertbastian March 27, 2025 13:17

makotokato mentioned this pull request Mar 28, 2025

Add Indic_Conjunct_Break property. #6379

Merged

Manishearth reviewed Mar 31, 2025

View reviewed changes

provider/source/data/segmenter/uprops/small/ExtPict.toml Show resolved Hide resolved

aethanyc reviewed Apr 1, 2025

View reviewed changes

components/segmenter/tests/testdata/GraphemeBreakExtraTest.txt Show resolved Hide resolved

sffc removed their request for review April 4, 2025 07:44

Manishearth previously approved these changes Apr 7, 2025

View reviewed changes

makotokato added 3 commits April 8, 2025 09:16

Update UAX#29 text segmenter data rules to 16.0.

c6dafc6

Use Indic_Conjunct_Break directly instead of Unicode 15.1's defines.

b192b7d

From Unicode 16.0, Extend if Indic_Conjunct_Break doesn't include ccc.

Update Unicode version in toml files.

9801375

makotokato dismissed Manishearth’s stale review via 9801375 April 8, 2025 00:31

makotokato force-pushed the uax29-16.0 branch from 800c964 to 9801375 Compare April 8, 2025 00:31

Manishearth approved these changes Apr 8, 2025

View reviewed changes

robertbastian reviewed Apr 8, 2025

View reviewed changes

aethanyc approved these changes Apr 8, 2025

View reviewed changes

Manishearth merged commit 211db9b into unicode-org:main Apr 8, 2025
29 checks passed

Manishearth mentioned this pull request Apr 8, 2025

Use range iteration in segmenter datagen #6430

Merged

makotokato deleted the uax29-16.0 branch April 10, 2025 01:45

Manishearth added a commit that referenced this pull request Apr 10, 2025

Use range iteration in segmenter datagen (#6430)

cd9a3e3

Followup from #6367

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update UAX#29 text segmenter data rules to 16.0. #6367

Update UAX#29 text segmenter data rules to 16.0. #6367

Uh oh!

makotokato commented Mar 27, 2025

Uh oh!

eggrobin commented Mar 27, 2025

Uh oh!

makotokato commented Mar 28, 2025

Uh oh!

Uh oh!

aethanyc left a comment

Uh oh!

Uh oh!

robertbastian Apr 8, 2025

Uh oh!

Manishearth Apr 8, 2025

Uh oh!

Manishearth Apr 8, 2025

Uh oh!

Uh oh!

Uh oh!

	# These grapheme boundary rules are based on UAX #29, Unicode Version 15.1.0.
	# https://www.unicode.org/reports/tr29/tr29-43.html

	# These sentence boundary rules are based on UAX #29, Unicode Version 15.1.0.
	# https://www.unicode.org/reports/tr29/tr29-43.html

	# These word boundary rules are based on UAX #29, Unicode Version 15.1.0.
	# https://www.unicode.org/reports/tr29/tr29-43.html

Update UAX#29 text segmenter data rules to 16.0. #6367

Update UAX#29 text segmenter data rules to 16.0. #6367

Uh oh!

Conversation

makotokato commented Mar 27, 2025

Uh oh!

eggrobin commented Mar 27, 2025

Uh oh!

makotokato commented Mar 28, 2025

Uh oh!

Uh oh!

aethanyc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robertbastian Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

Manishearth Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

Manishearth Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!