-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[DRAFT] Case-insensitive matching over union of strings #14350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) { | |
|
||
return alts; | ||
} | ||
|
||
/** | ||
* Folds the case of the given character according to {@link Character#toLowerCase(int)}, but with | ||
* exceptions if the turkic flag is set. | ||
* | ||
* @param codepoint to code point for the character to fold | ||
* @param turkic if true, then apply tr/az folding rules | ||
* @return the folded character | ||
*/ | ||
static int foldCase(int codepoint, boolean turkic) { | ||
if (turkic) { | ||
if (codepoint == 0x00130) { // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE] | ||
return 0x00069; // i [LATIN SMALL LETTER I] | ||
} else if (codepoint == 0x000049) { // I [LATIN CAPITAL LETTER I] | ||
return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I] | ||
} | ||
} | ||
return Character.toLowerCase(codepoint); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For real case folding we have to do more than this. it is a simple 1-1 mapping but e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. Checking https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt, indeed I can see those entries:
Ideally, I'd love to just use those folding rules. I could get them from Would it be worthwhile to write a generator that pulls https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt (updated to whatever the current Unicode spec is) and generates a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe, depending what we are going to do with it? if done correctly we could replace But for the purpose of this PR, we may want to start simpler (this is the same approach I mentioned on regex caseless PR). We should avoid huge arrays and large data files in lucene-core, just for adding more inefficient user regular expressions that isn't really related to searching. On the other hand, if we are going to get serious benefit everywhere (e.g. improve all analyzers), then maybe the tradeoff makes sense. And I don't understand why we'd parse text files versus just write any generator itself to use ICU... especially since we already use such an approach in the build already: https://github.com/apache/lucene/blob/main/gradle/generation/icu/GenerateUnicodeProps.groovy Still I wouldn't immediately jump to generation as a start, it is a lot of work, and we should iterate. First i'd compare |
||
} | ||
|
||
/** | ||
* Attempts to convert the given character to upper case, acccording to {@link | ||
* Character#toUpperCase(int)}, but with exceptions if the turkic flag is set. | ||
* | ||
* @param codepoint to code point for the character to convert to upper case | ||
* @param turkic if true, then apply tr/az folding rules | ||
* @return the upper case character | ||
*/ | ||
static int upperCase(int codepoint, boolean turkic) { | ||
if (turkic) { | ||
if (codepoint == 0x00069) { // i [LATIN SMALL LETTER I] | ||
return 0x00130; // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE] | ||
} else if (codepoint == 0x00131) { // ı [LATIN SMALL LETTER DOTLESS I] | ||
return 0x000049; // I [LATIN CAPITAL LETTER I] | ||
} | ||
} | ||
return Character.toUpperCase(codepoint); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turkic parameter is missing from the javadoc.