[DRAFT] Case-insensitive matching over union of strings #14350

msfroh · 2025-03-13T01:13:44Z

Description

This is a rough attempt to make StringsToAutomaton support case-insensitive strings.

rmuir · 2025-03-13T01:38:43Z

@dweiss understands this one the best, he implemented it.

rmuir · 2025-03-13T01:46:03Z

lucene/core/src/java/org/apache/lucene/util/automaton/StringsToAutomaton.java

+                          ? Character.toUpperCase(label)
+                          : Character.toLowerCase(label);
+          if (altCase != label) {
+            a.addTransition(converted, dest, altCase);


maybe this works because the finishState() will sort the transitions always, and still give us the same efficient build?

Essentially, I tried to copy what you did for the case-insensitive regex matching to add extra transition arcs for the other letter-cases.

I think finishState will sort the transitions. I don't know for sure, though.

Note that this implementation will be way more efficient if all of the input strings are the same case. Otherwise, it might miss common (case-insensitive) prefixes. I'm imagining that a query would lowercase everything first.

rmuir · 2025-03-13T01:50:06Z

To me it seems potentially safe and practical addition. The idea would be that, we can add transition "alternatives" (e.g. A vs a) and it doesn't break the high-level algorithm, due to internal representation and how we sort transitions when doing e.g. finishState()?

I don't really have confidence either way, but it is a good one to explore. Maybe with testing we can be confident it still does the right thing?

rmuir · 2025-03-13T02:00:06Z

OG paper: https://aclanthology.org/J00-1002.pdf

dweiss · 2025-03-13T08:57:31Z

@dweiss understands this one the best, he implemented it.

... 15 years ago in LUCENE-3832. Thanks for putting so much trust in my memory. I'll take a look.

dweiss · 2025-03-13T09:40:27Z

I think this will work just fine in most cases and is a rather inexpensive way to implement this case-insensitive matching, but this comes at the cost of the output automaton that may not be minimal. Consider this example:

    List<BytesRef> terms = new ArrayList<>(List.of(
            newBytesRef("abc"),
            newBytesRef("aBC")));
    Collections.sort(terms);
    Automaton a = build(terms, false, false);

which produces:

However, when you naively expand just the transitions for each letter variant, you get this:

which clearly isn't minimal (and doesn't pass checkMinimized).

I think the absolutely worst case is for the automaton to ~~double the number of transitions~~ edit: multiply the number of transitions, depending on how many alternatives there are for each codepoint - the number of states remains the same. So it's not like it's going to expand uncontrollably... But it's no longer minimal. Perhaps this is acceptable, given the constrained worst case?

rmuir · 2025-03-13T09:57:32Z

Bigger downside: that example isn't deterministic either.

dweiss · 2025-03-13T10:06:56Z

Crap, you're right. Didn't think of it.

dweiss · 2025-03-13T10:42:39Z

I also don't think you can make it deterministic in any trivial way.

msfroh · 2025-03-13T17:29:04Z

My thinking is that a query that uses this should lowercase, dedupe, and sort the input before feeding it into StringsToAutomaton. That would handle @dweiss's example (i.e. that input is "invalid", or at least exact duplicates (after lowercasing) wouldn't add any new states, I think, since the full string exists as a prior prefix).

Would a check with Character.isLowerCase() on each input codepoint for the case-insensitive case be sufficient to reject that kind of input across all valid Unicode strings?

rmuir · 2025-03-13T23:19:23Z

Would a check with Character.isLowerCase() on each input codepoint for the case-insensitive case be sufficient to reject that kind of input across all valid Unicode strings?

I dont think so for greek. I would step back from that and try to get matching working with simple Character.toLowerCase/toUpperCase first? If the user provides data with a certain order/casing as you suggest, will they always get a DFA?

I'm less concerned about it being minimal, let's start with deterministic. I don't think we should do this if it will explode (e.g. NFA). And union of strings really wants to do that, if handled the naive way, that's why there is a special algorithm for it.

msfroh · 2025-03-14T00:03:09Z

To the best of my understanding from reading the through the code while sketching this PR, I believe it would produce a minimal DFA if every character in a set of alternatives in the input strings have the same canonical representation. (The existing implementation already throws if input is not sorted BytesRefs.)

That is, if you input cap, cat, cats, cob, it will generate the minimal DFA. If you input CAP, CAT, CATS, COB, you'll end up with the same minimal DFA (albeit with the transitions added in the opposite order, which I think is fine). But if you input CAP, CATS, cat, cob, you'll end up with a NFA.

dweiss · 2025-03-14T09:05:09Z

I don't know Unicode as well as Rob so I can't say what these alternate case folding
equivalence classes are... but they definitely don't have a "canonical" representation
with regard to Character.toLowercase. Consider the killer Turkish dotless i, for example:

    public void testCornerCase() throws Exception {
        List<BytesRef> terms = Stream.of(
                        "aIb", "aıc")
                .map(s -> {
                    int[] lowercased = s.codePoints().map(Character::toLowerCase).toArray();
                    return new String(lowercased, 0, lowercased.length);
                })
                .map(LuceneTestCase::newBytesRef)
                .sorted()
                .collect(Collectors.toCollection(ArrayList::new));
        Automaton a = build(terms, false, true);
        System.out.println(a.toDot());
        assertTrue(a.isDeterministic());
    }

which yields:

It would take some kind of character normalization filter on both the index and automaton building/expansion side for this to work (but then you don't really need to bother with any of this - if your index is "normalized" and your query is "normalized" then a normal term query will do just fine?).

dweiss · 2025-03-14T10:53:04Z

Or we can just embrace the fact that it can be a non-minimal NFA and justlet it run like that (with NFARunAutomaton).

rmuir · 2025-03-14T12:46:49Z

This is why i recommended to not use the unicode function and to start simple. Then you have a potential way to get it working efficiently.

rmuir · 2025-03-14T12:48:56Z

Or we can just embrace the fact that it can be a non-minimal NFA and justlet it run like that (with NFARunAutomaton).

I don't think this is currently a good option either: users won't just do that. They will determinize, minimize, and tableize and then be confused when things are slow or use too much memory.

dweiss · 2025-03-14T18:46:43Z

Ok, fair enough.

msfroh · 2025-03-14T20:44:21Z

This is kind of what I had in mind:

  private static int canonicalize(int codePoint) {
    int[] alternatives = CaseFolding.lookupAlternates(codePoint);
    if (alternatives != null) {
      for (int cp : alternatives) {
        codePoint = Math.min(codePoint, cp);
      }
    } else {
      int altCase = Character.isLowerCase(codePoint) ? Character.toUpperCase(codePoint) : Character.toLowerCase(codePoint);
      codePoint = Math.min(codePoint, altCase);
    }
    return codePoint;
  }

  public void testCornerCase() throws Exception {
    List<BytesRef> terms = Stream.of(
                    "aIb", "aıc")
            .map(s -> {
              int[] lowercased = s.codePoints().map(TestStringsToAutomaton::canonicalize).toArray();
              return new String(lowercased, 0, lowercased.length);
            })
            .map(LuceneTestCase::newBytesRef)
            .sorted()
            .collect(Collectors.toCollection(ArrayList::new));
    Automaton a = build(terms, false, true);
    System.out.println(a.toDot());
    assertTrue(a.isDeterministic());
  }

That produces this automaton, which is minimal and deterministic:

I don't know if that canonicalize method is a good idea, though.

rmuir · 2025-03-14T20:57:10Z

It isn't a good idea. If the user wants to "erase case differences" then they should apply foldcase(ch). That's what case-folding means. That CaseFolding class does everything, except, that. Again its why i recommend not messing with it for now and starting simpler.

msfroh · 2025-03-15T01:30:37Z

Hmm... I'm thinking of just requiring that input is lowercase (per Character.lowerCase(c)), then check for collisions on uppercase versions when adding transitions, and throw an exception (since it won't be a DFA).

Unfortunately, that would mess with Turkish, if someone tries searching for sınıf (class) and sinirli (nervous). Without locale info, we'd get two transitions from s to I.

As I understand it, the regexp case-insensitive implementation just says that i, ı, I, and İ are all the same letter, which is what the canonicalize hack does (collapsing them all into I on input and exploding them in the automaton).

rmuir · 2025-03-15T20:23:40Z

+1 to start simple with Character.toLowerCase, thats the best you can get in java.

The problem is java not having a Character.foldCase. A proper function would look like ICU's UCharacter.foldCase(int, boolean): https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/lang/UCharacter.html#foldCase-int-boolean-

The regexp folding code doesn't handle turkish correctly either. dotless and dotted I are DIFFERENT, but it mixes all these characters up and conflates them. So I'd like for us not to perpetuate this further, somehow creating "nonstandard case folding" disagrees with the unicode standard.

rmuir · 2025-03-15T20:37:51Z

Separately, it would be nice to add boolean flag (for turkish/azeri) to that CaseFolding class, and fix it to do the right thing, so it doesn't match unrelated characters in turkish. ultimately if we want to add a function to that class to "fold" (e.g. for purposes like here) it should expose that boolean and match unicode casefolding algorithm.

msfroh · 2025-03-17T17:50:46Z

Instead of a boolean flag, what if we define an interface that specifies the folding rules?

It could have two methods: one that folds input characters to a canonical representation (before sorting) and one that expands from the canonical representation to the characters that should be matched. We could ship ASCII and Turkish implementations to start, say. If someone has a Romanian corpus that has a mix of characters with and without diacritics, they might strip diacritics on input and expand them for matching. (That would effectively combine lowercase and ASCII folding.)

I think the same pluggable folding logic could be applied to regex matching too.

rmuir · 2025-03-17T17:53:38Z

I think my ask is misunderstood, it is just to follow the Unicode standard. There are two mappings for simple case folding:

Default
Alternate (Turkish/azeri)

rmuir · 2025-03-17T17:59:13Z

If you want to do fancy romanian accent removal, use an analyzer and normalize your data. That's what a search engine is all about.

But if we want to provide some limited runtime expansion (which I'm still very unsure about), let's just stick with the standards and not invent something else. No need for interfaces, abstractions or lambdas. boolean is the correct solution.

msfroh · 2025-03-17T19:13:22Z

Okay, got it! That's the piece that I was misunderstanding. I didn't realize that Turkish/Azeri is the only other valid folding. I kept thinking of it as just an example where the default folding wouldn't work.

I'll go ahead and update this PR with that in mind.

Thanks a lot for the feedback and patience as I wrap my head around this!

rmuir · 2025-03-18T01:16:30Z

it is confusing. because unicode case folding algorithm is supposed to work for everyone. But here's the problem:

for most of the world:

lowercase i has a dot, uppercase I has no dot.

for turkish/azeri world:

lowercase i corresponds to İ with a dot and lowercase ı without dot corresponds to I

that's why they need their own separate mappings.

dweiss

Those turkic flags... they will proliferate throughout the factory methods.

I don't know. Seems like we're trying to compensate for something that could indeed be done during indexing (analyzer pipeline).

dweiss · 2025-03-18T08:31:42Z

lucene/core/src/java/org/apache/lucene/util/automaton/Automata.java

+   * @return An {@link Automaton} accepting all input strings. The resulting automaton is codepoint
+   *     based (full unicode codepoints on transitions).
+   */
+  public static Automaton makeCaseInsensitiveStringUnion(


turkic parameter is missing from the javadoc.

rmuir · 2025-03-18T09:05:47Z

lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java

+        return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I]
+      }
+    }
+    return Character.toLowerCase(codepoint);


For real case folding we have to do more than this. it is a simple 1-1 mapping but e.g. Σ, σ, and ς, will all fold to σ. Whereas toLowerCase(ς) = ς. Because it is already in lower-case, just in final-form. This is just an example. To see more, compare your function against ICU UCharacter.foldCase(int, bool) across all of unicode.

Got it.

Checking https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt, indeed I can see those entries:

03A3; C; 03C3; # GREEK CAPITAL LETTER SIGMA ... 03C2; C; 03C3; # GREEK SMALL LETTER FINAL SIGMA

Ideally, I'd love to just use those folding rules.

I could get them from UCharacter.foldCase(int, bool), but that involves pulling in icu4j as a dependency, which is an extra 12MB jar.

Would it be worthwhile to write a generator that pulls https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt (updated to whatever the current Unicode spec is) and generates a foldCase method that's functionally equivalent to UCharacter.foldcase(int, bool)?

Maybe, depending what we are going to do with it? if done correctly we could replace LowerCaseFilter, GreekLowerCaseFilter, etc in analysis chain. Of course "correctly" there is a difficult bar, as it would impact 100% of users in a very visible way and could easily bottleneck indexing / waste resources if not done correctly. For example large-arrays-of-objects or even primitives is a big no here. See https://www.strchr.com/multi-stage_tables and look at what JDK and ICU do already.

But for the purpose of this PR, we may want to start simpler (this is the same approach I mentioned on regex caseless PR). We should avoid huge arrays and large data files in lucene-core, just for adding more inefficient user regular expressions that isn't really related to searching. On the other hand, if we are going to get serious benefit everywhere (e.g. improve all analyzers), then maybe the tradeoff makes sense.

And I don't understand why we'd parse text files versus just write any generator itself to use ICU... especially since we already use such an approach in the build already: https://github.com/apache/lucene/blob/main/gradle/generation/icu/GenerateUnicodeProps.groovy

Still I wouldn't immediately jump to generation as a start, it is a lot of work, and we should iterate. First i'd compare Character.toLowerCase(Character.toUpperCase(x)) to UCharacter.foldCase(int, false) to see what the delta really needs to be as far as data. I'd expect this to be very small. You can start prototyping with that instead of investing a ton of up-front time.

rmuir · 2025-03-21T20:06:07Z

Maybe this one helps the issue: #14389

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Mar 13, 2025

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Mar 13, 2025

github-actions bot added the module:core/other label Mar 13, 2025

msfroh mentioned this pull request Mar 13, 2025

Case-insensitive TermInSetQuery Implementation (Proof of Concept) #14349

Open

msfroh force-pushed the case_insensitive_string_union branch from 99808bc to 14802c6 Compare March 13, 2025 01:19

[DRAFT] Case-insensitive matching over union of strings

036668a

rmuir requested a review from dweiss March 13, 2025 01:38

rmuir reviewed Mar 13, 2025

View reviewed changes

msfroh force-pushed the case_insensitive_string_union branch from 14802c6 to 036668a Compare March 13, 2025 01:47

Only accept lowercase input, generate transitions for uppercase

b39474f

dweiss reviewed Mar 18, 2025

View reviewed changes

rmuir reviewed Mar 18, 2025

View reviewed changes

msfroh mentioned this pull request Mar 21, 2025

Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter #14389

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Case-insensitive matching over union of strings #14350

[DRAFT] Case-insensitive matching over union of strings #14350

msfroh commented Mar 13, 2025

rmuir commented Mar 13, 2025

rmuir Mar 13, 2025

msfroh Mar 13, 2025 •

edited

Loading

rmuir commented Mar 13, 2025

rmuir commented Mar 13, 2025

dweiss commented Mar 13, 2025

dweiss commented Mar 13, 2025 •

edited

Loading

rmuir commented Mar 13, 2025

dweiss commented Mar 13, 2025

dweiss commented Mar 13, 2025

msfroh commented Mar 13, 2025 •

edited

Loading

rmuir commented Mar 13, 2025

msfroh commented Mar 14, 2025

dweiss commented Mar 14, 2025 •

edited

Loading

dweiss commented Mar 14, 2025

rmuir commented Mar 14, 2025

rmuir commented Mar 14, 2025

dweiss commented Mar 14, 2025

msfroh commented Mar 14, 2025

rmuir commented Mar 14, 2025

msfroh commented Mar 15, 2025 •

edited

Loading

rmuir commented Mar 15, 2025

rmuir commented Mar 15, 2025

msfroh commented Mar 17, 2025 •

edited

Loading

rmuir commented Mar 17, 2025

rmuir commented Mar 17, 2025

msfroh commented Mar 17, 2025 •

edited

Loading

rmuir commented Mar 18, 2025

dweiss left a comment

dweiss Mar 18, 2025

rmuir Mar 18, 2025

msfroh Mar 19, 2025

rmuir Mar 19, 2025

rmuir commented Mar 21, 2025

[DRAFT] Case-insensitive matching over union of strings #14350

Are you sure you want to change the base?

[DRAFT] Case-insensitive matching over union of strings #14350

Conversation

msfroh commented Mar 13, 2025

Description

rmuir commented Mar 13, 2025

rmuir Mar 13, 2025

Choose a reason for hiding this comment

msfroh Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

rmuir commented Mar 13, 2025

rmuir commented Mar 13, 2025

dweiss commented Mar 13, 2025

dweiss commented Mar 13, 2025 • edited Loading

rmuir commented Mar 13, 2025

dweiss commented Mar 13, 2025

dweiss commented Mar 13, 2025

msfroh commented Mar 13, 2025 • edited Loading

rmuir commented Mar 13, 2025

msfroh commented Mar 14, 2025

dweiss commented Mar 14, 2025 • edited Loading

dweiss commented Mar 14, 2025

rmuir commented Mar 14, 2025

rmuir commented Mar 14, 2025

dweiss commented Mar 14, 2025

msfroh commented Mar 14, 2025

rmuir commented Mar 14, 2025

msfroh commented Mar 15, 2025 • edited Loading

rmuir commented Mar 15, 2025

rmuir commented Mar 15, 2025

msfroh commented Mar 17, 2025 • edited Loading

rmuir commented Mar 17, 2025

rmuir commented Mar 17, 2025

msfroh commented Mar 17, 2025 • edited Loading

rmuir commented Mar 18, 2025

dweiss left a comment

Choose a reason for hiding this comment

dweiss Mar 18, 2025

Choose a reason for hiding this comment

rmuir Mar 18, 2025

Choose a reason for hiding this comment

msfroh Mar 19, 2025

Choose a reason for hiding this comment

rmuir Mar 19, 2025

Choose a reason for hiding this comment

rmuir commented Mar 21, 2025

msfroh Mar 13, 2025 •

edited

Loading

dweiss commented Mar 13, 2025 •

edited

Loading

msfroh commented Mar 13, 2025 •

edited

Loading

dweiss commented Mar 14, 2025 •

edited

Loading

msfroh commented Mar 15, 2025 •

edited

Loading

msfroh commented Mar 17, 2025 •

edited

Loading

msfroh commented Mar 17, 2025 •

edited

Loading