Skip to content

[DRAFT] Case-insensitive matching over union of strings #14350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

msfroh
Copy link
Contributor

@msfroh msfroh commented Mar 13, 2025

Description

This is a rough attempt to make StringsToAutomaton support case-insensitive strings.

@rmuir rmuir requested a review from dweiss March 13, 2025 01:38
@rmuir
Copy link
Member

rmuir commented Mar 13, 2025

@dweiss understands this one the best, he implemented it.

? Character.toUpperCase(label)
: Character.toLowerCase(label);
if (altCase != label) {
a.addTransition(converted, dest, altCase);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this works because the finishState() will sort the transitions always, and still give us the same efficient build?

Copy link
Contributor Author

@msfroh msfroh Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, I tried to copy what you did for the case-insensitive regex matching to add extra transition arcs for the other letter-cases.

I think finishState will sort the transitions. I don't know for sure, though.

Note that this implementation will be way more efficient if all of the input strings are the same case. Otherwise, it might miss common (case-insensitive) prefixes. I'm imagining that a query would lowercase everything first.

@msfroh msfroh force-pushed the case_insensitive_string_union branch from 14802c6 to 036668a Compare March 13, 2025 01:47
@rmuir
Copy link
Member

rmuir commented Mar 13, 2025

To me it seems potentially safe and practical addition. The idea would be that, we can add transition "alternatives" (e.g. A vs a) and it doesn't break the high-level algorithm, due to internal representation and how we sort transitions when doing e.g. finishState()?

I don't really have confidence either way, but it is a good one to explore. Maybe with testing we can be confident it still does the right thing?

@rmuir
Copy link
Member

rmuir commented Mar 13, 2025

OG paper: https://aclanthology.org/J00-1002.pdf

@dweiss
Copy link
Contributor

dweiss commented Mar 13, 2025

@dweiss understands this one the best, he implemented it.

... 15 years ago in LUCENE-3832. Thanks for putting so much trust in my memory. I'll take a look.

@dweiss
Copy link
Contributor

dweiss commented Mar 13, 2025

I think this will work just fine in most cases and is a rather inexpensive way to implement this case-insensitive matching, but this comes at the cost of the output automaton that may not be minimal. Consider this example:

    List<BytesRef> terms = new ArrayList<>(List.of(
            newBytesRef("abc"),
            newBytesRef("aBC")));
    Collections.sort(terms);
    Automaton a = build(terms, false, false);

which produces:
image

However, when you naively expand just the transitions for each letter variant, you get this:
image
which clearly isn't minimal (and doesn't pass checkMinimized).

I think the absolutely worst case is for the automaton to double the number of transitions edit: multiply the number of transitions, depending on how many alternatives there are for each codepoint - the number of states remains the same. So it's not like it's going to expand uncontrollably... But it's no longer minimal. Perhaps this is acceptable, given the constrained worst case?

@rmuir
Copy link
Member

rmuir commented Mar 13, 2025

Bigger downside: that example isn't deterministic either.

@dweiss
Copy link
Contributor

dweiss commented Mar 13, 2025

Crap, you're right. Didn't think of it.

@dweiss
Copy link
Contributor

dweiss commented Mar 13, 2025

I also don't think you can make it deterministic in any trivial way.

@msfroh
Copy link
Contributor Author

msfroh commented Mar 13, 2025

My thinking is that a query that uses this should lowercase, dedupe, and sort the input before feeding it into StringsToAutomaton. That would handle @dweiss's example (i.e. that input is "invalid", or at least exact duplicates (after lowercasing) wouldn't add any new states, I think, since the full string exists as a prior prefix).

Would a check with Character.isLowerCase() on each input codepoint for the case-insensitive case be sufficient to reject that kind of input across all valid Unicode strings?

@rmuir
Copy link
Member

rmuir commented Mar 13, 2025

Would a check with Character.isLowerCase() on each input codepoint for the case-insensitive case be sufficient to reject that kind of input across all valid Unicode strings?

I dont think so for greek. I would step back from that and try to get matching working with simple Character.toLowerCase/toUpperCase first? If the user provides data with a certain order/casing as you suggest, will they always get a DFA?

I'm less concerned about it being minimal, let's start with deterministic. I don't think we should do this if it will explode (e.g. NFA). And union of strings really wants to do that, if handled the naive way, that's why there is a special algorithm for it.

@msfroh
Copy link
Contributor Author

msfroh commented Mar 14, 2025

To the best of my understanding from reading the through the code while sketching this PR, I believe it would produce a minimal DFA if every character in a set of alternatives in the input strings have the same canonical representation. (The existing implementation already throws if input is not sorted BytesRefs.)

That is, if you input cap, cat, cats, cob, it will generate the minimal DFA. If you input CAP, CAT, CATS, COB, you'll end up with the same minimal DFA (albeit with the transitions added in the opposite order, which I think is fine). But if you input CAP, CATS, cat, cob, you'll end up with a NFA.

@dweiss
Copy link
Contributor

dweiss commented Mar 14, 2025

I don't know Unicode as well as Rob so I can't say what these alternate case folding
equivalence classes are... but they definitely don't have a "canonical" representation
with regard to Character.toLowercase. Consider the killer Turkish dotless i, for example:

    public void testCornerCase() throws Exception {
        List<BytesRef> terms = Stream.of(
                        "aIb", "aıc")
                .map(s -> {
                    int[] lowercased = s.codePoints().map(Character::toLowerCase).toArray();
                    return new String(lowercased, 0, lowercased.length);
                })
                .map(LuceneTestCase::newBytesRef)
                .sorted()
                .collect(Collectors.toCollection(ArrayList::new));
        Automaton a = build(terms, false, true);
        System.out.println(a.toDot());
        assertTrue(a.isDeterministic());
    }

which yields:
image

It would take some kind of character normalization filter on both the index and automaton building/expansion side for this to work (but then you don't really need to bother with any of this - if your index is "normalized" and your query is "normalized" then a normal term query will do just fine?).

@dweiss
Copy link
Contributor

dweiss commented Mar 14, 2025

Or we can just embrace the fact that it can be a non-minimal NFA and justlet it run like that (with NFARunAutomaton).

@rmuir
Copy link
Member

rmuir commented Mar 14, 2025

This is why i recommended to not use the unicode function and to start simple. Then you have a potential way to get it working efficiently.

@rmuir
Copy link
Member

rmuir commented Mar 14, 2025

Or we can just embrace the fact that it can be a non-minimal NFA and justlet it run like that (with NFARunAutomaton).

I don't think this is currently a good option either: users won't just do that. They will determinize, minimize, and tableize and then be confused when things are slow or use too much memory.

@dweiss
Copy link
Contributor

dweiss commented Mar 14, 2025

Ok, fair enough.

@msfroh
Copy link
Contributor Author

msfroh commented Mar 14, 2025

This is kind of what I had in mind:

  private static int canonicalize(int codePoint) {
    int[] alternatives = CaseFolding.lookupAlternates(codePoint);
    if (alternatives != null) {
      for (int cp : alternatives) {
        codePoint = Math.min(codePoint, cp);
      }
    } else {
      int altCase = Character.isLowerCase(codePoint) ? Character.toUpperCase(codePoint) : Character.toLowerCase(codePoint);
      codePoint = Math.min(codePoint, altCase);
    }
    return codePoint;
  }

  public void testCornerCase() throws Exception {
    List<BytesRef> terms = Stream.of(
                    "aIb", "aıc")
            .map(s -> {
              int[] lowercased = s.codePoints().map(TestStringsToAutomaton::canonicalize).toArray();
              return new String(lowercased, 0, lowercased.length);
            })
            .map(LuceneTestCase::newBytesRef)
            .sorted()
            .collect(Collectors.toCollection(ArrayList::new));
    Automaton a = build(terms, false, true);
    System.out.println(a.toDot());
    assertTrue(a.isDeterministic());
  }

That produces this automaton, which is minimal and deterministic:
automaton

I don't know if that canonicalize method is a good idea, though.

@rmuir
Copy link
Member

rmuir commented Mar 14, 2025

It isn't a good idea. If the user wants to "erase case differences" then they should apply foldcase(ch). That's what case-folding means. That CaseFolding class does everything, except, that. Again its why i recommend not messing with it for now and starting simpler.

@msfroh
Copy link
Contributor Author

msfroh commented Mar 15, 2025

Hmm... I'm thinking of just requiring that input is lowercase (per Character.lowerCase(c)), then check for collisions on uppercase versions when adding transitions, and throw an exception (since it won't be a DFA).

Unfortunately, that would mess with Turkish, if someone tries searching for sınıf (class) and sinirli (nervous). Without locale info, we'd get two transitions from s to I.

As I understand it, the regexp case-insensitive implementation just says that i, ı, I, and İ are all the same letter, which is what the canonicalize hack does (collapsing them all into I on input and exploding them in the automaton).

@rmuir
Copy link
Member

rmuir commented Mar 15, 2025

+1 to start simple with Character.toLowerCase, thats the best you can get in java.

The problem is java not having a Character.foldCase. A proper function would look like ICU's UCharacter.foldCase(int, boolean): https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/lang/UCharacter.html#foldCase-int-boolean-

The regexp folding code doesn't handle turkish correctly either. dotless and dotted I are DIFFERENT, but it mixes all these characters up and conflates them. So I'd like for us not to perpetuate this further, somehow creating "nonstandard case folding" disagrees with the unicode standard.

@rmuir
Copy link
Member

rmuir commented Mar 15, 2025

Separately, it would be nice to add boolean flag (for turkish/azeri) to that CaseFolding class, and fix it to do the right thing, so it doesn't match unrelated characters in turkish. ultimately if we want to add a function to that class to "fold" (e.g. for purposes like here) it should expose that boolean and match unicode casefolding algorithm.

@msfroh
Copy link
Contributor Author

msfroh commented Mar 17, 2025

Instead of a boolean flag, what if we define an interface that specifies the folding rules?

It could have two methods: one that folds input characters to a canonical representation (before sorting) and one that expands from the canonical representation to the characters that should be matched. We could ship ASCII and Turkish implementations to start, say. If someone has a Romanian corpus that has a mix of characters with and without diacritics, they might strip diacritics on input and expand them for matching. (That would effectively combine lowercase and ASCII folding.)

I think the same pluggable folding logic could be applied to regex matching too.

@rmuir
Copy link
Member

rmuir commented Mar 17, 2025

I think my ask is misunderstood, it is just to follow the Unicode standard. There are two mappings for simple case folding:

  • Default
  • Alternate (Turkish/azeri)

@rmuir
Copy link
Member

rmuir commented Mar 17, 2025

If you want to do fancy romanian accent removal, use an analyzer and normalize your data. That's what a search engine is all about.

But if we want to provide some limited runtime expansion (which I'm still very unsure about), let's just stick with the standards and not invent something else. No need for interfaces, abstractions or lambdas. boolean is the correct solution.

@msfroh
Copy link
Contributor Author

msfroh commented Mar 17, 2025

Okay, got it! That's the piece that I was misunderstanding. I didn't realize that Turkish/Azeri is the only other valid folding. I kept thinking of it as just an example where the default folding wouldn't work.

I'll go ahead and update this PR with that in mind.

Thanks a lot for the feedback and patience as I wrap my head around this!

@rmuir
Copy link
Member

rmuir commented Mar 18, 2025

it is confusing. because unicode case folding algorithm is supposed to work for everyone. But here's the problem:

for most of the world:

  • lowercase i has a dot, uppercase I has no dot.

for turkish/azeri world:

  • lowercase i corresponds to İ with a dot and lowercase ı without dot corresponds to I

that's why they need their own separate mappings.

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those turkic flags... they will proliferate throughout the factory methods.

I don't know. Seems like we're trying to compensate for something that could indeed be done during indexing (analyzer pipeline).

* @return An {@link Automaton} accepting all input strings. The resulting automaton is codepoint
* based (full unicode codepoints on transitions).
*/
public static Automaton makeCaseInsensitiveStringUnion(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turkic parameter is missing from the javadoc.

return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I]
}
}
return Character.toLowerCase(codepoint);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For real case folding we have to do more than this. it is a simple 1-1 mapping but e.g. Σ, σ, and ς, will all fold to σ. Whereas toLowerCase(ς) = ς. Because it is already in lower-case, just in final-form. This is just an example. To see more, compare your function against ICU UCharacter.foldCase(int, bool) across all of unicode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Checking https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt, indeed I can see those entries:

03A3; C; 03C3; # GREEK CAPITAL LETTER SIGMA
...
03C2; C; 03C3; # GREEK SMALL LETTER FINAL SIGMA

Ideally, I'd love to just use those folding rules.

I could get them from UCharacter.foldCase(int, bool), but that involves pulling in icu4j as a dependency, which is an extra 12MB jar.

Would it be worthwhile to write a generator that pulls https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt (updated to whatever the current Unicode spec is) and generates a foldCase method that's functionally equivalent to UCharacter.foldcase(int, bool)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, depending what we are going to do with it? if done correctly we could replace LowerCaseFilter, GreekLowerCaseFilter, etc in analysis chain. Of course "correctly" there is a difficult bar, as it would impact 100% of users in a very visible way and could easily bottleneck indexing / waste resources if not done correctly. For example large-arrays-of-objects or even primitives is a big no here. See https://www.strchr.com/multi-stage_tables and look at what JDK and ICU do already.

But for the purpose of this PR, we may want to start simpler (this is the same approach I mentioned on regex caseless PR). We should avoid huge arrays and large data files in lucene-core, just for adding more inefficient user regular expressions that isn't really related to searching. On the other hand, if we are going to get serious benefit everywhere (e.g. improve all analyzers), then maybe the tradeoff makes sense.

And I don't understand why we'd parse text files versus just write any generator itself to use ICU... especially since we already use such an approach in the build already: https://github.com/apache/lucene/blob/main/gradle/generation/icu/GenerateUnicodeProps.groovy

Still I wouldn't immediately jump to generation as a start, it is a lot of work, and we should iterate. First i'd compare Character.toLowerCase(Character.toUpperCase(x)) to UCharacter.foldCase(int, false) to see what the delta really needs to be as far as data. I'd expect this to be very small. You can start prototyping with that instead of investing a ton of up-front time.

@rmuir
Copy link
Member

rmuir commented Mar 21, 2025

Maybe this one helps the issue: #14389

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants