-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[DRAFT] Case-insensitive matching over union of strings #14350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
99808bc
to
14802c6
Compare
@dweiss understands this one the best, he implemented it. |
? Character.toUpperCase(label) | ||
: Character.toLowerCase(label); | ||
if (altCase != label) { | ||
a.addTransition(converted, dest, altCase); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this works because the finishState() will sort the transitions always, and still give us the same efficient build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, I tried to copy what you did for the case-insensitive regex matching to add extra transition arcs for the other letter-cases.
I think finishState
will sort the transitions. I don't know for sure, though.
Note that this implementation will be way more efficient if all of the input strings are the same case. Otherwise, it might miss common (case-insensitive) prefixes. I'm imagining that a query would lowercase everything first.
14802c6
to
036668a
Compare
To me it seems potentially safe and practical addition. The idea would be that, we can add transition "alternatives" (e.g. I don't really have confidence either way, but it is a good one to explore. Maybe with testing we can be confident it still does the right thing? |
OG paper: https://aclanthology.org/J00-1002.pdf |
... 15 years ago in LUCENE-3832. Thanks for putting so much trust in my memory. I'll take a look. |
Bigger downside: that example isn't deterministic either. |
Crap, you're right. Didn't think of it. |
I also don't think you can make it deterministic in any trivial way. |
My thinking is that a query that uses this should lowercase, dedupe, and sort the input before feeding it into Would a check with |
I dont think so for greek. I would step back from that and try to get matching working with simple Character.toLowerCase/toUpperCase first? If the user provides data with a certain order/casing as you suggest, will they always get a DFA? I'm less concerned about it being minimal, let's start with deterministic. I don't think we should do this if it will explode (e.g. NFA). And union of strings really wants to do that, if handled the naive way, that's why there is a special algorithm for it. |
To the best of my understanding from reading the through the code while sketching this PR, I believe it would produce a minimal DFA if every character in a set of alternatives in the input strings have the same canonical representation. (The existing implementation already throws if input is not sorted BytesRefs.) That is, if you input |
I don't know Unicode as well as Rob so I can't say what these alternate case folding
It would take some kind of character normalization filter on both the index and automaton building/expansion side for this to work (but then you don't really need to bother with any of this - if your index is "normalized" and your query is "normalized" then a normal term query will do just fine?). |
Or we can just embrace the fact that it can be a non-minimal NFA and justlet it run like that (with NFARunAutomaton). |
This is why i recommended to not use the unicode function and to start simple. Then you have a potential way to get it working efficiently. |
I don't think this is currently a good option either: users won't just do that. They will determinize, minimize, and tableize and then be confused when things are slow or use too much memory. |
Ok, fair enough. |
It isn't a good idea. If the user wants to "erase case differences" then they should apply |
Hmm... I'm thinking of just requiring that input is lowercase (per Unfortunately, that would mess with Turkish, if someone tries searching for sınıf (class) and sinirli (nervous). Without locale info, we'd get two transitions from s to I. As I understand it, the regexp case-insensitive implementation just says that i, ı, I, and İ are all the same letter, which is what the |
+1 to start simple with Character.toLowerCase, thats the best you can get in java. The problem is java not having a Character.foldCase. A proper function would look like ICU's The regexp folding code doesn't handle turkish correctly either. dotless and dotted I are DIFFERENT, but it mixes all these characters up and conflates them. So I'd like for us not to perpetuate this further, somehow creating "nonstandard case folding" disagrees with the unicode standard. |
Separately, it would be nice to add boolean flag (for turkish/azeri) to that CaseFolding class, and fix it to do the right thing, so it doesn't match unrelated characters in turkish. ultimately if we want to add a function to that class to "fold" (e.g. for purposes like here) it should expose that boolean and match unicode casefolding algorithm. |
Instead of a boolean flag, what if we define an interface that specifies the folding rules? It could have two methods: one that folds input characters to a canonical representation (before sorting) and one that expands from the canonical representation to the characters that should be matched. We could ship ASCII and Turkish implementations to start, say. If someone has a Romanian corpus that has a mix of characters with and without diacritics, they might strip diacritics on input and expand them for matching. (That would effectively combine lowercase and ASCII folding.) I think the same pluggable folding logic could be applied to regex matching too. |
I think my ask is misunderstood, it is just to follow the Unicode standard. There are two mappings for simple case folding:
|
If you want to do fancy romanian accent removal, use an analyzer and normalize your data. That's what a search engine is all about. But if we want to provide some limited runtime expansion (which I'm still very unsure about), let's just stick with the standards and not invent something else. No need for interfaces, abstractions or lambdas. |
Okay, got it! That's the piece that I was misunderstanding. I didn't realize that Turkish/Azeri is the only other valid folding. I kept thinking of it as just an example where the default folding wouldn't work. I'll go ahead and update this PR with that in mind. Thanks a lot for the feedback and patience as I wrap my head around this! |
it is confusing. because unicode case folding algorithm is supposed to work for everyone. But here's the problem: for most of the world:
for turkish/azeri world:
that's why they need their own separate mappings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those turkic flags... they will proliferate throughout the factory methods.
I don't know. Seems like we're trying to compensate for something that could indeed be done during indexing (analyzer pipeline).
* @return An {@link Automaton} accepting all input strings. The resulting automaton is codepoint | ||
* based (full unicode codepoints on transitions). | ||
*/ | ||
public static Automaton makeCaseInsensitiveStringUnion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turkic parameter is missing from the javadoc.
return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I] | ||
} | ||
} | ||
return Character.toLowerCase(codepoint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For real case folding we have to do more than this. it is a simple 1-1 mapping but e.g. Σ
, σ
, and ς
, will all fold to σ. Whereas toLowerCase(ς) = ς. Because it is already in lower-case, just in final-form. This is just an example. To see more, compare your function against ICU UCharacter.foldCase(int, bool) across all of unicode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
Checking https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt, indeed I can see those entries:
03A3; C; 03C3; # GREEK CAPITAL LETTER SIGMA
...
03C2; C; 03C3; # GREEK SMALL LETTER FINAL SIGMA
Ideally, I'd love to just use those folding rules.
I could get them from UCharacter.foldCase(int, bool)
, but that involves pulling in icu4j as a dependency, which is an extra 12MB jar.
Would it be worthwhile to write a generator that pulls https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt (updated to whatever the current Unicode spec is) and generates a foldCase
method that's functionally equivalent to UCharacter.foldcase(int, bool)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, depending what we are going to do with it? if done correctly we could replace LowerCaseFilter
, GreekLowerCaseFilter
, etc in analysis chain. Of course "correctly" there is a difficult bar, as it would impact 100% of users in a very visible way and could easily bottleneck indexing / waste resources if not done correctly. For example large-arrays-of-objects or even primitives is a big no here. See https://www.strchr.com/multi-stage_tables and look at what JDK and ICU do already.
But for the purpose of this PR, we may want to start simpler (this is the same approach I mentioned on regex caseless PR). We should avoid huge arrays and large data files in lucene-core, just for adding more inefficient user regular expressions that isn't really related to searching. On the other hand, if we are going to get serious benefit everywhere (e.g. improve all analyzers), then maybe the tradeoff makes sense.
And I don't understand why we'd parse text files versus just write any generator itself to use ICU... especially since we already use such an approach in the build already: https://github.com/apache/lucene/blob/main/gradle/generation/icu/GenerateUnicodeProps.groovy
Still I wouldn't immediately jump to generation as a start, it is a lot of work, and we should iterate. First i'd compare Character.toLowerCase(Character.toUpperCase(x))
to UCharacter.foldCase(int, false)
to see what the delta really needs to be as far as data. I'd expect this to be very small. You can start prototyping with that instead of investing a ton of up-front time.
Maybe this one helps the issue: #14389 |
Description
This is a rough attempt to make
StringsToAutomaton
support case-insensitive strings.