Skip to content

[DRAFT] Case-insensitive matching over union of strings #14350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -608,7 +608,24 @@ public static Automaton makeStringUnion(Iterable<BytesRef> utf8Strings) {
if (utf8Strings.iterator().hasNext() == false) {
return makeEmpty();
} else {
return StringsToAutomaton.build(utf8Strings, false);
return StringsToAutomaton.build(utf8Strings, false, false, false);
}
}

/**
* Returns a new (deterministic and minimal) automaton that accepts the union of the given
* collection of {@link BytesRef}s representing UTF-8 encoded strings.
*
* @param utf8Strings The input strings, UTF-8 encoded. The collection must be in sorted order.
* @return An {@link Automaton} accepting all input strings. The resulting automaton is codepoint
* based (full unicode codepoints on transitions).
*/
public static Automaton makeCaseInsensitiveStringUnion(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turkic parameter is missing from the javadoc.

Iterable<BytesRef> utf8Strings, boolean turkic) {
if (utf8Strings.iterator().hasNext() == false) {
return makeEmpty();
} else {
return StringsToAutomaton.build(utf8Strings, false, true, turkic);
}
}

Expand All @@ -625,7 +642,7 @@ public static Automaton makeBinaryStringUnion(Iterable<BytesRef> utf8Strings) {
if (utf8Strings.iterator().hasNext() == false) {
return makeEmpty();
} else {
return StringsToAutomaton.build(utf8Strings, true);
return StringsToAutomaton.build(utf8Strings, true, false, false);
}
}

Expand All @@ -638,7 +655,7 @@ public static Automaton makeBinaryStringUnion(Iterable<BytesRef> utf8Strings) {
* based (full unicode codepoints on transitions).
*/
public static Automaton makeStringUnion(BytesRefIterator utf8Strings) throws IOException {
return StringsToAutomaton.build(utf8Strings, false);
return StringsToAutomaton.build(utf8Strings, false, false, false);
}

/**
Expand All @@ -651,6 +668,6 @@ public static Automaton makeStringUnion(BytesRefIterator utf8Strings) throws IOE
* based (UTF-8 encoded byte transition labels).
*/
public static Automaton makeBinaryStringUnion(BytesRefIterator utf8Strings) throws IOException {
return StringsToAutomaton.build(utf8Strings, true);
return StringsToAutomaton.build(utf8Strings, true, false, false);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) {

return alts;
}

/**
* Folds the case of the given character according to {@link Character#toLowerCase(int)}, but with
* exceptions if the turkic flag is set.
*
* @param codepoint to code point for the character to fold
* @param turkic if true, then apply tr/az folding rules
* @return the folded character
*/
static int foldCase(int codepoint, boolean turkic) {
if (turkic) {
if (codepoint == 0x00130) { // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE]
return 0x00069; // i [LATIN SMALL LETTER I]
} else if (codepoint == 0x000049) { // I [LATIN CAPITAL LETTER I]
return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I]
}
}
return Character.toLowerCase(codepoint);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For real case folding we have to do more than this. it is a simple 1-1 mapping but e.g. Σ, σ, and ς, will all fold to σ. Whereas toLowerCase(ς) = ς. Because it is already in lower-case, just in final-form. This is just an example. To see more, compare your function against ICU UCharacter.foldCase(int, bool) across all of unicode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Checking https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt, indeed I can see those entries:

03A3; C; 03C3; # GREEK CAPITAL LETTER SIGMA
...
03C2; C; 03C3; # GREEK SMALL LETTER FINAL SIGMA

Ideally, I'd love to just use those folding rules.

I could get them from UCharacter.foldCase(int, bool), but that involves pulling in icu4j as a dependency, which is an extra 12MB jar.

Would it be worthwhile to write a generator that pulls https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt (updated to whatever the current Unicode spec is) and generates a foldCase method that's functionally equivalent to UCharacter.foldcase(int, bool)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, depending what we are going to do with it? if done correctly we could replace LowerCaseFilter, GreekLowerCaseFilter, etc in analysis chain. Of course "correctly" there is a difficult bar, as it would impact 100% of users in a very visible way and could easily bottleneck indexing / waste resources if not done correctly. For example large-arrays-of-objects or even primitives is a big no here. See https://www.strchr.com/multi-stage_tables and look at what JDK and ICU do already.

But for the purpose of this PR, we may want to start simpler (this is the same approach I mentioned on regex caseless PR). We should avoid huge arrays and large data files in lucene-core, just for adding more inefficient user regular expressions that isn't really related to searching. On the other hand, if we are going to get serious benefit everywhere (e.g. improve all analyzers), then maybe the tradeoff makes sense.

And I don't understand why we'd parse text files versus just write any generator itself to use ICU... especially since we already use such an approach in the build already: https://github.com/apache/lucene/blob/main/gradle/generation/icu/GenerateUnicodeProps.groovy

Still I wouldn't immediately jump to generation as a start, it is a lot of work, and we should iterate. First i'd compare Character.toLowerCase(Character.toUpperCase(x)) to UCharacter.foldCase(int, false) to see what the delta really needs to be as far as data. I'd expect this to be very small. You can start prototyping with that instead of investing a ton of up-front time.

}

/**
* Attempts to convert the given character to upper case, acccording to {@link
* Character#toUpperCase(int)}, but with exceptions if the turkic flag is set.
*
* @param codepoint to code point for the character to convert to upper case
* @param turkic if true, then apply tr/az folding rules
* @return the upper case character
*/
static int upperCase(int codepoint, boolean turkic) {
if (turkic) {
if (codepoint == 0x00069) { // i [LATIN SMALL LETTER I]
return 0x00130; // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE]
} else if (codepoint == 0x00131) { // ı [LATIN SMALL LETTER DOTLESS I]
return 0x000049; // I [LATIN CAPITAL LETTER I]
}
}
return Character.toUpperCase(codepoint);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -758,7 +758,7 @@ private Automaton toAutomaton(
* @param codepoint the Character code point to encode as an Automaton
* @return the original codepoint and the set of alternates
*/
private int[] toCaseInsensitiveChar(int codepoint) {
static int[] toCaseInsensitiveChar(int codepoint) {
int[] altCodepoints = CaseFolding.lookupAlternates(codepoint);
if (altCodepoints != null) {
int[] concat = new int[altCodepoints.length + 1];
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,13 @@
* @see Automata#makeBinaryStringUnion(BytesRefIterator)
*/
final class StringsToAutomaton {
private final boolean caseInsensitive;
private final boolean turkic;

/** The default constructor is private. Use static methods directly. */
private StringsToAutomaton() {
super();
private StringsToAutomaton(boolean caseInsensitive, boolean turkic) {
this.caseInsensitive = caseInsensitive;
this.turkic = turkic;
}

/** DFSA state with <code>char</code> labels on transitions. */
Expand Down Expand Up @@ -195,7 +198,11 @@ private boolean setPrevious(BytesRef current) {

/** Internal recursive traversal for conversion. */
private static int convert(
Automaton.Builder a, State s, IdentityHashMap<State, Integer> visited) {
Automaton.Builder a,
State s,
IdentityHashMap<State, Integer> visited,
boolean caseInsensitive,
boolean turkic) {

Integer converted = visited.get(s);
if (converted != null) {
Expand All @@ -209,7 +216,15 @@ private static int convert(
int i = 0;
int[] labels = s.labels;
for (StringsToAutomaton.State target : s.states) {
a.addTransition(converted, convert(a, target, visited), labels[i++]);
int label = labels[i++];
int dest = convert(a, target, visited, caseInsensitive, turkic);
a.addTransition(converted, dest, label);
if (caseInsensitive) {
int altCase = CaseFolding.upperCase(label, turkic);
if (altCase != label) {
a.addTransition(converted, dest, altCase);
}
}
}

return converted;
Expand All @@ -227,7 +242,7 @@ private Automaton completeAndConvert() {

// Convert:
Automaton.Builder a = new Automaton.Builder();
convert(a, root, new IdentityHashMap<>());
convert(a, root, new IdentityHashMap<>(), caseInsensitive, turkic);
return a.finish();
}

Expand All @@ -237,8 +252,12 @@ private Automaton completeAndConvert() {
* UTF-8 codepoints as transition labels or binary (compiled) transition labels based on {@code
* asBinary}.
*/
static Automaton build(Iterable<BytesRef> input, boolean asBinary) {
final StringsToAutomaton builder = new StringsToAutomaton();
static Automaton build(
Iterable<BytesRef> input, boolean asBinary, boolean caseInsensitive, boolean turkic) {
if (asBinary && caseInsensitive) {
throw new IllegalArgumentException("Cannot use caseInsensitive on binary automaton");
}
final StringsToAutomaton builder = new StringsToAutomaton(caseInsensitive, turkic);

for (BytesRef b : input) {
builder.add(b, asBinary);
Expand All @@ -253,8 +272,13 @@ static Automaton build(Iterable<BytesRef> input, boolean asBinary) {
* UTF-8 codepoints as transition labels or binary (compiled) transition labels based on {@code
* asBinary}.
*/
static Automaton build(BytesRefIterator input, boolean asBinary) throws IOException {
final StringsToAutomaton builder = new StringsToAutomaton();
static Automaton build(
BytesRefIterator input, boolean asBinary, boolean caseInsensitive, boolean turkic)
throws IOException {
if (asBinary && caseInsensitive) {
throw new IllegalArgumentException("Cannot use caseInsensitive on binary automaton");
}
final StringsToAutomaton builder = new StringsToAutomaton(caseInsensitive, turkic);

for (BytesRef b = input.next(); b != null; b = input.next()) {
builder.add(b, asBinary);
Expand Down Expand Up @@ -293,6 +317,10 @@ private void add(BytesRef current, boolean asBinary) {
} else {
while (pos < max) {
codePoint = UnicodeUtil.codePointAt(bytes, pos, codePoint);
if (caseInsensitive
&& codePoint.codePoint != CaseFolding.foldCase(codePoint.codePoint, turkic)) {
throw new IllegalArgumentException("Case-insensitive input must be lower-case");
}
next = state.lastChild(codePoint.codePoint);
if (next == null) {
break;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
import java.util.Iterator;
import java.util.List;
import java.util.Set;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import org.apache.lucene.tests.util.LuceneTestCase;
import org.apache.lucene.tests.util.TestUtil;
import org.apache.lucene.tests.util.automaton.AutomatonTestUtil;
Expand All @@ -47,6 +49,26 @@ public void testBasic() throws Exception {
checkMinimized(a);
}

public void testCaseInsensitive() throws Exception {
List<BytesRef> terms = basicTerms();
Collections.sort(terms);

Automaton a = buildCaseInsensitive(terms, false);
checkAutomaton(terms, a, false, true);
checkMinimized(a);
}

public void testCornerCase() throws Exception {
List<BytesRef> terms =
Stream.of("aib", "aıc")
.map(LuceneTestCase::newBytesRef)
.sorted()
.collect(Collectors.toCollection(ArrayList::new));
Automaton a = buildCaseInsensitive(terms, true);
System.out.println(a.toDot());
assertTrue(a.isDeterministic());
}

public void testBasicBinary() throws Exception {
List<BytesRef> terms = basicTerms();
Collections.sort(terms);
Expand Down Expand Up @@ -84,6 +106,46 @@ public void testRandomMinimized() throws Exception {
}
}

public void testRandomMinimizedCaseInsensitive() throws Exception {
int iters = RandomizedTest.isNightly() ? 20 : 5;
for (int i = 0; i < iters; i++) {
int size = random().nextInt(2, 50);
Set<BytesRef> terms = new HashSet<>();
List<Automaton> automatonList = new ArrayList<>(size);
boolean turkic = random().nextBoolean();
for (int j = 0; j < size; j++) {
String s = TestUtil.randomRealisticUnicodeString(random(), 8);
int[] lowercased = s.codePoints().map(c -> CaseFolding.foldCase(c, turkic)).toArray();
s = new String(lowercased, 0, lowercased.length);
terms.add(newBytesRef(s));
List<Automaton> charAutomata =
s.codePoints()
.mapToObj(
c -> {
Automaton a = Automata.makeChar(c);
int altCase = CaseFolding.upperCase(c, turkic);
if (altCase != c) {
return Operations.union(List.of(a, Automata.makeChar(altCase)));
}
return a;
})
.collect(Collectors.toList());
if (charAutomata.isEmpty()) {
automatonList.add(Automata.makeEmptyString());
} else {
automatonList.add(Operations.concatenate(charAutomata));
}
}
List<BytesRef> sortedTerms = terms.stream().sorted().toList();

Automaton expected =
MinimizationOperations.minimize(
Operations.union(automatonList), Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
Automaton actual = buildCaseInsensitive(sortedTerms, turkic);
assertSameAutomaton(expected, actual);
}
}

public void testRandomUnicodeOnly() throws Exception {
testRandom(false);
}
Expand Down Expand Up @@ -131,6 +193,11 @@ private void testRandom(boolean allowBinary) throws Exception {
}

private void checkAutomaton(List<BytesRef> expected, Automaton a, boolean isBinary) {
checkAutomaton(expected, a, isBinary, false);
}

private void checkAutomaton(
List<BytesRef> expected, Automaton a, boolean isBinary, boolean caseInsensitive) {
CompiledAutomaton c = new CompiledAutomaton(a, true, false, isBinary);
ByteRunAutomaton runAutomaton = c.runAutomaton;

Expand All @@ -141,12 +208,14 @@ private void checkAutomaton(List<BytesRef> expected, Automaton a, boolean isBina
readable + " should be found but wasn't", runAutomaton.run(t.bytes, t.offset, t.length));
}

// Make sure every term produced by the automaton is expected
BytesRefBuilder scratch = new BytesRefBuilder();
FiniteStringsIterator it = new FiniteStringsIterator(c.automaton);
for (IntsRef r = it.next(); r != null; r = it.next()) {
BytesRef t = Util.toBytesRef(r, scratch);
assertTrue(expected.contains(t));
if (caseInsensitive == false) {
// Make sure every term produced by the automaton is expected
BytesRefBuilder scratch = new BytesRefBuilder();
FiniteStringsIterator it = new FiniteStringsIterator(c.automaton);
for (IntsRef r = it.next(); r != null; r = it.next()) {
BytesRef t = Util.toBytesRef(r, scratch);
assertTrue(expected.contains(t));
}
}
}

Expand Down Expand Up @@ -174,9 +243,18 @@ private List<BytesRef> basicTerms() {

private Automaton build(Collection<BytesRef> terms, boolean asBinary) throws IOException {
if (random().nextBoolean()) {
return StringsToAutomaton.build(terms, asBinary);
return StringsToAutomaton.build(terms, asBinary, false, false);
} else {
return StringsToAutomaton.build(new TermIterator(terms), asBinary, false, false);
}
}

private Automaton buildCaseInsensitive(Collection<BytesRef> terms, boolean turkic)
throws IOException {
if (random().nextBoolean()) {
return StringsToAutomaton.build(terms, false, true, turkic);
} else {
return StringsToAutomaton.build(new TermIterator(terms), asBinary);
return StringsToAutomaton.build(new TermIterator(terms), false, true, turkic);
}
}

Expand Down