Skip to content

Speed up exhaustive evaluation. #14679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 22, 2025

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented May 16, 2025

This change helps speed up exhaustive evaluation of term queries, ie. calling DocIdSetIterator#nextDoc() then Scorer#score() in a loop.

It helps in two ways:

  • Iteration of matching doc IDs gets a bit more efficient, especially in the case when a block of postings is encoded as a bit set.
  • Computation of scores now gets (auto-)vectorized.

While this change doesn't help much when dynamic pruning kicks in, I'm hopeful that we can improve this in the future.

This change helps speed up exhaustive evaluation of term queries, ie. calling
`DocIdSetIterator#nextDoc()` then `Scorer#score()` in a loop.

It helps in two ways:
 - Iteration of matching doc IDs gets a bit more efficient, especially in the
   case when a block of postings is encoded as a bit set.
 - Computation of scores now gets (auto-)vectorized.

While this change doesn't help much when dynamic pruning kicks in, I'm hopeful
that we can improve this in the future.
Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

@jpountz
Copy link
Contributor Author

jpountz commented May 16, 2025

Exhaustive evaluation (totalHitsThreshold=Integer.MAX_VALUE):

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                FilteredOr3Terms       53.34      (2.1%)       52.58      (1.5%)   -1.4% (  -4% -    2%) 0.229
                       CountTerm     8599.61      (6.2%)     8508.01      (3.0%)   -1.1% (  -9% -    8%) 0.730
                  FilteredOrMany        5.75      (1.8%)        5.70      (2.2%)   -0.9% (  -4% -    3%) 0.484
                          IntNRQ       30.32     (11.9%)       30.09     (11.2%)   -0.8% ( -21% -   25%) 0.915
              FilteredOrHighHigh       48.15      (2.3%)       47.81      (2.7%)   -0.7% (  -5% -    4%) 0.644
               FilteredOrHighMed       56.92      (2.2%)       56.62      (2.2%)   -0.5% (  -4% -    4%) 0.710
                 FilteredPrefix3       29.16      (3.5%)       29.04      (2.8%)   -0.4% (  -6% -    5%) 0.829
                AndMedOrHighHigh       43.48      (3.3%)       43.30      (3.0%)   -0.4% (  -6% -    6%) 0.835
                 CountAndHighMed      301.20      (0.8%)      300.43      (1.3%)   -0.3% (  -2% -    1%) 0.717
                   TermMonthSort     3203.56      (2.0%)     3195.42      (2.3%)   -0.3% (  -4% -    4%) 0.852
                DismaxOrHighHigh        7.01      (3.5%)        6.99      (4.5%)   -0.2% (  -7% -    8%) 0.925
                  FilteredIntNRQ       98.99      (1.0%)       98.86      (1.7%)   -0.1% (  -2% -    2%) 0.886
          CountFilteredOrHighMed      148.68      (0.8%)      148.59      (0.7%)   -0.1% (  -1% -    1%) 0.893
                     CountOrMany       30.07      (1.3%)       30.06      (3.3%)   -0.1% (  -4% -    4%) 0.972
                          Phrase        3.07      (0.5%)        3.07      (0.7%)   -0.0% (  -1% -    1%) 0.950
             CountFilteredOrMany       27.64      (1.6%)       27.64      (1.4%)   -0.0% (  -2% -    3%) 0.996
             CountFilteredPhrase       25.98      (1.3%)       25.99      (1.0%)    0.1% (  -2% -    2%) 0.933
             FilteredOrStopWords       29.82      (2.0%)       29.84      (2.2%)    0.1% (  -4% -    4%) 0.962
               TermDayOfYearSort      295.71      (1.4%)      295.96      (1.0%)    0.1% (  -2% -    2%) 0.907
                CountAndHighHigh      363.25      (1.8%)      363.57      (1.6%)    0.1% (  -3% -    3%) 0.935
                  FilteredPhrase       23.36      (0.4%)       23.38      (1.3%)    0.1% (  -1% -    1%) 0.877
                    AndStopWords       16.41      (1.1%)       16.43      (1.4%)    0.1% (  -2% -    2%) 0.865
                   TermTitleSort       90.77      (6.6%)       90.91      (3.8%)    0.2% (  -9% -   11%) 0.965
                     CountPhrase        4.21      (0.5%)        4.22      (0.9%)    0.2% (  -1% -    1%) 0.635
         CountFilteredOrHighHigh      137.11      (0.7%)      137.42      (0.8%)    0.2% (  -1% -    1%) 0.617
                        Wildcard       18.10      (5.8%)       18.16      (5.1%)    0.3% (  -9% -   11%) 0.920
                       And3Terms      148.00      (2.1%)      148.56      (1.8%)    0.4% (  -3% -    4%) 0.760
                  CountOrHighMed      339.44      (1.8%)      340.76      (2.1%)    0.4% (  -3% -    4%) 0.756
     FilteredAnd2Terms2StopWords      195.69      (1.2%)      196.57      (0.6%)    0.5% (  -1% -    2%) 0.447
                     AndHighHigh       27.09      (1.1%)       27.23      (0.9%)    0.5% (  -1% -    2%) 0.438
                         Prefix3       14.72      (6.6%)       14.80      (6.0%)    0.5% ( -11% -   14%) 0.892
                    FilteredTerm      112.90      (1.3%)      113.52      (0.5%)    0.5% (  -1% -    2%) 0.379
                 CountOrHighHigh      347.22      (1.8%)      349.12      (2.0%)    0.5% (  -3% -    4%) 0.649
               FilteredAnd3Terms      222.15      (1.8%)      223.38      (0.6%)    0.6% (  -1% -    3%) 0.518
              FilteredAndHighMed      138.64      (1.3%)      139.44      (0.8%)    0.6% (  -1% -    2%) 0.410
             FilteredAndHighHigh       71.98      (2.0%)       72.41      (0.9%)    0.6% (  -2% -    3%) 0.550
                 DismaxOrHighMed       10.76      (4.2%)       10.84      (4.2%)    0.7% (  -7% -    9%) 0.787
            FilteredAndStopWords       49.82      (2.3%)       50.22      (1.6%)    0.8% (  -3% -    4%) 0.524
                IntervalsOrdered        2.26      (3.0%)        2.28      (2.6%)    0.8% (  -4% -    6%) 0.635
      FilteredOr2Terms2StopWords       27.40      (2.6%)       27.64      (1.5%)    0.9% (  -3% -    5%) 0.516
             And2Terms2StopWords       86.94      (1.4%)       87.73      (1.9%)    0.9% (  -2% -    4%) 0.382
                    CombinedTerm       31.88      (3.3%)       32.18      (3.1%)    0.9% (  -5% -    7%) 0.648
                      DismaxTerm       45.81      (5.8%)       46.24      (4.9%)    0.9% (  -9% -   12%) 0.783
                      AndHighMed       92.77      (1.4%)       93.65      (1.0%)    1.0% (  -1% -    3%) 0.227
                          Fuzzy2       85.92      (1.2%)       86.85      (1.2%)    1.1% (  -1% -    3%) 0.153
                            Term       74.11      (0.9%)       74.92      (1.5%)    1.1% (  -1% -    3%) 0.169
                        PKLookup      316.12      (5.0%)      320.82      (5.7%)    1.5% (  -8% -   12%) 0.660
              CombinedAndHighMed       40.66      (3.8%)       41.29      (1.9%)    1.5% (  -4% -    7%) 0.418
                 AndHighOrMedMed       33.86      (3.3%)       34.41      (2.1%)    1.6% (  -3% -    7%) 0.360
             CombinedAndHighHigh       12.60      (5.5%)       12.83      (2.1%)    1.9% (  -5% -   10%) 0.480
                      TermDTSort      402.82      (6.1%)      412.67      (1.0%)    2.4% (  -4% -   10%) 0.374
                          Fuzzy1       91.70      (1.3%)       95.61      (0.8%)    4.3% (   2% -    6%) 0.000
              CombinedOrHighHigh        5.93      (4.2%)        6.26      (5.3%)    5.5% (  -3% -   15%) 0.066
               CombinedOrHighMed        9.13      (4.3%)        9.65      (5.3%)    5.6% (  -3% -   15%) 0.064
                      OrHighRare       19.06      (2.4%)       21.38      (3.8%)   12.1% (   5% -   18%) 0.000
                       OrHighMed       19.01      (1.8%)       29.92      (0.9%)   57.4% (  53% -   61%) 0.000
                        Or3Terms       25.69      (1.5%)       40.45      (0.7%)   57.5% (  54% -   60%) 0.000
                      OrHighHigh       12.66      (1.5%)       20.58      (1.0%)   62.6% (  59% -   66%) 0.000
                     OrStopWords        6.11      (1.3%)       10.55      (1.3%)   72.7% (  69% -   76%) 0.000
              Or2Terms2StopWords        5.30      (1.9%)        9.28      (1.8%)   75.1% (  70% -   80%) 0.000
                          OrMany        1.63      (1.4%)        3.13      (1.6%)   91.8% (  87% -   96%) 0.000

When dynamic pruning is enabled (Lucene's defaults):

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
            FilteredAndStopWords       45.33      (1.2%)       43.82      (3.6%)   -3.3% (  -8% -    1%) 0.034
             FilteredAndHighHigh       65.85      (1.5%)       64.06      (3.1%)   -2.7% (  -7% -    1%) 0.051
                    AndStopWords       29.85      (5.7%)       29.08      (6.2%)   -2.6% ( -13% -    9%) 0.457
                IntervalsOrdered        2.28      (2.3%)        2.23      (4.5%)   -2.2% (  -8% -    4%) 0.290
     FilteredAnd2Terms2StopWords      172.40      (0.9%)      168.82      (1.6%)   -2.1% (  -4% -    0%) 0.006
                      DismaxTerm      484.12      (1.7%)      474.54      (1.4%)   -2.0% (  -4% -    1%) 0.025
                      TermDTSort      405.39      (4.2%)      397.98      (4.5%)   -1.8% ( -10% -    7%) 0.470
                       And3Terms      173.19      (3.2%)      170.07      (3.7%)   -1.8% (  -8% -    5%) 0.368
              FilteredAndHighMed      125.60      (2.3%)      123.40      (2.4%)   -1.7% (  -6% -    2%) 0.190
               FilteredAnd3Terms      168.40      (2.2%)      165.67      (1.8%)   -1.6% (  -5% -    2%) 0.153
                     CountPhrase        4.16      (2.6%)        4.10      (3.7%)   -1.6% (  -7% -    4%) 0.401
                      AndHighMed      134.30      (0.6%)      132.22      (1.3%)   -1.5% (  -3% -    0%) 0.009
                AndMedOrHighHigh       65.58      (1.5%)       64.62      (2.0%)   -1.5% (  -4% -    2%) 0.147
                     AndHighHigh       42.50      (1.4%)       41.92      (2.1%)   -1.4% (  -4% -    2%) 0.183
                   TermTitleSort       90.81      (3.3%)       89.59      (3.1%)   -1.3% (  -7% -    5%) 0.470
             CountFilteredPhrase       25.42      (1.0%)       25.08      (1.4%)   -1.3% (  -3% -    1%) 0.061
                            Term      434.45      (3.0%)      429.49      (6.3%)   -1.1% ( -10% -    8%) 0.689
             And2Terms2StopWords      166.31      (3.2%)      164.47      (3.7%)   -1.1% (  -7% -    5%) 0.577
                        PKLookup      323.33      (3.9%)      320.16      (5.1%)   -1.0% (  -9% -    8%) 0.708
                       OrHighMed      185.68      (1.9%)      183.88      (3.8%)   -1.0% (  -6% -    4%) 0.576
                      OrHighHigh       49.69      (2.4%)       49.22      (4.1%)   -0.9% (  -7% -    5%) 0.628
                       CountTerm     8469.36      (4.1%)     8397.80      (3.8%)   -0.8% (  -8% -    7%) 0.710
              FilteredOrHighHigh       67.32      (2.2%)       66.76      (1.5%)   -0.8% (  -4% -    2%) 0.433
             FilteredOrStopWords       46.18      (2.8%)       45.81      (2.0%)   -0.8% (  -5% -    4%) 0.568
                  FilteredOrMany       16.62      (1.7%)       16.50      (2.9%)   -0.7% (  -5% -    3%) 0.593
               CombinedOrHighMed       73.42      (1.3%)       72.90      (2.0%)   -0.7% (  -4% -    2%) 0.472
               TermDayOfYearSort      285.02      (3.4%)      283.24      (4.0%)   -0.6% (  -7% -    7%) 0.772
                  FilteredPhrase       33.28      (1.8%)       33.08      (1.2%)   -0.6% (  -3% -    2%) 0.501
                        Or3Terms      164.02      (3.3%)      163.11      (4.9%)   -0.6% (  -8% -    7%) 0.817
                    CombinedTerm       30.17      (3.6%)       30.01      (3.0%)   -0.5% (  -6% -    6%) 0.783
                  CountOrHighMed      368.23      (2.0%)      366.30      (2.2%)   -0.5% (  -4% -    3%) 0.661
                     OrStopWords       31.61      (6.3%)       31.47      (8.0%)   -0.5% ( -13% -   14%) 0.912
               FilteredOrHighMed      152.49      (1.5%)      151.89      (0.8%)   -0.4% (  -2% -    1%) 0.562
              CombinedOrHighHigh       19.12      (1.9%)       19.05      (2.3%)   -0.4% (  -4% -    3%) 0.775
              CombinedAndHighMed       39.13      (3.9%)       39.01      (3.3%)   -0.3% (  -7% -    7%) 0.888
                FilteredOr3Terms      166.72      (1.8%)      166.36      (0.6%)   -0.2% (  -2% -    2%) 0.777
                    FilteredTerm      157.71      (2.9%)      157.40      (2.3%)   -0.2% (  -5% -    5%) 0.897
      FilteredOr2Terms2StopWords      148.31      (1.5%)      148.08      (1.0%)   -0.2% (  -2% -    2%) 0.837
                          Phrase       14.61      (2.9%)       14.64      (1.2%)    0.2% (  -3% -    4%) 0.881
              Or2Terms2StopWords      156.06      (3.7%)      156.47      (4.4%)    0.3% (  -7% -    8%) 0.911
             CountFilteredOrMany       27.40      (0.7%)       27.48      (1.6%)    0.3% (  -1% -    2%) 0.684
             CombinedAndHighHigh       11.51      (4.0%)       11.55      (3.8%)    0.4% (  -7% -    8%) 0.870
          CountFilteredOrHighMed      148.24      (0.5%)      148.89      (0.5%)    0.4% (   0% -    1%) 0.140
                 AndHighOrMedMed       46.45      (1.5%)       46.66      (2.4%)    0.5% (  -3% -    4%) 0.697
                 CountAndHighMed      309.81      (1.8%)      311.39      (1.7%)    0.5% (  -2% -    4%) 0.608
                          IntNRQ      304.10      (2.3%)      306.45      (0.8%)    0.8% (  -2% -    3%) 0.435
                 DismaxOrHighMed      166.23      (1.4%)      167.53      (2.6%)    0.8% (  -3% -    4%) 0.522
                  FilteredIntNRQ      300.74      (2.2%)      303.30      (0.5%)    0.8% (  -1% -    3%) 0.359
         CountFilteredOrHighHigh      136.40      (1.0%)      137.57      (0.5%)    0.9% (   0% -    2%) 0.056
                         Prefix3      164.17      (5.1%)      165.64      (2.5%)    0.9% (  -6% -    8%) 0.700
                 FilteredPrefix3      161.26      (5.4%)      162.79      (2.1%)    0.9% (  -6% -    8%) 0.686
                 CountOrHighHigh      346.50      (1.6%)      349.92      (0.9%)    1.0% (  -1% -    3%) 0.203
                          Fuzzy2       85.93      (1.4%)       86.79      (1.0%)    1.0% (  -1% -    3%) 0.156
                     CountOrMany       30.53      (0.9%)       30.87      (0.8%)    1.1% (   0% -    2%) 0.031
                CountAndHighHigh      358.55      (1.8%)      363.03      (1.7%)    1.2% (  -2% -    4%) 0.208
                        Wildcard       93.12      (3.3%)       94.52      (1.8%)    1.5% (  -3% -    6%) 0.333
                          Fuzzy1      101.52      (1.8%)      103.07      (1.5%)    1.5% (  -1% -    4%) 0.113
                DismaxOrHighHigh      112.45      (2.5%)      114.28      (3.5%)    1.6% (  -4% -    7%) 0.354
                   TermMonthSort     3132.96      (1.4%)     3185.18      (2.2%)    1.7% (  -1% -    5%) 0.114
                      OrHighRare      265.78      (3.5%)      274.41      (7.6%)    3.2% (  -7% -   14%) 0.341
                          OrMany       19.03      (3.3%)       20.56      (3.3%)    8.0% (   1% -   15%) 0.000

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speed up is very exciting! I did a rough pass and left some minor suggestions/questions.

So this optimization can typically help cases like TOPN_COUNT which needs to evaluate all docs, especially for the indices with deleted docs which makes count can not return in constant time!

/** Grow both arrays to ensure that they can store at least the given number of entries. */
public void grow(int minSize) {
if (docs.length < minSize) {
docs = ArrayUtil.grow(docs, minSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we typically need growNoCopy instead of grow?

*
* <p><b>NOTE</b>: The returned {@link DocAndFreqBuffer} should not hold references to internal
* data structures.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify we should not call this when unpositioned?

size2 = enumerateSetBits(docBitSet.getBits()[i], i << 6, reuse.docs, size2);
}
assert size2 >= size : size2 + " < " + size;
for (int i = 0; i < size; ++i) {
Copy link
Contributor

@gf2121 gf2121 May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though this loop might get vectorized, would it be faster if just add to the base of enumerateSetBits? because these words typically get dense 1 bits.

enumerateSetBits(docBitSet.getBits()[i], (i << 6) + docBitSetBase, reuse.docs, size2)

/** Grow both arrays to ensure that they can store at least the given number of entries. */
public void grow(int minSize) {
if (docs.length < minSize) {
docs = ArrayUtil.grow(docs, minSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, growNoCopy might be better?

* <p><b>NOTE</b>: The returned {@link DocAndScoreBuffer} should not hold references to internal
* data structures.
*
* <p><b>NOTE</b>: In case this {@link Scorer} exposes a {@link #twoPhaseIterator()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When only disi exposed, it should be positioned as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, will clarify.

* return reuse;
* </pre>
*
* <p><b>NOTE</b>: The returned {@link DocAndFreqBuffer} should not hold references to internal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the buffer arrays private and only expose getters and grow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple places where I call System#arraycopy directly on these arrays, let me think more about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up not applying this suggestion, or the API calls would have looked awkward. I hope this is ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying! Let's keep it then.

* <p>This method behaves as if implemented as below, which is the default implementation:
*
* <pre class="prettyprint">
* int batchSize = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When i only read this java doc without looking impls, i was thinking impls should limit their block size under 16 as well :) Maybe clarify the max size of buffer depends on data structures.

}

int size = docAndFreqBuffer.size;
normValues = ArrayUtil.grow(normValues, size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

growNoCopy?

*/
public void score(int size, int[] freqs, long[] norms, float[] scores) {
for (int i = 0; i < size; ++i) {
scores[i] = score(freqs[i], norms[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computation of scores now gets (auto-)vectorized.

By this word, do you mean this method can get vectorized? So the abstraction layer do not prevent inline here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-vectorization requires score(float, long) to get inlined indeed, which would only happen if there are two impls of SimScorer being used at most. We may need to implement score(int, int[], long[], float[]) on our main similarities in the future to make performance more predictable. We may also be able to do a bit better than calling score in a loop. I was trying to keep the change small.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also be able to do a bit better than calling score in a loop

Yeah! I played withBM25 a bit and the result looks promising:

Benchmark                               Mode  Cnt   Score   Error   Units
VectorizedBM25Benchmark.scoreBaseline  thrpt    5  10.991 ± 0.356  ops/us
VectorizedBM25Benchmark.scoreVector    thrpt    5  15.149 ± 0.029  ops/us
public static void scoreBaseline(int size, int[] freqs, long[] norms, float[] scores, float[] cache, int weight, float[] buffer) {
  for (int i = 0; i < size; ++i) {
    float normInverse = cache[((byte) norms[i]) & 0xFF];
    scores[i] = weight - weight / (1f + freqs[i] * normInverse);
  }
}

public static void scoreVector(int size, int[] freqs, long[] norms, float[] scores, float[] cache, int weight, float[] buffer) {
  for (int i = 0; i < size; ++i) {
    buffer[i] = cache[((byte) norms[i]) & 0xFF];
  }
  for (int i = 0; i < size; ++i) {
    scores[i] = weight - weight / (1f + freqs[i] * buffer[i]);
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting!

@jpountz
Copy link
Contributor Author

jpountz commented May 16, 2025

So this optimization can typically help cases like TOPN_COUNT which needs to evaluate all docs, especially for the indices with deleted docs which makes count can not return in constant time!

Right, though we don't use Weight#count for TOPN_COUNT, we probably should!

It should also help sparse neural search, when weights are less predictable and dynamic pruning works less well and effectively evaluates hits exhaustively in practice.

Finally, I'm hoping that we can iterate on this change to also speed up top-n evaluation in the future.

@rmuir
Copy link
Member

rmuir commented May 20, 2025

Do we really need the method on Similarity? I guess I feel, most users are probably using BM25Similarity, so I don't understand the explanation in the comments.

If we have "bogus" instances (such as wrappers) of similarity in use, then that's a java problem, let's fix that instead.

@jpountz
Copy link
Contributor Author

jpountz commented May 20, 2025

You are correct, no need for additional APIs on Similarity at this point, I removed it. I suspect it may be tempting in the future, because it enables further optimizations as @gf2121 showed in #14679 (comment) (though let's see if it actually translates to speedups with luceneutil), and because FeatureField is a contributor to SimScorer#score polymorphism. We can discuss this more in a followup.

I cleaned up the change, it's now ready for review.

@jpountz jpountz marked this pull request as ready for review May 20, 2025 20:05
@jpountz jpountz added this to the 10.3.0 milestone May 20, 2025
jpountz added a commit to jpountz/lucene that referenced this pull request May 20, 2025
Calls to `DocIdSetIterator#nextDoc`, `DocIdSetIterator#advance` and
`SimScorer#score` are currently interleaved and include lots of conditionals.
This builds up on apache#14679 and refactors the code a bit to make it eligible to
auto-vectorization and better pipelining.

This effectively speeds up conjunctive queries (e.g. `AndHighHigh`) but also
disjunctive queries that run as conjunctive queries in practice (e.g.
`OrHighHigh`).
@rmuir
Copy link
Member

rmuir commented May 21, 2025

Thank you! "bulkpostings 2.0" is looking really clean and non-invasive :)

I suspect it may be tempting in the future, because it enables further optimizations as @gf2121 showed in #14679 (comment) (though let's see if it actually translates to speedups with luceneutil), and because FeatureField is a contributor to SimScorer#score polymorphism. We can discuss this more in a followup.

Yes, thank you, I agree 100% to investigate it as followup: the additional speedup hinted at there seems promising. If we can proceed with caution there, it would help.

For similarities in particular, correct formula can be difficult, and if you have to implement it twice, I have some concerns around correctness. At the very minimum we'd want to improve BaseSimilarityTestCase...

For PostingsEnum/Scorer changes I have similar concerns about correctness, I think what's happening in Asserting is not enough to guarantee correctness? E.g. for the PostingsEnum one I would think about CheckIndex itself validating the new bulk API, BasePostingsFormatTestCase additions, and also TestDuelingCodecs.

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic job!

freq();

int start = docBufferUpto - 1;
buffer.size = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: buffer.size has be set to 0 above (line 1047), can we avoid this one?

int batchSize = 16; // arbitrary
buffer.growNoCopy(batchSize);
int size = 0;
DocIdSetIterator iterator = iterator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have many implementations returning a new iterator here (like TwoPhaseIterator.asDocIdSetIterator), will the object construction for each 16 docs cause noticeable overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly indeed. Let's look into it as a follow-up? I'm not sure if we should cache the iterator here or rather fix impls to avoid allocating in #iterator().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's look into it as a follow-up

+1

int size = docAndFreqBuffer.size;
normValues = ArrayUtil.growNoCopy(normValues, size);
if (norms == null) {
Arrays.fill(normValues, 0, size, 1L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we only do this fill when grow happens?

@jpountz
Copy link
Contributor Author

jpountz commented May 21, 2025

Thanks for the feedback, both. I added coverage to BasePostingsFormatTestCase. TestDuelingCodecs is a bit tricky since implementations are free to return buffers of arbitrary sizes. I will look into CheckIndex next, looking into how not to slow it down too much.

Arrays.fill(normValues, 1L);
}
}
normValues = ArrayUtil.growNoCopy(normValues, size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be removed?

@jpountz
Copy link
Contributor Author

jpountz commented May 21, 2025

CheckIndex integration is pushed, I hooked into a place where we were already exhaustively consuming the PostingsEnum anyway, so it shouldn't cause a major slowdown.

@jpountz jpountz merged commit 3dad852 into apache:main May 22, 2025
7 checks passed
@jpountz jpountz deleted the vectorized_exhaustive_evaluation branch May 22, 2025 09:49
jpountz added a commit that referenced this pull request May 22, 2025
This change helps speed up exhaustive evaluation of term queries, ie. calling
`DocIdSetIterator#nextDoc()` then `Scorer#score()` in a loop.

It helps in two ways:
 - Iteration of matching doc IDs gets a bit more efficient, especially in the
   case when a block of postings is encoded as a bit set.
 - Computation of scores now gets (auto-)vectorized.

While this change doesn't help much when dynamic pruning kicks in, I'm hopeful
that we can improve this in the future.
jpountz added a commit to jpountz/lucene that referenced this pull request May 23, 2025
Existing vectorization of scores is a bit fragile since it relies on
`SimScorer#score` being inlined in the for loops where it is called. This is
currently the case in nightly benchmarks, but may not be the case in the real
world where more implementations of `SimScorer` may be used, in particular
those from `FeatureField`.

Furthermore, existing vectorization has some room for improvement as @gf2121
highlighted at
apache#14679 (comment).
@jpountz
Copy link
Contributor Author

jpountz commented May 23, 2025

Nightly benchmarks reported a ~6% speedup on the OrMany task: https://benchmarks.mikemccandless.com/OrMany.html. I'll push an annotation.

gf2121 pushed a commit to gf2121/lucene that referenced this pull request May 24, 2025
Existing vectorization of scores is a bit fragile since it relies on
`SimScorer#score` being inlined in the for loops where it is called. This is
currently the case in nightly benchmarks, but may not be the case in the real
world where more implementations of `SimScorer` may be used, in particular
those from `FeatureField`.

Furthermore, existing vectorization has some room for improvement as @gf2121
highlighted at
apache#14679 (comment).
gf2121 pushed a commit to gf2121/lucene that referenced this pull request May 26, 2025
Existing vectorization of scores is a bit fragile since it relies on
`SimScorer#score` being inlined in the for loops where it is called. This is
currently the case in nightly benchmarks, but may not be the case in the real
world where more implementations of `SimScorer` may be used, in particular
those from `FeatureField`.

Furthermore, existing vectorization has some room for improvement as @gf2121
highlighted at
apache#14679 (comment).
weizijun added a commit to weizijun/lucene that referenced this pull request May 27, 2025
* main: (32 commits)
  update os.makedirs with pathlib mkdir (apache#14710)
  Optimize AbstractKnnVectorQuery#createBitSet with intoBitset (apache#14674)
  Implement #docIDRunEnd() on PostingsEnum. (apache#14693)
  Speed up TermQuery (apache#14709)
  Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion. (apache#14701)
  Fix WindowsFS test failure seen on Policeman Jenkins (apache#14706)
  Use a temporary repository location to download certain ecj versions ("drops") (apache#14703)
  Add assumption to ignore occasional test failures due to disconnected graphs (apache#14696)
  Return MatchNoDocsQuery when IndexOrDocValuesQuery::rewrite does not match (apache#14700)
  Minor access modifier adjustment to a couple of lucene90 backward compat types (apache#14695)
  Speed up exhaustive evaluation. (apache#14679)
  Specify and test that IOContext is immutable (apache#14686)
  deps(java): bump org.gradle.toolchains.foojay-resolver-convention (apache#14691)
  deps(java): bump org.eclipse.jgit:org.eclipse.jgit (apache#14692)
  Clean up how the test framework creates asserting scorables. (apache#14452)
  Make competitive iterators more robust. (apache#14532)
  Remove DISIDocIdStream. (apache#14550)
  Implement AssertingPostingsEnum#intoBitSet. (apache#14675)
  Fix patience knn queries to work with seeded knn queries (apache#14688)
  Added toString() method to BytesRefBuilder (apache#14676)
  ...
jpountz added a commit that referenced this pull request Jun 10, 2025
Existing vectorization of scores is a bit fragile since it relies on
`SimScorer#score` being inlined in the for loops where it is called. This is
currently the case in nightly benchmarks, but may not be the case in the real
world where more implementations of `SimScorer` may be used, in particular
those from `FeatureField`.

Furthermore, existing vectorization has some room for improvement as @gf2121
highlighted at
#14679 (comment).
jpountz added a commit that referenced this pull request Jun 10, 2025
Existing vectorization of scores is a bit fragile since it relies on
`SimScorer#score` being inlined in the for loops where it is called. This is
currently the case in nightly benchmarks, but may not be the case in the real
world where more implementations of `SimScorer` may be used, in particular
those from `FeatureField`.

Furthermore, existing vectorization has some room for improvement as @gf2121
highlighted at
#14679 (comment).
gf2121 pushed a commit to gf2121/lucene that referenced this pull request Jul 16, 2025
Existing vectorization of scores is a bit fragile since it relies on
`SimScorer#score` being inlined in the for loops where it is called. This is
currently the case in nightly benchmarks, but may not be the case in the real
world where more implementations of `SimScorer` may be used, in particular
those from `FeatureField`.

Furthermore, existing vectorization has some room for improvement as @gf2121
highlighted at
apache#14679 (comment).
gf2121 pushed a commit to gf2121/lucene that referenced this pull request Jul 19, 2025
Existing vectorization of scores is a bit fragile since it relies on
`SimScorer#score` being inlined in the for loops where it is called. This is
currently the case in nightly benchmarks, but may not be the case in the real
world where more implementations of `SimScorer` may be used, in particular
those from `FeatureField`.

Furthermore, existing vectorization has some room for improvement as @gf2121
highlighted at
apache#14679 (comment).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants