Skip to content

cache preset dict for LZ4WithPresetDictDecompressor #14397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kkewwei
Copy link
Contributor

@kkewwei kkewwei commented Mar 24, 2025

Description

As mentioned in #14347, we use LZ4WithPresetDictDecompressor to decompress, we will always read preset dict for every doc in non-merge scenarios. If two consecutive documents fall into the same chunk, we can reuse the same dictionary. This is a lossless optimization, the cached preset dict dictionary does not take up additional memory.

Lucene benchmark: python3 src/python/localrun.py wikimediumall
Hardware used: linux ecs.t2-c1m2dev.8xlarge | 32 cores | 64G

In order to obtain the exact performance improvement, I tested it 6 times in total, and the results are as follows:

Test1:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               HighTermMonthSort      951.76     (10.5%)      897.70      (8.7%)   -5.7% ( -22% -   15%) 0.061
       BrowseDayOfYearSSDVFacets        3.41     (19.3%)        3.27     (17.9%)   -4.2% ( -34% -   40%) 0.474
            BrowseDateSSDVFacets        0.65     (15.0%)        0.62     (12.4%)   -3.9% ( -27% -   27%) 0.372
                   OrHighNotHigh      181.95      (6.2%)      175.52      (5.1%)   -3.5% ( -13% -    8%) 0.048
                           range     2164.58      (7.7%)     2098.62      (6.5%)   -3.0% ( -16% -   12%) 0.178
                      OrHighHigh       18.62      (5.0%)       18.20      (5.7%)   -2.3% ( -12% -    8%) 0.179
            HighTermTitleBDVSort        2.15      (4.1%)        2.11      (4.0%)   -2.0% (  -9% -    6%) 0.122
                         LowTerm      262.36      (7.1%)      257.53      (6.6%)   -1.8% ( -14% -   12%) 0.397
            HighIntervalsOrdered        0.42      (5.9%)        0.41      (5.6%)   -1.4% ( -12% -   10%) 0.429
                   OrNotHighHigh      137.47      (4.0%)      135.52      (3.9%)   -1.4% (  -8% -    6%) 0.257
                    HighSpanNear        5.50      (2.1%)        5.43      (2.2%)   -1.4% (  -5% -    3%) 0.044
               HighTermTitleSort       61.59      (5.9%)       60.76      (6.3%)   -1.4% ( -12% -   11%) 0.485
                       OrHighLow      128.91      (4.9%)      127.20      (4.7%)   -1.3% ( -10% -    8%) 0.382
                    OrHighNotMed      167.55      (6.0%)      165.47      (6.2%)   -1.2% ( -12% -   11%) 0.521
                        HighTerm      170.81      (6.9%)      168.74      (5.9%)   -1.2% ( -13% -   12%) 0.548
                    OrNotHighMed      124.94      (3.5%)      123.44      (5.3%)   -1.2% (  -9% -    7%) 0.398
            MedTermDayTaxoFacets        9.20      (2.8%)        9.10      (2.9%)   -1.1% (  -6% -    4%) 0.230
                    OrNotHighLow      179.84      (4.5%)      177.97      (3.7%)   -1.0% (  -8% -    7%) 0.427
          OrHighMedDayTaxoFacets        2.43      (5.0%)        2.41      (3.5%)   -1.0% (  -9% -    7%) 0.474
                       OrHighMed       40.19      (3.9%)       39.81      (2.4%)   -1.0% (  -7% -    5%) 0.347
                      AndHighMed       26.18      (2.2%)       25.94      (3.0%)   -0.9% (  -5% -    4%) 0.268
                          IntNRQ       17.99      (6.2%)       17.83      (6.0%)   -0.9% ( -12% -   12%) 0.652
                     MedSpanNear       23.26      (3.0%)       23.08      (2.5%)   -0.8% (  -6% -    4%) 0.372
                        Wildcard       28.13      (4.5%)       27.94      (4.2%)   -0.7% (  -8% -    8%) 0.633
                         Prefix3      327.18      (9.4%)      325.10      (8.4%)   -0.6% ( -16% -   18%) 0.821
                    OrHighNotLow      200.50      (5.6%)      199.53      (4.2%)   -0.5% (  -9% -    9%) 0.754
                      AndHighLow      297.86      (6.7%)      296.43      (7.5%)   -0.5% ( -13% -   14%) 0.832
                      TermDTSort       90.01      (6.6%)       89.67      (4.3%)   -0.4% ( -10% -   11%) 0.831
             LowIntervalsOrdered        2.83      (2.9%)        2.83      (3.0%)   -0.2% (  -5% -    5%) 0.855
                HighSloppyPhrase        4.48      (3.3%)        4.48      (4.0%)   -0.1% (  -7% -    7%) 0.903
                     AndHighHigh       18.78      (2.9%)       18.77      (2.5%)   -0.1% (  -5% -    5%) 0.919
                     LowSpanNear       32.35      (3.0%)       32.36      (2.8%)    0.0% (  -5% -    5%) 0.993
                         MedTerm      325.56      (5.8%)      325.60      (6.0%)    0.0% ( -11% -   12%) 0.994
                       LowPhrase        4.37      (3.8%)        4.38      (4.3%)    0.1% (  -7% -    8%) 0.933
        AndHighHighDayTaxoFacets        3.50      (1.8%)        3.51      (2.6%)    0.2% (  -4% -    4%) 0.751
                        PKLookup      103.50      (4.6%)      103.98      (4.2%)    0.5% (  -7% -    9%) 0.742
                      HighPhrase       39.74      (5.6%)       39.95      (5.3%)    0.5% (  -9% -   12%) 0.760
                 LowSloppyPhrase       26.29      (2.6%)       26.47      (2.8%)    0.7% (  -4% -    6%) 0.419
                       MedPhrase       61.27      (5.9%)       61.71      (4.8%)    0.7% (  -9% -   12%) 0.673
         AndHighMedDayTaxoFacets       12.94      (2.9%)       13.07      (2.3%)    0.9% (  -4% -    6%) 0.249
                 MedSloppyPhrase       35.28      (2.8%)       35.63      (2.4%)    1.0% (  -4% -    6%) 0.225
           HighTermDayOfYearSort      108.74      (4.3%)      109.82      (4.5%)    1.0% (  -7% -   10%) 0.475
             MedIntervalsOrdered       13.63      (5.9%)       13.79      (3.9%)    1.1% (  -8% -   11%) 0.475
     BrowseRandomLabelSSDVFacets        1.78      (9.6%)        1.80      (8.6%)    1.2% ( -15% -   21%) 0.686
                          Fuzzy1       35.79      (4.1%)       36.21      (4.5%)    1.2% (  -7% -   10%) 0.380
                          Fuzzy2       31.10      (4.9%)       31.60      (3.8%)    1.6% (  -6% -   10%) 0.250
                         Respell       28.75      (5.1%)       29.23      (3.2%)    1.7% (  -6% -   10%) 0.213
           BrowseMonthSSDVFacets        3.32     (11.0%)        3.39     (17.4%)    2.0% ( -23% -   34%) 0.661
            BrowseDateTaxoFacets        4.14     (40.9%)        4.26     (46.1%)    2.9% ( -59% -  152%) 0.831
       BrowseDayOfYearTaxoFacets        4.30     (41.2%)        4.44     (49.6%)    3.2% ( -61% -  159%) 0.823
     BrowseRandomLabelTaxoFacets        2.57     (35.2%)        2.68     (38.5%)    4.4% ( -51% -  120%) 0.705
           BrowseMonthTaxoFacets        4.08     (43.3%)        5.16     (65.5%)   26.6% ( -57% -  239%) 0.131

Test2:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        PKLookup      107.53      (3.9%)      104.81      (4.5%)   -2.5% ( -10% -    6%) 0.056
            BrowseDateSSDVFacets        0.69     (16.5%)        0.67     (15.2%)   -1.9% ( -28% -   35%) 0.708
                        Wildcard       27.49      (4.5%)       27.11      (4.0%)   -1.4% (  -9% -    7%) 0.311
               HighTermMonthSort     1057.35      (8.3%)     1044.09      (9.6%)   -1.3% ( -17% -   18%) 0.659
            HighIntervalsOrdered        5.42      (7.8%)        5.35      (7.5%)   -1.2% ( -15% -   15%) 0.612
                          Fuzzy2       31.65      (2.5%)       31.29      (3.4%)   -1.2% (  -6% -    4%) 0.219
                          IntNRQ       34.85      (3.5%)       34.47      (4.6%)   -1.1% (  -8% -    7%) 0.387
                     LowSpanNear        3.88      (2.1%)        3.85      (2.7%)   -0.9% (  -5% -    4%) 0.253
                         Respell       19.33      (2.6%)       19.18      (3.8%)   -0.8% (  -6% -    5%) 0.435
                       OrHighMed       59.82      (2.5%)       59.42      (4.0%)   -0.7% (  -7% -    6%) 0.531
         AndHighMedDayTaxoFacets       15.30      (3.5%)       15.20      (3.4%)   -0.7% (  -7% -    6%) 0.549
                          Fuzzy1       36.34      (3.6%)       36.12      (4.5%)   -0.6% (  -8% -    7%) 0.647
                         Prefix3       63.78      (5.5%)       63.43      (5.3%)   -0.6% ( -10% -   10%) 0.743
                       MedPhrase       36.74      (4.3%)       36.54      (3.6%)   -0.5% (  -8% -    7%) 0.663
                      TermDTSort       88.62      (4.4%)       88.16      (4.4%)   -0.5% (  -8% -    8%) 0.708
                 MedSloppyPhrase        2.14      (2.7%)        2.13      (3.2%)   -0.5% (  -6% -    5%) 0.591
           HighTermDayOfYearSort      114.76      (5.3%)      114.29      (4.8%)   -0.4% (  -9% -   10%) 0.795
                      OrHighHigh       15.01      (4.6%)       14.95      (3.7%)   -0.4% (  -8% -    8%) 0.788
                   OrHighNotHigh      141.36      (4.3%)      140.94      (4.1%)   -0.3% (  -8% -    8%) 0.820
                      HighPhrase       34.33      (5.6%)       34.25      (6.1%)   -0.2% ( -11% -   12%) 0.901
             MedIntervalsOrdered       10.06      (5.8%)       10.04      (6.3%)   -0.2% ( -11% -   12%) 0.924
                    OrNotHighLow      333.78      (5.1%)      333.23      (5.7%)   -0.2% ( -10% -   11%) 0.922
            MedTermDayTaxoFacets        8.98      (2.8%)        8.98      (4.1%)   -0.1% (  -6% -    7%) 0.939
                HighSloppyPhrase        7.60      (2.7%)        7.59      (3.4%)   -0.0% (  -5% -    6%) 0.972
                    HighSpanNear        4.69      (4.1%)        4.69      (2.1%)   -0.0% (  -5% -    6%) 0.974
                     MedSpanNear        4.58      (3.1%)        4.58      (2.8%)   -0.0% (  -5% -    6%) 0.995
                         LowTerm      207.87      (7.7%)      207.88      (6.3%)    0.0% ( -13% -   15%) 0.998
                         MedTerm      245.93      (6.4%)      245.97      (3.1%)    0.0% (  -8% -   10%) 0.992
                     AndHighHigh       22.64      (3.1%)       22.65      (2.9%)    0.1% (  -5% -    6%) 0.948
        AndHighHighDayTaxoFacets        5.66      (2.5%)        5.67      (2.4%)    0.2% (  -4% -    5%) 0.822
                    OrHighNotMed      222.59      (5.5%)      223.01      (4.5%)    0.2% (  -9% -   10%) 0.906
                 LowSloppyPhrase       23.25      (2.9%)       23.34      (3.0%)    0.4% (  -5% -    6%) 0.680
          OrHighMedDayTaxoFacets        3.30      (3.6%)        3.31      (3.9%)    0.4% (  -6% -    8%) 0.741
             LowIntervalsOrdered       11.12      (2.9%)       11.16      (3.3%)    0.4% (  -5% -    6%) 0.683
               HighTermTitleSort       26.88      (5.6%)       26.98      (4.3%)    0.4% (  -9% -   10%) 0.797
                      AndHighMed       57.43      (4.5%)       57.74      (3.3%)    0.5% (  -7% -    8%) 0.670
                   OrNotHighHigh      131.23      (5.6%)      131.98      (4.7%)    0.6% (  -9% -   11%) 0.727
                       LowPhrase       50.84      (3.4%)       51.44      (4.2%)    1.2% (  -6% -    9%) 0.335
            HighTermTitleBDVSort        4.59      (3.1%)        4.64      (3.5%)    1.2% (  -5% -    8%) 0.246
                        HighTerm      174.02      (6.6%)      176.17      (3.4%)    1.2% (  -8% -   12%) 0.459
                    OrHighNotLow      167.90      (5.0%)      170.06      (5.0%)    1.3% (  -8% -   11%) 0.418
                    OrNotHighMed      127.88      (3.7%)      129.60      (4.0%)    1.3% (  -6% -    9%) 0.274
                       OrHighLow      170.29      (4.5%)      172.60      (2.9%)    1.4% (  -5% -    9%) 0.253
       BrowseDayOfYearSSDVFacets        3.33     (16.5%)        3.38     (16.2%)    1.5% ( -26% -   41%) 0.770
           BrowseMonthSSDVFacets        3.28      (8.9%)        3.33      (8.6%)    1.6% ( -14% -   21%) 0.567
                      AndHighLow      323.35      (4.9%)      329.17      (5.2%)    1.8% (  -7% -   12%) 0.262
                           range     2173.60      (8.0%)     2219.00      (8.5%)    2.1% ( -13% -   20%) 0.425
     BrowseRandomLabelSSDVFacets        1.75      (6.9%)        1.81      (9.4%)    3.1% ( -12% -   20%) 0.232
     BrowseRandomLabelTaxoFacets        2.70     (33.1%)        2.95     (37.9%)    9.0% ( -46% -  119%) 0.422
            BrowseDateTaxoFacets        4.31     (38.7%)        4.77     (47.8%)   10.7% ( -54% -  158%) 0.438
       BrowseDayOfYearTaxoFacets        4.45     (39.4%)        4.93     (49.3%)   10.9% ( -55% -  164%) 0.438
           BrowseMonthTaxoFacets        4.48     (47.7%)        5.80     (60.7%)   29.3% ( -53% -  263%) 0.090

Test3:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       BrowseDayOfYearSSDVFacets        3.39     (14.2%)        3.18      (6.5%)   -6.1% ( -23% -   17%) 0.083
           BrowseMonthSSDVFacets        3.61     (17.0%)        3.41     (12.3%)   -5.4% ( -29% -   28%) 0.250
                           range     2205.30      (8.2%)     2119.30     (11.4%)   -3.9% ( -21% -   17%) 0.214
                          Fuzzy1       38.53      (3.6%)       37.93      (3.8%)   -1.6% (  -8% -    6%) 0.177
             MedIntervalsOrdered       15.14      (4.3%)       14.99      (4.6%)   -1.0% (  -9% -    8%) 0.477
               HighTermTitleSort       89.49      (4.2%)       88.63      (5.4%)   -1.0% ( -10% -    9%) 0.530
                   OrHighNotHigh      199.21      (4.4%)      197.48      (3.9%)   -0.9% (  -8% -    7%) 0.510
                         Respell       27.65      (3.8%)       27.42      (3.6%)   -0.9% (  -7% -    6%) 0.460
           HighTermDayOfYearSort      125.46      (6.1%)      124.48      (3.5%)   -0.8% (  -9% -    9%) 0.622
                HighSloppyPhrase        4.01      (3.2%)        3.98      (3.1%)   -0.7% (  -6% -    5%) 0.457
                         MedTerm      227.57      (4.2%)      225.90      (3.2%)   -0.7% (  -7% -    7%) 0.537
                       MedPhrase       23.76      (5.6%)       23.59      (4.1%)   -0.7% (  -9% -    9%) 0.650
            MedTermDayTaxoFacets        7.47      (2.3%)        7.43      (2.5%)   -0.5% (  -5% -    4%) 0.488
                       OrHighMed       40.60      (2.8%)       40.40      (4.6%)   -0.5% (  -7% -    7%) 0.684
                     AndHighHigh       16.67      (2.7%)       16.61      (3.0%)   -0.3% (  -5% -    5%) 0.715
            BrowseDateSSDVFacets        0.63     (14.5%)        0.63     (14.4%)   -0.3% ( -25% -   33%) 0.949
          OrHighMedDayTaxoFacets        1.26      (4.5%)        1.26      (3.4%)   -0.3% (  -7% -    7%) 0.820
                 MedSloppyPhrase        2.91      (2.5%)        2.90      (2.5%)   -0.2% (  -5% -    4%) 0.753
     BrowseRandomLabelSSDVFacets        1.81      (8.3%)        1.81      (7.8%)   -0.2% ( -15% -   17%) 0.952
                      HighPhrase       31.29      (8.0%)       31.25      (4.7%)   -0.1% ( -11% -   13%) 0.944
        AndHighHighDayTaxoFacets        3.85      (2.1%)        3.85      (2.3%)   -0.1% (  -4% -    4%) 0.854
            HighIntervalsOrdered        2.24      (5.2%)        2.24      (5.0%)   -0.1% (  -9% -   10%) 0.974
                      TermDTSort       84.23      (5.2%)       84.25      (4.6%)    0.0% (  -9% -   10%) 0.990
                          Fuzzy2       35.64      (4.4%)       35.65      (3.5%)    0.0% (  -7% -    8%) 0.981
                       OrHighLow      148.60      (3.7%)      148.71      (5.2%)    0.1% (  -8% -    9%) 0.956
               HighTermMonthSort      905.42      (6.4%)      906.40      (5.7%)    0.1% ( -11% -   12%) 0.955
                     LowSpanNear        7.78      (2.2%)        7.80      (2.2%)    0.2% (  -4% -    4%) 0.749
                 LowSloppyPhrase       12.79      (2.4%)       12.83      (1.2%)    0.2% (  -3% -    3%) 0.683
                        Wildcard       26.08      (3.3%)       26.16      (4.0%)    0.3% (  -6% -    7%) 0.794
                      AndHighMed       43.27      (3.8%)       43.44      (3.0%)    0.4% (  -6% -    7%) 0.718
                         Prefix3       89.79      (5.0%)       90.22      (4.8%)    0.5% (  -8% -   10%) 0.757
                          IntNRQ       78.48      (6.9%)       78.88      (5.7%)    0.5% ( -11% -   14%) 0.801
         AndHighMedDayTaxoFacets       17.39      (2.7%)       17.49      (2.5%)    0.5% (  -4% -    5%) 0.519
                     MedSpanNear        9.92      (3.3%)        9.98      (2.1%)    0.6% (  -4% -    6%) 0.509
                        PKLookup      105.50      (5.4%)      106.12      (4.4%)    0.6% (  -8% -   10%) 0.706
                    OrHighNotLow      199.60      (5.6%)      200.99      (4.6%)    0.7% (  -8% -   11%) 0.667
                    HighSpanNear        7.14      (2.6%)        7.20      (2.0%)    0.8% (  -3% -    5%) 0.259
            HighTermTitleBDVSort        2.49      (3.5%)        2.52      (3.8%)    0.9% (  -6% -    8%) 0.445
                      OrHighHigh       21.16      (4.2%)       21.36      (3.5%)    1.0% (  -6% -    9%) 0.438
                       LowPhrase        8.67      (3.2%)        8.76      (3.0%)    1.0% (  -4% -    7%) 0.299
                    OrNotHighMed      130.15      (3.5%)      131.48      (3.6%)    1.0% (  -5% -    8%) 0.366
             LowIntervalsOrdered        4.90      (3.3%)        4.96      (2.4%)    1.2% (  -4% -    7%) 0.182
                   OrNotHighHigh      156.60      (2.6%)      158.68      (3.5%)    1.3% (  -4% -    7%) 0.169
                        HighTerm      224.01      (3.0%)      227.90      (4.8%)    1.7% (  -5% -    9%) 0.168
                      AndHighLow      211.74      (6.6%)      215.73      (4.6%)    1.9% (  -8% -   14%) 0.297
                    OrNotHighLow      262.16      (3.7%)      267.97      (3.2%)    2.2% (  -4% -    9%) 0.042
                         LowTerm      228.72      (4.8%)      233.79      (7.1%)    2.2% (  -9% -   14%) 0.248
                    OrHighNotMed      176.76      (4.8%)      182.29      (4.8%)    3.1% (  -6% -   13%) 0.039
     BrowseRandomLabelTaxoFacets        2.38     (27.8%)        2.58     (31.1%)    8.5% ( -39% -   93%) 0.360
       BrowseDayOfYearTaxoFacets        3.69     (30.6%)        4.08     (43.1%)   10.6% ( -48% -  121%) 0.372
            BrowseDateTaxoFacets        3.59     (30.9%)        3.97     (44.5%)   10.6% ( -49% -  124%) 0.381
           BrowseMonthTaxoFacets        3.75     (33.1%)        4.20     (46.7%)   12.0% ( -50% -  137%) 0.348

Test4

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       BrowseDayOfYearSSDVFacets        3.28      (7.8%)        3.20      (8.3%)   -2.2% ( -17% -   15%) 0.383
     BrowseRandomLabelSSDVFacets        1.92      (9.2%)        1.88      (8.8%)   -2.1% ( -18% -   17%) 0.458
                      AndHighLow      373.59      (3.2%)      365.70      (4.0%)   -2.1% (  -9% -    5%) 0.065
                         LowTerm      238.90      (5.9%)      234.19      (3.8%)   -2.0% ( -11% -    8%) 0.209
                   OrHighNotHigh      160.60      (4.2%)      158.04      (3.6%)   -1.6% (  -9% -    6%) 0.197
               HighTermMonthSort      924.41      (7.5%)      909.94      (4.5%)   -1.6% ( -12% -   11%) 0.425
                     LowSpanNear        8.91      (1.8%)        8.80      (2.1%)   -1.3% (  -5% -    2%) 0.043
           HighTermDayOfYearSort      119.25      (3.6%)      117.88      (4.2%)   -1.1% (  -8% -    6%) 0.356
             MedIntervalsOrdered        3.82      (6.1%)        3.78      (6.9%)   -1.0% ( -13% -   12%) 0.619
            BrowseDateTaxoFacets        3.80     (30.3%)        3.76     (30.3%)   -1.0% ( -47% -   85%) 0.917
               HighTermTitleSort       49.61      (7.6%)       49.13      (7.4%)   -1.0% ( -14% -   15%) 0.684
                    HighSpanNear        1.40      (1.5%)        1.39      (2.0%)   -0.9% (  -4% -    2%) 0.125
                         MedTerm      312.23      (3.1%)      309.55      (4.5%)   -0.9% (  -8% -    6%) 0.478
                     AndHighHigh       15.47      (2.0%)       15.36      (2.8%)   -0.7% (  -5% -    4%) 0.365
                HighSloppyPhrase        3.06      (2.6%)        3.04      (2.6%)   -0.7% (  -5% -    4%) 0.432
            HighIntervalsOrdered        0.60      (5.5%)        0.60      (6.1%)   -0.6% ( -11% -   11%) 0.742
                      AndHighMed       45.12      (2.1%)       44.84      (2.5%)   -0.6% (  -5% -    4%) 0.416
                    OrNotHighMed      112.15      (3.8%)      111.49      (4.1%)   -0.6% (  -8% -    7%) 0.635
            MedTermDayTaxoFacets        9.67      (2.9%)        9.62      (2.4%)   -0.6% (  -5% -    4%) 0.491
                         Prefix3      100.76      (6.8%)      100.20      (4.7%)   -0.6% ( -11% -   11%) 0.764
                    OrHighNotMed      178.70      (3.5%)      177.85      (2.8%)   -0.5% (  -6% -    6%) 0.636
                      OrHighHigh       13.29      (3.4%)       13.23      (3.0%)   -0.4% (  -6% -    6%) 0.667
                      TermDTSort       82.39      (4.2%)       82.11      (3.4%)   -0.3% (  -7% -    7%) 0.779
                 MedSloppyPhrase       14.51      (2.3%)       14.47      (1.9%)   -0.3% (  -4% -    4%) 0.630
        AndHighHighDayTaxoFacets        2.77      (1.8%)        2.76      (1.9%)   -0.3% (  -3% -    3%) 0.582
          OrHighMedDayTaxoFacets        3.66      (3.2%)        3.64      (2.8%)   -0.3% (  -6% -    5%) 0.753
             LowIntervalsOrdered       25.74      (3.4%)       25.69      (3.1%)   -0.2% (  -6% -    6%) 0.842
                 LowSloppyPhrase        4.02      (2.8%)        4.01      (2.6%)   -0.2% (  -5% -    5%) 0.836
                       OrHighLow      229.87      (5.1%)      229.52      (5.5%)   -0.2% ( -10% -   11%) 0.927
       BrowseDayOfYearTaxoFacets        3.85     (28.1%)        3.85     (32.0%)   -0.1% ( -46% -   83%) 0.989
                       MedPhrase       11.60      (3.6%)       11.59      (3.2%)   -0.1% (  -6% -    6%) 0.935
                   OrNotHighHigh      181.54      (2.8%)      181.48      (4.8%)   -0.0% (  -7% -    7%) 0.978
                      HighPhrase       34.64      (4.8%)       34.64      (3.2%)    0.0% (  -7% -    8%) 0.998
            HighTermTitleBDVSort        2.47      (4.2%)        2.48      (4.1%)    0.1% (  -7% -    8%) 0.956
         AndHighMedDayTaxoFacets       38.29      (2.6%)       38.32      (1.8%)    0.1% (  -4% -    4%) 0.908
                       LowPhrase       18.49      (4.3%)       18.50      (3.7%)    0.1% (  -7% -    8%) 0.944
                       OrHighMed       26.44      (2.8%)       26.48      (3.5%)    0.2% (  -5% -    6%) 0.876
                        Wildcard       20.32      (3.8%)       20.37      (3.3%)    0.2% (  -6% -    7%) 0.832
                     MedSpanNear       17.47      (2.0%)       17.53      (1.8%)    0.3% (  -3% -    4%) 0.554
                          Fuzzy2       35.81      (2.2%)       36.00      (2.7%)    0.5% (  -4% -    5%) 0.502
                         Respell       28.51      (2.9%)       28.68      (2.8%)    0.6% (  -4% -    6%) 0.497
     BrowseRandomLabelTaxoFacets        2.45     (27.6%)        2.47     (26.1%)    0.7% ( -41% -   75%) 0.939
                    OrHighNotLow      166.55      (3.1%)      167.93      (3.9%)    0.8% (  -5% -    8%) 0.455
                        HighTerm      268.60      (4.4%)      271.37      (4.8%)    1.0% (  -7% -   10%) 0.478
                    OrNotHighLow      341.54      (4.4%)      345.89      (4.9%)    1.3% (  -7% -   11%) 0.384
                          IntNRQ       19.69      (2.9%)       19.95      (3.5%)    1.3% (  -4% -    7%) 0.199
                          Fuzzy1       39.15      (2.3%)       39.87      (2.0%)    1.8% (  -2% -    6%) 0.007
           BrowseMonthSSDVFacets        3.62     (18.0%)        3.69     (20.5%)    2.2% ( -30% -   49%) 0.723
                           range     2175.09     (10.7%)     2229.64      (8.3%)    2.5% ( -14% -   24%) 0.409
                        PKLookup      104.87      (3.2%)      107.93      (3.3%)    2.9% (  -3% -    9%) 0.004
            BrowseDateSSDVFacets        0.66     (14.7%)        0.68     (17.1%)    3.1% ( -25% -   40%) 0.543
           BrowseMonthTaxoFacets        4.18     (42.3%)        5.26     (64.0%)   25.7% ( -56% -  228%) 0.134

Test5

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       BrowseDayOfYearTaxoFacets        4.73     (45.7%)        4.07     (35.1%)  -14.0% ( -65% -  122%) 0.276
            BrowseDateTaxoFacets        4.57     (43.9%)        3.95     (36.2%)  -13.6% ( -65% -  118%) 0.286
     BrowseRandomLabelTaxoFacets        2.83     (37.0%)        2.60     (29.0%)   -8.1% ( -54% -   91%) 0.439
                           range     2378.00     (11.9%)     2260.56      (8.5%)   -4.9% ( -22% -   17%) 0.131
       BrowseDayOfYearSSDVFacets        3.40     (10.3%)        3.32     (12.9%)   -2.2% ( -23% -   23%) 0.551
     BrowseRandomLabelSSDVFacets        1.96     (10.5%)        1.93     (13.9%)   -2.0% ( -23% -   25%) 0.605
            MedTermDayTaxoFacets        8.98      (3.4%)        8.85      (3.0%)   -1.4% (  -7% -    5%) 0.174
                          Fuzzy1       42.66      (3.8%)       42.12      (4.1%)   -1.3% (  -8% -    6%) 0.311
               HighTermTitleSort       56.06      (5.1%)       55.36      (2.4%)   -1.3% (  -8% -    6%) 0.317
                     AndHighHigh       14.14      (2.2%)       13.96      (3.1%)   -1.2% (  -6% -    4%) 0.154
               HighTermMonthSort      986.79      (7.8%)      975.74      (7.6%)   -1.1% ( -15% -   15%) 0.646
                    HighSpanNear        2.60      (2.0%)        2.57      (2.3%)   -1.1% (  -5% -    3%) 0.100
                 LowSloppyPhrase       21.13      (2.0%)       20.96      (2.9%)   -0.8% (  -5% -    4%) 0.301
                HighSloppyPhrase        2.23      (2.9%)        2.21      (2.6%)   -0.8% (  -6% -    4%) 0.354
                       LowPhrase       36.07      (2.8%)       35.81      (3.1%)   -0.7% (  -6% -    5%) 0.444
                    OrNotHighLow      338.46      (3.2%)      336.26      (4.9%)   -0.6% (  -8% -    7%) 0.617
                      AndHighMed       61.39      (3.0%)       61.01      (2.7%)   -0.6% (  -6% -    5%) 0.497
                        HighTerm      205.75      (3.1%)      204.50      (5.0%)   -0.6% (  -8% -    7%) 0.645
                        Wildcard       24.56      (4.0%)       24.42      (3.0%)   -0.6% (  -7% -    6%) 0.601
        AndHighHighDayTaxoFacets        2.54      (2.4%)        2.53      (2.7%)   -0.6% (  -5% -    4%) 0.476
                         Respell       27.74      (2.5%)       27.58      (3.4%)   -0.6% (  -6% -    5%) 0.559
           HighTermDayOfYearSort      116.66      (3.6%)      116.21      (4.0%)   -0.4% (  -7% -    7%) 0.748
                         Prefix3      401.91      (5.9%)      401.13      (8.4%)   -0.2% ( -13% -   15%) 0.933
                 MedSloppyPhrase       37.73      (2.9%)       37.67      (2.9%)   -0.2% (  -5% -    5%) 0.866
                      HighPhrase       11.75      (4.9%)       11.75      (6.7%)   -0.0% ( -11% -   12%) 0.999
                     MedSpanNear        5.55      (1.7%)        5.56      (2.3%)    0.2% (  -3% -    4%) 0.806
                     LowSpanNear        3.72      (2.2%)        3.72      (1.9%)    0.2% (  -3% -    4%) 0.771
         AndHighMedDayTaxoFacets       21.81      (2.5%)       21.87      (2.3%)    0.3% (  -4% -    5%) 0.711
                          Fuzzy2       36.16      (4.5%)       36.27      (2.4%)    0.3% (  -6% -    7%) 0.780
                          IntNRQ       19.82      (5.7%)       19.88      (5.0%)    0.3% (  -9% -   11%) 0.843
                       OrHighLow      216.46      (4.2%)      217.19      (3.8%)    0.3% (  -7% -    8%) 0.790
                   OrNotHighHigh      138.32      (5.1%)      138.98      (4.2%)    0.5% (  -8% -   10%) 0.745
                      OrHighHigh       13.04      (2.7%)       13.10      (3.0%)    0.5% (  -5% -    6%) 0.588
            HighTermTitleBDVSort        3.85      (3.5%)        3.87      (4.2%)    0.6% (  -6% -    8%) 0.632
             MedIntervalsOrdered       10.81      (4.7%)       10.87      (5.0%)    0.6% (  -8% -   10%) 0.702
            HighIntervalsOrdered        1.37      (6.7%)        1.38      (7.5%)    0.7% ( -12% -   15%) 0.750
             LowIntervalsOrdered        9.17      (2.5%)        9.24      (2.7%)    0.8% (  -4% -    6%) 0.339
                    OrNotHighMed      207.30      (3.3%)      208.97      (4.1%)    0.8% (  -6% -    8%) 0.492
                       MedPhrase       12.82      (4.3%)       12.93      (2.7%)    0.8% (  -5% -    8%) 0.464
           BrowseMonthSSDVFacets        3.48      (6.5%)        3.52     (15.2%)    0.9% ( -19% -   24%) 0.804
                      AndHighLow      285.41      (5.9%)      288.14      (5.0%)    1.0% (  -9% -   12%) 0.579
                        PKLookup      107.13      (4.1%)      108.20      (3.1%)    1.0% (  -5% -    8%) 0.380
                      TermDTSort       89.93      (5.0%)       90.88      (3.9%)    1.1% (  -7% -   10%) 0.449
          OrHighMedDayTaxoFacets        3.44      (3.1%)        3.48      (2.6%)    1.1% (  -4% -    7%) 0.222
                       OrHighMed       42.67      (2.7%)       43.23      (2.4%)    1.3% (  -3% -    6%) 0.102
                   OrHighNotHigh      206.53      (6.2%)      209.43      (5.3%)    1.4% (  -9% -   13%) 0.441
                    OrHighNotMed      175.18      (3.3%)      178.12      (5.3%)    1.7% (  -6% -   10%) 0.230
                         MedTerm      241.21      (4.3%)      245.54      (5.0%)    1.8% (  -7% -   11%) 0.222
                    OrHighNotLow      160.53      (4.3%)      163.59      (4.8%)    1.9% (  -6% -   11%) 0.187
                         LowTerm      211.35      (3.0%)      216.15      (4.5%)    2.3% (  -5% -   10%) 0.060
            BrowseDateSSDVFacets        0.65     (13.1%)        0.68     (16.5%)    5.3% ( -21% -   40%) 0.265
           BrowseMonthTaxoFacets        4.67     (44.8%)        5.27     (56.3%)   13.0% ( -60% -  206%) 0.419

Test6

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                    OrHighNotMed      171.74      (3.8%)      165.28      (6.1%)   -3.8% ( -13% -    6%) 0.019
                         LowTerm      262.47      (6.8%)      254.38      (4.9%)   -3.1% ( -13% -    9%) 0.099
                    OrHighNotLow      185.00      (4.5%)      179.79      (3.6%)   -2.8% ( -10% -    5%) 0.029
                   OrNotHighHigh      136.66      (5.2%)      133.82      (3.8%)   -2.1% ( -10% -    7%) 0.149
                      HighPhrase       66.26      (6.4%)       65.02      (4.1%)   -1.9% ( -11% -    9%) 0.274
                    OrNotHighMed      200.38      (5.3%)      197.56      (6.1%)   -1.4% ( -12% -   10%) 0.433
                       LowPhrase       10.09      (3.2%)        9.96      (3.3%)   -1.3% (  -7% -    5%) 0.214
                      AndHighMed       70.37      (3.5%)       69.63      (4.4%)   -1.0% (  -8% -    7%) 0.410
                HighSloppyPhrase        4.63      (3.3%)        4.59      (3.3%)   -0.9% (  -7% -    5%) 0.395
                      TermDTSort       60.46      (7.3%)       60.03      (7.2%)   -0.7% ( -14% -   14%) 0.753
                    OrNotHighLow      225.20      (3.2%)      223.60      (4.0%)   -0.7% (  -7% -    6%) 0.532
                       MedPhrase       50.43      (4.9%)       50.12      (5.3%)   -0.6% ( -10% -   10%) 0.697
           HighTermDayOfYearSort      111.28      (5.3%)      110.68      (4.1%)   -0.5% (  -9% -    9%) 0.719
         AndHighMedDayTaxoFacets       34.61      (3.2%)       34.49      (2.3%)   -0.3% (  -5% -    5%) 0.694
                       OrHighLow      222.40      (4.8%)      221.63      (3.7%)   -0.3% (  -8% -    8%) 0.797
                     AndHighHigh        9.56      (2.4%)        9.52      (2.3%)   -0.3% (  -4% -    4%) 0.641
                         Respell       22.70      (3.0%)       22.65      (3.5%)   -0.2% (  -6% -    6%) 0.827
                   OrHighNotHigh      150.87      (4.4%)      150.54      (5.3%)   -0.2% (  -9% -    9%) 0.887
                          Fuzzy1       21.63      (3.5%)       21.58      (3.5%)   -0.2% (  -6% -    7%) 0.844
               HighTermTitleSort       46.88      (3.6%)       46.79      (3.2%)   -0.2% (  -6% -    6%) 0.855
                          IntNRQ       16.16      (6.5%)       16.16      (6.2%)   -0.0% ( -11% -   13%) 0.999
          OrHighMedDayTaxoFacets        2.01      (3.9%)        2.01      (4.5%)    0.1% (  -7% -    8%) 0.917
                       OrHighMed       43.53      (2.9%)       43.59      (3.1%)    0.2% (  -5% -    6%) 0.869
                     LowSpanNear        3.64      (2.1%)        3.65      (1.9%)    0.3% (  -3% -    4%) 0.642
            HighTermTitleBDVSort        2.92      (3.0%)        2.93      (3.4%)    0.4% (  -5% -    6%) 0.723
                 MedSloppyPhrase       30.86      (4.1%)       30.99      (3.4%)    0.4% (  -6% -    8%) 0.725
             LowIntervalsOrdered        9.38      (2.7%)        9.42      (3.2%)    0.4% (  -5% -    6%) 0.646
                    HighSpanNear        0.61      (4.4%)        0.61      (3.7%)    0.4% (  -7% -    8%) 0.728
             MedIntervalsOrdered        2.84      (3.5%)        2.85      (3.4%)    0.5% (  -6% -    7%) 0.633
                 LowSloppyPhrase        4.07      (3.3%)        4.10      (3.3%)    0.7% (  -5% -    7%) 0.496
                        Wildcard       27.11      (4.3%)       27.31      (5.2%)    0.8% (  -8% -   10%) 0.615
                        HighTerm      347.91      (5.7%)      350.57      (4.2%)    0.8% (  -8% -   11%) 0.630
     BrowseRandomLabelSSDVFacets        1.83     (12.5%)        1.85     (11.4%)    0.8% ( -20% -   28%) 0.833
                      OrHighHigh       16.78      (2.4%)       16.92      (3.9%)    0.9% (  -5% -    7%) 0.397
            MedTermDayTaxoFacets        5.69      (3.0%)        5.74      (3.2%)    0.9% (  -5% -    7%) 0.364
                          Fuzzy2       31.38      (3.7%)       31.66      (3.6%)    0.9% (  -6% -    8%) 0.432
        AndHighHighDayTaxoFacets        5.60      (2.9%)        5.65      (2.4%)    1.0% (  -4% -    6%) 0.251
            BrowseDateSSDVFacets        0.64     (15.2%)        0.65     (15.4%)    1.0% ( -25% -   37%) 0.830
                     MedSpanNear        4.94      (2.6%)        5.00      (1.7%)    1.2% (  -3% -    5%) 0.088
                         Prefix3       78.50      (4.9%)       79.47      (6.3%)    1.2% (  -9% -   13%) 0.491
                        PKLookup      103.02      (5.0%)      104.30      (3.9%)    1.2% (  -7% -   10%) 0.381
                      AndHighLow      222.95      (4.3%)      226.54      (4.4%)    1.6% (  -6% -   10%) 0.244
                         MedTerm      350.42      (5.9%)      356.74      (4.3%)    1.8% (  -7% -   12%) 0.269
           BrowseMonthSSDVFacets        3.34     (14.4%)        3.40     (13.1%)    1.8% ( -22% -   34%) 0.672
            HighIntervalsOrdered        1.22      (4.7%)        1.24      (4.5%)    2.0% (  -6% -   11%) 0.171
                           range     1161.32     (10.4%)     1187.32      (9.3%)    2.2% ( -15% -   24%) 0.472
       BrowseDayOfYearSSDVFacets        3.09      (7.8%)        3.16      (7.1%)    2.3% ( -11% -   18%) 0.336
               HighTermMonthSort      951.11      (8.8%)      990.93      (6.9%)    4.2% ( -10% -   21%) 0.094
           BrowseMonthTaxoFacets        3.92     (31.4%)        4.11     (45.0%)    4.9% ( -54% -  118%) 0.687
       BrowseDayOfYearTaxoFacets        3.92     (29.6%)        4.15     (36.8%)    5.7% ( -46% -  102%) 0.587
            BrowseDateTaxoFacets        3.76     (28.7%)        4.04     (37.4%)    7.4% ( -45% -  103%) 0.482
     BrowseRandomLabelTaxoFacets        2.36     (23.3%)        2.57     (35.0%)    8.8% ( -40% -   87%) 0.348

Copy link
Contributor

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkewwei - Thanks for raising this PR. Looks like a useful change to me.

Comment on lines +515 to 516
decompressor.reset();
decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if reset should be the default behavior. We can pass another flag to indicate reuse if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that reset is essential. When the block changes, we must discard the cache in time, this operation can only be detected from external.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the block changes, we must discard the cache in time, this operation can only be detected from external.

I am not questioning that. My point is to not have reset method in the Decompressor interface, and add another decompress method that takes reuseIfPossible as one of the parameters. It ensures the functional correctness even if we don't make the reset call from somewhere in the code. And, allows explicit optimization wherever we deem appropriate. The risk in not explicitly making the reset call is much more than using original decompress without the reuse.

public abstract class Decompressor implements Cloneable {

  protected Decompressor() {}

  public void decompress(
      DataInput in, int originalLength, int offset, int length, BytesRef bytes) throws IOException {
    decompress(in, originalLength, offset, length, bytes, false);
  }

  public abstract void decompress(
      DataInput in, int originalLength, int offset, int length, BytesRef bytes, boolean reuseIfPossible) throws IOException;

  @Override
  public abstract Decompressor clone();
}

Copy link
Contributor Author

@kkewwei kkewwei Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but failed in just relying on outer reuseIfPossible to decide whether to cache PreSet Dict , In the follow case, outer must call the reset to clear the cache.

We have two chunks:

  1. chunk0 [doc0(length>0)]
  2. chunk1[doc0(length=0), doc1(length=1)]

Steps are as follow:

  1. Reading the chunk0/doc0, reuseIfPossible=false
  2. Reading the chunk1/doc0, reuseIfPossible=false. As length is 0, lucene will not read the PreSet Dict, the PreSet Dict is not cached.
  3. Reading the chunk1/doc1. In the case, doc1 is in the current chunk1, reuseIfPossible=true, but the PreSet Dict is not cached for now, lucene will throw exception.

In the case, we should call reset in the step1.

@kkewwei kkewwei force-pushed the optimize_storedfield_read branch 2 times, most recently from 9ab44d5 to b2c1f0b Compare March 29, 2025 09:47
Comment on lines +102 to +111
if (reused) {
assert buffer.length >= dictLength + blockLength;
in.skipBytes(compressedLengths[0]);
} else {
// Read the dictionary
buffer = ArrayUtil.growNoCopy(buffer, dictLength + blockLength);
if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) {
throw new CorruptIndexException("Illegal dict length", in);
}
reused = true;
Copy link
Contributor

@jainankitk jainankitk Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should consider exposing metric (simple counter maybe) on how many times we could reuse, and how many times had to read from the disk? That would provide some useful insights on the usefulness of this change

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Apr 15, 2025
@kkewwei kkewwei force-pushed the optimize_storedfield_read branch from b2c1f0b to 57d9661 Compare April 17, 2025 14:00
@github-actions github-actions bot removed the Stale label Apr 18, 2025
Copy link

github-actions bot commented May 2, 2025

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants