cache preset dict for LZ4WithPresetDictDecompressor #14397

kkewwei · 2025-03-24T14:18:48Z

Description

As mentioned in #14347, we use LZ4WithPresetDictDecompressor to decompress, we will always read preset dict for every doc in non-merge scenarios. If two consecutive documents fall into the same chunk, we can reuse the same dictionary. This is a lossless optimization, the cached preset dict dictionary does not take up additional memory.

Lucene benchmark: python3 src/python/localrun.py wikimediumall
Hardware used: linux ecs.t2-c1m2dev.8xlarge | 32 cores | 64G

In order to obtain the exact performance improvement, I tested it 6 times in total, and the results are as follows:

Test1:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               HighTermMonthSort      951.76     (10.5%)      897.70      (8.7%)   -5.7% ( -22% -   15%) 0.061
       BrowseDayOfYearSSDVFacets        3.41     (19.3%)        3.27     (17.9%)   -4.2% ( -34% -   40%) 0.474
            BrowseDateSSDVFacets        0.65     (15.0%)        0.62     (12.4%)   -3.9% ( -27% -   27%) 0.372
                   OrHighNotHigh      181.95      (6.2%)      175.52      (5.1%)   -3.5% ( -13% -    8%) 0.048
                           range     2164.58      (7.7%)     2098.62      (6.5%)   -3.0% ( -16% -   12%) 0.178
                      OrHighHigh       18.62      (5.0%)       18.20      (5.7%)   -2.3% ( -12% -    8%) 0.179
            HighTermTitleBDVSort        2.15      (4.1%)        2.11      (4.0%)   -2.0% (  -9% -    6%) 0.122
                         LowTerm      262.36      (7.1%)      257.53      (6.6%)   -1.8% ( -14% -   12%) 0.397
            HighIntervalsOrdered        0.42      (5.9%)        0.41      (5.6%)   -1.4% ( -12% -   10%) 0.429
                   OrNotHighHigh      137.47      (4.0%)      135.52      (3.9%)   -1.4% (  -8% -    6%) 0.257
                    HighSpanNear        5.50      (2.1%)        5.43      (2.2%)   -1.4% (  -5% -    3%) 0.044
               HighTermTitleSort       61.59      (5.9%)       60.76      (6.3%)   -1.4% ( -12% -   11%) 0.485
                       OrHighLow      128.91      (4.9%)      127.20      (4.7%)   -1.3% ( -10% -    8%) 0.382
                    OrHighNotMed      167.55      (6.0%)      165.47      (6.2%)   -1.2% ( -12% -   11%) 0.521
                        HighTerm      170.81      (6.9%)      168.74      (5.9%)   -1.2% ( -13% -   12%) 0.548
                    OrNotHighMed      124.94      (3.5%)      123.44      (5.3%)   -1.2% (  -9% -    7%) 0.398
            MedTermDayTaxoFacets        9.20      (2.8%)        9.10      (2.9%)   -1.1% (  -6% -    4%) 0.230
                    OrNotHighLow      179.84      (4.5%)      177.97      (3.7%)   -1.0% (  -8% -    7%) 0.427
          OrHighMedDayTaxoFacets        2.43      (5.0%)        2.41      (3.5%)   -1.0% (  -9% -    7%) 0.474
                       OrHighMed       40.19      (3.9%)       39.81      (2.4%)   -1.0% (  -7% -    5%) 0.347
                      AndHighMed       26.18      (2.2%)       25.94      (3.0%)   -0.9% (  -5% -    4%) 0.268
                          IntNRQ       17.99      (6.2%)       17.83      (6.0%)   -0.9% ( -12% -   12%) 0.652
                     MedSpanNear       23.26      (3.0%)       23.08      (2.5%)   -0.8% (  -6% -    4%) 0.372
                        Wildcard       28.13      (4.5%)       27.94      (4.2%)   -0.7% (  -8% -    8%) 0.633
                         Prefix3      327.18      (9.4%)      325.10      (8.4%)   -0.6% ( -16% -   18%) 0.821
                    OrHighNotLow      200.50      (5.6%)      199.53      (4.2%)   -0.5% (  -9% -    9%) 0.754
                      AndHighLow      297.86      (6.7%)      296.43      (7.5%)   -0.5% ( -13% -   14%) 0.832
                      TermDTSort       90.01      (6.6%)       89.67      (4.3%)   -0.4% ( -10% -   11%) 0.831
             LowIntervalsOrdered        2.83      (2.9%)        2.83      (3.0%)   -0.2% (  -5% -    5%) 0.855
                HighSloppyPhrase        4.48      (3.3%)        4.48      (4.0%)   -0.1% (  -7% -    7%) 0.903
                     AndHighHigh       18.78      (2.9%)       18.77      (2.5%)   -0.1% (  -5% -    5%) 0.919
                     LowSpanNear       32.35      (3.0%)       32.36      (2.8%)    0.0% (  -5% -    5%) 0.993
                         MedTerm      325.56      (5.8%)      325.60      (6.0%)    0.0% ( -11% -   12%) 0.994
                       LowPhrase        4.37      (3.8%)        4.38      (4.3%)    0.1% (  -7% -    8%) 0.933
        AndHighHighDayTaxoFacets        3.50      (1.8%)        3.51      (2.6%)    0.2% (  -4% -    4%) 0.751
                        PKLookup      103.50      (4.6%)      103.98      (4.2%)    0.5% (  -7% -    9%) 0.742
                      HighPhrase       39.74      (5.6%)       39.95      (5.3%)    0.5% (  -9% -   12%) 0.760
                 LowSloppyPhrase       26.29      (2.6%)       26.47      (2.8%)    0.7% (  -4% -    6%) 0.419
                       MedPhrase       61.27      (5.9%)       61.71      (4.8%)    0.7% (  -9% -   12%) 0.673
         AndHighMedDayTaxoFacets       12.94      (2.9%)       13.07      (2.3%)    0.9% (  -4% -    6%) 0.249
                 MedSloppyPhrase       35.28      (2.8%)       35.63      (2.4%)    1.0% (  -4% -    6%) 0.225
           HighTermDayOfYearSort      108.74      (4.3%)      109.82      (4.5%)    1.0% (  -7% -   10%) 0.475
             MedIntervalsOrdered       13.63      (5.9%)       13.79      (3.9%)    1.1% (  -8% -   11%) 0.475
     BrowseRandomLabelSSDVFacets        1.78      (9.6%)        1.80      (8.6%)    1.2% ( -15% -   21%) 0.686
                          Fuzzy1       35.79      (4.1%)       36.21      (4.5%)    1.2% (  -7% -   10%) 0.380
                          Fuzzy2       31.10      (4.9%)       31.60      (3.8%)    1.6% (  -6% -   10%) 0.250
                         Respell       28.75      (5.1%)       29.23      (3.2%)    1.7% (  -6% -   10%) 0.213
           BrowseMonthSSDVFacets        3.32     (11.0%)        3.39     (17.4%)    2.0% ( -23% -   34%) 0.661
            BrowseDateTaxoFacets        4.14     (40.9%)        4.26     (46.1%)    2.9% ( -59% -  152%) 0.831
       BrowseDayOfYearTaxoFacets        4.30     (41.2%)        4.44     (49.6%)    3.2% ( -61% -  159%) 0.823
     BrowseRandomLabelTaxoFacets        2.57     (35.2%)        2.68     (38.5%)    4.4% ( -51% -  120%) 0.705
           BrowseMonthTaxoFacets        4.08     (43.3%)        5.16     (65.5%)   26.6% ( -57% -  239%) 0.131

Test2:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        PKLookup      107.53      (3.9%)      104.81      (4.5%)   -2.5% ( -10% -    6%) 0.056
            BrowseDateSSDVFacets        0.69     (16.5%)        0.67     (15.2%)   -1.9% ( -28% -   35%) 0.708
                        Wildcard       27.49      (4.5%)       27.11      (4.0%)   -1.4% (  -9% -    7%) 0.311
               HighTermMonthSort     1057.35      (8.3%)     1044.09      (9.6%)   -1.3% ( -17% -   18%) 0.659
            HighIntervalsOrdered        5.42      (7.8%)        5.35      (7.5%)   -1.2% ( -15% -   15%) 0.612
                          Fuzzy2       31.65      (2.5%)       31.29      (3.4%)   -1.2% (  -6% -    4%) 0.219
                          IntNRQ       34.85      (3.5%)       34.47      (4.6%)   -1.1% (  -8% -    7%) 0.387
                     LowSpanNear        3.88      (2.1%)        3.85      (2.7%)   -0.9% (  -5% -    4%) 0.253
                         Respell       19.33      (2.6%)       19.18      (3.8%)   -0.8% (  -6% -    5%) 0.435
                       OrHighMed       59.82      (2.5%)       59.42      (4.0%)   -0.7% (  -7% -    6%) 0.531
         AndHighMedDayTaxoFacets       15.30      (3.5%)       15.20      (3.4%)   -0.7% (  -7% -    6%) 0.549
                          Fuzzy1       36.34      (3.6%)       36.12      (4.5%)   -0.6% (  -8% -    7%) 0.647
                         Prefix3       63.78      (5.5%)       63.43      (5.3%)   -0.6% ( -10% -   10%) 0.743
                       MedPhrase       36.74      (4.3%)       36.54      (3.6%)   -0.5% (  -8% -    7%) 0.663
                      TermDTSort       88.62      (4.4%)       88.16      (4.4%)   -0.5% (  -8% -    8%) 0.708
                 MedSloppyPhrase        2.14      (2.7%)        2.13      (3.2%)   -0.5% (  -6% -    5%) 0.591
           HighTermDayOfYearSort      114.76      (5.3%)      114.29      (4.8%)   -0.4% (  -9% -   10%) 0.795
                      OrHighHigh       15.01      (4.6%)       14.95      (3.7%)   -0.4% (  -8% -    8%) 0.788
                   OrHighNotHigh      141.36      (4.3%)      140.94      (4.1%)   -0.3% (  -8% -    8%) 0.820
                      HighPhrase       34.33      (5.6%)       34.25      (6.1%)   -0.2% ( -11% -   12%) 0.901
             MedIntervalsOrdered       10.06      (5.8%)       10.04      (6.3%)   -0.2% ( -11% -   12%) 0.924
                    OrNotHighLow      333.78      (5.1%)      333.23      (5.7%)   -0.2% ( -10% -   11%) 0.922
            MedTermDayTaxoFacets        8.98      (2.8%)        8.98      (4.1%)   -0.1% (  -6% -    7%) 0.939
                HighSloppyPhrase        7.60      (2.7%)        7.59      (3.4%)   -0.0% (  -5% -    6%) 0.972
                    HighSpanNear        4.69      (4.1%)        4.69      (2.1%)   -0.0% (  -5% -    6%) 0.974
                     MedSpanNear        4.58      (3.1%)        4.58      (2.8%)   -0.0% (  -5% -    6%) 0.995
                         LowTerm      207.87      (7.7%)      207.88      (6.3%)    0.0% ( -13% -   15%) 0.998
                         MedTerm      245.93      (6.4%)      245.97      (3.1%)    0.0% (  -8% -   10%) 0.992
                     AndHighHigh       22.64      (3.1%)       22.65      (2.9%)    0.1% (  -5% -    6%) 0.948
        AndHighHighDayTaxoFacets        5.66      (2.5%)        5.67      (2.4%)    0.2% (  -4% -    5%) 0.822
                    OrHighNotMed      222.59      (5.5%)      223.01      (4.5%)    0.2% (  -9% -   10%) 0.906
                 LowSloppyPhrase       23.25      (2.9%)       23.34      (3.0%)    0.4% (  -5% -    6%) 0.680
          OrHighMedDayTaxoFacets        3.30      (3.6%)        3.31      (3.9%)    0.4% (  -6% -    8%) 0.741
             LowIntervalsOrdered       11.12      (2.9%)       11.16      (3.3%)    0.4% (  -5% -    6%) 0.683
               HighTermTitleSort       26.88      (5.6%)       26.98      (4.3%)    0.4% (  -9% -   10%) 0.797
                      AndHighMed       57.43      (4.5%)       57.74      (3.3%)    0.5% (  -7% -    8%) 0.670
                   OrNotHighHigh      131.23      (5.6%)      131.98      (4.7%)    0.6% (  -9% -   11%) 0.727
                       LowPhrase       50.84      (3.4%)       51.44      (4.2%)    1.2% (  -6% -    9%) 0.335
            HighTermTitleBDVSort        4.59      (3.1%)        4.64      (3.5%)    1.2% (  -5% -    8%) 0.246
                        HighTerm      174.02      (6.6%)      176.17      (3.4%)    1.2% (  -8% -   12%) 0.459
                    OrHighNotLow      167.90      (5.0%)      170.06      (5.0%)    1.3% (  -8% -   11%) 0.418
                    OrNotHighMed      127.88      (3.7%)      129.60      (4.0%)    1.3% (  -6% -    9%) 0.274
                       OrHighLow      170.29      (4.5%)      172.60      (2.9%)    1.4% (  -5% -    9%) 0.253
       BrowseDayOfYearSSDVFacets        3.33     (16.5%)        3.38     (16.2%)    1.5% ( -26% -   41%) 0.770
           BrowseMonthSSDVFacets        3.28      (8.9%)        3.33      (8.6%)    1.6% ( -14% -   21%) 0.567
                      AndHighLow      323.35      (4.9%)      329.17      (5.2%)    1.8% (  -7% -   12%) 0.262
                           range     2173.60      (8.0%)     2219.00      (8.5%)    2.1% ( -13% -   20%) 0.425
     BrowseRandomLabelSSDVFacets        1.75      (6.9%)        1.81      (9.4%)    3.1% ( -12% -   20%) 0.232
     BrowseRandomLabelTaxoFacets        2.70     (33.1%)        2.95     (37.9%)    9.0% ( -46% -  119%) 0.422
            BrowseDateTaxoFacets        4.31     (38.7%)        4.77     (47.8%)   10.7% ( -54% -  158%) 0.438
       BrowseDayOfYearTaxoFacets        4.45     (39.4%)        4.93     (49.3%)   10.9% ( -55% -  164%) 0.438
           BrowseMonthTaxoFacets        4.48     (47.7%)        5.80     (60.7%)   29.3% ( -53% -  263%) 0.090

Test3:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       BrowseDayOfYearSSDVFacets        3.39     (14.2%)        3.18      (6.5%)   -6.1% ( -23% -   17%) 0.083
           BrowseMonthSSDVFacets        3.61     (17.0%)        3.41     (12.3%)   -5.4% ( -29% -   28%) 0.250
                           range     2205.30      (8.2%)     2119.30     (11.4%)   -3.9% ( -21% -   17%) 0.214
                          Fuzzy1       38.53      (3.6%)       37.93      (3.8%)   -1.6% (  -8% -    6%) 0.177
             MedIntervalsOrdered       15.14      (4.3%)       14.99      (4.6%)   -1.0% (  -9% -    8%) 0.477
               HighTermTitleSort       89.49      (4.2%)       88.63      (5.4%)   -1.0% ( -10% -    9%) 0.530
                   OrHighNotHigh      199.21      (4.4%)      197.48      (3.9%)   -0.9% (  -8% -    7%) 0.510
                         Respell       27.65      (3.8%)       27.42      (3.6%)   -0.9% (  -7% -    6%) 0.460
           HighTermDayOfYearSort      125.46      (6.1%)      124.48      (3.5%)   -0.8% (  -9% -    9%) 0.622
                HighSloppyPhrase        4.01      (3.2%)        3.98      (3.1%)   -0.7% (  -6% -    5%) 0.457
                         MedTerm      227.57      (4.2%)      225.90      (3.2%)   -0.7% (  -7% -    7%) 0.537
                       MedPhrase       23.76      (5.6%)       23.59      (4.1%)   -0.7% (  -9% -    9%) 0.650
            MedTermDayTaxoFacets        7.47      (2.3%)        7.43      (2.5%)   -0.5% (  -5% -    4%) 0.488
                       OrHighMed       40.60      (2.8%)       40.40      (4.6%)   -0.5% (  -7% -    7%) 0.684
                     AndHighHigh       16.67      (2.7%)       16.61      (3.0%)   -0.3% (  -5% -    5%) 0.715
            BrowseDateSSDVFacets        0.63     (14.5%)        0.63     (14.4%)   -0.3% ( -25% -   33%) 0.949
          OrHighMedDayTaxoFacets        1.26      (4.5%)        1.26      (3.4%)   -0.3% (  -7% -    7%) 0.820
                 MedSloppyPhrase        2.91      (2.5%)        2.90      (2.5%)   -0.2% (  -5% -    4%) 0.753
     BrowseRandomLabelSSDVFacets        1.81      (8.3%)        1.81      (7.8%)   -0.2% ( -15% -   17%) 0.952
                      HighPhrase       31.29      (8.0%)       31.25      (4.7%)   -0.1% ( -11% -   13%) 0.944
        AndHighHighDayTaxoFacets        3.85      (2.1%)        3.85      (2.3%)   -0.1% (  -4% -    4%) 0.854
            HighIntervalsOrdered        2.24      (5.2%)        2.24      (5.0%)   -0.1% (  -9% -   10%) 0.974
                      TermDTSort       84.23      (5.2%)       84.25      (4.6%)    0.0% (  -9% -   10%) 0.990
                          Fuzzy2       35.64      (4.4%)       35.65      (3.5%)    0.0% (  -7% -    8%) 0.981
                       OrHighLow      148.60      (3.7%)      148.71      (5.2%)    0.1% (  -8% -    9%) 0.956
               HighTermMonthSort      905.42      (6.4%)      906.40      (5.7%)    0.1% ( -11% -   12%) 0.955
                     LowSpanNear        7.78      (2.2%)        7.80      (2.2%)    0.2% (  -4% -    4%) 0.749
                 LowSloppyPhrase       12.79      (2.4%)       12.83      (1.2%)    0.2% (  -3% -    3%) 0.683
                        Wildcard       26.08      (3.3%)       26.16      (4.0%)    0.3% (  -6% -    7%) 0.794
                      AndHighMed       43.27      (3.8%)       43.44      (3.0%)    0.4% (  -6% -    7%) 0.718
                         Prefix3       89.79      (5.0%)       90.22      (4.8%)    0.5% (  -8% -   10%) 0.757
                          IntNRQ       78.48      (6.9%)       78.88      (5.7%)    0.5% ( -11% -   14%) 0.801
         AndHighMedDayTaxoFacets       17.39      (2.7%)       17.49      (2.5%)    0.5% (  -4% -    5%) 0.519
                     MedSpanNear        9.92      (3.3%)        9.98      (2.1%)    0.6% (  -4% -    6%) 0.509
                        PKLookup      105.50      (5.4%)      106.12      (4.4%)    0.6% (  -8% -   10%) 0.706
                    OrHighNotLow      199.60      (5.6%)      200.99      (4.6%)    0.7% (  -8% -   11%) 0.667
                    HighSpanNear        7.14      (2.6%)        7.20      (2.0%)    0.8% (  -3% -    5%) 0.259
            HighTermTitleBDVSort        2.49      (3.5%)        2.52      (3.8%)    0.9% (  -6% -    8%) 0.445
                      OrHighHigh       21.16      (4.2%)       21.36      (3.5%)    1.0% (  -6% -    9%) 0.438
                       LowPhrase        8.67      (3.2%)        8.76      (3.0%)    1.0% (  -4% -    7%) 0.299
                    OrNotHighMed      130.15      (3.5%)      131.48      (3.6%)    1.0% (  -5% -    8%) 0.366
             LowIntervalsOrdered        4.90      (3.3%)        4.96      (2.4%)    1.2% (  -4% -    7%) 0.182
                   OrNotHighHigh      156.60      (2.6%)      158.68      (3.5%)    1.3% (  -4% -    7%) 0.169
                        HighTerm      224.01      (3.0%)      227.90      (4.8%)    1.7% (  -5% -    9%) 0.168
                      AndHighLow      211.74      (6.6%)      215.73      (4.6%)    1.9% (  -8% -   14%) 0.297
                    OrNotHighLow      262.16      (3.7%)      267.97      (3.2%)    2.2% (  -4% -    9%) 0.042
                         LowTerm      228.72      (4.8%)      233.79      (7.1%)    2.2% (  -9% -   14%) 0.248
                    OrHighNotMed      176.76      (4.8%)      182.29      (4.8%)    3.1% (  -6% -   13%) 0.039
     BrowseRandomLabelTaxoFacets        2.38     (27.8%)        2.58     (31.1%)    8.5% ( -39% -   93%) 0.360
       BrowseDayOfYearTaxoFacets        3.69     (30.6%)        4.08     (43.1%)   10.6% ( -48% -  121%) 0.372
            BrowseDateTaxoFacets        3.59     (30.9%)        3.97     (44.5%)   10.6% ( -49% -  124%) 0.381
           BrowseMonthTaxoFacets        3.75     (33.1%)        4.20     (46.7%)   12.0% ( -50% -  137%) 0.348

Test4

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       BrowseDayOfYearSSDVFacets        3.28      (7.8%)        3.20      (8.3%)   -2.2% ( -17% -   15%) 0.383
     BrowseRandomLabelSSDVFacets        1.92      (9.2%)        1.88      (8.8%)   -2.1% ( -18% -   17%) 0.458
                      AndHighLow      373.59      (3.2%)      365.70      (4.0%)   -2.1% (  -9% -    5%) 0.065
                         LowTerm      238.90      (5.9%)      234.19      (3.8%)   -2.0% ( -11% -    8%) 0.209
                   OrHighNotHigh      160.60      (4.2%)      158.04      (3.6%)   -1.6% (  -9% -    6%) 0.197
               HighTermMonthSort      924.41      (7.5%)      909.94      (4.5%)   -1.6% ( -12% -   11%) 0.425
                     LowSpanNear        8.91      (1.8%)        8.80      (2.1%)   -1.3% (  -5% -    2%) 0.043
           HighTermDayOfYearSort      119.25      (3.6%)      117.88      (4.2%)   -1.1% (  -8% -    6%) 0.356
             MedIntervalsOrdered        3.82      (6.1%)        3.78      (6.9%)   -1.0% ( -13% -   12%) 0.619
            BrowseDateTaxoFacets        3.80     (30.3%)        3.76     (30.3%)   -1.0% ( -47% -   85%) 0.917
               HighTermTitleSort       49.61      (7.6%)       49.13      (7.4%)   -1.0% ( -14% -   15%) 0.684
                    HighSpanNear        1.40      (1.5%)        1.39      (2.0%)   -0.9% (  -4% -    2%) 0.125
                         MedTerm      312.23      (3.1%)      309.55      (4.5%)   -0.9% (  -8% -    6%) 0.478
                     AndHighHigh       15.47      (2.0%)       15.36      (2.8%)   -0.7% (  -5% -    4%) 0.365
                HighSloppyPhrase        3.06      (2.6%)        3.04      (2.6%)   -0.7% (  -5% -    4%) 0.432
            HighIntervalsOrdered        0.60      (5.5%)        0.60      (6.1%)   -0.6% ( -11% -   11%) 0.742
                      AndHighMed       45.12      (2.1%)       44.84      (2.5%)   -0.6% (  -5% -    4%) 0.416
                    OrNotHighMed      112.15      (3.8%)      111.49      (4.1%)   -0.6% (  -8% -    7%) 0.635
            MedTermDayTaxoFacets        9.67      (2.9%)        9.62      (2.4%)   -0.6% (  -5% -    4%) 0.491
                         Prefix3      100.76      (6.8%)      100.20      (4.7%)   -0.6% ( -11% -   11%) 0.764
                    OrHighNotMed      178.70      (3.5%)      177.85      (2.8%)   -0.5% (  -6% -    6%) 0.636
                      OrHighHigh       13.29      (3.4%)       13.23      (3.0%)   -0.4% (  -6% -    6%) 0.667
                      TermDTSort       82.39      (4.2%)       82.11      (3.4%)   -0.3% (  -7% -    7%) 0.779
                 MedSloppyPhrase       14.51      (2.3%)       14.47      (1.9%)   -0.3% (  -4% -    4%) 0.630
        AndHighHighDayTaxoFacets        2.77      (1.8%)        2.76      (1.9%)   -0.3% (  -3% -    3%) 0.582
          OrHighMedDayTaxoFacets        3.66      (3.2%)        3.64      (2.8%)   -0.3% (  -6% -    5%) 0.753
             LowIntervalsOrdered       25.74      (3.4%)       25.69      (3.1%)   -0.2% (  -6% -    6%) 0.842
                 LowSloppyPhrase        4.02      (2.8%)        4.01      (2.6%)   -0.2% (  -5% -    5%) 0.836
                       OrHighLow      229.87      (5.1%)      229.52      (5.5%)   -0.2% ( -10% -   11%) 0.927
       BrowseDayOfYearTaxoFacets        3.85     (28.1%)        3.85     (32.0%)   -0.1% ( -46% -   83%) 0.989
                       MedPhrase       11.60      (3.6%)       11.59      (3.2%)   -0.1% (  -6% -    6%) 0.935
                   OrNotHighHigh      181.54      (2.8%)      181.48      (4.8%)   -0.0% (  -7% -    7%) 0.978
                      HighPhrase       34.64      (4.8%)       34.64      (3.2%)    0.0% (  -7% -    8%) 0.998
            HighTermTitleBDVSort        2.47      (4.2%)        2.48      (4.1%)    0.1% (  -7% -    8%) 0.956
         AndHighMedDayTaxoFacets       38.29      (2.6%)       38.32      (1.8%)    0.1% (  -4% -    4%) 0.908
                       LowPhrase       18.49      (4.3%)       18.50      (3.7%)    0.1% (  -7% -    8%) 0.944
                       OrHighMed       26.44      (2.8%)       26.48      (3.5%)    0.2% (  -5% -    6%) 0.876
                        Wildcard       20.32      (3.8%)       20.37      (3.3%)    0.2% (  -6% -    7%) 0.832
                     MedSpanNear       17.47      (2.0%)       17.53      (1.8%)    0.3% (  -3% -    4%) 0.554
                          Fuzzy2       35.81      (2.2%)       36.00      (2.7%)    0.5% (  -4% -    5%) 0.502
                         Respell       28.51      (2.9%)       28.68      (2.8%)    0.6% (  -4% -    6%) 0.497
     BrowseRandomLabelTaxoFacets        2.45     (27.6%)        2.47     (26.1%)    0.7% ( -41% -   75%) 0.939
                    OrHighNotLow      166.55      (3.1%)      167.93      (3.9%)    0.8% (  -5% -    8%) 0.455
                        HighTerm      268.60      (4.4%)      271.37      (4.8%)    1.0% (  -7% -   10%) 0.478
                    OrNotHighLow      341.54      (4.4%)      345.89      (4.9%)    1.3% (  -7% -   11%) 0.384
                          IntNRQ       19.69      (2.9%)       19.95      (3.5%)    1.3% (  -4% -    7%) 0.199
                          Fuzzy1       39.15      (2.3%)       39.87      (2.0%)    1.8% (  -2% -    6%) 0.007
           BrowseMonthSSDVFacets        3.62     (18.0%)        3.69     (20.5%)    2.2% ( -30% -   49%) 0.723
                           range     2175.09     (10.7%)     2229.64      (8.3%)    2.5% ( -14% -   24%) 0.409
                        PKLookup      104.87      (3.2%)      107.93      (3.3%)    2.9% (  -3% -    9%) 0.004
            BrowseDateSSDVFacets        0.66     (14.7%)        0.68     (17.1%)    3.1% ( -25% -   40%) 0.543
           BrowseMonthTaxoFacets        4.18     (42.3%)        5.26     (64.0%)   25.7% ( -56% -  228%) 0.134

Test5

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       BrowseDayOfYearTaxoFacets        4.73     (45.7%)        4.07     (35.1%)  -14.0% ( -65% -  122%) 0.276
            BrowseDateTaxoFacets        4.57     (43.9%)        3.95     (36.2%)  -13.6% ( -65% -  118%) 0.286
     BrowseRandomLabelTaxoFacets        2.83     (37.0%)        2.60     (29.0%)   -8.1% ( -54% -   91%) 0.439
                           range     2378.00     (11.9%)     2260.56      (8.5%)   -4.9% ( -22% -   17%) 0.131
       BrowseDayOfYearSSDVFacets        3.40     (10.3%)        3.32     (12.9%)   -2.2% ( -23% -   23%) 0.551
     BrowseRandomLabelSSDVFacets        1.96     (10.5%)        1.93     (13.9%)   -2.0% ( -23% -   25%) 0.605
            MedTermDayTaxoFacets        8.98      (3.4%)        8.85      (3.0%)   -1.4% (  -7% -    5%) 0.174
                          Fuzzy1       42.66      (3.8%)       42.12      (4.1%)   -1.3% (  -8% -    6%) 0.311
               HighTermTitleSort       56.06      (5.1%)       55.36      (2.4%)   -1.3% (  -8% -    6%) 0.317
                     AndHighHigh       14.14      (2.2%)       13.96      (3.1%)   -1.2% (  -6% -    4%) 0.154
               HighTermMonthSort      986.79      (7.8%)      975.74      (7.6%)   -1.1% ( -15% -   15%) 0.646
                    HighSpanNear        2.60      (2.0%)        2.57      (2.3%)   -1.1% (  -5% -    3%) 0.100
                 LowSloppyPhrase       21.13      (2.0%)       20.96      (2.9%)   -0.8% (  -5% -    4%) 0.301
                HighSloppyPhrase        2.23      (2.9%)        2.21      (2.6%)   -0.8% (  -6% -    4%) 0.354
                       LowPhrase       36.07      (2.8%)       35.81      (3.1%)   -0.7% (  -6% -    5%) 0.444
                    OrNotHighLow      338.46      (3.2%)      336.26      (4.9%)   -0.6% (  -8% -    7%) 0.617
                      AndHighMed       61.39      (3.0%)       61.01      (2.7%)   -0.6% (  -6% -    5%) 0.497
                        HighTerm      205.75      (3.1%)      204.50      (5.0%)   -0.6% (  -8% -    7%) 0.645
                        Wildcard       24.56      (4.0%)       24.42      (3.0%)   -0.6% (  -7% -    6%) 0.601
        AndHighHighDayTaxoFacets        2.54      (2.4%)        2.53      (2.7%)   -0.6% (  -5% -    4%) 0.476
                         Respell       27.74      (2.5%)       27.58      (3.4%)   -0.6% (  -6% -    5%) 0.559
           HighTermDayOfYearSort      116.66      (3.6%)      116.21      (4.0%)   -0.4% (  -7% -    7%) 0.748
                         Prefix3      401.91      (5.9%)      401.13      (8.4%)   -0.2% ( -13% -   15%) 0.933
                 MedSloppyPhrase       37.73      (2.9%)       37.67      (2.9%)   -0.2% (  -5% -    5%) 0.866
                      HighPhrase       11.75      (4.9%)       11.75      (6.7%)   -0.0% ( -11% -   12%) 0.999
                     MedSpanNear        5.55      (1.7%)        5.56      (2.3%)    0.2% (  -3% -    4%) 0.806
                     LowSpanNear        3.72      (2.2%)        3.72      (1.9%)    0.2% (  -3% -    4%) 0.771
         AndHighMedDayTaxoFacets       21.81      (2.5%)       21.87      (2.3%)    0.3% (  -4% -    5%) 0.711
                          Fuzzy2       36.16      (4.5%)       36.27      (2.4%)    0.3% (  -6% -    7%) 0.780
                          IntNRQ       19.82      (5.7%)       19.88      (5.0%)    0.3% (  -9% -   11%) 0.843
                       OrHighLow      216.46      (4.2%)      217.19      (3.8%)    0.3% (  -7% -    8%) 0.790
                   OrNotHighHigh      138.32      (5.1%)      138.98      (4.2%)    0.5% (  -8% -   10%) 0.745
                      OrHighHigh       13.04      (2.7%)       13.10      (3.0%)    0.5% (  -5% -    6%) 0.588
            HighTermTitleBDVSort        3.85      (3.5%)        3.87      (4.2%)    0.6% (  -6% -    8%) 0.632
             MedIntervalsOrdered       10.81      (4.7%)       10.87      (5.0%)    0.6% (  -8% -   10%) 0.702
            HighIntervalsOrdered        1.37      (6.7%)        1.38      (7.5%)    0.7% ( -12% -   15%) 0.750
             LowIntervalsOrdered        9.17      (2.5%)        9.24      (2.7%)    0.8% (  -4% -    6%) 0.339
                    OrNotHighMed      207.30      (3.3%)      208.97      (4.1%)    0.8% (  -6% -    8%) 0.492
                       MedPhrase       12.82      (4.3%)       12.93      (2.7%)    0.8% (  -5% -    8%) 0.464
           BrowseMonthSSDVFacets        3.48      (6.5%)        3.52     (15.2%)    0.9% ( -19% -   24%) 0.804
                      AndHighLow      285.41      (5.9%)      288.14      (5.0%)    1.0% (  -9% -   12%) 0.579
                        PKLookup      107.13      (4.1%)      108.20      (3.1%)    1.0% (  -5% -    8%) 0.380
                      TermDTSort       89.93      (5.0%)       90.88      (3.9%)    1.1% (  -7% -   10%) 0.449
          OrHighMedDayTaxoFacets        3.44      (3.1%)        3.48      (2.6%)    1.1% (  -4% -    7%) 0.222
                       OrHighMed       42.67      (2.7%)       43.23      (2.4%)    1.3% (  -3% -    6%) 0.102
                   OrHighNotHigh      206.53      (6.2%)      209.43      (5.3%)    1.4% (  -9% -   13%) 0.441
                    OrHighNotMed      175.18      (3.3%)      178.12      (5.3%)    1.7% (  -6% -   10%) 0.230
                         MedTerm      241.21      (4.3%)      245.54      (5.0%)    1.8% (  -7% -   11%) 0.222
                    OrHighNotLow      160.53      (4.3%)      163.59      (4.8%)    1.9% (  -6% -   11%) 0.187
                         LowTerm      211.35      (3.0%)      216.15      (4.5%)    2.3% (  -5% -   10%) 0.060
            BrowseDateSSDVFacets        0.65     (13.1%)        0.68     (16.5%)    5.3% ( -21% -   40%) 0.265
           BrowseMonthTaxoFacets        4.67     (44.8%)        5.27     (56.3%)   13.0% ( -60% -  206%) 0.419

Test6

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                    OrHighNotMed      171.74      (3.8%)      165.28      (6.1%)   -3.8% ( -13% -    6%) 0.019
                         LowTerm      262.47      (6.8%)      254.38      (4.9%)   -3.1% ( -13% -    9%) 0.099
                    OrHighNotLow      185.00      (4.5%)      179.79      (3.6%)   -2.8% ( -10% -    5%) 0.029
                   OrNotHighHigh      136.66      (5.2%)      133.82      (3.8%)   -2.1% ( -10% -    7%) 0.149
                      HighPhrase       66.26      (6.4%)       65.02      (4.1%)   -1.9% ( -11% -    9%) 0.274
                    OrNotHighMed      200.38      (5.3%)      197.56      (6.1%)   -1.4% ( -12% -   10%) 0.433
                       LowPhrase       10.09      (3.2%)        9.96      (3.3%)   -1.3% (  -7% -    5%) 0.214
                      AndHighMed       70.37      (3.5%)       69.63      (4.4%)   -1.0% (  -8% -    7%) 0.410
                HighSloppyPhrase        4.63      (3.3%)        4.59      (3.3%)   -0.9% (  -7% -    5%) 0.395
                      TermDTSort       60.46      (7.3%)       60.03      (7.2%)   -0.7% ( -14% -   14%) 0.753
                    OrNotHighLow      225.20      (3.2%)      223.60      (4.0%)   -0.7% (  -7% -    6%) 0.532
                       MedPhrase       50.43      (4.9%)       50.12      (5.3%)   -0.6% ( -10% -   10%) 0.697
           HighTermDayOfYearSort      111.28      (5.3%)      110.68      (4.1%)   -0.5% (  -9% -    9%) 0.719
         AndHighMedDayTaxoFacets       34.61      (3.2%)       34.49      (2.3%)   -0.3% (  -5% -    5%) 0.694
                       OrHighLow      222.40      (4.8%)      221.63      (3.7%)   -0.3% (  -8% -    8%) 0.797
                     AndHighHigh        9.56      (2.4%)        9.52      (2.3%)   -0.3% (  -4% -    4%) 0.641
                         Respell       22.70      (3.0%)       22.65      (3.5%)   -0.2% (  -6% -    6%) 0.827
                   OrHighNotHigh      150.87      (4.4%)      150.54      (5.3%)   -0.2% (  -9% -    9%) 0.887
                          Fuzzy1       21.63      (3.5%)       21.58      (3.5%)   -0.2% (  -6% -    7%) 0.844
               HighTermTitleSort       46.88      (3.6%)       46.79      (3.2%)   -0.2% (  -6% -    6%) 0.855
                          IntNRQ       16.16      (6.5%)       16.16      (6.2%)   -0.0% ( -11% -   13%) 0.999
          OrHighMedDayTaxoFacets        2.01      (3.9%)        2.01      (4.5%)    0.1% (  -7% -    8%) 0.917
                       OrHighMed       43.53      (2.9%)       43.59      (3.1%)    0.2% (  -5% -    6%) 0.869
                     LowSpanNear        3.64      (2.1%)        3.65      (1.9%)    0.3% (  -3% -    4%) 0.642
            HighTermTitleBDVSort        2.92      (3.0%)        2.93      (3.4%)    0.4% (  -5% -    6%) 0.723
                 MedSloppyPhrase       30.86      (4.1%)       30.99      (3.4%)    0.4% (  -6% -    8%) 0.725
             LowIntervalsOrdered        9.38      (2.7%)        9.42      (3.2%)    0.4% (  -5% -    6%) 0.646
                    HighSpanNear        0.61      (4.4%)        0.61      (3.7%)    0.4% (  -7% -    8%) 0.728
             MedIntervalsOrdered        2.84      (3.5%)        2.85      (3.4%)    0.5% (  -6% -    7%) 0.633
                 LowSloppyPhrase        4.07      (3.3%)        4.10      (3.3%)    0.7% (  -5% -    7%) 0.496
                        Wildcard       27.11      (4.3%)       27.31      (5.2%)    0.8% (  -8% -   10%) 0.615
                        HighTerm      347.91      (5.7%)      350.57      (4.2%)    0.8% (  -8% -   11%) 0.630
     BrowseRandomLabelSSDVFacets        1.83     (12.5%)        1.85     (11.4%)    0.8% ( -20% -   28%) 0.833
                      OrHighHigh       16.78      (2.4%)       16.92      (3.9%)    0.9% (  -5% -    7%) 0.397
            MedTermDayTaxoFacets        5.69      (3.0%)        5.74      (3.2%)    0.9% (  -5% -    7%) 0.364
                          Fuzzy2       31.38      (3.7%)       31.66      (3.6%)    0.9% (  -6% -    8%) 0.432
        AndHighHighDayTaxoFacets        5.60      (2.9%)        5.65      (2.4%)    1.0% (  -4% -    6%) 0.251
            BrowseDateSSDVFacets        0.64     (15.2%)        0.65     (15.4%)    1.0% ( -25% -   37%) 0.830
                     MedSpanNear        4.94      (2.6%)        5.00      (1.7%)    1.2% (  -3% -    5%) 0.088
                         Prefix3       78.50      (4.9%)       79.47      (6.3%)    1.2% (  -9% -   13%) 0.491
                        PKLookup      103.02      (5.0%)      104.30      (3.9%)    1.2% (  -7% -   10%) 0.381
                      AndHighLow      222.95      (4.3%)      226.54      (4.4%)    1.6% (  -6% -   10%) 0.244
                         MedTerm      350.42      (5.9%)      356.74      (4.3%)    1.8% (  -7% -   12%) 0.269
           BrowseMonthSSDVFacets        3.34     (14.4%)        3.40     (13.1%)    1.8% ( -22% -   34%) 0.672
            HighIntervalsOrdered        1.22      (4.7%)        1.24      (4.5%)    2.0% (  -6% -   11%) 0.171
                           range     1161.32     (10.4%)     1187.32      (9.3%)    2.2% ( -15% -   24%) 0.472
       BrowseDayOfYearSSDVFacets        3.09      (7.8%)        3.16      (7.1%)    2.3% ( -11% -   18%) 0.336
               HighTermMonthSort      951.11      (8.8%)      990.93      (6.9%)    4.2% ( -10% -   21%) 0.094
           BrowseMonthTaxoFacets        3.92     (31.4%)        4.11     (45.0%)    4.9% ( -54% -  118%) 0.687
       BrowseDayOfYearTaxoFacets        3.92     (29.6%)        4.15     (36.8%)    5.7% ( -46% -  102%) 0.587
            BrowseDateTaxoFacets        3.76     (28.7%)        4.04     (37.4%)    7.4% ( -45% -  103%) 0.482
     BrowseRandomLabelTaxoFacets        2.36     (23.3%)        2.57     (35.0%)    8.8% ( -40% -   87%) 0.348

jainankitk

@kkewwei - Thanks for raising this PR. Looks like a useful change to me.

lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java

jainankitk · 2025-03-29T01:48:09Z

...ava/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java

+            decompressor.reset();
            decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare);


I am wondering if reset should be the default behavior. We can pass another flag to indicate reuse if possible.

It seems that reset is essential. When the block changes, we must discard the cache in time, this operation can only be detected from external.

When the block changes, we must discard the cache in time, this operation can only be detected from external.

I am not questioning that. My point is to not have reset method in the Decompressor interface, and add another decompress method that takes reuseIfPossible as one of the parameters. It ensures the functional correctness even if we don't make the reset call from somewhere in the code. And, allows explicit optimization wherever we deem appropriate. The risk in not explicitly making the reset call is much more than using original decompress without the reuse.

public abstract class Decompressor implements Cloneable { protected Decompressor() {} public void decompress( DataInput in, int originalLength, int offset, int length, BytesRef bytes) throws IOException { decompress(in, originalLength, offset, length, bytes, false); } public abstract void decompress( DataInput in, int originalLength, int offset, int length, BytesRef bytes, boolean reuseIfPossible) throws IOException; @Override public abstract Decompressor clone(); }

I tried but failed in just relying on outer reuseIfPossible to decide whether to cache PreSet Dict , In the follow case, outer must call the reset to clear the cache.

We have two chunks:

chunk0 [doc0(length>0)]

chunk1[doc0(length=0), doc1(length=1)]

Steps are as follow:

Reading the chunk0/doc0, reuseIfPossible=false

Reading the chunk1/doc0, reuseIfPossible=false. As length is 0, lucene will not read the PreSet Dict, the PreSet Dict is not cached.

Reading the chunk1/doc1. In the case, doc1 is in the current chunk1, reuseIfPossible=true, but the PreSet Dict is not cached for now, lucene will throw exception.

In the case, we should call reset in the step1.

jainankitk · 2025-03-31T20:52:38Z

lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java

+      if (reused) {
+        assert buffer.length >= dictLength + blockLength;
+        in.skipBytes(compressedLengths[0]);
+      } else {
+        // Read the dictionary
+        buffer = ArrayUtil.growNoCopy(buffer, dictLength + blockLength);
+        if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) {
+          throw new CorruptIndexException("Illegal dict length", in);
+        }
+        reused = true;


I am wondering if we should consider exposing metric (simple counter maybe) on how many times we could reuse, and how many times had to read from the disk? That would provide some useful insights on the usefulness of this change

github-actions · 2025-04-15T00:25:33Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions · 2025-05-02T00:25:36Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Mar 24, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Mar 24, 2025

github-actions bot added the module:core/codecs label Mar 24, 2025

kkewwei force-pushed the optimize_storedfield_read branch 3 times, most recently from e5193bc to 100974c Compare March 24, 2025 14:53

jainankitk reviewed Mar 29, 2025

View reviewed changes

kkewwei force-pushed the optimize_storedfield_read branch 2 times, most recently from 9ab44d5 to b2c1f0b Compare March 29, 2025 09:47

jainankitk reviewed Mar 31, 2025

View reviewed changes

github-actions bot added the Stale label Apr 15, 2025

Cache preset dict in LZ4WithPresetDictDecompressor

57d9661

kkewwei force-pushed the optimize_storedfield_read branch from b2c1f0b to 57d9661 Compare April 17, 2025 14:00

github-actions bot removed the Stale label Apr 18, 2025

github-actions bot added the Stale label May 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cache preset dict for LZ4WithPresetDictDecompressor #14397

cache preset dict for LZ4WithPresetDictDecompressor #14397

Uh oh!

kkewwei commented Mar 24, 2025

Uh oh!

jainankitk left a comment

Uh oh!

Uh oh!

jainankitk Mar 29, 2025

Uh oh!

kkewwei Mar 29, 2025

Uh oh!

jainankitk Mar 31, 2025

Uh oh!

kkewwei Apr 17, 2025 •

edited

Loading

Uh oh!

jainankitk Mar 31, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 15, 2025

Uh oh!

github-actions bot commented May 2, 2025

Uh oh!

Uh oh!

		decompressor.reset();
		decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare);

cache preset dict for LZ4WithPresetDictDecompressor #14397

Are you sure you want to change the base?

cache preset dict for LZ4WithPresetDictDecompressor #14397

Uh oh!

Conversation

kkewwei commented Mar 24, 2025

Description

Uh oh!

jainankitk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jainankitk Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

kkewwei Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

jainankitk Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

kkewwei Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jainankitk Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 15, 2025

Uh oh!

github-actions bot commented May 2, 2025

Uh oh!

Uh oh!

kkewwei Apr 17, 2025 •

edited

Loading

jainankitk Mar 31, 2025 •

edited

Loading