-
Notifications
You must be signed in to change notification settings - Fork 149
Update merge policy floor_setting from 2MB to 16MB #2623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update merge policy floor_setting from 2MB to 16MB #2623
Conversation
Signed-off-by: Finn Roblin <[email protected]>
Signed-off-by: Finn Roblin <[email protected]>
Signed-off-by: Finn Roblin <[email protected]>
Signed-off-by: Finn Roblin <[email protected]>
* This change is in line with Lucene 10.2 default and will lead to fewer segments (more merges), improving search performance. | ||
*/ | ||
@Override | ||
public Settings getAdditionalIndexSettings(String indexName, boolean isDataStreamIndex, Settings templateAndRequestSettings) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overridden method hooks into OpenSearch here.
Signed-off-by: Finn Roblin <[email protected]>
Currently running cohere-10M benchmark. |
src/test/java/org/opensearch/knn/index/SegmentSizeFloorMergePolicyIT.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/plugin/KNNIndexSettingProvider.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/plugin/KNNIndexSettingProvider.java
Outdated
Show resolved
Hide resolved
src/test/java/org/opensearch/knn/index/SegmentSizeFloorMergePolicyIT.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/plugin/KNNIndexSettingProvider.java
Outdated
Show resolved
Hide resolved
Here are benchmark results for 10M 768D dataset: Setup:instance types
benchmark setup
results
full document |
Signed-off-by: Finn Roblin <[email protected]>
Signed-off-by: Finn Roblin <[email protected]>
Signed-off-by: Finn Roblin <[email protected]>
Signed-off-by: Finn Roblin <[email protected]>
String regularIndexName = "test-regular-floor-segment"; | ||
createIndex(regularIndexName, Settings.EMPTY); | ||
|
||
String knnFloorSegment = getIndexSettingByName(knnIndexName, "index.merge.policy.floor_segment", true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Use the setting key here org.opensearch.index.TieredMergePolicyProvider.INDEX_MERGE_POLICY_FLOOR_SEGMENT_SETTING
Signed-off-by: Finn Roblin <[email protected]>
Want to call out that OS core just opened a PR for this change: opensearch-project/OpenSearch#17699. We can merge in this listener + setting update on the k-NN side to see the performance impact on nightly benchmarks, but we should probably revert the change once it's merged into core since it's unnecessary overhead to overwrite an equivalent setting value. |
Closing this out since core merged this in with #17687 |
Description
Sets
merge.policy.floor_segment : 16mb
for knn indices to create more merges and have fewer segments after indexing. Thefloor_segment
setting in Lucene TieredMergePolicy dictates the "size" of a segment to the merge logic. That is, the merge logic usesmax(floor_segment, actualSegmentSize)
when calculating merges. The OpenSearch default is 2mb. The Lucene default is 16mb since 10.2. This PR is against k-NN repo and not core because there are likely cases where a 2mb default is smarter for non-vector search 😄.This blog post by a Lucene maintainer explains and visualizes Lucene merges. The TieredMergePolicy javadoc also describes how the merge works. The upshot is that this setting change will lead to fewer segments after indexing, which will improve search performance.
Benchmarks on the 10M 768d cohere dataset with 3x r6g.4xl instances show 5-7% improvements in query throughput and common-case latency, larger 20+% improvements in tail latency, ~4% decrease in indexing time, but ~27% increase in median merge time. The original merge floor policy results in 212 segments and this new merge floor policy results in 154 segments. The ~27% decrease in segments explains the ~27% increase in merging time.
We can see that these setting listeners worked by querying the
_settings
endpoint. The response includes"merge":{"policy":{"floor_segment":"16mb"}}
for knn indices andpolicy.floor_segment":"2097152b"
for non-knn indices. See the full_settings
response for knn index here and the full_settings
response for non-knn index here.Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.