Let's explore a machine-learned MergePolicy?

[The following is a brainstormy kind of idea ... I have no clue how to actually approach it ... patches/ideas very welcome!]

Segment merging is a tricky balance between index-time resources (costly CPU and IO for each merge), replication costs (sending the newly created segments out to all replicas during each refresh), and search-time costs (keeping delete % low and segment count low for most efficient searching).

Ideally (and at Lucene's defaults) merging is amortized logarithmic cost, i.e. each indexed document will only be merged into a new segment `O(log(N))` times, where N is the number of documents or bytes in your index.

`TieredMergePolicy` has a number of tunables to trade off the above costs, and different use cases will tune it very differently.  If you tune it to very aggressively reclaim deletes (which we do for Amazon's product search), the merge cost can approach `O(N^2)` (or worse?) instead.  But `TieredMergePolicy` is tricky/trappy to configure, and maybe buggy, and leads to [confusion/problems like this](http://github.com/apache/lucene/issues/13226).  Plus fun ideas like a [merge simulator](https://github.com/apache/lucene/issues/9378), a [GUI/utility API to manually kick off your own merge](https://github.com/apache/lucene/pull/14163), and [the new segment tracing tools/GUI in `luceneutil`](https://github.com/apache/lucene/issues/14182#issuecomment-2634071572) for easier merge/flush segment debugging.

This leads to further ideas like a `MergePoilcy`/`MergeScheduler` that [caps net bandwidth required](http://github.com/apache/lucene/issues/14148) so that during peace time, it could spend the bandwidth pushing delete % very low, but during war-time (a sudden, sustained surge in updates), it could tolerate some increase in % deletes in order to rate-limit MB/sec required for replicating segments.  But such a thing would have tunables that smart humans somewhere would need to pay attention to and then tweak.

In general, seen from `MergePoilcy`'s standpoint, this is a game against a challenging adversary who keeps making new segments (by indexing docs, using N threads, making updates/deletes, at varying rates, and refreshing every M seconds).  It's fun to [visualize this game theoretic approach](https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html).

So this all got me thinking: could we instead make a machine-learned `MergePolicy`?  We could train it from collected `InfoStream` coming out of real world Lucene applications (or maybe privately to just your application's Lucene usage) -- is it append only?  are there updates?  how often is it an append vs update?  are there update storms?.  It would then tune its merge selection to build a multi-objective ML model (optimizing for low % deletes searched over time, low IO/CPU cost during indexing, low MB/sec replicated, etc.)?  I don't know what kind of ML model structure this would be, since it would operate with time as one dimension/input, and its output is not a vector but rather discrete events (a `MergeSpec`) at specific times stating which merge should be kicked off for which segments.  (Or maybe that is a binary vector, one dimension (bit) per segment in the index, ranked by descending size or something?).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Let's explore a machine-learned MergePolicy? #14754

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Let's explore a machine-learned MergePolicy? #14754

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions