Description
[The following is a brainstormy kind of idea ... I have no clue how to actually approach it ... patches/ideas very welcome!]
Segment merging is a tricky balance between index-time resources (costly CPU and IO for each merge), replication costs (sending the newly created segments out to all replicas during each refresh), and search-time costs (keeping delete % low and segment count low for most efficient searching).
Ideally (and at Lucene's defaults) merging is amortized logarithmic cost, i.e. each indexed document will only be merged into a new segment O(log(N))
times, where N is the number of documents or bytes in your index.
TieredMergePolicy
has a number of tunables to trade off the above costs, and different use cases will tune it very differently. If you tune it to very aggressively reclaim deletes (which we do for Amazon's product search), the merge cost can approach O(N^2)
(or worse?) instead. But TieredMergePolicy
is tricky/trappy to configure, and maybe buggy, and leads to confusion/problems like this. Plus fun ideas like a merge simulator, a GUI/utility API to manually kick off your own merge, and the new segment tracing tools/GUI in luceneutil
for easier merge/flush segment debugging.
This leads to further ideas like a MergePoilcy
/MergeScheduler
that caps net bandwidth required so that during peace time, it could spend the bandwidth pushing delete % very low, but during war-time (a sudden, sustained surge in updates), it could tolerate some increase in % deletes in order to rate-limit MB/sec required for replicating segments. But such a thing would have tunables that smart humans somewhere would need to pay attention to and then tweak.
In general, seen from MergePoilcy
's standpoint, this is a game against a challenging adversary who keeps making new segments (by indexing docs, using N threads, making updates/deletes, at varying rates, and refreshing every M seconds). It's fun to visualize this game theoretic approach.
So this all got me thinking: could we instead make a machine-learned MergePolicy
? We could train it from collected InfoStream
coming out of real world Lucene applications (or maybe privately to just your application's Lucene usage) -- is it append only? are there updates? how often is it an append vs update? are there update storms?. It would then tune its merge selection to build a multi-objective ML model (optimizing for low % deletes searched over time, low IO/CPU cost during indexing, low MB/sec replicated, etc.)? I don't know what kind of ML model structure this would be, since it would operate with time as one dimension/input, and its output is not a vector but rather discrete events (a MergeSpec
) at specific times stating which merge should be kicked off for which segments. (Or maybe that is a binary vector, one dimension (bit) per segment in the index, ranked by descending size or something?).