perf: experiment with REMIX/Disco-inspired overlay index

[REMIX](https://www.usenix.org/system/files/fast21-zhong.pdf) and [Disco](https://dl.acm.org/doi/pdf/10.1145/3709683) introduce the idea of creating an index on top of sstables in order to reduce or eliminate the impact of read amplification from multiple levels. The overlay index is a fraction of the size of the sstables that are indexed. REMIX and Disco both utilize a mutable b+tree for the index and produce the index over all sstables in the LSM-tree. This issue proposes an alternative formulation: create an immutable index over a small set of overlapping sstables. Because the index is immutable we can utilize an sstable itself as the index, rather than a b+tree.

Consider the following setup:

```
L0: [3]      d-----------------------p
L1: [2]          f---------------------------t
L2: [1] a-------------------------------------------------z
```

In the diagram above, there are 3 levels, each with one sstable (`1`, `2`, and `3`). Performing a scan from `[a,g]` would typically involve creating a "merging iter" across all of the levels. The merging iter is composed of iterators on each of the sstables. In order to perform the scan we seek each iterator to the start of the scan (`a`) and then using a heap to repeatedly access retrieve the next key. Read amplification manifests as the separate sstable seeks and the heap maintenance during iteration (heap maintenance requires key comparisons). Normally, if we want to reduce read amplification we would compact the sstables into one merged sstable. Yet compactions create write amplification which we also want to minimize.

Rather than compacting the 3 sstables, we can instead create an additional index structure that sits on top of them. In REMIX, this index structure is created over "segments" of keys where each segment is something like 100 keys. While the index only contains every 100th key, the metadata for each segment specifies the merged ordering of the keys (known as "run selectors" in the papers). The metadata also specifies the "cursor offset" for where the first key from each sstable is located within the sstable. In Pebble, these cursor offsets would be `<blockNum, index>` tuples where the index is 0-based within the block.

We can interpret the creation of the overlay index for a set of sstables as a type of logical compaction. We'd replace the existing sstables in the in-memory LSM-tree with a single logical sstable that spans `[a,z]`. This logical sstable would be composed of the existing input sstables (`1.sst`, `2.sst`, and `3.sst`) along with the a new overlay sstable (`4.overlay`). The LSM in-memory state would look like:

```
L0:
L1:
L2: [4] a-------------------------------------------------z
```

In the metadata of the overlay index sstable we'd store an array of the sstable fileNums. The metadata would also have to store per-sstable arrays mapping `blockNum` to `<blockOffset, blockLength>`. We'd limit an overlay index to covering a max of 255 sstables. Each record in the overlay index would contain:

```
  N: <number of keys in the segment> (1 or 2 bytes)
  R: <number of sstables referenced by the segment (1 byte)
  cursorOffsets[R]: <blockNum, index> (R * (4 + 2) bytes)
  runSelectors[N]: <fileNumIndex> (N bytes)
```

For 100 keys, the size of an overlay index record is ~100-200 bytes, depending on how many sstables the record references, or ~1-2 bytes/key. We could do fancier run-length encoding of the `runSelectors` to reduce this size further. Or we could limit the number of sstables an overlay index covers to 15 so we could use 4-bits per entry in the `runSelectors` array. We might want to expand the `cursorOffsets` metadata to allow easier binary search within a segment, though with 100-key segments we'll typically only be seeking in one or two data blocks in the underlying sstables. Seeking within a overlay index involves seeking in the index block, seeking within the data block to find the overlay record, and then seeking within the data blocks of each of the sstables referenced by the segment. Note that we don't need to seek within the index blocks of the referenced sstables. If a segment references all of the sstables an overlay index covers then we'll have modest savings. There is a larger win if a segment references a small fraction of the tables covered by an overlay index. For example, an overlay index for tables in L0 could cover 8 levels of table, but a segment might reference only 1 or 2 of those tables. Whether this occurs
in practice is dependent on the workload.

Besides the per-key overhead, the size of the overlay index sstable also includes ~1/N of the input sstable key size. Larger N reduces the size of the overlay index, but increases the likelihood that a segment references more of the covered sstables. For N=100 and 100-byte avg key size, the overlay index size is ~2-3% of the input sstable key size, so likely <1% of the total input sstable size once the values are accounted for (though value-separation needs to be accounted for).

In order to determine the merit of this idea we could prototype the overlay index sstable and run micro-benchmarks for point lookups and range scans under uniform and skewed key distributions. Full integration into Pebble would not be a small project, but probably not as large as columnar blocks or value-separation. One of the bigger challenges is that the overlay index needs to violate the sstable abstraction and have knowledge of the sstable internals. We'd need to expose an sstable iterator that allows retrieving a block by `<blockOffset, blockLength>`. Ideally this could all be contained within the `sstable` package internals.

Note that Disco does something fancier than REMIX in the overlay index records and maintains additional per-key state in order to further eliminate seeks for non-existent records in covering sstables. If we start with the REMIX approach the Disco idea would be a potential follow-on optimization.

Jira issue: PEBBLE-530

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: experiment with REMIX/Disco-inspired overlay index #5019

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf: experiment with REMIX/Disco-inspired overlay index #5019

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions