Skip to content

Pack file pointers when merging BKD trees #14393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 25, 2025
Merged

Pack file pointers when merging BKD trees #14393

merged 6 commits into from
Mar 25, 2025

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Mar 24, 2025

We are currently using long arrays to hold file pointers. These arrays can get pretty big if the number of points is big which seems wasteful, moreover when those file pointers are monotonically increasing. This commit proposes to pack these arrays using existing lucene algorithms to save some heap during merging and to avoid humongous allocations.

relates #14382

final OneDimensionBKDWriter oneDimWriter =
new OneDimensionBKDWriter(metaOut, indexOut, dataOut);
new OneDimensionBKDWriter(metaOut, indexOut, dataOut, new ArrayLongAccumulator(numLeaves));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could/should we use the "packed" impl all the time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, I am concerned on the overhead of using it when flushing small segments.
Maybe it is not worthy the complexity of specializing for that case and we should use the packed impl all the time. I am ok either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively, there is ~512x (the leaf node size) more data than file pointers, so the bottleneck of flushing/merging should be on the data rather than on file pointers. So having small overhead on file pointers should be ok (hopefully benchmarks will confirm this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, let's use it all the time for OneDimensionBKDWriter, it clearly simplifies the implementation

@@ -1961,7 +1989,7 @@ private void build(
int leafCardinality = heapSource.computeCardinality(from, to, commonPrefixLengths);

// Save the block file pointer:
leafBlockFPs[leavesOffset] = out.getFilePointer();
leafBlockFPs.add(out.getFilePointer());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, since filepointers are monotonic, we can make the compact. NICE!

@iverase iverase added this to the 10.2.0 milestone Mar 25, 2025
@iverase iverase merged commit 510ce5f into apache:main Mar 25, 2025
7 checks passed
@iverase iverase deleted the BKDFP branch March 25, 2025 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants