-
Notifications
You must be signed in to change notification settings - Fork 1.1k
cache preset dict for LZ4WithPresetDictDecompressor #14397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e5193bc
to
100974c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kkewwei - Thanks for raising this PR. Looks like a useful change to me.
lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java
Outdated
Show resolved
Hide resolved
decompressor.reset(); | ||
decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if reset should be the default behavior. We can pass another flag to indicate reuse if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that reset
is essential. When the block changes, we must discard the cache in time, this operation can only be detected from external.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the block changes, we must discard the cache in time, this operation can only be detected from external.
I am not questioning that. My point is to not have reset
method in the Decompressor interface, and add another decompress method that takes reuseIfPossible
as one of the parameters. It ensures the functional correctness even if we don't make the reset
call from somewhere in the code. And, allows explicit optimization wherever we deem appropriate. The risk in not explicitly making the reset
call is much more than using original decompress without the reuse.
public abstract class Decompressor implements Cloneable {
protected Decompressor() {}
public void decompress(
DataInput in, int originalLength, int offset, int length, BytesRef bytes) throws IOException {
decompress(in, originalLength, offset, length, bytes, false);
}
public abstract void decompress(
DataInput in, int originalLength, int offset, int length, BytesRef bytes, boolean reuseIfPossible) throws IOException;
@Override
public abstract Decompressor clone();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried but failed in just relying on outer reuseIfPossible
to decide whether to cache PreSet Dict , In the follow case, outer must call the reset
to clear the cache.
We have two chunks:
- chunk0 [doc0(length>0)]
- chunk1[doc0(length=0), doc1(length=1)]
Steps are as follow:
- Reading the chunk0/doc0,
reuseIfPossible
=false - Reading the chunk1/doc0,
reuseIfPossible
=false. As length is 0, lucene will not read thePreSet Dict
, thePreSet Dict
is not cached. - Reading the chunk1/doc1. In the case, doc1 is in the current chunk1,
reuseIfPossible
=true, but thePreSet Dict
is not cached for now, lucene will throw exception.
In the case, we should call reset
in the step1.
9ab44d5
to
b2c1f0b
Compare
if (reused) { | ||
assert buffer.length >= dictLength + blockLength; | ||
in.skipBytes(compressedLengths[0]); | ||
} else { | ||
// Read the dictionary | ||
buffer = ArrayUtil.growNoCopy(buffer, dictLength + blockLength); | ||
if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) { | ||
throw new CorruptIndexException("Illegal dict length", in); | ||
} | ||
reused = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should consider exposing metric (simple counter maybe) on how many times we could reuse, and how many times had to read from the disk? That would provide some useful insights on the usefulness of this change
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
b2c1f0b
to
57d9661
Compare
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Description
As mentioned in #14347, we use
LZ4WithPresetDictDecompressor
to decompress, we will always read preset dict for every doc in non-merge scenarios. If two consecutive documents fall into the same chunk, we can reuse the same dictionary. This is a lossless optimization, the cached preset dict dictionary does not take up additional memory.Lucene benchmark: python3 src/python/localrun.py wikimediumall
Hardware used: linux ecs.t2-c1m2dev.8xlarge | 32 cores | 64G
In order to obtain the exact performance improvement, I tested it 6 times in total, and the results are as follows:
Test1:
Test2:
Test3:
Test4
Test5
Test6