Skip to content

Completion FSTs to be loaded off-heap at all times #14364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 19, 2025

Conversation

javanna
Copy link
Contributor

@javanna javanna commented Mar 17, 2025

All the existing completion postings format load their FSTs on-heap. It is possible to customize that behaviour by mainintaing a custom postings format that override the fst load mode.

TestSuggestField attempted to test all modes, but in practice, it can only test the default load mode because the read path relies on the default constructor called via SPI, that does not allow providing the fst load mode.

This commit switches loading of FST to off-heap, and removes the fst load mode argument from all completion postings format classes, effectively making the loading always off-heap.

See related discussion where this change originated at #14275 .

All the existing completion postings format load their FSTs on-heap.
It is possible to customize that behaviour by mainintaing a custom postings
format that override the fst load mode.

TestSuggestField attempted to test all modes, but in practice, it can only test
the default load mode because the read path relies on the default constructor
called via SPI, that does not allow providing the fst load mode.

This commit changes the default value to OFF_HEAP, and addresses the testing
gap in TestSuggestField, so that all load modes are tested. This unveiled a bug
when loading FST off heap and the index input is a MemorySegmentIndexInput
which is fixed by apache#14271.
* Creates a {@link Completion101PostingsFormat} that will use the provided <code>fstLoadMode
* </code> to determine if the completion FST should be loaded on or off heap.
*/
public Completion101PostingsFormat(FSTLoadMode fstLoadMode) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that these constructors taking the fst load mode are trappy, in that they make readers think that calling them allows to override the load mode. In practice, the load mode is only ever used at read time, and that codepath goes through SPI loading, so via the default constructor. I think removing these is the best path forward to avoid confusion.

@@ -122,11 +122,6 @@ public enum FSTLoadMode {

private final FSTLoadMode fstLoadMode;

/** Used only by core Lucene at read-time via Service Provider instantiation */
public CompletionPostingsFormat(String name) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was unused and redundant.

protected PostingsFormat delegatePostingsFormat() {
return delegate.delegatePostingsFormat();
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this may work accidentally on NRT segments if IndexWriter opens them with the codec configured on the IndexWriterConfig, but it you were to open an index from a Directory using this postings format, it would open it using the default FST load mode and ignore the one passed to the constructor here. This is because, Lucene would read "Completion101" from the segment infos and look it up through SPI, which would return the default format that is registered under this name rather than this instance that you are creating here.

I wonder if we should remove FSTLoadMode completely. To your point, it's trappy to think that this thing is actually configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, I tried to quickly hack this together and I ended up making the same mistake we already had in existing tests. Shame on me! I really do need to register the custom postings format to SPI for this to leverage the random fst load mode. I was also thinking about the option to remove configurability of fst load mode. That would certainly simplify things and avoid this same mistake in the future. Either that, or have a proper example of how the postings format load mode can be customized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this PR to remove the option to configure the load mode entirely.

@javanna javanna changed the title Completion FSTs to be loaded off-heap by default Completion FSTs to be loaded off-heap at all times Mar 18, 2025
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, don't miss adding a CHANGES entry. :) We may want to add back-compat for NRTSuggester#load(IndexInput, FstLoadMode) upon backporting?

@javanna
Copy link
Contributor Author

javanna commented Mar 19, 2025

Thanks @jpountz what's your suggestion around back-compat? Sounds like you are suggesting not backporting the removal of the fst load mode enum but only the switch to off-heap by default, is that accurate?

@jpountz
Copy link
Contributor

jpountz commented Mar 19, 2025

I was thinking of keeping the load(IndexInput, FSTLoadMode) static method, documenting that the load mode is ignored and deprecating it. Indeed that would require keeping the FSTLoadMode enum, which would be deprecated too.

@javanna
Copy link
Contributor Author

javanna commented Mar 19, 2025

Cool then I will target this PR at main only, and open a separate PR for branch_10x. Out of curiosity, what are the usecases where you'd expect users to call NRTSuggester#load directly?

@jpountz
Copy link
Contributor

jpountz commented Mar 19, 2025

Good question. The class is public and looked like a user-facing API hence my comment, but you can't serialize a NRTSuggester yourself since NRTSuggesterBuilder is pkg-private. So it looks like this load method should really be pkg-private too?

@javanna
Copy link
Contributor Author

javanna commented Mar 19, 2025

I opened #14372 to address the visibility issue of load, that should simplify this PR and backporting it once merged.

@javanna javanna added this to the 10.2.0 milestone Mar 19, 2025
@javanna
Copy link
Contributor Author

javanna commented Mar 19, 2025

I merged main in after merging #14372 and added a changelog entry. I believe this is ready to go, and can now be backported to branch_10x as-is.

javanna added a commit to javanna/elasticsearch that referenced this pull request Mar 19, 2025
A while ago we introduced a completion postings format extension to eventually
be able to customize how completion FSTs are loaded. See elastic#111494.

We have never leveraged this extension, and meanwhile Lucene is moving
to always load FSTs off-heap, and no longer allow on-heap.
See apache/lucene#14364 .

This commit removes the SPI extension as it is no longer needed.
@jpountz
Copy link
Contributor

jpountz commented Mar 19, 2025

Agreed!

@javanna javanna merged commit fe86094 into apache:main Mar 19, 2025
7 checks passed
@javanna
Copy link
Contributor Author

javanna commented Mar 19, 2025

Thanks @jpountz for all the help!

javanna added a commit that referenced this pull request Mar 19, 2025
All the existing completion postings format load their FSTs on-heap. It is possible to customize that behaviour by mainintaing a custom postings format that override the fst load mode.

TestSuggestField attempted to test all modes, but in practice, it can only test the default load mode because the read path relies on the default constructor called via SPI, that does not allow providing the fst load mode.

This commit switches loading of FST to off-heap, and removes the fst load mode argument from all completion postings format classes, effectively making the loading always off-heap.
javanna added a commit to elastic/elasticsearch that referenced this pull request Mar 20, 2025
A while ago we introduced a completion postings format extension to eventually
be able to customize how completion FSTs are loaded. See #111494.

We have never leveraged this extension, and meanwhile Lucene is moving
to always load FSTs off-heap, and no longer allow on-heap.
See apache/lucene#14364 .

This commit removes the SPI extension as it is no longer needed.
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Mar 21, 2025
A while ago we introduced a completion postings format extension to eventually
be able to customize how completion FSTs are loaded. See elastic#111494.

We have never leveraged this extension, and meanwhile Lucene is moving
to always load FSTs off-heap, and no longer allow on-heap.
See apache/lucene#14364 .

This commit removes the SPI extension as it is no longer needed.
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this pull request Mar 21, 2025
A while ago we introduced a completion postings format extension to eventually
be able to customize how completion FSTs are loaded. See elastic#111494.

We have never leveraged this extension, and meanwhile Lucene is moving
to always load FSTs off-heap, and no longer allow on-heap.
See apache/lucene#14364 .

This commit removes the SPI extension as it is no longer needed.
jpountz pushed a commit to jpountz/lucene that referenced this pull request Mar 24, 2025
All the existing completion postings format load their FSTs on-heap. It is possible to customize that behaviour by mainintaing a custom postings format that override the fst load mode.

TestSuggestField attempted to test all modes, but in practice, it can only test the default load mode because the read path relies on the default constructor called via SPI, that does not allow providing the fst load mode.

This commit switches loading of FST to off-heap, and removes the fst load mode argument from all completion postings format classes, effectively making the loading always off-heap.
jpountz pushed a commit to jpountz/lucene that referenced this pull request Mar 24, 2025
All the existing completion postings format load their FSTs on-heap. It is possible to customize that behaviour by mainintaing a custom postings format that override the fst load mode.

TestSuggestField attempted to test all modes, but in practice, it can only test the default load mode because the read path relies on the default constructor called via SPI, that does not allow providing the fst load mode.

This commit switches loading of FST to off-heap, and removes the fst load mode argument from all completion postings format classes, effectively making the loading always off-heap.
omricohenn pushed a commit to omricohenn/elasticsearch that referenced this pull request Mar 28, 2025
A while ago we introduced a completion postings format extension to eventually
be able to customize how completion FSTs are loaded. See elastic#111494.

We have never leveraged this extension, and meanwhile Lucene is moving
to always load FSTs off-heap, and no longer allow on-heap.
See apache/lucene#14364 .

This commit removes the SPI extension as it is no longer needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants