-
Notifications
You must be signed in to change notification settings - Fork 149
Add multi-vector-support faiss patch to IndexHNSW::search_level_0 #2647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: AnnTian Shao <[email protected]>
Signed-off-by: AnnTian Shao <[email protected]>
This PR adds a new In the GPU remote index builder workflow:
Setting However, to unit test on KNN, we cannot build a
This ensures that the unit test goes down the same search code path as the workflow with GPU remote index builder. This is the reason for adding a new I'm looking for consensus on this approach in the first take of this PR before adding all the JNI and KNN tests for |
2a60f73
to
968d2dc
Compare
for (SpaceType spaceType : spaces) { | ||
|
||
final long pointer; | ||
try (IndexInput indexInput = loadHnswBinary("data/remoteindexbuild/faiss_hnsw_cagra_nested_float_1000_vectors_128_dims.bin")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This faiss graph binary was created with the IndexHNSWCagra
index (base_level_only==true
) containing the test_vectors_nested_1000x128.json
vectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT : could you add it in the comment?
Following up from #2647 (comment), we decided to remove the new JNI layer This ensures that the |
…ethod updateIndexSettings Signed-off-by: AnnTian Shao <[email protected]>
List<SpaceType> spaces = ImmutableList.of(SpaceType.L2, SpaceType.INNER_PRODUCT); | ||
int[] parentIds = toParentIdArray(testDataNested.indexData.docs); | ||
Map<Integer, Integer> idToParentIdMap = toIdToParentIdMap(testDataNested.indexData.docs); | ||
for (SpaceType spaceType : spaces) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make a separate function accepting SpaceType
?
doTestQueryIndex_faissCagra_parentIds(SpaceType.L2, ...);
doTestQueryIndex_faissCagra_parentIds(SpaceType.INNER_PRODUCT, ...);
Signed-off-by: AnnTian Shao <[email protected]>
…ensearch-project#2647) * Add multi-vector-support faiss patch to IndexHNSW::search_level_0 Signed-off-by: AnnTian Shao <[email protected]> * Add tests to JNI and KNN Signed-off-by: AnnTian Shao <[email protected]> * Update tests by adding hnsw cagra index binary and remove JNI layer method updateIndexSettings Signed-off-by: AnnTian Shao <[email protected]> * test fixes Signed-off-by: AnnTian Shao <[email protected]> --------- Signed-off-by: AnnTian Shao <[email protected]> Co-authored-by: AnnTian Shao <[email protected]>
…ensearch-project#2647) * Add multi-vector-support faiss patch to IndexHNSW::search_level_0 Signed-off-by: AnnTian Shao <[email protected]> * Add tests to JNI and KNN Signed-off-by: AnnTian Shao <[email protected]> * Update tests by adding hnsw cagra index binary and remove JNI layer method updateIndexSettings Signed-off-by: AnnTian Shao <[email protected]> * test fixes Signed-off-by: AnnTian Shao <[email protected]> --------- Signed-off-by: AnnTian Shao <[email protected]> Co-authored-by: AnnTian Shao <[email protected]>
Description
This PR fixes a bug in the GPU remote index builder feature by adding the multi-vector-support faiss patch https://github.com/opensearch-project/k-NN/blob/main/jni/patches/faiss/0001-Custom-patch-to-support-multi-vector.patch to the
IndexHNSW::search_level_0
method.The remote index builder first creates a GPU
GpuIndexCagra
index, then converts it to a CPUIndexHNSWCagra
index withbase_level_only = true
, which will be downloaded and saved in KNN. The search method on aIndexHNSWCagra
withbase_level_only = true
callsIndexHNSW::search_level_0
. So this PR builds off of the existing faiss patch by adding theGroupedHeapBlockResultHandler
to theIndexHNSW::search_level_0
method.As seen in the Github issues below, several existing KNN ITs fail when run against the GPU remote index builder feature. This is largely due to KNN nested search queries returning hits to nested fields belonging to the same parent doc, resulting in
<k
docs returned in search results. This bug was previously patched in KNN, which uses theIndexHNSW
index callingIndexHNSW::search()
. But the remote index builder usesIndexHNSWCagra
indices, callingIndexHNSW::search_level_0
, which was not covered in the previous patch.So this PR fixes the issue by using
GroupedHeapBlockResultHandler
inIndexHNSW::search_level_0
to ensure that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields.For convenience, I opened up a PR on my faiss clone for review: https://github.com/anntians/faiss/pull/1/files. Please feel free to leave comments there.
Related Issues
Resolves
NestedSearchIT: opensearch-project/remote-vector-index-builder#34
ExpandNestedDocsIT: opensearch-project/remote-vector-index-builder#35
AdvancedFilteringUseCasesIT: opensearch-project/remote-vector-index-builder#32
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.