Skip to content

Add multi-vector-support faiss patch to IndexHNSW::search_level_0 #2647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 9, 2025

Conversation

anntians
Copy link
Contributor

@anntians anntians commented Apr 3, 2025

Description

This PR fixes a bug in the GPU remote index builder feature by adding the multi-vector-support faiss patch https://github.com/opensearch-project/k-NN/blob/main/jni/patches/faiss/0001-Custom-patch-to-support-multi-vector.patch to the IndexHNSW::search_level_0 method.

The remote index builder first creates a GPU GpuIndexCagra index, then converts it to a CPU IndexHNSWCagra index with base_level_only = true, which will be downloaded and saved in KNN. The search method on a IndexHNSWCagra with base_level_only = true calls IndexHNSW::search_level_0. So this PR builds off of the existing faiss patch by adding the GroupedHeapBlockResultHandler to the IndexHNSW::search_level_0 method.

As seen in the Github issues below, several existing KNN ITs fail when run against the GPU remote index builder feature. This is largely due to KNN nested search queries returning hits to nested fields belonging to the same parent doc, resulting in <k docs returned in search results. This bug was previously patched in KNN, which uses the IndexHNSW index calling IndexHNSW::search(). But the remote index builder uses IndexHNSWCagra indices, calling IndexHNSW::search_level_0, which was not covered in the previous patch.

So this PR fixes the issue by using GroupedHeapBlockResultHandler in IndexHNSW::search_level_0 to ensure that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields.

For convenience, I opened up a PR on my faiss clone for review: https://github.com/anntians/faiss/pull/1/files. Please feel free to leave comments there.

Related Issues

Resolves
NestedSearchIT: opensearch-project/remote-vector-index-builder#34
ExpandNestedDocsIT: opensearch-project/remote-vector-index-builder#35
AdvancedFilteringUseCasesIT: opensearch-project/remote-vector-index-builder#32

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: AnnTian Shao <[email protected]>
@anntians
Copy link
Contributor Author

anntians commented Apr 7, 2025

This PR adds a new JNIService method updateIndexSettings to set base_level_only = true on the IndexHNSWCagra created in the KNN JNIServiceTests https://github.com/opensearch-project/k-NN/pull/2647/files#diff-0d5ddae71f742d733c35aeccb027096ba6dd3dc0076899cc07c760211a2b7a90. So this method was created in order to unit test the faiss patch on IndexHNSWCagra from KNN. Below is context on why this decision was made:

In the GPU remote index builder workflow:

  1. createGpuIndexCagra
  2. add vectors to GpuIndexCagra
  3. create IndexHNSWCagra with base_level_only = true
  4. copy GpuIndexCagra to IndexHNSWCagra

Setting base_level_only = true is used to copy the knn graph from GpuIndexCagra to the base level of IndexHNSWCagra without adding upper levels. Doing so enables to search the HNSW index, but removes the ability to add vectors by making the index immutable.

However, to unit test on KNN, we cannot build a GpuIndexCagra first since we cannot assume KNN will be running on GPU machines. So in order to test:

  1. create IndexHNSWCagra
  2. add vectors to IndexHNSWCagra
  3. set base_level_only = true. This setting makes the graph immutable, so we set it after adding vectors

This ensures that the unit test goes down the same search code path as the workflow with GPU remote index builder. This is the reason for adding a new JNIService method updateIndexSettings.

I'm looking for consensus on this approach in the first take of this PR before adding all the JNI and KNN tests for updateIndexSettings

for (SpaceType spaceType : spaces) {

final long pointer;
try (IndexInput indexInput = loadHnswBinary("data/remoteindexbuild/faiss_hnsw_cagra_nested_float_1000_vectors_128_dims.bin")) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This faiss graph binary was created with the IndexHNSWCagra index (base_level_only==true) containing the test_vectors_nested_1000x128.json vectors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT : could you add it in the comment?

@anntians
Copy link
Contributor Author

anntians commented Apr 8, 2025

Following up from #2647 (comment), we decided to remove the new JNI layer updateIndexSettings method, and instead load a faiss graph binary created with the IndexHNSWCagra index (base_level_only==true) containing the test_vectors_nested_1000x128.json vectors

This ensures that the JNIServiceTests unit test https://github.com/opensearch-project/k-NN/pull/2647/files#diff-0d5ddae71f742d733c35aeccb027096ba6dd3dc0076899cc07c760211a2b7a90 goes down the same search code path as the workflow with GPU remote index builder

…ethod updateIndexSettings

Signed-off-by: AnnTian Shao <[email protected]>
List<SpaceType> spaces = ImmutableList.of(SpaceType.L2, SpaceType.INNER_PRODUCT);
int[] parentIds = toParentIdArray(testDataNested.indexData.docs);
Map<Integer, Integer> idToParentIdMap = toIdToParentIdMap(testDataNested.indexData.docs);
for (SpaceType spaceType : spaces) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make a separate function accepting SpaceType?

doTestQueryIndex_faissCagra_parentIds(SpaceType.L2, ...);
doTestQueryIndex_faissCagra_parentIds(SpaceType.INNER_PRODUCT, ...);

Signed-off-by: AnnTian Shao <[email protected]>
@jmazanec15 jmazanec15 merged commit 4a11c33 into opensearch-project:main Apr 9, 2025
36 checks passed
@github-project-automation github-project-automation bot moved this from 3.0 to ✅ Done in Vector Search RoadMap Apr 9, 2025
anntians added a commit to anntians/k-NN that referenced this pull request Apr 22, 2025
…ensearch-project#2647)

* Add multi-vector-support faiss patch to IndexHNSW::search_level_0

Signed-off-by: AnnTian Shao <[email protected]>

* Add tests to JNI and KNN

Signed-off-by: AnnTian Shao <[email protected]>

* Update tests by adding hnsw cagra index binary and remove JNI layer method updateIndexSettings

Signed-off-by: AnnTian Shao <[email protected]>

* test fixes

Signed-off-by: AnnTian Shao <[email protected]>

---------

Signed-off-by: AnnTian Shao <[email protected]>
Co-authored-by: AnnTian Shao <[email protected]>
oaganesh pushed a commit to oaganesh/k-NN that referenced this pull request Apr 24, 2025
…ensearch-project#2647)

* Add multi-vector-support faiss patch to IndexHNSW::search_level_0

Signed-off-by: AnnTian Shao <[email protected]>

* Add tests to JNI and KNN

Signed-off-by: AnnTian Shao <[email protected]>

* Update tests by adding hnsw cagra index binary and remove JNI layer method updateIndexSettings

Signed-off-by: AnnTian Shao <[email protected]>

* test fixes

Signed-off-by: AnnTian Shao <[email protected]>

---------

Signed-off-by: AnnTian Shao <[email protected]>
Co-authored-by: AnnTian Shao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants