Skip to content

[WIP] Multi-Vector support for HNSW search #13525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 135 commits into
base: main
Choose a base branch
from

Conversation

vigyasharma
Copy link
Contributor

@vigyasharma vigyasharma commented Jun 26, 2024

Adds support for multi-valued vectors to Lucene.

In addition to max-similarity aggregations like parent-block joins, this change supports ColBERT style distance functions that compute interaction across all query and document vector values. Documents can have a variable number of vector values, but to support distance function computations, we require all values to have the same dimension.

This is a big change and I still need to work on tests (existing and new), backward compatibility, benchmarks and some code refactor/cleanup. Raising this early version to get feedback on the overall approach. I marked the PR with no commit tags.

Addresses #12313 .

.

Approach

We define a new "Tensor" field that comprises multiple vector values, and a new TensorSimilarityFunction to compute distance across multiple vectors (uses SumMax() currently). Node ordinal is assigned to the tensor value, giving us one ordinal per document. All vector values of a tensor field are processed together during writing, reading and scoring. They are passed around as a packed float[] or byte[] array with all vector values concatenated. Consumers (like the TensorSimilarityFunction) slice this array by dimension to get individual vector values.

Tensors are stored using a new FlatVectorStorage that supports writing/reading variable length values per field (allowing us to have a different number of vectors per tensor). We reuse the existing HNSW readers and writers. Each graph node is a tensor and maps to a single document. I also added a new codec tensor format, to allow both tensors and vectors to coexist. I'm not yet sure how to integrate with the quantization changes (separate later date change) and didn't want to force everything into a single format. Tensors continue to work with KnnVectorWriter/Reader and extend the FlatVectorWriter/Reader classes.

Finally, I named the field and format "Tensors" though technically these are only rank-2 tensors. The thought was that we might extend this field and format if we ever went for higher rank tensors support. I'm open to renaming based on community feedback.

.

Major Changes

The PR has a lot of files which is not practical to review. Here are the files with key changes. If we align on the approach, I'm happy to reraise separate split PRs with different changes.

  1. New fields and similarity function for tensors.
    1. lucene/core/src/java/org/apache/lucene/document/FieldType.java
    2. lucene/core/src/java/org/apache/lucene/document/KnnByteTensorField.java
    3. lucene/core/src/java/org/apache/lucene/util/ByteTensorValue.java
    4. lucene/core/src/java/org/apache/lucene/document/KnnFloatTensorField.java
    5. lucene/core/src/java/org/apache/lucene/util/FloatTensorValue.java
    6. lucene/core/src/java/org/apache/lucene/index/TensorSimilarityFunction.java
    7. lucene/core/src/java/org/apache/lucene/index/FieldInfo.java
    8. lucene/core/src/java/org/apache/lucene/index/FieldInfos.java
  2. Indexing chain changes
    1. lucene/core/src/java/org/apache/lucene/index/IndexingChain.java
    2. lucene/core/src/java/org/apache/lucene/index/VectorValuesConsumer.java
  3. Reader side changes to return a tensor reader for tensor fields
    1. lucene/core/src/java/org/apache/lucene/index/SegmentCoreReaders.java
    2. lucene/core/src/java/org/apache/lucene/index/SegmentReader.java
  4. A new tensor format in the codec
    1. lucene/core/src/java/org/apache/lucene/codecs/KnnTensorsFormat.java
    2. lucene/core/src/java/org/apache/lucene/index/CodecReader.java
  5. A new tensor scorer to work with multiple vector values
    1. lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatTensorsScorer.java
    2. lucene/core/src/java/org/apache/lucene/codecs/hnsw/DefaultFlatTensorScorer.java
  6. A Lucene99FlatTensorsWriter for writing in the new flat tensor format - lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatTensorsWriter.java
  7. A Lucene99FlatTensorsReader for reading the flat tensor format - lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatTensorsReader.java
  8. An HnswTensorFormat that uses FlatTensorFormat to initialize the flat storage readers/writers underlying HNSW reader/writer.
    1. lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswTensorsFormat.java
  9. Hnsw reader and writer changes to support tensor fields and similarity function
    1. lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java
    2. lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java
  10. Off Heap Byte and FloatTensorValues for use by scorers
    1. lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapByteTensorValues.java
    2. lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapFloatTensorValues.java
  11. Setup to read and write tensor data value offsets to support variable vector count per tensor. This uses a DirectMonotonicReader/Writer.
    1. lucene/core/src/java/org/apache/lucene/codecs/lucene99/TensorDataOffsetsReaderConfiguration.java
  12. Syntax sugar for tensor queries
    1. lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java
    2. lucene/core/src/java/org/apache/lucene/search/KnnByteTensorQuery.java
    3. lucene/core/src/java/org/apache/lucene/search/KnnFloatTensorQuery.java

.

Open Questions

  1. I like the type safety of a separate field and similarityFn. It avoids user traps around passing a single value VectorSimilarity for tensors. But it does add a bunch of extra code, and ctor changes across a lot of files. Some options to simplify could be:
    • Reuse vectorEncoding and vectorDimension attributes in FieldInfo instead of a separate tensor encoding and dimension
    • Also reuse VectorSimilarityFunction, but create a separate "tensor aggregator" that corresponds to SumMax.

@vigyasharma
Copy link
Contributor Author

As mentioned earlier, here is my rough plan for splitting this change into smaller PRs. Some of these steps could be merged if the impl. warrants it:

  1. Multi-Vector similarity and aggregation classes.
  2. FieldInfo changes to add a new attribute for "aggregation". This will be set to NONE for single-valued vectors, by default, and in formats prior to this change.
  3. Multi-vector support to flat vectors writer.
  4. Random access vector values for multi-vectors.
  5. Multi-vector support to flat vectors reader.
  6. Hnsw writer/reader changes to work with multi-vectors if configured.
  7. Support to index and query multi-vector values (may need to add this with the flat writer/reader PRs).

@jimczi
Copy link
Contributor

jimczi commented Oct 29, 2024

The more I think about it, the less I feel like the knn codec is the best choice for this feature (assuming that this issue is focused on late interaction models).

It is possible that HNSW is not the ideal data structure to expose multi-vector ANN. We don't really change much in hnsw impl, except using multi-vector similarity for comparisons (graph build and search). Users can use the PerFieldKnnVectorsFormat to wire different data structures on top of the flat multi-vector format. We can also provide something off the box in a subsequent change. I think the aggregation fn. interface is also flexible for different types of similarity implementations?

Using the knn codec to handle multi-vectors seems limiting, especially since it treats multi-vectors as a single unit for scoring. This works well for late interaction models, where we’re dealing with a collection of embeddings, but it’s restrictive if we want to index each vector separately.
Using the original max similarity for HNSW is just not practical, it doesn’t scale, and I don’t think it’s something we’d actually want to support.

It could be helpful to explore other options instead of relying on the knn codec alone. Along those lines, I created a quick draft of a LateInteractionField using binary doc values, which keeps things simple and avoids major changes to the knn codec. I don’t think the flat vector format really offers any advantages over using binary doc values. In both cases, we’re able to store plain dense vectors as bytes, so there doesn’t seem to be a clear benefit to using the flat format here.

What do you think of this approach? It feels like we could skip the full knn framework if our main goal is just to score a bag of embeddings. This would keep things simpler and allow us to focus specifically on max similarity scoring without the added weight of the full knn codec.

My main worry is that adding multi-vectors to the knn codec as a late interaction model might add complexity later. It’s really two different approaches, and it seems valuable to keep the option for indexing each vector separately. We could expose this flexibility through the aggregation function, but that might complicate things across all codecs, as they’d need to handle both aggregate and independent cases efficiently.

@vigyasharma
Copy link
Contributor Author

One use-case for multi-vectors is indexing product aspects as separate embeddings for e-commerce search. At Amazon Product Search (where I work), we'd like to experiment with separate embeddings to represent product attributes, user product opinions, and product images. Such e-commerce use-cases would have a limited set of embeddings, but leverage similarity computations across all of them.

I see your point about scaling challenges with very high cardinality multi-vectors like token level ColBERT embeddings. Keeping them in a BinaryDocValues field is a good idea for scoring only applications. I like the LateInteractionField wrapper you shared, we should bring it into Lucene for such usecases.

However, I do think there is space for both solutions. It's not obvious to me how the knn codec gets polluted for future complexity. We would still support single vectors as is. My mental model is: if you want to use multi-vectors in nearest neighbor search (hnsw or newer algos later), index them in the knn field. Otherwise, index them separately as doc-values used only for re-ranking top results.

@benwtrent
Copy link
Member

One use-case for multi-vectors is indexing product aspects as separate embeddings for e-commerce search. At Amazon Product Search (where I work), we'd like to experiment with separate embeddings to represent product attributes, user product opinions, and product images. Such e-commerce use-cases would have a limited set of embeddings, but leverage similarity computations across all of them.

This seems like just more than one knn field, or the nested field support.

But, I understand the desire to add a multi-vector support to the flat codecs. I am honestly torn around whats the best path forward for the majority of users in Lucene.

@vigyasharma
Copy link
Contributor Author

I tried to find some blogs and benchmarks on other library implementations. Astra Db, Vespa, faiss and nmslib, all seem to support multi-vectors in some form.

From what I can tell, Astra Db and Vespa have ColBERT style multi-vector support in ANN [1] [2]. Benchmarks indicate ColBERT outperforms other techniques in quality, but full ColBERT on ANN has higher latency [3]. For large scale applications, users seem to overquery on ANN with single vector representations, and rerank them with ColBERT token vectors [4]. However, there's also ongoing work/research on reducing the no. of embeddings in ColBERT, like PLAID which replaces a bunch of vectors with their centroids [5].

...

I am honestly torn around whats the best path forward for the majority of users in Lucene.

I hear you! And I don't want to add complexity only because we have some body of work in this PR. Thanks for raising the concern Jim, it led me to some interesting reading.

...

My current thinking is that this is a rapidly evolving field, and it's early to lean one way or another. Adding this support unlocks experimentation. We might add different, scalable, ANN algos going forward, and our flat storage format should work with most of them. Meanwhile, there's research on different ways to run late interaction with multiple but fewer vectors. This change will help users experiment with what works at their scale, for their cost/performance/quality requirements.

I'm happy to change my perspective, and would like to hear more opinions. One reason to not add this would be if it makes the single vector setup hard to evolve. I'd like to understand if (and how) this is happening, and think on how we can address those concerns.
...

1: https://docs.datastax.com/en/ragstack/examples/colbert.html
2: https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/
3: https://thenewstack.io/overcoming-the-limits-of-rag-with-colbert/
4: https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/
5: PLAID - https://arxiv.org/abs/2205.09707

@krickert
Copy link

krickert commented Nov 9, 2024

My current thinking is that this is a rapidly evolving field, and it's early to lean one way or another. Adding this support unlocks experimentation.

Amen!

This ends up being so domain specific. Multi-embeddings become key when you deal with domain voids in the LLMs used to create the embeddings. That's most big corpuses. So at least being able to experiment would get you far more feedback.

I would be ok with writing some tests if that helps.

@jimczi
Copy link
Contributor

jimczi commented Nov 15, 2024

One reason to not add this would be if it makes the single vector setup hard to evolve. I'd like to understand if (and how) this is happening, and think on how we can address those concerns.

I believe we should carefully consider the approach to adding multi-vector support through an aggregate function. From the outset, we assume that multi-vectors should be scored together, which is an important principle. Moreover, the default aggregate function proposed in the PR relies on brute force, which is not practical for any indexing setup.

My concern is that this proposal doesn’t truly add support for independent multi-vectors. Instead, it introduces a block of vectors that must be scored together, which feels like a workaround rather than a comprehensive solution. This approach doesn’t address the key challenges of implementing true multi-vector support in the codec.

The root issue is that the current KNN codec assumes the number of vectors is bounded by a single integer, a limitation that needs to be addressed first. Removing this constraint is a complex task but essential for properly supporting multi-vectors. Once that foundation is in place, adding support for setups like ColBERT should become relatively straightforward.

Finally, while the max-sim function proposed in this PR may work as a ranking function, it isn’t suitable for indexing any documents. A true solution should allow for independent multi-vectors to be queried and scored flexibly without these constraints.

@vigyasharma
Copy link
Contributor Author

My concern is that this proposal doesn’t truly add support for independent multi-vectors.

That's a valid concern. I've been thinking about a more comprehensive multi-vector solution. Sharing some raw thoughts below, would love to get feedback.

We support a default aggregation value of NONE, which builds the graph with independent multi-vectors. Each node will be a separate vector value. As a first change, we can just support this without creating an aggregation enum. (Adding a plan for indexing this in a follow-up comment).

Once this is in place, we can add support for "dependent" multi-vector values like ColBERT. They'll take an aggregation function. Each graph node will represent all vectors for a document and use aggregated similarity (like in this PR). This will let us experiment with full ANN on ColBERT style multi-vectors.

@vigyasharma
Copy link
Contributor Author

...contd. from above – thoughts on supporting independent multi-vectors specified via NONE multi-vector aggregation...
__

The Knn{Float|Byte}Vector fields will accept multiple vector values for documents. Each vector value will be uniquely identifiable by a nodeId. Vectors for a doc will be stored adjacent to each other in flat storage. KnnVectorValues will support APIs for 1) getting docId for a given nodeId (existing), 2) getting vector value for a specific nodeId (existing), 3) getting all vector values for the document corresponding to a nodeId (new).

Our codec today has single unique sequentially increasing vector ordinal per doc, which we can store and fetch with the DirectMonotonicWriter. For multi-vectors, we need to handle multiple nodeIds mapping to a single document.

I'm thinking of using "ordinals" and "sub-ordinals" to identify each vector value. 'Ordinal' is incremented when docId changes. 'Sub-ordinals' start at 0 for each new doc and are incremented for subsequent vector values in the doc. A nodeId in the graph, is a "long" with ordinals and sub-ordinals packed into MSB and LSB bits separately.

For flat storage, we can continue to use the technique in this PR; i.e. have one DirectMonotonicWriter object for docIds indexed by "ordinals", and another that stores start offsets for each docId, again indexed by ordinals. The sub-ordinal bits help us seek to exact vector values from this metadata.

int ordToDoc(long nodeId) {
  // get int ordinal from most-significant 32 bits
  // get docId for the ordinal from DirectMonotonicWriter
}

float[] vectorValue(int nodeId) {
  // get int ordinal from most-significant 32 bits
  // get "startOffset" for ordinal
  // get subOrdinal from least-signifant 32 bits
  // read vector value from startOffset + (subOrdinal * dimension * byteSize)
}

float[] getAllVectorValues(int nodeId) {
  // get int ordinal from most-significant 32 bits
  // get "startOffset" for ordinal
  // get "endOffset" from offset value for ordinal + 1
  // return values from [startOffset, endOffset)
}

With this setup, we won't need parent-block join queries for multiple vector values. And we can use getAllVectorValues() for scoring with max or avg of all vectors in the doc at query time.

I'm skeptical if this'll give a visible performance boost. It should at least be similar to the block-join setup we have today, but hopefully more convenient to use. And it sets us up for "dependent" multi-vector values like ColBERT.

We'll need to code this up to iron out any wrinkles. I can work on a draft PR if the idea makes sense.
__

Note that this still doesn't allow >2B vector values. While the "long" nodeId can support it, our ANN impl. returns arrays containing all nodeIds is various places. I don't think java can support >2B array length. But we can address this limitation separately, perhaps with a different ANN algo for such high cardinality graphs.

@krickert
Copy link

And we can use getAllVectorValues() for scoring with max or avg of all vectors in the doc at query time.

Your proposal to implement getAllVectorValues() for scoring documents by aggregating their vectors (using methods like max or average) at query time has a lot of use cases and think it's a great idea. However, in my domain-specific data, this approach hasn't enhanced search results. However, providing a default implementation, as you suggested, with the option for customization, could be beneficial.

(sidenote: if you are doing max/average, you can do that during index time though, right?)

I'm currently conducting A/B tests on three methods to retrieve and rank documents with multiple vectors:

  1. Aggregate Scoring: Computing a single relevance score per document by aggregating all its vectors. Flexibility in the aggregation method would help me a lot.
  2. Chunk-Based Highlighting: Treating each vector as a distinct document chunk to facilitate highlighting. This involves returning the top N documents so K would be more dynamic based on aggregate scores, with each document potentially containing multiple relevant sections - which would make K a bit more dynamic because we want the top N documents and K would represent the chunks that represent those documents. Implementing thresholds per-doc can help manage performance.
  3. Custom Aggregation with Embedding Tags: Associating vectors with specific tags, such as user access levels or n-gram embeddings, to enable dynamic aggregation strategies. This allows for personalized and context-sensitive relevance scoring and would require the ability to override/customize.

The third approach is particularly promising for domain-specific applications, where standard aggregation methods may not suffice. For instance, embedding tags could be linked to user access controls, unlocking certain vectors at query time, or to specific n-grams, activating them based on query content.

Incorporating a mechanism to override the default aggregation method would facilitate experimentation with these strategies.

@vigyasharma
Copy link
Contributor Author

Thank you for sharing these use-cases @krickert !

  1. Aggregate Scoring – I think we can do this today by joining the child doc hits with their parents and calculating score on all children in the ToParentBlockJoinQuery. The getAllVectorValues() api should make this easier by avoiding the two phase query. We can also use the aggregate query scores during approximate search graph traversal itself (use aggregate query similarity with all vector values for the doc)?

  2. Chunk-Based Highlighting – Interesting. With getAllVectorValues(), we can find all vector values with similarity above a separate sim-threshold for highlights?

  3. Custom Aggregation with Embedding Tags – I think this one plays better with a separate child doc per vector value. We can store these tags and access related data as separate fields in child docs and filter on them during search.

Honestly, I think the existing parent-block join can achieve most use-cases for independent multi-vectors (the passage vector use case). But the approach above might make it easier to use? We also need it for dependent multi-vectors like ColBERT, though it's a separate question on whether ANN is even viable for ColBERT (v/s only for reranking).

I'd like to know what issues or limitations do people face with the existing parent-child support for multiple vector values, so we can address them here.

@krickert
Copy link

Chunk-Based Highlighting – Interesting. With getAllVectorValues(), we can find all vector values with similarity above a separate sim-threshold for highlights?

Not sure. But it is frustrating for me: we only calculate K chunks and not N documents. I want to return N documents all the time, and keep running K until N is reached. Since it runs K on the chunks, I'd rather it return all thee chunks that it can until it reaches N amount of documents. Then we can return the chunks that match which can be used by highlighting.

I think this one plays better with a separate child doc per vector value. We can store these tags and access related data as separate fields in child docs and filter on them during search.

Indexing the child docs requires making more docs. We just care about the resulting embedding, so why not treat it like a tensor instead of an entire document? It's frustrating to always make a child doc for multiple vectors when I can just do a keyword-value style instead. Also, there's def some limitations with how you can use it with scoring and the query ends up looking like a mess. If we can simplify the query syntax that would help a lot.

If you can get a unit test going for your PR, I'd be glad to expand on it and play with it a bit.

Copy link

github-actions bot commented Dec 5, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Dec 5, 2024
@vigyasharma
Copy link
Contributor Author

I pivoted to an approach that handles independent multi-vectors within flat storage, instead of requiring index time parent-block joins. Have raised a draft PR here – #14173

@github-actions github-actions bot removed the Stale label Jan 29, 2025
Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@heemin32
Copy link
Contributor

heemin32 commented May 22, 2025

I believe this proposal to add a new field for multi-vector support is facing significant challenges primarily because we aim to support HNSW-based search on it. However, if our goal were limited to enabling MAXSIM scoring between two multi-vectors using only exact search (i.e., without KNN indexing for faster retrieval), many of these challenges could be avoided. This exact search with MAXSIM scoring approach would still be valuable for re-scoring documents after an initial KNN query retrieves a small candidate set.

If we want to represent multi-vectors as an array of vectors within a single document—rather than relying on a parent-child document structure—it might make sense to integrate this into the existing field type. However, such an approach should clearly demonstrate a performance advantage to justify the change.

@heemin32
Copy link
Contributor

On a second thought, if user just diable indexing, implementing multi vector in the same field might be same as having a separate field without indexing capability.

@github-actions github-actions bot removed the Stale label May 23, 2025
@vigyasharma
Copy link
Contributor Author

if our goal were limited to enabling MAXSIM scoring between two multi-vectors using only exact search (i.e., without KNN indexing for faster retrieval), many of these challenges could be avoided.

one of the challenges, and perhaps the main concern with this PR, has been practical scaling challenges of working with multi-vectors on the entire corpus. This would apply to exact search as well. A viable middle ground would be to over-collect documents via regular single-vector knn search in the first pass, and later rerank them using late-interaction multi-vectors in the second pass. This idea was shared by jimczi@ earlier in this thread.

I'm working on a PR that combines his approach with a FunctionScore query for reranking. Should be able to raise it for review this week.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants