[Review][Java] Extend `Dataset` to work as an output data container #1111

ldematte · 2025-07-14T09:30:42Z

In #902 and #1034 we introduced a Dataset interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building.

As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. #1105 / #1102 or #1104).

This PR extends Dataset to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform back and forth on-heap memory.
This work is inspired by the existing raft mdspan and DLTensor data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend.

By itself, the PR is just a refactoring to extend the Dataset implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from #1105).
The implementation class hierarchy is implemented with future extensions in mind: atm we have one HostMemoryDatasetImpl, but we are already thinking to have a corresponding DeviceMemoryDatasetImpl that will wrap and mange (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing).

Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring prepareTensor).

…set copy to heap Java arrays

copy-pr-bot · 2025-07-14T09:30:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ldematte · 2025-07-16T07:06:15Z

As @chatman pointed out in #1105, this needs a better name, as it would not be a "Dataset" neither a "Graph" in the lexicon of cuvs.
@mythrocks what about CuVSMatrix or CuVSArray2d?

cjnolet · 2025-07-16T21:38:28Z

Linking cjnolet/nv_elastic#22

mythrocks · 2025-07-16T23:15:44Z

@mythrocks what about CuVSMatrix or CuVSArray2d?

Sorry for the delay. (It has been a packed day, today.) I like the idea of calling this the CuVSMatrix. 👍

There might be value in differentiating whether the matrix is in __host__ or __device__ memory. Separate types: CuVSDeviceMatrix vs CuVSHostMatrix?

cjnolet · 2025-07-17T00:15:26Z

There might be value in differentiating whether the matrix is in host or device memory. Separate types: CuVSDeviceMatrix vs CuVSHostMatrix?

+1 for this. We do this in all of our other APIs (python, c++) and it really helps us keep them clean and relatively self-documenting

ldematte · 2025-07-17T07:08:50Z

I like the idea of separate types, I was already planning to do this at implementation level (e.g. HostMemoryDatasetImpl). I can have intermediate CPU/GPU interfaces, but I'd like to keep a common ancestor. So we'll have CuVSMatrix, which can be either, and sub-interfaces CuVSDeviceMatrix and CuVSHostMatrix.
Why the common type? So when we want, we can be specific. When we don't want to be specific, we can pass the more generic type to any cuvs function and choose the path to take "dynamically" like we do like in the C API: if (isGPU(matrix)) then do() else copyToGPUMemory(); do(); (very pseudo).

WDYT?

…ed-dataset # Conflicts: # java/cuvs-java/src/main/java/com/nvidia/cuvs/Dataset.java # java/cuvs-java/src/main/java/com/nvidia/cuvs/spi/CuVSProvider.java # java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BruteForceIndexImpl.java # java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java # java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/HnswIndexImpl.java # java/cuvs-java/src/test/java/com/nvidia/cuvs/CagraBuildAndSearchIT.java

…ed-dataset # Conflicts: # java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java

…CuVSResources.access()

ldematte added 3 commits July 13, 2025 16:01

Expand Dataset interface and implementation class hierarchy

ed594c8

Remove single element accessor from Dateset interface; add whole Data…

5514855

…set copy to heap Java arrays

Whole matrix of data types in code and tests

4454dad

ldematte requested a review from a team as a code owner July 14, 2025 09:30

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Jul 14, 2025

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Jul 14, 2025

cjnolet assigned ldematte Jul 15, 2025

cjnolet added this to Elasticsearch + cuVS Team Jul 15, 2025

cjnolet moved this to In Progress in Elasticsearch + cuVS Team Jul 15, 2025

mythrocks added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 15, 2025

cjnolet added this to Elasticsearch + cuVS Team Board Jul 16, 2025

cjnolet moved this to In Progress in Elasticsearch + cuVS Team Board Jul 16, 2025

ldematte mentioned this pull request Jul 17, 2025

Expose graph and dataset accessors for CAGRA to C/Python #1086

Open

ldematte added 6 commits July 17, 2025 12:05

Renaming (automated, IntelliJ)

c8fec87

Introducing CuVSHostMatrix and MemoryKind

9583445

Introducing CuVSHostMatrix/CuVSDeviceMatrix and MemoryKind

10942ca

Merge remote-tracking branch 'upstream/branch-25.08' into java/extend…

faa16b4

…ed-dataset # Conflicts: # java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java

Fix TieredIndexImpl to use new things, including Dataset.ofArray and …

e4f7da4

…CuVSResources.access()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Review][Java] Extend `Dataset` to work as an output data container #1111

[Review][Java] Extend `Dataset` to work as an output data container #1111

ldematte commented Jul 14, 2025

Uh oh!

copy-pr-bot bot commented Jul 14, 2025

Uh oh!

ldematte commented Jul 16, 2025

Uh oh!

cjnolet commented Jul 16, 2025

Uh oh!

mythrocks commented Jul 16, 2025

Uh oh!

cjnolet commented Jul 17, 2025

Uh oh!

ldematte commented Jul 17, 2025

Uh oh!

Uh oh!

[Review][Java] Extend Dataset to work as an output data container #1111

Are you sure you want to change the base?

[Review][Java] Extend Dataset to work as an output data container #1111

Conversation

ldematte commented Jul 14, 2025

Uh oh!

copy-pr-bot bot commented Jul 14, 2025

Uh oh!

ldematte commented Jul 16, 2025

Uh oh!

cjnolet commented Jul 16, 2025

Uh oh!

mythrocks commented Jul 16, 2025

Uh oh!

cjnolet commented Jul 17, 2025

Uh oh!

ldematte commented Jul 17, 2025

Uh oh!

Uh oh!

[Review][Java] Extend `Dataset` to work as an output data container #1111

[Review][Java] Extend `Dataset` to work as an output data container #1111