Skip to content

[Review][Java] Extend Dataset to work as an output data container #1111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: branch-25.08
Choose a base branch
from

Conversation

ldematte
Copy link
Contributor

In #902 and #1034 we introduced a Dataset interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building.

As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. #1105 / #1102 or #1104).

This PR extends Dataset to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform back and forth on-heap memory.
This work is inspired by the existing raft mdspan and DLTensor data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend.

By itself, the PR is just a refactoring to extend the Dataset implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from #1105).
The implementation class hierarchy is implemented with future extensions in mind: atm we have one HostMemoryDatasetImpl, but we are already thinking to have a corresponding DeviceMemoryDatasetImpl that will wrap and mange (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing).

Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring prepareTensor).

Copy link

copy-pr-bot bot commented Jul 14, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cjnolet cjnolet moved this to In Progress in Elasticsearch + cuVS Team Jul 15, 2025
@mythrocks mythrocks added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 15, 2025
@ldematte
Copy link
Contributor Author

As @chatman pointed out in #1105, this needs a better name, as it would not be a "Dataset" neither a "Graph" in the lexicon of cuvs.
@mythrocks what about CuVSMatrix or CuVSArray2d?

@cjnolet
Copy link
Member

cjnolet commented Jul 16, 2025

Linking cjnolet/nv_elastic#22

@mythrocks
Copy link
Contributor

@mythrocks what about CuVSMatrix or CuVSArray2d?

Sorry for the delay. (It has been a packed day, today.) I like the idea of calling this the CuVSMatrix. 👍

There might be value in differentiating whether the matrix is in __host__ or __device__ memory. Separate types: CuVSDeviceMatrix vs CuVSHostMatrix?

@cjnolet
Copy link
Member

cjnolet commented Jul 17, 2025

There might be value in differentiating whether the matrix is in host or device memory. Separate types: CuVSDeviceMatrix vs CuVSHostMatrix?

+1 for this. We do this in all of our other APIs (python, c++) and it really helps us keep them clean and relatively self-documenting

@ldematte
Copy link
Contributor Author

I like the idea of separate types, I was already planning to do this at implementation level (e.g. HostMemoryDatasetImpl). I can have intermediate CPU/GPU interfaces, but I'd like to keep a common ancestor. So we'll have CuVSMatrix, which can be either, and sub-interfaces CuVSDeviceMatrix and CuVSHostMatrix.
Why the common type? So when we want, we can be specific. When we don't want to be specific, we can pass the more generic type to any cuvs function and choose the path to take "dynamically" like we do like in the C API: if (isGPU(matrix)) then do() else copyToGPUMemory(); do(); (very pseudo).

WDYT?

ldematte added 6 commits July 17, 2025 12:05
…ed-dataset

# Conflicts:
#	java/cuvs-java/src/main/java/com/nvidia/cuvs/Dataset.java
#	java/cuvs-java/src/main/java/com/nvidia/cuvs/spi/CuVSProvider.java
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BruteForceIndexImpl.java
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/HnswIndexImpl.java
#	java/cuvs-java/src/test/java/com/nvidia/cuvs/CagraBuildAndSearchIT.java
…ed-dataset

# Conflicts:
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants