Skip to content

Bench improvements with config files #462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
May 23, 2025
Merged

Bench improvements with config files #462

merged 30 commits into from
May 23, 2025

Conversation

marianotepper
Copy link
Collaborator

@marianotepper marianotepper commented Apr 30, 2025

This PR introduces a few ease-of-use tools in jvector-examples. These tools are mainly related to codeless benchmarking through the use of config files for setting hyperparameters.

  1. Bench now loads the list of available datasets from a YAML file. The list is provided in jvector-examples/yaml-configs/datasets.yml. The datasets in datasets.yml can be commented to run a subset, for example. Additionally, Bench now allows to specify a list of overquery values for each topK value.
  2. It creates BenchYAML that allows to read config files with JVector hyperparameters in YAML format. For details on the format of these config files, see below.
  3. It creates HelloVectorWorld with a single, clean, and simple example using ada002-100k. This example does not do a grid search over the hyperparameters, but it does try a few overquery values for topK=10 and topK=100. The corresponding config file is in jvector-examples/yaml-configs/ada002-100k.yml.

Here's an example YAML file showing what and how we can specify hyperparameters:

configVersion: 4 # do not change this number unless you know what you are doing

dataset: ada002-100k # needs to match one of the names in jvector-examples/yaml-configs/datasets.yml

construction:
  outDegree: [32, 48, 64, 96, 128]
  efConstruction: [60, 80, 100, 120, 160, 200, 400, 600, 800]
  neighborOverflow: [1.2f, 2.0f]
  addHierarchy: [No, Yes]
  compression:
    - type: None
    - type: PQ
      parameters:
        m: 192 # we can either specify the integer m or the integer mFactor. In this case, m will be set to the data dimensionality divided by mFactor
        # mFactor: 8
        k: 256 # optional parameter. By default, k=256
        centerData: No # optional parameter. By default, centerData=Yes if the dataset similarity function is Euclidean
        anisotropicThreshold: -1.0 # optional parameter. By default, anisotropicThreshold=-1 (i.e., no anisotropy)
    - type: PQ
      parameters: # here, the default values of k, centerData, and anisotropicThreshold are used
        mFactor: 2
  reranking:
    - FP
    - NVQ
  useSavedIndexIfExists: Yes

search:
  topKOverquery:
    # the value of topK followed by a list with the overquery rates we want to cover for that topK
    10: [1.0, 2.0, 5.0, 10.0]
    100: [1.0, 2.0]
  useSearchPruning: [No, Yes]
  compression:
    - type: None
    - type: PQ
      parameters:
        m: 192 # we can either specify the integer m or the integer mFactor.
        k: 256 # optional parameter. By default, k=256
        centerData: No # optional parameter. By default, centerData=Yes if the dataset similarity function is Euclidean
        anisotropicThreshold: -1.0 # optional parameter. By default, anisotropicThreshold=-1 (i.e., no anisotropy)

@marianotepper marianotepper changed the title Bench improvements Bench improvements with config files May 1, 2025
# Conflicts:
#	rat-excludes.txt
@marianotepper marianotepper marked this pull request as ready for review May 22, 2025 22:14
@marianotepper marianotepper requested a review from tlwillke May 22, 2025 22:14
Copy link
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@marianotepper marianotepper merged commit 7353ad9 into main May 23, 2025
8 checks passed
@marianotepper marianotepper deleted the bench-improvements branch May 23, 2025 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants