Skip to content

Optimization options for source storage for particular fields #6356

Open
@jmazanec15

Description

@jmazanec15

Currently, for the k-NN plugin, we introduce a data type called the knn_vector. For the purposes of this issue, this data type allows users to define fixed dimensional arrays of floating point numbers (i.e. [1.2,3.2,4.2,5.2]). On disk, we have a codec that serializes the vectors as binary doc values. So, 1 million 128-dimensional vectors would consume 1,000,000 * 128 * 4 = 512,000,000 bytes ~= 488 MB.

The problem is that the vectors also get stored in the source field. So, the file storage looks like this for 10K vectors with dimension=100 and using BEST_SPEED codec:

Total Index Size 24.3 mb
HNSW files 5.91 mb
Doc values 3.8 mb
Source 14.6 mb

With BEST_COMPRESSION codec:

Total Index Size 18.3 mb
HNSW files 5.91 mb
Doc values 3.75 mb
Source 8.64 mb

As you can see, in both cases the source takes up significantly more space than the DocValues.

Part of the problem lies in the fact that if a floating point number is represented as a string with an average of 16 digits, the total string storage size for the vector in the example above will be 1,000,000 * 128 * 16 = 2,048,000,000 bytes ~= 1953 MB, not including the additional characters like commas and spaces, etc. I understand that that will be compressed, but still, from the table above, the source field is still very large.

Im wondering if it would be possible to optimize at the field level the stored field representation. I am aware of SourceMapper here where we are able to filter based on fields. Im wondering if it would be feasible to hook in here and modify the representation for certain types before adding as a stored field.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions