Description
Currently, for the k-NN plugin, we introduce a data type called the knn_vector
. For the purposes of this issue, this data type allows users to define fixed dimensional arrays of floating point numbers (i.e. [1.2,3.2,4.2,5.2]). On disk, we have a codec that serializes the vectors as binary doc values. So, 1 million 128-dimensional vectors would consume 1,000,000 * 128 * 4 = 512,000,000 bytes ~= 488 MB
.
The problem is that the vectors also get stored in the source field. So, the file storage looks like this for 10K vectors with dimension=100 and using BEST_SPEED codec:
Total Index Size | 24.3 mb |
---|---|
HNSW files | 5.91 mb |
Doc values | 3.8 mb |
Source | 14.6 mb |
With BEST_COMPRESSION codec:
Total Index Size | 18.3 mb |
---|---|
HNSW files | 5.91 mb |
Doc values | 3.75 mb |
Source | 8.64 mb |
As you can see, in both cases the source takes up significantly more space than the DocValues.
Part of the problem lies in the fact that if a floating point number is represented as a string with an average of 16 digits, the total string storage size for the vector in the example above will be 1,000,000 * 128 * 16 = 2,048,000,000 bytes ~= 1953 MB
, not including the additional characters like commas and spaces, etc. I understand that that will be compressed, but still, from the table above, the source field is still very large.
Im wondering if it would be possible to optimize at the field level the stored field representation. I am aware of SourceMapper here where we are able to filter based on fields. Im wondering if it would be feasible to hook in here and modify the representation for certain types before adding as a stored field.