Skip to content

Add optional AVX512-FP16 arithmetic for the scalar quantizer. #4225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

mulugetam
Copy link
Contributor

@mulugetam mulugetam commented Mar 6, 2025

PR #4025 introduced a new architecture mode, avx512_spr, which enables the use of features available since Intel® Sapphire Rapids. The Hamming Distance Optimization (PR #4020), based on this mode, is now used by OpenSearch to speed up the indexing and searching of binary vectors.

This PR adds support for AVX512-FP16 arithmetic for the Scalar Quantizer. It introduces a new Boolean flag, ENABLE_AVX512_FP16, which, when used together with the avx512_spr mode, explicitly enables avx512fp16 arithmetic.

Tests on an AWS r7i instance demonstrate up to a 1.6x speedup in execution time when using AVX512-FP16 compared to AVX512. The improvement comes from a reduction in path length.


-DFAISS_OPT_LEVEL=avx512:

$ numactl -C 2 ./build/perf_tests/bench_scalar_quantizer_distance --d=768 --n=2000 --iterations=20
----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
QT_4bit/iterations:20                377313128 ns    377307894 ns           20 code_size=384
QT_4bit_uniform/iterations:20        375116351 ns    375113141 ns           20 code_size=384
QT_6bit/iterations:20                313387520 ns    313382880 ns           20 code_size=576
QT_8bit/iterations:20                256168739 ns    256166690 ns           20 code_size=768
QT_8bit_direct/iterations:20          86934297 ns     86933750 ns           20 code_size=768
QT_8bit_direct_signed/iterations:20  182414307 ns    182413034 ns           20 code_size=768
QT_8bit_uniform/iterations:20        199623822 ns    199621677 ns           20 code_size=768
QT_bf16/iterations:20                205998126 ns    205996720 ns           20 code_size=1.536k
QT_fp16/iterations:20                204291381 ns    204290326 ns           20 code_size=1.536k

$ numactl -C 2 ./build/perf_tests/bench_scalar_quantizer_accuracy --d=768 --n=2000 --iterations=20
----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
QT_4bit/iterations:20                    0.000 ns        0.000 ns            0 code_size=384 code_size_two=768k ndiff_for_idempotence=0 sql2_recons_error=0.284829
QT_4bit_uniform/iterations:20            0.000 ns        0.000 ns            0 code_size=384 code_size_two=768k ndiff_for_idempotence=0 sql2_recons_error=0.2845
QT_6bit/iterations:20                    0.000 ns        0.000 ns            0 code_size=576 code_size_two=1.152M ndiff_for_idempotence=0 sql2_recons_error=0.0164574
QT_8bit/iterations:20                    0.000 ns        0.000 ns            0 code_size=768 code_size_two=1.536M ndiff_for_idempotence=23.184k sql2_recons_error=1.32533m
QT_8bit_direct/iterations:20             0.000 ns        0.000 ns            0 code_size=768 code_size_two=1.536M ndiff_for_idempotence=0 sql2_recons_error=255.806
QT_8bit_direct_signed/iterations:20      0.000 ns        0.000 ns            0 code_size=768 code_size_two=1.536M ndiff_for_idempotence=0 sql2_recons_error=255.799
QT_8bit_uniform/iterations:20            0.000 ns        0.000 ns            0 code_size=768 code_size_two=1.536M ndiff_for_idempotence=0 sql2_recons_error=983.725u
QT_bf16/iterations:20                    0.000 ns        0.000 ns            0 code_size=1.536k code_size_two=3.072M ndiff_for_idempotence=0 sql2_recons_error=558.005u
QT_fp16/iterations:20                    0.000 ns        0.000 ns            0 code_size=1.536k code_size_two=3.072M ndiff_for_idempotence=0 sql2_recons_error=8.71347u

-DFAISS_ENABLE_AVX512_FP16=ON -DFAISS_OPT_LEVEL=avx512_spr

$ numactl -C 2 ./build/perf_tests/bench_scalar_quantizer_distance --d=768 --n=2000 --iterations=20
----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
QT_4bit/iterations:20                235309050 ns    235307645 ns           20 code_size=384
QT_4bit_uniform/iterations:20        232136257 ns    232133724 ns           20 code_size=384
QT_6bit/iterations:20                285525339 ns    285522795 ns           20 code_size=576
QT_8bit/iterations:20                194723922 ns    194722583 ns           20 code_size=768
QT_8bit_direct/iterations:20          92415395 ns     92415183 ns           20 code_size=768
QT_8bit_direct_signed/iterations:20  166470157 ns    166469954 ns           20 code_size=768
QT_8bit_uniform/iterations:20        182553814 ns    182547301 ns           20 code_size=768
QT_bf16/iterations:20                222188800 ns    222187274 ns           20 code_size=1.536k
QT_fp16/iterations:20                198899549 ns    198898266 ns           20 code_size=1.536k

$ numactl -C 2 ./build/perf_tests/bench_scalar_quantizer_accuracy --d=768 --n=2000 --iterations=20
----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
QT_4bit/iterations:20                    0.000 ns        0.000 ns            0 code_size=384 code_size_two=768k ndiff_for_idempotence=0 sql2_recons_error=0.284803
QT_4bit_uniform/iterations:20            0.000 ns        0.000 ns            0 code_size=384 code_size_two=768k ndiff_for_idempotence=0 sql2_recons_error=0.284519
QT_6bit/iterations:20                    0.000 ns        0.000 ns            0 code_size=576 code_size_two=1.152M ndiff_for_idempotence=0 sql2_recons_error=0.0164255
QT_8bit/iterations:20                    0.000 ns        0.000 ns            0 code_size=768 code_size_two=1.536M ndiff_for_idempotence=27.191k sql2_recons_error=1.3546m
QT_8bit_direct/iterations:20             0.000 ns        0.000 ns            0 code_size=768 code_size_two=1.536M ndiff_for_idempotence=0 sql2_recons_error=255.806
QT_8bit_direct_signed/iterations:20      0.000 ns        0.000 ns            0 code_size=768 code_size_two=1.536M ndiff_for_idempotence=0 sql2_recons_error=255.799
QT_8bit_uniform/iterations:20            0.000 ns        0.000 ns            0 code_size=768 code_size_two=1.536M ndiff_for_idempotence=0 sql2_recons_error=1.0097m
QT_bf16/iterations:20                    0.000 ns        0.000 ns            0 code_size=1.536k code_size_two=3.072M ndiff_for_idempotence=0 sql2_recons_error=558.005u
QT_fp16/iterations:20                    0.000 ns        0.000 ns            0 code_size=1.536k code_size_two=3.072M ndiff_for_idempotence=0 sql2_recons_error=8.71347u

@alexanderguzhva
Copy link
Contributor

@mulugetam do I get it correct that this PR introduces an optional tradeoff between the accuracy and the speed?

@mulugetam
Copy link
Contributor Author

@mulugetam do I get it correct that this PR introduces an optional tradeoff between the accuracy and the speed?

Yes.

@mdouze
Copy link
Contributor

mdouze commented Mar 7, 2025

Where does the loss in accuracy come from? because computation is performed as fp16 * fp16 instead of fp16-> fp32 then fp32 * fp32 ?

@alexanderguzhva
Copy link
Contributor

yes, pure fp16 FMAD operations

@mulugetam
Copy link
Contributor Author

@alexanderguzhva I'm not sure why this PR fails some of the pytest tests/test_*.py. On my local machine, the current main branch also fails some of these tests. Any ideas?

@alexanderguzhva
Copy link
Contributor

@mulugetam well, that's the most sad part in committing PRs, according to my experience. Will you be able to reproduce the problem if you try rebasing your PR on top of the head? Otherwise, I have no explanations: could be a different hardware on the CI machine, a compiler or maybe your code.

@mulugetam
Copy link
Contributor Author

@alexanderguzhva Yes, it's reproducible. If I git clone https://github.com/facebookresearch/faiss.git and run pytest tests/test_*.py on my machine, some of the tests will fail.

@mulugetam
Copy link
Contributor Author

@mengdilin @alexanderguzhva Would you please review the code changes?

@mdouze
Copy link
Contributor

mdouze commented Apr 1, 2025

The ScalarQuantizer implementation is already quite complicated. This diff complicates it further, with one tradeoff that has to be decided at compilation time.
We are thinking of how to improve the structure of the library to support several architectures and dispatching among them in runtime. So we will probably put this PR on hold until we converged on this topic.

Copy link
Contributor

@mdouze mdouze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The objectives of the changes asked here are :

  1. avoid carrying around the T (accumulator type) everywhere
  2. try to get closer to a state where we could switch between float32 and float16 computations at runtime

@@ -58,7 +58,12 @@ struct ScalarQuantizer : Quantizer {
size_t bits = 0;

/// trained values (including the range)
#if defined(ENABLE_AVX512_FP16) && defined(__AVX512FP16__) && \
defined(__FLT16_MANT_DIG__)
std::vector<_Float16> trained;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it have a performance impact if we don't touch the externally visible fields ?
ie. convert the trained data on-the-fly to float16.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll undo my changes and then check the performance with on-the-fly float16 conversion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have a std::vector<_Float16> trained for AVX-512-FP16, it'll hurt performance quite a bit in the non-uniform case.

With vector<_Float16> trained;

    FAISS_ALWAYS_INLINE __m512h
    reconstruct_32_components(const uint8_t* code, int i) const {
        __m512h xi = Codec::decode_32_components(code, i);
        return _mm512_fmadd_ph(
                xi,
                _mm512_loadu_ph(this->vdiff + i),
                _mm512_loadu_ph(this->vmin + i));
    }

With vector<float> trained;

    FAISS_ALWAYS_INLINE __m512h
    reconstruct_32_components(const uint8_t* code, int i) const {
        __m512h xi = Codec::decode_32_components(code, i);

        __m512 vdiff_lo  = _mm512_loadu_ps(this->vdiff + i);
        __m512 vdiff_hi = _mm512_loadu_ps(this->vdiff + i + 16);
        __m512 vmin_lo  = _mm512_loadu_ps(this->vmin + i);
        __m512 vmin_hi = _mm512_loadu_ps(this->vmin + i + 16);

        __m256i vdiff_h_lo = _mm512_cvtps_ph(vdiff_lo,  _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);
        __m256i vdiff_h_hi = _mm512_cvtps_ph(vdiff_hi, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);
        __m256i vmin_h_lo  = _mm512_cvtps_ph(vmin_lo,   _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);
        __m256i vmin_h_hi  = _mm512_cvtps_ph(vmin_hi,  _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);

        __m512i vdiff_i = _mm512_inserti64x4(_mm512_castsi256_si512(vdiff_h_lo), vdiff_h_hi, 1);
        __m512i vmin_i = _mm512_inserti64x4(_mm512_castsi256_si512(vmin_h_lo), vmin_h_hi, 1);

        __m512h vdiff_h = _mm512_castsi512_ph(vdiff_i);
        __m512h vmin_h  = _mm512_castsi512_ph(vmin_i);

        return _mm512_fmadd_ph(xi, vdiff_h, vmin_h);
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test that are convincing.
It would be better void* pointer to the class that contains the _Float16* data. It would be synced with the reference std::vector<float> trained after training, index loading (and deallocated at destruction time).

@@ -67,6 +67,7 @@ option(FAISS_ENABLE_PYTHON "Build Python extension." ON)
option(FAISS_ENABLE_C_API "Build C API." OFF)
option(FAISS_ENABLE_EXTRAS "Build extras like benchmarks and demos" ON)
option(FAISS_USE_LTO "Enable Link-Time optimization" OFF)
option(FAISS_ENABLE_AVX512_FP16 "Enable AVX512-FP16 arithmetic (for avx512_spr opt level)." OFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a runtime option not a compile-time option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mdouze. The current functions, like reconstruct_8_components and reconstruct_16_components, are picked at compile time. I'm not exactly sure how to approach what you're suggesting. Could you explain a bit more?

@@ -366,28 +443,28 @@ struct Codec6bit {

enum class QuantizerTemplateScaling { UNIFORM = 0, NON_UNIFORM = 1 };

template <class Codec, QuantizerTemplateScaling SCALING, int SIMD>
template <class T, class Codec, QuantizerTemplateScaling SCALING, int SIMD>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T is not specific enough. Another name?

: ScalarQuantizer::SQuantizer {
const size_t d;
const float vmin, vdiff;
const T vmin, vdiff;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this scalar code need to be in float16 ? Will it be optimized to SIMD automatically?

void train_Uniform(
RangeStat rs,
float rs_arg,
idx_t n,
int k,
const float* x,
std::vector<float>& trained) {
std::vector<T>& trained) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in any case, this can be converted later from float32 to float16

sim.begin_32();
for (size_t i = 0; i < quant.d; i += 32) {
__m512h xi = quant.reconstruct_32_components(code, i);
// print_m512h(xi);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!

@@ -1511,17 +1810,155 @@ struct DCTemplate<Quantizer, Similarity, 1> : SQDistanceComputer {
}
};

#if defined(USE_AVX512_FP16)

template <class T, class Quantizer, class Similarity>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please redefine T with using inside the Quantizer and/or Similarity class so that it does not need to be declared all the time.

@@ -1463,23 +1763,22 @@ struct SimilarityIP<8> {
* code-to-vector or code-to-code comparisons
*******************************************************************/

template <class Quantizer, class Similarity, int SIMDWIDTH>
template <class T, class Quantizer, class Similarity, int SIMDWIDTH>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T could be a field of Quantizer::T

@alexanderguzhva
Copy link
Contributor

I'm not sure that I vote for adding T EVERYWHERE

@mdouze
Copy link
Contributor

mdouze commented Apr 25, 2025

The PR #4309 shows how the AVX variants are going to be handled (it is currently a draft).
It will avoid spaghetti code (lots of #ifdefs), which was especially the case for ScalarQuantizer.

@satymish
Copy link
Contributor

@pankajsingh88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants