-
Notifications
You must be signed in to change notification settings - Fork 4k
Add optional AVX512-FP16 arithmetic for the scalar quantizer. #4225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Mulugeta Mammo <[email protected]>
Signed-off-by: Mulugeta Mammo <[email protected]>
@mulugetam do I get it correct that this PR introduces an optional tradeoff between the accuracy and the speed? |
Yes. |
Where does the loss in accuracy come from? because computation is performed as fp16 * fp16 instead of fp16-> fp32 then fp32 * fp32 ? |
yes, pure fp16 FMAD operations |
@alexanderguzhva I'm not sure why this PR fails some of the |
@mulugetam well, that's the most sad part in committing PRs, according to my experience. Will you be able to reproduce the problem if you try rebasing your PR on top of the head? Otherwise, I have no explanations: could be a different hardware on the CI machine, a compiler or maybe your code. |
@alexanderguzhva Yes, it's reproducible. If I |
Signed-off-by: Mulugeta Mammo <[email protected]>
Signed-off-by: Mulugeta Mammo <[email protected]>
@mengdilin @alexanderguzhva Would you please review the code changes? |
The ScalarQuantizer implementation is already quite complicated. This diff complicates it further, with one tradeoff that has to be decided at compilation time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The objectives of the changes asked here are :
- avoid carrying around the T (accumulator type) everywhere
- try to get closer to a state where we could switch between float32 and float16 computations at runtime
@@ -58,7 +58,12 @@ struct ScalarQuantizer : Quantizer { | |||
size_t bits = 0; | |||
|
|||
/// trained values (including the range) | |||
#if defined(ENABLE_AVX512_FP16) && defined(__AVX512FP16__) && \ | |||
defined(__FLT16_MANT_DIG__) | |||
std::vector<_Float16> trained; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it have a performance impact if we don't touch the externally visible fields ?
ie. convert the trained data on-the-fly to float16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll undo my changes and then check the performance with on-the-fly float16 conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't have a std::vector<_Float16> trained
for AVX-512-FP16, it'll hurt performance quite a bit in the non-uniform case.
With vector<_Float16> trained;
FAISS_ALWAYS_INLINE __m512h
reconstruct_32_components(const uint8_t* code, int i) const {
__m512h xi = Codec::decode_32_components(code, i);
return _mm512_fmadd_ph(
xi,
_mm512_loadu_ph(this->vdiff + i),
_mm512_loadu_ph(this->vmin + i));
}
With vector<float> trained;
FAISS_ALWAYS_INLINE __m512h
reconstruct_32_components(const uint8_t* code, int i) const {
__m512h xi = Codec::decode_32_components(code, i);
__m512 vdiff_lo = _mm512_loadu_ps(this->vdiff + i);
__m512 vdiff_hi = _mm512_loadu_ps(this->vdiff + i + 16);
__m512 vmin_lo = _mm512_loadu_ps(this->vmin + i);
__m512 vmin_hi = _mm512_loadu_ps(this->vmin + i + 16);
__m256i vdiff_h_lo = _mm512_cvtps_ph(vdiff_lo, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);
__m256i vdiff_h_hi = _mm512_cvtps_ph(vdiff_hi, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);
__m256i vmin_h_lo = _mm512_cvtps_ph(vmin_lo, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);
__m256i vmin_h_hi = _mm512_cvtps_ph(vmin_hi, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);
__m512i vdiff_i = _mm512_inserti64x4(_mm512_castsi256_si512(vdiff_h_lo), vdiff_h_hi, 1);
__m512i vmin_i = _mm512_inserti64x4(_mm512_castsi256_si512(vmin_h_lo), vmin_h_hi, 1);
__m512h vdiff_h = _mm512_castsi512_ph(vdiff_i);
__m512h vmin_h = _mm512_castsi512_ph(vmin_i);
return _mm512_fmadd_ph(xi, vdiff_h, vmin_h);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test that are convincing.
It would be better void*
pointer to the class that contains the _Float16*
data. It would be synced with the reference std::vector<float> trained
after training, index loading (and deallocated at destruction time).
@@ -67,6 +67,7 @@ option(FAISS_ENABLE_PYTHON "Build Python extension." ON) | |||
option(FAISS_ENABLE_C_API "Build C API." OFF) | |||
option(FAISS_ENABLE_EXTRAS "Build extras like benchmarks and demos" ON) | |||
option(FAISS_USE_LTO "Enable Link-Time optimization" OFF) | |||
option(FAISS_ENABLE_AVX512_FP16 "Enable AVX512-FP16 arithmetic (for avx512_spr opt level)." OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a runtime option not a compile-time option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mdouze. The current functions, like reconstruct_8_components
and reconstruct_16_components
, are picked at compile time. I'm not exactly sure how to approach what you're suggesting. Could you explain a bit more?
@@ -366,28 +443,28 @@ struct Codec6bit { | |||
|
|||
enum class QuantizerTemplateScaling { UNIFORM = 0, NON_UNIFORM = 1 }; | |||
|
|||
template <class Codec, QuantizerTemplateScaling SCALING, int SIMD> | |||
template <class T, class Codec, QuantizerTemplateScaling SCALING, int SIMD> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T is not specific enough. Another name?
: ScalarQuantizer::SQuantizer { | ||
const size_t d; | ||
const float vmin, vdiff; | ||
const T vmin, vdiff; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this scalar code need to be in float16 ? Will it be optimized to SIMD automatically?
void train_Uniform( | ||
RangeStat rs, | ||
float rs_arg, | ||
idx_t n, | ||
int k, | ||
const float* x, | ||
std::vector<float>& trained) { | ||
std::vector<T>& trained) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in any case, this can be converted later from float32 to float16
sim.begin_32(); | ||
for (size_t i = 0; i < quant.d; i += 32) { | ||
__m512h xi = quant.reconstruct_32_components(code, i); | ||
// print_m512h(xi); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!
@@ -1511,17 +1810,155 @@ struct DCTemplate<Quantizer, Similarity, 1> : SQDistanceComputer { | |||
} | |||
}; | |||
|
|||
#if defined(USE_AVX512_FP16) | |||
|
|||
template <class T, class Quantizer, class Similarity> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please redefine T
with using inside the Quantizer and/or Similarity class so that it does not need to be declared all the time.
@@ -1463,23 +1763,22 @@ struct SimilarityIP<8> { | |||
* code-to-vector or code-to-code comparisons | |||
*******************************************************************/ | |||
|
|||
template <class Quantizer, class Similarity, int SIMDWIDTH> | |||
template <class T, class Quantizer, class Similarity, int SIMDWIDTH> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T could be a field of Quantizer::T
I'm not sure that I vote for adding |
The PR #4309 shows how the AVX variants are going to be handled (it is currently a draft). |
PR #4025 introduced a new architecture mode,
avx512_spr
, which enables the use of features available since Intel® Sapphire Rapids. The Hamming Distance Optimization (PR #4020), based on this mode, is now used by OpenSearch to speed up the indexing and searching of binary vectors.This PR adds support for
AVX512-FP16
arithmetic for the Scalar Quantizer. It introduces a new Boolean flag,ENABLE_AVX512_FP16
, which, when used together with theavx512_spr
mode, explicitly enablesavx512fp16
arithmetic.Tests on an AWS r7i instance demonstrate up to a 1.6x speedup in execution time when using
AVX512-FP16
compared toAVX512
. The improvement comes from a reduction in path length.-DFAISS_OPT_LEVEL=avx512
:-DFAISS_ENABLE_AVX512_FP16=ON -DFAISS_OPT_LEVEL=avx512_spr