-
Notifications
You must be signed in to change notification settings - Fork 351
Simdized quantized operations #2904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simdized quantized operations #2904
Conversation
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
@chentong319 there is currently an error, working to fix it. It will only be a small change. |
…idler Signed-off-by: Alexandre Eichenberger <[email protected]>
@chentong319 ran independent tests the fixes works. The latest commit should have a green build. |
Summary of changes: Elementwise: RoundOp was expanded manually in elementwise, as a full loop over all operations. But I needed it as an operation performing on a scalar or a simd vector. So I pulled the implementation into MathBuilder, so that I can call it anywhere where I need to compute the Round (which is an elaborate operation rounding to even whole numbers). Enabled simd for dequantize. The issue that prevented it was the lack of SIMD support for the MathBuilder.cast. Had to add this for quantize operations (which are now vectorized) so it now works here too. Delayed splatting in Reduction: Migrated some list support in a separate file. Created a new operation Changed the interface to know when ops need a division using a templated approach. ONNXToKrnlCommon For elementary, simple operations (such as Add/Sub...) don't have a custom [Dynamic] Quantize Linear Added 2 functions to perform the Dynamic part (compute min/max to get the scale/zero point] and perform the conversion. Simply moved the methods to a new independently callable operations (as they will also be needed elsewhere in the future). Removed the Krnl DialectBuilder Generate a SIMD loop for the given kernel. See the .hpp for explanation of the scheme. MLIR DialectBuilder Added handling of scalar/vector for I added a new Code was added for |
|
All very good questions
I use them interchangeably. If you feel strongly about the one or the other, I can do a cleanup in a subsequent PR. Technically vectorization does not require the use of SIMD. For example ESSL has a vector mode where instead of calling one "math" function at a time, it calls a long vector of them of arbitrary length, and use a mixture of SIMD and scalar operations to execute them as fast as possible. SIMD implies the use of SIMD instructions.
There is 2 components in VL. One is the hardware constraints of the machine. For z: 4 floats, 8 dlfloat16,... LLVM backend efficiently supports arbitrary vector lengths that are multiples of the hardware constraints. Essentially, if we create a 8-wide float vector, then it generates 2 SIMD instructions for each. That is a very good way to exploit ILP. I call this second factor "unroll" factor as it effectively unroll further the loop. When presented to the loops (for blocking), the VL is the product of the hardware constraints and the "unroll" factor In practice, I look also at the register pressure: if there is a kernel with very few SIMD operations, then I want a larger unroll and if there are lots of SIMD operations, then I want a smaller unroll factor as otherwise we may blow the number of registers. Pressure is approximated by number of SIMD operations. Ideally I would use a better metrics, but it works well in practice so far.
Ideally yes, I would love for you to integrate multiple reductions into a single kernel. Maybe with this new infrastructure for SIMD, it will be easier to do. Note that this reduction pattern is to reduce a whole loop to a single scalar (not an array of reductions). That is a pattern that is currently not supported for any of the unary/binary elementwise reduction (1) because this pattern does not exist except for this very special quantization of whole vector, and (2) because I would have to significantly rewrite all of the reductions to handle this quite specific pattern.
Before, the divide by means was a flag on the pattern. That does not work for supporting multiple reductions, so I moved it to a template that can be individually turned on for each specific op. I am interested to learn more about the tensor representation, my goal here was to introduce as few changes as possible.
Agreed. I am trying to simplify a bit the generated code, but it is not easy. On x86/arm, there are efficient horizontal/accross reduction instructions, for example. Z supports some of them for integers but not float. Thus I need a custom scheme to handle VL reductions at once so that I may efficiently do a VL-by-VL permute pattern to fully use the SIMD operations (which still requires 4 additional permute operations that are not there for machines with horizontal reductions). I am looking into the possibility of doing a krnl - simd - reduce but its a bit involved, so I needed to first generate the easier code manually and then look into abstracting it into a support function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks, will implement your suggestions in the next PR, namely:
|
Jenkins Linux s390x Build #15330 [push] Simdized quantized opera... started at 15:52 |
Jenkins Linux ppc64le Build #14355 [push] Simdized quantized opera... started at 15:53 |
Jenkins Linux amd64 Build #15325 [push] Simdized quantized opera... started at 14:52 |
Jenkins Linux amd64 Build #15325 [push] Simdized quantized opera... passed after 1 hr 13 min |
Jenkins Linux s390x Build #15330 [push] Simdized quantized opera... passed after 1 hr 46 min |
Jenkins Linux ppc64le Build #14355 [push] Simdized quantized opera... passed after 2 hr 5 min |
Simdized quantized operations: DynamicQuantizeLinear, QuantizedLinear, and DequantizeLinear.
Added support for reduction to a scalar (current scheme for our tensor-only quantization), fused reduction of min an max needed for dynamic quantization, and added a generic support in KrnlBuilder to generate SIMD loops.
Also added
MathBuilder
support forclip
andround
so that we don't need to rely ononnx
operators to do so when lowering to Krnl.