Skip to content

Simdized quantized operations #2904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 50 commits into from
Aug 13, 2024

Conversation

AlexandreEichenberger
Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger commented Aug 8, 2024

Simdized quantized operations: DynamicQuantizeLinear, QuantizedLinear, and DequantizeLinear.

Added support for reduction to a scalar (current scheme for our tensor-only quantization), fused reduction of min an max needed for dynamic quantization, and added a generic support in KrnlBuilder to generate SIMD loops.

Also added MathBuilder support for clip and round so that we don't need to rely on onnx operators to do so when lowering to Krnl.

Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
@AlexandreEichenberger
Copy link
Collaborator Author

@chentong319 there is currently an error, working to fix it. It will only be a small change.

@AlexandreEichenberger
Copy link
Collaborator Author

@chentong319 ran independent tests the fixes works. The latest commit should have a green build.

@AlexandreEichenberger
Copy link
Collaborator Author

AlexandreEichenberger commented Aug 9, 2024

Summary of changes:

Elementwise:

RoundOp was expanded manually in elementwise, as a full loop over all operations. But I needed it as an operation performing on a scalar or a simd vector. So I pulled the implementation into MathBuilder, so that I can call it anywhere where I need to compute the Round (which is an elaborate operation rounding to even whole numbers).

Enabled simd for dequantize. The issue that prevented it was the lack of SIMD support for the MathBuilder.cast. Had to add this for quantize operations (which are now vectorized) so it now works here too.

Delayed splatting in getPartiallyFlattenedSimdCode. Since MathBuilder does the splatting when operations have a mixture of scalar and SIMD, no need to do it here anymore.

Reduction:

Migrated some list support in a separate file.

Created a new operation emitFullSIMDReductionFor that does a reduction to a single scalar (previous support only reduced to an array of reduction, not a single scalar). While at it, I also enable the fused reduction of 2 distinct reductions, as DynamicQuantizedLinear needs both the min and the max at the same time.

Changed the interface to know when ops need a division using a templated approach.

ONNXToKrnlCommon

For elementary, simple operations (such as Add/Sub...) don't have a custom emitScalarOpFDor template that use the MathBuilder, and thus they don't have the scalar/vector expansion scheme. Added it there directly.

[Dynamic] Quantize Linear

Added 2 functions to perform the Dynamic part (compute min/max to get the scale/zero point] and perform the conversion. Simply moved the methods to a new independently callable operations (as they will also be needed elsewhere in the future). Removed the onnx.xxx and replaced them by math.xxx builder as we now generate fused loops.

Krnl DialectBuilder

Generate a SIMD loop for the given kernel. See the .hpp for explanation of the scheme.

MLIR DialectBuilder

Added handling of scalar/vector for math.select, and cast (that one is tricky, I left explanation in the code, code are mostly making sure to systematically use the proper type, elementType or original possibly vector type).

I added a new computeSuitableUnrollFactor to guide simd. It basically look at if simd is possible from the data type, then look at the average usage of SIMD operations, decide of an additional unroll factor given the register pressure (low pressure, more unrolling; high pressure because lots of operation, less unrolling).

Code was added for math.round and math.clip

@chentong319
Copy link
Collaborator

  1. Do the terms, vector and simd, have different meaning in the code?
  2. What's the constraint for the vector length (VL)? I am not clear about the relationship of vector dialect and the final simd code for particular machine.
  3. For reduction, I wonder whether the loop fusion in later pass can save us the trouble of handling multiple reductions.
  4. For divide by mean, we could represent that semantics with tensor dialect and make our code cleaner. But not easy in our onnx to krnl framework.
  5. Overall, we are trying to generate the best code in performance for common patterns with complicated code.

@AlexandreEichenberger
Copy link
Collaborator Author

AlexandreEichenberger commented Aug 13, 2024

All very good questions

Do the terms, vector and simd, have different meaning in the code?

I use them interchangeably. If you feel strongly about the one or the other, I can do a cleanup in a subsequent PR. Technically vectorization does not require the use of SIMD. For example ESSL has a vector mode where instead of calling one "math" function at a time, it calls a long vector of them of arbitrary length, and use a mixture of SIMD and scalar operations to execute them as fast as possible. SIMD implies the use of SIMD instructions.

What's the constraint for the vector length (VL)? I am not clear about the relationship of vector dialect and the final simd code for particular machine.

There is 2 components in VL. One is the hardware constraints of the machine. For z: 4 floats, 8 dlfloat16,... LLVM backend efficiently supports arbitrary vector lengths that are multiples of the hardware constraints. Essentially, if we create a 8-wide float vector, then it generates 2 SIMD instructions for each. That is a very good way to exploit ILP. I call this second factor "unroll" factor as it effectively unroll further the loop. When presented to the loops (for blocking), the VL is the product of the hardware constraints and the "unroll" factor

In practice, I look also at the register pressure: if there is a kernel with very few SIMD operations, then I want a larger unroll and if there are lots of SIMD operations, then I want a smaller unroll factor as otherwise we may blow the number of registers. Pressure is approximated by number of SIMD operations. Ideally I would use a better metrics, but it works well in practice so far.

For reduction, I wonder whether the loop fusion in later pass can save us the trouble of handling multiple reductions.

Ideally yes, I would love for you to integrate multiple reductions into a single kernel. Maybe with this new infrastructure for SIMD, it will be easier to do.

Note that this reduction pattern is to reduce a whole loop to a single scalar (not an array of reductions). That is a pattern that is currently not supported for any of the unary/binary elementwise reduction (1) because this pattern does not exist except for this very special quantization of whole vector, and (2) because I would have to significantly rewrite all of the reductions to handle this quite specific pattern.

For divide by mean, we could represent that semantics with tensor dialect and make our code cleaner. But not easy in our onnx to krnl framework.

Before, the divide by means was a flag on the pattern. That does not work for supporting multiple reductions, so I moved it to a template that can be individually turned on for each specific op. I am interested to learn more about the tensor representation, my goal here was to introduce as few changes as possible.

Overall, we are trying to generate the best code in performance for common patterns with complicated code.

Agreed. I am trying to simplify a bit the generated code, but it is not easy. On x86/arm, there are efficient horizontal/accross reduction instructions, for example. Z supports some of them for integers but not float. Thus I need a custom scheme to handle VL reductions at once so that I may efficiently do a VL-by-VL permute pattern to fully use the SIMD operations (which still requires 4 additional permute operations that are not there for machines with horizontal reductions).

I am looking into the possibility of doing a krnl - simd - reduce but its a bit involved, so I needed to first generate the easier code manually and then look into abstracting it into a support function.

Copy link
Collaborator

@chentong319 chentong319 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@AlexandreEichenberger
Copy link
Collaborator Author

Thanks, will implement your suggestions in the next PR, namely:

  • distinguish more precisely between Vector Length (VL) for hardware reason vs the additional unrolling for performance
  • give an example of how to use the new simd krnl interface.

@AlexandreEichenberger AlexandreEichenberger merged commit 2164245 into onnx:main Aug 13, 2024
7 checks passed
@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15330 [push] Simdized quantized opera... started at 15:52

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14355 [push] Simdized quantized opera... started at 15:53

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15325 [push] Simdized quantized opera... started at 14:52

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15325 [push] Simdized quantized opera... passed after 1 hr 13 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15330 [push] Simdized quantized opera... passed after 1 hr 46 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14355 [push] Simdized quantized opera... passed after 2 hr 5 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants