Skip to content

Feature Request: Support for ONNX backend for CrossEncoders. #3039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
SupreethRao99 opened this issue Nov 6, 2024 · 7 comments · Fixed by #3319
Closed

Feature Request: Support for ONNX backend for CrossEncoders. #3039

SupreethRao99 opened this issue Nov 6, 2024 · 7 comments · Fixed by #3319

Comments

@SupreethRao99
Copy link

Recently, I noticed that the SentenceTransformers class has gained the ability to use the ONNX backend, which is incredibly beneficial for enhancing performance, especially on CPUs.

I would like to request a similar feature for the CrossEncoder class. Adding support for the ONNX backend in CrossEncoder would be a significant enhancement. It would greatly accelerate reranking tasks on CPU, making the library even more powerful and efficient.

Here are some potential benefits:

  • Improved Performance: Faster inference times on CPU, useful when GPUs are not available.
  • Scalability: Ability to handle larger reranking workloads with reduced latency.
  • Consistency: Ensuring that both SentenceTransformers and CrossEncoder classes can leverage the same performance optimizations.

Thank you for considering this feature request.

@tomaarsen
Copy link
Collaborator

Hello!

Thanks for the suggestion. Since I took over this project, I have made various improvements to SentenceTransformer models, such as multi-GPU training, bf16, loss logging, new backends, etc. My intention is to spend some time starting from next week on extending these improvements to CrossEncoder: both on the training and on the inference side. That will include adding ONNX/OV backends to the CrossEncoder.

  • Tom Aarsen

@arjungandeeva
Copy link

Hello @tomaarsen , Is there any update on this ? I am not able to see any onnx models on the hugging face for the
https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2
https://huggingface.co/cross-encoder/ms-marco-MiniLM-L4-v2

Could you please add the onnx models for these ?

@tomaarsen
Copy link
Collaborator

Hello!

I wrote an implementation for ONNX and OpenVINO support last Friday, and I'll be benchmarking and testing it this week.
It will be included in the next v4.1 release, and all existing models on the cross-encoder organization on Hugging Face will have ONNX models uploaded.

I'm aiming to release v4.1 as soon as possible.

  • Tom Aarsen

@toniopelo
Copy link

I could test this performance-wise on consumer GPU if there is a pre-release available @tomaarsen.
I have a large dataset to score with CrossEncoder and I would greatly benefit from this to speed up the process.
Do you have a rough idea of the expected speedup with ONNX backend now ?

Anyway, thanks for your great work on this lib and the very clean v4 release!

@tomaarsen
Copy link
Collaborator

You're in luck! I've prepared my Pull Request here: #3319
It contains 2 pictures that detail the expected average speedup gain on both GPUs and CPUs. You can also already install this branch and use it, if you'd like, but the documentation on how to use it isn't on https://sbert.net yet (but only in the PR itself). The full v4.1 release with this feature should be published next week.

  • Tom Aarsen

@toniopelo
Copy link

toniopelo commented Apr 12, 2025

Woaw, your response time doesn't need a speedup, it's definitely SOTA! 🥇
Pictures are great to have an idea of the speedup. If I understand correctly, there is nothing faster that torch-fp16 on GPU ?
If so and I am already running with model_kwargs={"torch_dtype": "float16"}, then I have nothing to expect from another backend like ONNX speedup-wise, am I right ?

@tomaarsen
Copy link
Collaborator

That's right! Sometimes bfloat16 is marginally faster (e.g. 1-2%), but it can also perform slightly worse.

ONNX can also be slightly faster under certain settings, but fp16 outperformed ONNX on average.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants