Skip to content

[backend] Add ONNX & OpenVINO support for Cross Encoder (reranker) models #3319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 15, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
602 changes: 602 additions & 0 deletions docs/cross_encoder/usage/efficiency.rst

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion docs/cross_encoder/usage/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,4 +73,5 @@ Once you have `installed <../../installation.html>`_ Sentence Transformers, you
:caption: Tasks

Cross-Encoder vs Bi-Encoder <../../../examples/cross_encoder/applications/README>
../../../examples/sentence_transformer/applications/retrieve_rerank/README
../../../examples/sentence_transformer/applications/retrieve_rerank/README
efficiency
Binary file added docs/img/ce_backends_benchmark_cpu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/ce_backends_benchmark_gpu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 10 additions & 10 deletions docs/sentence_transformer/usage/efficiency.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,9 +132,9 @@ Optimizing ONNX Models

.. include:: backend_export_sidebar.rst

ONNX models can be optimized using Optimum, allowing for speedups on CPUs and GPUs alike. To do this, you can use the :func:`~sentence_transformers.backend.export_optimized_onnx_model` function, which saves the optimized in a directory or model repository that you specify. It expects:
ONNX models can be optimized using `Optimum <https://huggingface.co/docs/optimum/index>`_, allowing for speedups on CPUs and GPUs alike. To do this, you can use the :func:`~sentence_transformers.backend.export_optimized_onnx_model` function, which saves the optimized in a directory or model repository that you specify. It expects:

- ``model``: a Sentence Transformer model loaded with the ONNX backend.
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the ONNX backend.
- ``optimization_config``: ``"O1"``, ``"O2"``, ``"O3"``, or ``"O4"`` representing optimization levels from :class:`~optimum.onnxruntime.AutoOptimizationConfig`, or an :class:`~optimum.onnxruntime.OptimizationConfig` instance.
- ``model_name_or_path``: a path to save the optimized model file, or the repository name if you want to push it to the Hugging Face Hub.
- ``push_to_hub``: (Optional) a boolean to push the optimized model to the Hugging Face Hub.
Expand Down Expand Up @@ -204,9 +204,9 @@ Quantizing ONNX Models

.. include:: backend_export_sidebar.rst

ONNX models can be quantized to int8 precision using Optimum, allowing for faster inference on CPUs. To do this, you can use the :func:`~sentence_transformers.backend.export_dynamic_quantized_onnx_model` function, which saves the quantized in a directory or model repository that you specify. Dynamic quantization, unlike static quantization, does not require a calibration dataset. It expects:
ONNX models can be quantized to int8 precision using `Optimum <https://huggingface.co/docs/optimum/index>`_, allowing for faster inference on CPUs. To do this, you can use the :func:`~sentence_transformers.backend.export_dynamic_quantized_onnx_model` function, which saves the quantized in a directory or model repository that you specify. Dynamic quantization, unlike static quantization, does not require a calibration dataset. It expects:

- ``model``: a Sentence Transformer model loaded with the ONNX backend.
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the ONNX backend.
- ``quantization_config``: ``"arm64"``, ``"avx2"``, ``"avx512"``, or ``"avx512_vnni"`` representing quantization configurations from :class:`~optimum.onnxruntime.AutoQuantizationConfig`, or an :class:`~optimum.onnxruntime.QuantizationConfig` instance.
- ``model_name_or_path``: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub.
- ``push_to_hub``: (Optional) a boolean to push the quantized model to the Hugging Face Hub.
Expand Down Expand Up @@ -329,15 +329,15 @@ Quantizing OpenVINO Models

.. include:: backend_export_sidebar.rst

OpenVINO models can be quantized to int8 precision using Optimum Intel to speed up inference.
OpenVINO models can be quantized to int8 precision using `Optimum Intel <https://huggingface.co/docs/optimum/main/en/intel/index>`_ to speed up inference.
To do this, you can use the :func:`~sentence_transformers.backend.export_static_quantized_openvino_model` function,
which saves the quantized model in a directory or model repository that you specify.
Post-Training Static Quantization expects:

- ``model``: a Sentence Transformer model loaded with the OpenVINO backend.
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the OpenVINO backend.
- ``quantization_config``: (Optional) The quantization configuration. This parameter accepts either:
``None`` for the default 8-bit quantization, a dictionary representing quantization configurations, or
an :class:`~optimum.intel.OVQuantizationConfig` instance.
``None`` for the default 8-bit quantization, a dictionary representing quantization configurations, or
an :class:`~optimum.intel.OVQuantizationConfig` instance.
- ``model_name_or_path``: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub.
- ``dataset_name``: (Optional) The name of the dataset to load for calibration. If not specified, defaults to ``sst2`` subset from the ``glue`` dataset.
- ``dataset_config_name``: (Optional) The specific configuration of the dataset to load.
Expand Down Expand Up @@ -541,8 +541,8 @@ Based on the benchmarks, this flowchart should help you decide which backend to
}
}}%%
graph TD
A(What is your hardware?) -->|GPU| B(Is your text usually smaller than 500 characters?)
A -->|CPU| C(Is a 0.4% accuracy loss acceptable?)
A(What is your hardware?) -->|GPU| B(Is your text usually smaller<br>than 500 characters?)
A -->|CPU| C(Is a 0.4% accuracy loss<br>acceptable?)
B -->|yes| D[onnx-O4]
B -->|no| F[float16]
C -->|yes| G[openvino-qint8]
Expand Down
2 changes: 1 addition & 1 deletion docs/sentence_transformer/usage/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,6 @@ Once you have `installed <../../installation.html>`_ Sentence Transformers, you
../../../examples/sentence_transformer/applications/parallel-sentence-mining/README
../../../examples/sentence_transformer/applications/image-search/README
../../../examples/sentence_transformer/applications/embedding-quantization/README
efficiency
custom_models
efficiency

Loading
Loading