Skip to content

Commit

Permalink
[docs] Add links to backend-export in Speeding up Inference (#3071)
Browse files Browse the repository at this point in the history
  • Loading branch information
tomaarsen authored Nov 20, 2024
1 parent 8fabce0 commit efbf3ee
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 2 deletions.
5 changes: 5 additions & 0 deletions docs/sentence_transformer/usage/backend_export_sidebar.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.. sidebar:: Export, Optimize, and Quantize Hugging Face models

This Hugging Face Space provides a user interface for exporting, optimizing, and quantizing models for either ONNX or OpenVINO:

- `sentence-transformers/backend-export <https://huggingface.co/spaces/sentence-transformers/backend-export>`_
25 changes: 23 additions & 2 deletions docs/sentence_transformer/usage/efficiency.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,16 @@ Sentence Transformers supports 3 backends for computing embeddings, each with it
</a>
<a href="#openvino" class="box">
<div class="header">OpenVINO</div>
Optimization of models, primarily for Intel Hardware.
Optimization of models, mainly for Intel Hardware.
</a>
<a href="#benchmarks" class="box">
<div class="header">Benchmarks</div>
Benchmarks for the different backends.
</a>
<a href="#user-interface" class="box">
<div class="header">User Interface</div>
GUI to export, optimize, and quantize models.
</a>
</div>
<br>

Expand Down Expand Up @@ -74,6 +78,8 @@ If you're using a GPU, then you can use the following options to speed up your i
ONNX
----

.. include:: backend_export_sidebar.rst

ONNX can be used to speed up inference by converting the model to ONNX format and using ONNX Runtime to run the model. To use the ONNX backend, you must install Sentence Transformers with the ``onnx`` or ``onnx-gpu`` extra for CPU or GPU acceleration, respectively:

.. code-block:: bash
Expand Down Expand Up @@ -120,6 +126,8 @@ All keyword arguments passed via ``model_kwargs`` will be passed on to :meth:`OR
Optimizing ONNX Models
^^^^^^^^^^^^^^^^^^^^^^

.. include:: backend_export_sidebar.rst

ONNX models can be optimized using Optimum, allowing for speedups on CPUs and GPUs alike. To do this, you can use the :func:`~sentence_transformers.backend.export_optimized_onnx_model` function, which saves the optimized in a directory or model repository that you specify. It expects:

- ``model``: a Sentence Transformer model loaded with the ONNX backend.
Expand Down Expand Up @@ -190,6 +198,8 @@ See this example for exporting a model with :doc:`optimization level 3 <optimum:
Quantizing ONNX Models
^^^^^^^^^^^^^^^^^^^^^^

.. include:: backend_export_sidebar.rst

ONNX models can be quantized to int8 precision using Optimum, allowing for faster inference on CPUs. To do this, you can use the :func:`~sentence_transformers.backend.export_dynamic_quantized_onnx_model` function, which saves the quantized in a directory or model repository that you specify. Dynamic quantization, unlike static quantization, does not require a calibration dataset. It expects:

- ``model``: a Sentence Transformer model loaded with the ONNX backend.
Expand Down Expand Up @@ -262,6 +272,8 @@ See this example for quantizing a model to ``int8`` with :doc:`avx512_vnni <opti
OpenVINO
--------

.. include:: backend_export_sidebar.rst

OpenVINO allows for accelerated inference on CPUs by exporting the model to the OpenVINO format. To use the OpenVINO backend, you must install Sentence Transformers with the ``openvino`` extra:

.. code-block:: bash
Expand Down Expand Up @@ -305,6 +317,8 @@ To convert a model to OpenVINO format, you can use the following code:
Quantizing OpenVINO Models
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. include:: backend_export_sidebar.rst

OpenVINO models can be quantized to int8 precision using Optimum Intel to speed up inference.
To do this, you can use the :func:`~sentence_transformers.backend.export_static_quantized_openvino_model` function,
which saves the quantized model in a directory or model repository that you specify.
Expand Down Expand Up @@ -533,4 +547,11 @@ Based on the benchmarks, this flowchart should help you decide which backend to

.. note::

Your milage may vary, and you should always test the different backends with your specific model and data to find the best one for your use case.
Your milage may vary, and you should always test the different backends with your specific model and data to find the best one for your use case.

User Interface
^^^^^^^^^^^^^^

This Hugging Face Space provides a user interface for exporting, optimizing, and quantizing models for either ONNX or OpenVINO:

- `sentence-transformers/backend-export <https://huggingface.co/spaces/sentence-transformers/backend-export>`_

0 comments on commit efbf3ee

Please sign in to comment.