Merge branch 'sf/v2.1'

reazon-research · Aug 1, 2024 · 135060a · 135060a
2 parents a259d83 + 22cae49
commit 135060a
Show file tree

Hide file tree

Showing 11 changed files with 400 additions and 22 deletions.
diff --git a/pkg/reazon_theme/theme/static/style.css b/pkg/reazon_theme/theme/static/style.css
diff --git a/requirements.txt b/requirements.txt
@@ -3,5 +3,4 @@ Pygments==2.13.0
 myst-parser==0.18.1
 furo==2022.12.7
 sphinxcontrib-googleanalytics
-djlint
 -e ./pkg/reazon_theme
diff --git a/source/_static/blog/2024-08-01-ReazonSpeech/cer.png b/source/_static/blog/2024-08-01-ReazonSpeech/cer.png
diff --git a/source/_static/cer.png b/source/_static/cer.png
diff --git a/source/_static/rtf.png b/source/_static/rtf.png
diff --git a/source/blog/2024-08-01-ReazonSpeech.rst b/source/blog/2024-08-01-ReazonSpeech.rst
@@ -0,0 +1,125 @@
+======================================================================
+(2024-08-01) ReazonSpeech v2.1: Setting a New Standard in Japanese ASR
+======================================================================
+
+Today, we're excited to announce ReazonSpeech v2.1. In this release, we
+publish ReazonSpeech-k2-v2, an open-source Japanese ASR model which sets
+new records in benchmark tests. It is built on the
+`Next-gen Kaldi framework <https://k2-fsa.org/>`_ and distributed in
+the platform-neutral
+`Open Neural Network Exchange (ONNX) format <https://github.com/onnx/onnx>`_.
+ReazonSpeech-k2-v2 excels in accuracy, compactness, and inference speed,
+and can run on-device without GPU.
+
+We published the ReazonSpeech-k2-v2 model under the Apache 2.0 license. The
+model files and the inference code are readily available on
+`Hugging Face <https://huggingface.co/reazon-research/reazonspeech-k2-v2>`_
+and
+`GitHub <https://github.com/reazon-research/ReazonSpeech>`_.
+
+.. figure:: ../_static/blog/2024-08-01-ReazonSpeech/cer.png
+
+   **Figure 1: ReazonSpeech v2.1 on common Japanese ASR benchmark tests**
+
+What is ReazonSpeech v2.1?
+==========================
+
+ReazonSpeech v2.1 represents the latest iteration of Reazon Human Interaction
+Lab's ASR research. This release introduces a new Japanese ASR model that:
+
+* Outperforms existing Japanese ASR models on JSUT-BASIC5000 [#jsut-basic5000]_,
+  Common Voice v8.0 [#cv]_, and TEDxJP-10K [#tedx]_ benchmark sets (see the
+  chart above).
+
+* Excels in compactness, only having 159M parameters.
+
+* Excels in inference speed, one of the fastest models to process short audio inputs.
+
+What enables such outstanding performance is the state-of-the-art Transformer
+called Zipformer [#zipformer]_. We trained this novel network architecture on
+35,000 hours of `Reazonspeech v2.0 corpus
+<https://huggingface.co/datasets/reazon-research/reazonspeech>`_,
+which revealed a best-in-class performance.
+
+.. tip::
+
+   For further details about the ReazonSpeech-k2-v2 model, the full training
+   recipe is available on `k2-fsa/icefall <https://github.com/k2-fsa/icefall/tree/master/egs/reazonspeech/ASR>`_.
+
+Easy deployment with ONNX
+=========================
+
+The ReazonSpeech-k2-v2 model is available in the ONNX format, significantly
+enhancing its versatility across a wide range of platforms. Leveraging the ONNX
+runtime, which is independent of the PyTorch framework, simplifies the setup
+process, facilitating seamless integration across diverse environments. This
+adaptability ensures practical application on various devices even without GPU,
+including Linux, macOS, Windows, embedded systems, Android, and iOS.
+
+For more details about the supported platforms, please refer to the
+`Sherpa-ONNX's documentation <https://k2-fsa.github.io/sherpa/onnx/index.html>`_.
+
+Reduce memory footprint with quantization
+=========================================
+
+We also released a ``int8``-quantized version of the ReazonSpeech-k2-v2 model.
+The quantized model exhibits a significantly smaller footprint, as shown
+in the following table.
+
+.. table:: Table 1: The effects of quantization on model size
+
+   ============ ================ ================
+   FILE         FILE SIZE (FP32) FILE SIZE (INT8)
+   ============ ================ ================
+   Encoder      565 MB           148 MB
+   Decoder       12 MB             3 MB
+   Joiner        11 MB             3 MB
+   ============ ================ ================
+
+These quantized models are up to 10x smaller than comparable ASR models like
+Whisper-Large-v3, enabling their deployment on a wide range of devices with
+computational constraints. Notably, when used with a non-quantized decoder,
+these quantized models maintain accuracy levels comparable to their
+non-quantized counterparts. This enables the deployment of our model even on
+devices with very limited computational capacity.
+
+.. table:: Table 2: The effects of quantization on accuracy
+
+   ============================== ======= ============ ==========
+   Model Name                      JSUT   Common Voice TEDxJP-10K
+   ============================== ======= ============ ==========
+   ReazonSpeech-k2-v2               6.45     7.85        9.09
+   ReazonSpeech-k2-v2 (int8)        6.63     8.19        9.86
+   ReazonSpeech-k2-v2 (int8-fp32)   6.45     7.87        9.15
+   Whisper Large-v3                 7.18     8.18        9.96
+   ReazonSpeech-NeMo-v2             7.31     8.81       10.42
+   ReazonSpeech-ESPnet-v2           6.89     8.27        9.28
+   ============================== ======= ============ ==========
+
+Future goals
+============
+
+With this release, we have significantly enhanced both the speed and accuracy
+of our Japanese ASR models. By making our model open-source on the K2
+Sherpa-ONNX platform, we have greatly improved accessibility for a broad range
+of users and developers across various platforms.
+
+Looking ahead, we are committed to further advancing our models by expanding
+our dataset, developing streaming ASR capabilities, and incorporating
+multilingual data to create an exceptional bilingual English-Japanese ASR
+model.
+
+This release represents a major milestone, and we are excited to continue
+pushing the boundaries of Japanese speech processing technology in the future.
+Currently, ReazonSpeech-k2-v2 can process longer segments of audio with the
+help of voice activity detection (VAD). In the future, we plan to release a
+streaming version of this model which can innately support real-time
+transcription.
+
+Footnotes
+=========
+
+.. [#jsut-basic5000] Ryosuke Sonobe, Shinnosuke Takamichi and Hiroshi Saruwatari,  "JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis," arXiv preprint, 1711.00354, 2017.
+.. [#cv] https://commonvoice.mozilla.org/
+.. [#tedx] https://github.com/laboroai/TEDxJP-10K
+.. [#zipformer] https://arxiv.org/abs/2310.11230
diff --git a/source/index.rst b/source/index.rst
@@ -6,6 +6,8 @@ Reazon Human Interaction Laboratory
 
 .. list-table::
 
+   * - 2024年8月1日
+     - :any:`ReazonSpeechの最新バージョン v2.1 をリリースしました。 <blog/2024-08-01-ReazonSpeech>`
    * - 2024年2月14日
      - :any:`ReazonSpeechの最新バージョン v2.0 をリリースしました。 <blog/2024-02-14-ReazonSpeech>`
    * - 2023年6月15日
@@ -18,6 +20,8 @@ Reazon Human Interaction Laboratory
 最新記事
 --------
 
+* :any:`blog/2024-08-01-ReazonSpeech`
+* :any:`blog/2024-03-02-how-to-run-aloha-developers.agirobots.com`
 * :any:`blog/2024-02-14-ReazonSpeech`
 * :any:`blog/2023-04-04-ReazonSpeech`
 * :any:`blog/2023-01-15-DDS-performance`

diff --git a/source/projects/ReazonSpeech/api/index.rst b/source/projects/ReazonSpeech/api/index.rst
@@ -9,5 +9,6 @@ ReazonSpeechでは、音声処理を行うための様々なPythonインター
    :caption: ReazonSpeech APIリファレンス
 
    reazonspeech.nemo.asr.rst
+   reazonspeech.k2.asr.rst
    reazonspeech.espnet.asr.rst
    reazonspeech.espnet.oneseg.rst
diff --git a/source/projects/ReazonSpeech/api/reazonspeech.k2.asr.rst b/source/projects/ReazonSpeech/api/reazonspeech.k2.asr.rst
@@ -0,0 +1,140 @@
+===================
+reazonspeech.k2.asr
+===================
+
+.. py:module:: reazonspeech.k2.asr
+
+このリファレンスでは、K2モデルで音声認識するためのインターフェイスを解説します。
+
+関数
+====
+
+.. function:: load_model(device="cpu", precision="fp32")
+
+   ReazonSpeechのK2モデルをロードする。
+
+   :param device: ``cuda``, ``cpu`` または ``coreml``
+   :param precision: ``fp32``, ``int8`` または ``int8-fp32``
+   :rtype: sherpa_onnx.OfflineRecognizer
+
+.. function:: transcribe(model, audio, config=None)
+
+   ReazonSpeechモデルで音声を認識し、結果を返却する。
+
+   **サンプルコード**
+
+   .. code:: python3
+
+      from reazonspeech.k2.asr import audio_from_path, load_model, transcribe
+
+      audio = audio_from_path("test.wav")
+      model = load_model()
+      ret = transcribe(model, audio)
+
+      print('TEXT:')
+      print('  -', ret.text)
+
+      print('SUBWORDS:')
+      for subword in ret.subwords[:9]:
+          print('  -', subword)
+
+   **実行結果**
+
+   .. code:: yaml
+
+      TEXT:
+        - ヤンバルクイナとの出会いは十八歳の時だった
+      SUBWORDS:
+        - Subword(seconds=0.03, token='ヤ')
+        - Subword(seconds=1.36, token='ン')
+        - Subword(seconds=1.55, token='バ')
+        - Subword(seconds=1.75, token='ル')
+        - Subword(seconds=1.91, token='ク')
+        - Subword(seconds=2.11, token='イ')
+        - Subword(seconds=2.27, token='ナ')
+        - Subword(seconds=2.51, token='と')
+        - Subword(seconds=2.67, token='の')
+
+   :param sherpa_onnx.OfflineRecognizer model: ReazonSpeechモデル
+   :param AudioData audio: 音声データ
+   :param TranscribeConfig config: 追加オプション（省略可）
+   :rtype: TranscribeResult
+
+補助関数
+========
+
+.. function:: audio_from_path(path)
+
+   音声ファイルを読み込み、音声データを返却する。
+
+   :param str path: 音声ファイルのパス
+   :rtype: AudioData
+
+.. function:: audio_from_numpy(array, samplerate)
+
+   Numpyの配列を受け取り、音声データを返却する。
+
+   :param array numpy.ndarray: 音声データ
+   :param samplerate int: サンプリングレート
+   :rtype: AudioData
+
+.. function:: audio_from_tensor(tensor, samplerate)
+
+   PyTorchのテンソルを受け取り、音声データを返却する。
+
+   :param array torch.tensor: 音声データ
+   :param samplerate int: サンプリングレート
+   :rtype: AudioData
+
+クラス
+======
+
+.. class:: TranscribeConfig
+
+   音声認識の処理を調整するための設定値クラス
+
+   .. attribute:: verbose
+      :type: bool
+      :value: True
+
+.. class:: TranscribeResult
+
+   音声認識の結果を格納するためのデータクラス
+
+   .. attribute:: text
+      :type: str
+
+      音声認識結果の文字列
+
+   .. attribute:: subwords
+      :type: List[Subword]
+
+      サブワード単位のタイムスタンプ情報
+
+.. class:: Subword
+
+   サブワード単位の認識結果
+
+   .. attribute:: seconds
+      :type: float
+
+      サブワードの出現時刻
+
+   .. attribute:: token
+      :type: str
+
+      サブワード文字列
+
+.. class:: AudioData
+
+   音声データを格納するためのコンテナ
+
+   .. attribute:: waveform
+      :type: numpy.array
+
+      音声データ
+
+   .. attribute:: samplerate
+      :type: int
+
+      サンプリングレート