Skip to content

Commit

Permalink
Merge branch 'sf/v2.1'
Browse files Browse the repository at this point in the history
  • Loading branch information
fujimotos committed Aug 1, 2024
2 parents a259d83 + 22cae49 commit 135060a
Show file tree
Hide file tree
Showing 11 changed files with 400 additions and 22 deletions.
2 changes: 1 addition & 1 deletion pkg/reazon_theme/theme/static/style.css

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,4 @@ Pygments==2.13.0
myst-parser==0.18.1
furo==2022.12.7
sphinxcontrib-googleanalytics
djlint
-e ./pkg/reazon_theme
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified source/_static/cer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified source/_static/rtf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
125 changes: 125 additions & 0 deletions source/blog/2024-08-01-ReazonSpeech.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
======================================================================
(2024-08-01) ReazonSpeech v2.1: Setting a New Standard in Japanese ASR
======================================================================

Today, we're excited to announce ReazonSpeech v2.1. In this release, we
publish ReazonSpeech-k2-v2, an open-source Japanese ASR model which sets
new records in benchmark tests. It is built on the
`Next-gen Kaldi framework <https://k2-fsa.org/>`_ and distributed in
the platform-neutral
`Open Neural Network Exchange (ONNX) format <https://github.com/onnx/onnx>`_.
ReazonSpeech-k2-v2 excels in accuracy, compactness, and inference speed,
and can run on-device without GPU.

We published the ReazonSpeech-k2-v2 model under the Apache 2.0 license. The
model files and the inference code are readily available on
`Hugging Face <https://huggingface.co/reazon-research/reazonspeech-k2-v2>`_
and
`GitHub <https://github.com/reazon-research/ReazonSpeech>`_.

.. figure:: ../_static/blog/2024-08-01-ReazonSpeech/cer.png

**Figure 1: ReazonSpeech v2.1 on common Japanese ASR benchmark tests**

What is ReazonSpeech v2.1?
==========================

ReazonSpeech v2.1 represents the latest iteration of Reazon Human Interaction
Lab's ASR research. This release introduces a new Japanese ASR model that:

* Outperforms existing Japanese ASR models on JSUT-BASIC5000 [#jsut-basic5000]_,
Common Voice v8.0 [#cv]_, and TEDxJP-10K [#tedx]_ benchmark sets (see the
chart above).

* Excels in compactness, only having 159M parameters.

* Excels in inference speed, one of the fastest models to process short audio inputs.

What enables such outstanding performance is the state-of-the-art Transformer
called Zipformer [#zipformer]_. We trained this novel network architecture on
35,000 hours of `Reazonspeech v2.0 corpus
<https://huggingface.co/datasets/reazon-research/reazonspeech>`_,
which revealed a best-in-class performance.

.. tip::

For further details about the ReazonSpeech-k2-v2 model, the full training
recipe is available on `k2-fsa/icefall <https://github.com/k2-fsa/icefall/tree/master/egs/reazonspeech/ASR>`_.

Easy deployment with ONNX
=========================

The ReazonSpeech-k2-v2 model is available in the ONNX format, significantly
enhancing its versatility across a wide range of platforms. Leveraging the ONNX
runtime, which is independent of the PyTorch framework, simplifies the setup
process, facilitating seamless integration across diverse environments. This
adaptability ensures practical application on various devices even without GPU,
including Linux, macOS, Windows, embedded systems, Android, and iOS.

For more details about the supported platforms, please refer to the
`Sherpa-ONNX's documentation <https://k2-fsa.github.io/sherpa/onnx/index.html>`_.

Reduce memory footprint with quantization
=========================================

We also released a ``int8``-quantized version of the ReazonSpeech-k2-v2 model.
The quantized model exhibits a significantly smaller footprint, as shown
in the following table.

.. table:: Table 1: The effects of quantization on model size

============ ================ ================
FILE FILE SIZE (FP32) FILE SIZE (INT8)
============ ================ ================
Encoder 565 MB 148 MB
Decoder 12 MB 3 MB
Joiner 11 MB 3 MB
============ ================ ================

These quantized models are up to 10x smaller than comparable ASR models like
Whisper-Large-v3, enabling their deployment on a wide range of devices with
computational constraints. Notably, when used with a non-quantized decoder,
these quantized models maintain accuracy levels comparable to their
non-quantized counterparts. This enables the deployment of our model even on
devices with very limited computational capacity.

.. table:: Table 2: The effects of quantization on accuracy

============================== ======= ============ ==========
Model Name JSUT Common Voice TEDxJP-10K
============================== ======= ============ ==========
ReazonSpeech-k2-v2 6.45 7.85 9.09
ReazonSpeech-k2-v2 (int8) 6.63 8.19 9.86
ReazonSpeech-k2-v2 (int8-fp32) 6.45 7.87 9.15
Whisper Large-v3 7.18 8.18 9.96
ReazonSpeech-NeMo-v2 7.31 8.81 10.42
ReazonSpeech-ESPnet-v2 6.89 8.27 9.28
============================== ======= ============ ==========

Future goals
============

With this release, we have significantly enhanced both the speed and accuracy
of our Japanese ASR models. By making our model open-source on the K2
Sherpa-ONNX platform, we have greatly improved accessibility for a broad range
of users and developers across various platforms.

Looking ahead, we are committed to further advancing our models by expanding
our dataset, developing streaming ASR capabilities, and incorporating
multilingual data to create an exceptional bilingual English-Japanese ASR
model.

This release represents a major milestone, and we are excited to continue
pushing the boundaries of Japanese speech processing technology in the future.
Currently, ReazonSpeech-k2-v2 can process longer segments of audio with the
help of voice activity detection (VAD). In the future, we plan to release a
streaming version of this model which can innately support real-time
transcription.

Footnotes
=========

.. [#jsut-basic5000] Ryosuke Sonobe, Shinnosuke Takamichi and Hiroshi Saruwatari, "JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis," arXiv preprint, 1711.00354, 2017.
.. [#cv] https://commonvoice.mozilla.org/
.. [#tedx] https://github.com/laboroai/TEDxJP-10K
.. [#zipformer] https://arxiv.org/abs/2310.11230
4 changes: 4 additions & 0 deletions source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ Reazon Human Interaction Laboratory

.. list-table::

* - 2024年8月1日
- :any:`ReazonSpeechの最新バージョン v2.1 をリリースしました。 <blog/2024-08-01-ReazonSpeech>`
* - 2024年2月14日
- :any:`ReazonSpeechの最新バージョン v2.0 をリリースしました。 <blog/2024-02-14-ReazonSpeech>`
* - 2023年6月15日
Expand All @@ -18,6 +20,8 @@ Reazon Human Interaction Laboratory
最新記事
--------

* :any:`blog/2024-08-01-ReazonSpeech`
* :any:`blog/2024-03-02-how-to-run-aloha-developers.agirobots.com`
* :any:`blog/2024-02-14-ReazonSpeech`
* :any:`blog/2023-04-04-ReazonSpeech`
* :any:`blog/2023-01-15-DDS-performance`
Expand Down
1 change: 1 addition & 0 deletions source/projects/ReazonSpeech/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,6 @@ ReazonSpeechでは、音声処理を行うための様々なPythonインター
:caption: ReazonSpeech APIリファレンス

reazonspeech.nemo.asr.rst
reazonspeech.k2.asr.rst
reazonspeech.espnet.asr.rst
reazonspeech.espnet.oneseg.rst
140 changes: 140 additions & 0 deletions source/projects/ReazonSpeech/api/reazonspeech.k2.asr.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
===================
reazonspeech.k2.asr
===================

.. py:module:: reazonspeech.k2.asr
このリファレンスでは、K2モデルで音声認識するためのインターフェイスを解説します。

関数
====

.. function:: load_model(device="cpu", precision="fp32")

ReazonSpeechのK2モデルをロードする。

:param device: ``cuda``, ``cpu`` または ``coreml``
:param precision: ``fp32``, ``int8`` または ``int8-fp32``
:rtype: sherpa_onnx.OfflineRecognizer

.. function:: transcribe(model, audio, config=None)

ReazonSpeechモデルで音声を認識し、結果を返却する。

**サンプルコード**

.. code:: python3
from reazonspeech.k2.asr import audio_from_path, load_model, transcribe
audio = audio_from_path("test.wav")
model = load_model()
ret = transcribe(model, audio)
print('TEXT:')
print(' -', ret.text)
print('SUBWORDS:')
for subword in ret.subwords[:9]:
print(' -', subword)
**実行結果**

.. code:: yaml
TEXT:
- ヤンバルクイナとの出会いは十八歳の時だった
SUBWORDS:
- Subword(seconds=0.03, token='ヤ')
- Subword(seconds=1.36, token='ン')
- Subword(seconds=1.55, token='バ')
- Subword(seconds=1.75, token='ル')
- Subword(seconds=1.91, token='ク')
- Subword(seconds=2.11, token='イ')
- Subword(seconds=2.27, token='ナ')
- Subword(seconds=2.51, token='と')
- Subword(seconds=2.67, token='の')
:param sherpa_onnx.OfflineRecognizer model: ReazonSpeechモデル
:param AudioData audio: 音声データ
:param TranscribeConfig config: 追加オプション(省略可)
:rtype: TranscribeResult

補助関数
========

.. function:: audio_from_path(path)

音声ファイルを読み込み、音声データを返却する。

:param str path: 音声ファイルのパス
:rtype: AudioData

.. function:: audio_from_numpy(array, samplerate)

Numpyの配列を受け取り、音声データを返却する。

:param array numpy.ndarray: 音声データ
:param samplerate int: サンプリングレート
:rtype: AudioData

.. function:: audio_from_tensor(tensor, samplerate)

PyTorchのテンソルを受け取り、音声データを返却する。

:param array torch.tensor: 音声データ
:param samplerate int: サンプリングレート
:rtype: AudioData

クラス
======

.. class:: TranscribeConfig

音声認識の処理を調整するための設定値クラス

.. attribute:: verbose
:type: bool
:value: True

.. class:: TranscribeResult

音声認識の結果を格納するためのデータクラス

.. attribute:: text
:type: str

音声認識結果の文字列

.. attribute:: subwords
:type: List[Subword]

サブワード単位のタイムスタンプ情報

.. class:: Subword

サブワード単位の認識結果

.. attribute:: seconds
:type: float

サブワードの出現時刻

.. attribute:: token
:type: str

サブワード文字列

.. class:: AudioData

音声データを格納するためのコンテナ

.. attribute:: waveform
:type: numpy.array

音声データ

.. attribute:: samplerate
:type: int

サンプリングレート
Loading

0 comments on commit 135060a

Please sign in to comment.