-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
11 changed files
with
400 additions
and
22 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,5 +3,4 @@ Pygments==2.13.0 | |
myst-parser==0.18.1 | ||
furo==2022.12.7 | ||
sphinxcontrib-googleanalytics | ||
djlint | ||
-e ./pkg/reazon_theme |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
====================================================================== | ||
(2024-08-01) ReazonSpeech v2.1: Setting a New Standard in Japanese ASR | ||
====================================================================== | ||
|
||
Today, we're excited to announce ReazonSpeech v2.1. In this release, we | ||
publish ReazonSpeech-k2-v2, an open-source Japanese ASR model which sets | ||
new records in benchmark tests. It is built on the | ||
`Next-gen Kaldi framework <https://k2-fsa.org/>`_ and distributed in | ||
the platform-neutral | ||
`Open Neural Network Exchange (ONNX) format <https://github.com/onnx/onnx>`_. | ||
ReazonSpeech-k2-v2 excels in accuracy, compactness, and inference speed, | ||
and can run on-device without GPU. | ||
|
||
We published the ReazonSpeech-k2-v2 model under the Apache 2.0 license. The | ||
model files and the inference code are readily available on | ||
`Hugging Face <https://huggingface.co/reazon-research/reazonspeech-k2-v2>`_ | ||
and | ||
`GitHub <https://github.com/reazon-research/ReazonSpeech>`_. | ||
|
||
.. figure:: ../_static/blog/2024-08-01-ReazonSpeech/cer.png | ||
|
||
**Figure 1: ReazonSpeech v2.1 on common Japanese ASR benchmark tests** | ||
|
||
What is ReazonSpeech v2.1? | ||
========================== | ||
|
||
ReazonSpeech v2.1 represents the latest iteration of Reazon Human Interaction | ||
Lab's ASR research. This release introduces a new Japanese ASR model that: | ||
|
||
* Outperforms existing Japanese ASR models on JSUT-BASIC5000 [#jsut-basic5000]_, | ||
Common Voice v8.0 [#cv]_, and TEDxJP-10K [#tedx]_ benchmark sets (see the | ||
chart above). | ||
|
||
* Excels in compactness, only having 159M parameters. | ||
|
||
* Excels in inference speed, one of the fastest models to process short audio inputs. | ||
|
||
What enables such outstanding performance is the state-of-the-art Transformer | ||
called Zipformer [#zipformer]_. We trained this novel network architecture on | ||
35,000 hours of `Reazonspeech v2.0 corpus | ||
<https://huggingface.co/datasets/reazon-research/reazonspeech>`_, | ||
which revealed a best-in-class performance. | ||
|
||
.. tip:: | ||
|
||
For further details about the ReazonSpeech-k2-v2 model, the full training | ||
recipe is available on `k2-fsa/icefall <https://github.com/k2-fsa/icefall/tree/master/egs/reazonspeech/ASR>`_. | ||
|
||
Easy deployment with ONNX | ||
========================= | ||
|
||
The ReazonSpeech-k2-v2 model is available in the ONNX format, significantly | ||
enhancing its versatility across a wide range of platforms. Leveraging the ONNX | ||
runtime, which is independent of the PyTorch framework, simplifies the setup | ||
process, facilitating seamless integration across diverse environments. This | ||
adaptability ensures practical application on various devices even without GPU, | ||
including Linux, macOS, Windows, embedded systems, Android, and iOS. | ||
|
||
For more details about the supported platforms, please refer to the | ||
`Sherpa-ONNX's documentation <https://k2-fsa.github.io/sherpa/onnx/index.html>`_. | ||
|
||
Reduce memory footprint with quantization | ||
========================================= | ||
|
||
We also released a ``int8``-quantized version of the ReazonSpeech-k2-v2 model. | ||
The quantized model exhibits a significantly smaller footprint, as shown | ||
in the following table. | ||
|
||
.. table:: Table 1: The effects of quantization on model size | ||
|
||
============ ================ ================ | ||
FILE FILE SIZE (FP32) FILE SIZE (INT8) | ||
============ ================ ================ | ||
Encoder 565 MB 148 MB | ||
Decoder 12 MB 3 MB | ||
Joiner 11 MB 3 MB | ||
============ ================ ================ | ||
|
||
These quantized models are up to 10x smaller than comparable ASR models like | ||
Whisper-Large-v3, enabling their deployment on a wide range of devices with | ||
computational constraints. Notably, when used with a non-quantized decoder, | ||
these quantized models maintain accuracy levels comparable to their | ||
non-quantized counterparts. This enables the deployment of our model even on | ||
devices with very limited computational capacity. | ||
|
||
.. table:: Table 2: The effects of quantization on accuracy | ||
|
||
============================== ======= ============ ========== | ||
Model Name JSUT Common Voice TEDxJP-10K | ||
============================== ======= ============ ========== | ||
ReazonSpeech-k2-v2 6.45 7.85 9.09 | ||
ReazonSpeech-k2-v2 (int8) 6.63 8.19 9.86 | ||
ReazonSpeech-k2-v2 (int8-fp32) 6.45 7.87 9.15 | ||
Whisper Large-v3 7.18 8.18 9.96 | ||
ReazonSpeech-NeMo-v2 7.31 8.81 10.42 | ||
ReazonSpeech-ESPnet-v2 6.89 8.27 9.28 | ||
============================== ======= ============ ========== | ||
|
||
Future goals | ||
============ | ||
|
||
With this release, we have significantly enhanced both the speed and accuracy | ||
of our Japanese ASR models. By making our model open-source on the K2 | ||
Sherpa-ONNX platform, we have greatly improved accessibility for a broad range | ||
of users and developers across various platforms. | ||
|
||
Looking ahead, we are committed to further advancing our models by expanding | ||
our dataset, developing streaming ASR capabilities, and incorporating | ||
multilingual data to create an exceptional bilingual English-Japanese ASR | ||
model. | ||
|
||
This release represents a major milestone, and we are excited to continue | ||
pushing the boundaries of Japanese speech processing technology in the future. | ||
Currently, ReazonSpeech-k2-v2 can process longer segments of audio with the | ||
help of voice activity detection (VAD). In the future, we plan to release a | ||
streaming version of this model which can innately support real-time | ||
transcription. | ||
|
||
Footnotes | ||
========= | ||
|
||
.. [#jsut-basic5000] Ryosuke Sonobe, Shinnosuke Takamichi and Hiroshi Saruwatari, "JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis," arXiv preprint, 1711.00354, 2017. | ||
.. [#cv] https://commonvoice.mozilla.org/ | ||
.. [#tedx] https://github.com/laboroai/TEDxJP-10K | ||
.. [#zipformer] https://arxiv.org/abs/2310.11230 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
140 changes: 140 additions & 0 deletions
140
source/projects/ReazonSpeech/api/reazonspeech.k2.asr.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
=================== | ||
reazonspeech.k2.asr | ||
=================== | ||
|
||
.. py:module:: reazonspeech.k2.asr | ||
このリファレンスでは、K2モデルで音声認識するためのインターフェイスを解説します。 | ||
|
||
関数 | ||
==== | ||
|
||
.. function:: load_model(device="cpu", precision="fp32") | ||
|
||
ReazonSpeechのK2モデルをロードする。 | ||
|
||
:param device: ``cuda``, ``cpu`` または ``coreml`` | ||
:param precision: ``fp32``, ``int8`` または ``int8-fp32`` | ||
:rtype: sherpa_onnx.OfflineRecognizer | ||
|
||
.. function:: transcribe(model, audio, config=None) | ||
|
||
ReazonSpeechモデルで音声を認識し、結果を返却する。 | ||
|
||
**サンプルコード** | ||
|
||
.. code:: python3 | ||
from reazonspeech.k2.asr import audio_from_path, load_model, transcribe | ||
audio = audio_from_path("test.wav") | ||
model = load_model() | ||
ret = transcribe(model, audio) | ||
print('TEXT:') | ||
print(' -', ret.text) | ||
print('SUBWORDS:') | ||
for subword in ret.subwords[:9]: | ||
print(' -', subword) | ||
**実行結果** | ||
|
||
.. code:: yaml | ||
TEXT: | ||
- ヤンバルクイナとの出会いは十八歳の時だった | ||
SUBWORDS: | ||
- Subword(seconds=0.03, token='ヤ') | ||
- Subword(seconds=1.36, token='ン') | ||
- Subword(seconds=1.55, token='バ') | ||
- Subword(seconds=1.75, token='ル') | ||
- Subword(seconds=1.91, token='ク') | ||
- Subword(seconds=2.11, token='イ') | ||
- Subword(seconds=2.27, token='ナ') | ||
- Subword(seconds=2.51, token='と') | ||
- Subword(seconds=2.67, token='の') | ||
:param sherpa_onnx.OfflineRecognizer model: ReazonSpeechモデル | ||
:param AudioData audio: 音声データ | ||
:param TranscribeConfig config: 追加オプション(省略可) | ||
:rtype: TranscribeResult | ||
|
||
補助関数 | ||
======== | ||
|
||
.. function:: audio_from_path(path) | ||
|
||
音声ファイルを読み込み、音声データを返却する。 | ||
|
||
:param str path: 音声ファイルのパス | ||
:rtype: AudioData | ||
|
||
.. function:: audio_from_numpy(array, samplerate) | ||
|
||
Numpyの配列を受け取り、音声データを返却する。 | ||
|
||
:param array numpy.ndarray: 音声データ | ||
:param samplerate int: サンプリングレート | ||
:rtype: AudioData | ||
|
||
.. function:: audio_from_tensor(tensor, samplerate) | ||
|
||
PyTorchのテンソルを受け取り、音声データを返却する。 | ||
|
||
:param array torch.tensor: 音声データ | ||
:param samplerate int: サンプリングレート | ||
:rtype: AudioData | ||
|
||
クラス | ||
====== | ||
|
||
.. class:: TranscribeConfig | ||
|
||
音声認識の処理を調整するための設定値クラス | ||
|
||
.. attribute:: verbose | ||
:type: bool | ||
:value: True | ||
|
||
.. class:: TranscribeResult | ||
|
||
音声認識の結果を格納するためのデータクラス | ||
|
||
.. attribute:: text | ||
:type: str | ||
|
||
音声認識結果の文字列 | ||
|
||
.. attribute:: subwords | ||
:type: List[Subword] | ||
|
||
サブワード単位のタイムスタンプ情報 | ||
|
||
.. class:: Subword | ||
|
||
サブワード単位の認識結果 | ||
|
||
.. attribute:: seconds | ||
:type: float | ||
|
||
サブワードの出現時刻 | ||
|
||
.. attribute:: token | ||
:type: str | ||
|
||
サブワード文字列 | ||
|
||
.. class:: AudioData | ||
|
||
音声データを格納するためのコンテナ | ||
|
||
.. attribute:: waveform | ||
:type: numpy.array | ||
|
||
音声データ | ||
|
||
.. attribute:: samplerate | ||
:type: int | ||
|
||
サンプリングレート |
Oops, something went wrong.