Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guided #723

Closed
wants to merge 32 commits into from
Closed

Guided #723

wants to merge 32 commits into from

Conversation

Patchethium
Copy link
Contributor

The continue of #252, use a custom forced aligner snfa instead of julius for the sake of my nerve.
snfa is very young and unstable, I'll keep this API in experiment sub dir and lock with --enable-guided as in #252.

@github-actions
Copy link

github-actions bot commented Aug 13, 2023

Coverage Result

Resultを開く
Name Stmts Miss Cover
run.py 446 296 coverage-34%
voicevox_engine/init.py 1 0 coverage-100%
voicevox_engine/acoustic_feature_extractor.py 75 0 coverage-100%
voicevox_engine/cancellable_engine.py 85 66 coverage-22%
voicevox_engine/dev/core/init.py 2 0 coverage-100%
voicevox_engine/dev/core/mock.py 27 12 coverage-56%
voicevox_engine/dev/synthesis_engine/init.py 2 0 coverage-100%
voicevox_engine/dev/synthesis_engine/mock.py 38 3 coverage-92%
voicevox_engine/downloadable_library.py 93 5 coverage-95%
voicevox_engine/engine_manifest/EngineManifest.py 34 0 coverage-100%
voicevox_engine/engine_manifest/EngineManifestLoader.py 12 0 coverage-100%
voicevox_engine/engine_manifest/init.py 3 0 coverage-100%
voicevox_engine/full_context_label.py 162 3 coverage-98%
voicevox_engine/guided/init.py 0 0 coverage-100%
voicevox_engine/guided/extractor.py 67 3 coverage-96%
voicevox_engine/kana_parser.py 86 1 coverage-99%
voicevox_engine/metas/Metas.py 33 0 coverage-100%
voicevox_engine/metas/MetasStore.py 29 12 coverage-59%
voicevox_engine/metas/init.py 2 0 coverage-100%
voicevox_engine/model.py 160 9 coverage-94%
voicevox_engine/mora_list.py 4 0 coverage-100%
voicevox_engine/morphing.py 70 46 coverage-34%
voicevox_engine/part_of_speech_data.py 5 0 coverage-100%
voicevox_engine/preset/Preset.py 12 0 coverage-100%
voicevox_engine/preset/PresetError.py 2 0 coverage-100%
voicevox_engine/preset/PresetManager.py 81 2 coverage-98%
voicevox_engine/preset/init.py 4 0 coverage-100%
voicevox_engine/setting/Setting.py 11 0 coverage-100%
voicevox_engine/setting/SettingLoader.py 19 0 coverage-100%
voicevox_engine/setting/init.py 3 0 coverage-100%
voicevox_engine/synthesis_engine/init.py 5 0 coverage-100%
voicevox_engine/synthesis_engine/core_wrapper.py 201 146 coverage-27%
voicevox_engine/synthesis_engine/make_synthesis_engines.py 59 30 coverage-49%
voicevox_engine/synthesis_engine/synthesis_engine.py 156 35 coverage-78%
voicevox_engine/synthesis_engine/synthesis_engine_base.py 70 10 coverage-86%
voicevox_engine/user_dict.py 144 11 coverage-92%
voicevox_engine/utility/init.py 5 0 coverage-100%
voicevox_engine/utility/connect_base64_waves.py 37 0 coverage-100%
voicevox_engine/utility/core_version_utility.py 8 1 coverage-88%
voicevox_engine/utility/mutex_utility.py 10 0 coverage-100%
voicevox_engine/utility/path_utility.py 26 8 coverage-69%
TOTAL 2289 699 coverage-69%

Copy link
Member

@Hiroshiba Hiroshiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is still a draft, but I've taken a look at the pull request!
I don't have any comments regarding the direction at the moment.

voicevox_engine/guided/extractor.py Outdated Show resolved Hide resolved
@Patchethium Patchethium marked this pull request as ready for review August 15, 2023 18:02
@Patchethium Patchethium requested a review from a team as a code owner August 15, 2023 18:02
@Patchethium Patchethium requested review from Hiroshiba and removed request for a team August 15, 2023 18:02
@Patchethium Patchethium marked this pull request as draft August 15, 2023 18:03
@Patchethium
Copy link
Contributor Author

Sorry, I forgot to set up the build pipeline.

@Patchethium Patchethium marked this pull request as ready for review August 15, 2023 19:48
@Patchethium
Copy link
Contributor Author

Think it's ready for review.

Copy link
Member

@sevenc-nanashi sevenc-nanashi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

音声合成周りはあまりわからないのでそれ以外を

run.py Outdated Show resolved Hide resolved
run.py Outdated Show resolved Hide resolved
voicevox_engine/synthesis_engine/make_synthesis_engines.py Outdated Show resolved Hide resolved
Copy link
Member

@takana-v takana-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要だと思われる部分にコメントを付けました。
また、以下のファイルにcv_jp.binのダウンロード処理の追加が必要かと思われます。

  • Dockerfile
  • .github/workflows/build.yml

run.py Outdated
@@ -1287,6 +1333,7 @@ def custom_openapi():
runtime_dirs=args.runtime_dir,
cpu_num_threads=cpu_num_threads,
enable_mock=args.enable_mock,
enable_guided=args.enable_guided,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

この引数は使っていないので不要かと思います。

Suggested change
enable_guided=args.enable_guided,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually used in
https://github.com/Patchethium/voicevox_engine/blob/5b78b4d1c61ed9bfb671cb7154a69815e7c1bf00/run.py#L342-L346
in case as snfa is still unstable, users may want to disable it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

分かりにくく申し訳ありませんでした。
「引数」はargs.enable_guidedの事ではなく、voicevox_engine/synthesis_engine/make_synthesis_engines.pymake_synthesis_engines関数の引数であるenable_guidedの事です。
関数内でenable_guidedは参照されていないため、削除しても問題ないと思います。


Sorry for the confusion.
"Argument" is not args.enable_guided, but enable_guided which is the argument of make_synthesis_engines function in voicevox_engine/synthesis_engine/make_synthesis_engines.py.
There is no reference to enable_guided in the function, so it should be fine to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, fixed, thanks!

@@ -15,6 +15,7 @@ def make_synthesis_engines(
runtime_dirs: Optional[List[Path]] = None,
cpu_num_threads: Optional[int] = None,
enable_mock: bool = True,
enable_guided: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enable_guided: bool = False,

Comment on lines 40 to 41
enable_guided: bool, optional, default=False
入力音声を解析してAudio Queryで返す機能が有効かどうか
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enable_guided: bool, optional, default=False
入力音声を解析してAudio Queryで返す機能が有効かどうか

Copy link
Member

@Hiroshiba Hiroshiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!!

I think it might be best to decide after listening to the actual synthesis results!
(@y-chan, what do you think? 👀 )

I can run it here, but due to time constraints, it might take a while.
I've prepared sample input voices. The filenames are the content of the lines. If you could create a sample with this, it would be very helpful 🙇
hiho_input_data.zip

We can decide later where to place the file and how to set the parameters!


PRありがとうございます!!

実際の合成結果を聞いてから判断するのが良いのかなと思いました!
@y-chan どうでしょう 👀 )

こちらで動かしても良いのですが、時間の都合上かなり遅くなるかもしれません。
サンプルの入力音声をご用意しました。ファイル名がセリフ内容になっています。もしよかったらこれでサンプルを作っていただけるととても助かります 🙇
hiho_input_data.zip

@@ -12,6 +12,8 @@ datas = [
('presets.yaml', '.'),
('default_setting.yml', '.'),
('ui_template', 'ui_template'),
('model', 'model'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要そう?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is unnecessary.

parser.add_argument(
"--guide_model",
type=Path,
default="cv_jp.bin",
Copy link
Member

@Hiroshiba Hiroshiba Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can decide later where to place the file and how to set the parameters!


ファイル名をどうするかや、引数をどうするかは後で決めさせていただこうと思います!

@Patchethium
Copy link
Contributor Author

@Hiroshiba I made a quick script to do generate audio samples from the reference audio you provided:

Expand
import requests
import json
import os

guide_url = "http://localhost:50021/guide"
query_url = "http://localhost:50021/audio_query"
synthesis_url = "http://localhost:50021/synthesis"

cur_dir = os.getcwd()
result_dir = os.path.join(cur_dir, "result")
if not os.path.exists(result_dir):
    os.mkdir(result_dir)

speaker = 1
filenames = os.listdir(".")
for filename in filenames:
    if filename[-4:] == ".wav":
        print(f"Processing: {filename}")
        full_guide_url = f"{guide_url}?speaker={speaker}&normalize=true&ref_path={os.path.join(cur_dir, filename)}"
        full_query_url = f"{query_url}?speaker={speaker}&text={filename[:-4]}"
        query = requests.request("POST", full_query_url).text
        headers = {"Content-Type": "application/json", "charset": "utf-8"}
        guided_query = requests.request("POST", full_guide_url, headers=headers, data=query.encode("utf-8")).text
        full_synthesis_url = f"{synthesis_url}?speaker={speaker}"
        audio = requests.request("POST", full_synthesis_url, headers=headers, data=guided_query.encode("utf-8")).content
        with open(os.path.join(result_dir, filename), "wb") as f:
            f.write(audio)

The result is attached as hiho_result.zip. I'd say the quality is not pretty satisfying for the alignment search collapses a several times especially with singing voice. However, it's snfa's problem, which I have the confidence to fix in later upgrades.

Copy link
Member

@Hiroshiba Hiroshiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing the sample. I was able to intuitively grasp its accuracy!
I agree that there seems to be an issue with the singing.
Separately, I noticed that the control of intonation seems slightly off.
I've commented on possible causes for the observed trend where the pitch tends to go low.
Additionally, there also seems to be a pattern where the voice becomes noticeably quieter, as in the audio clip of "のるしか無いっすよね このビッグウェーブにね," but I couldn't think of a reason for this.


サンプルの共有ありがとうございます。精度を直感的に感じることができました!
歌に関して問題がありそうなのは同感です。
別で、イントネーションの制御が微妙にできていないのが若干気になりました。
ピッチが低くなる傾向がありそうでしたが、その原因となりそうな候補をコメントに書きました。
他にも「のるしか無いっすよね このビッグウェーブにね」の音声のように声が全体的になぜか小さくなっているパターンもありそうですが、こっちは原因を思いつくことができませんでした。

voicevox_engine/guided/extractor.py Outdated Show resolved Hide resolved
@Patchethium
Copy link
Contributor Author

@takana-v I modified the Dockerfile and it should be fine, however, I'm not familiar with github workflow i.e. the .github/workflows/build.yml. Some further instruction would be helpful.

@Hiroshiba I don't know the suitable way to test the guide API. I mean, while testing the synthesis engine it only produces some blank audio, which is not suitable for forced alignment. Should I add an addition audio source, like a clip from common voice dataset (public domain), into the repository and align on that for testing?

Copy link
Member

@takana-v takana-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build run.py with PyInstallerの前に、ダウンロード処理を入れれば大丈夫だと思います。
ダウンロードするファイルをキャッシュするようにするとより良いですが、ひとまずはキャッシュ無しでも良いと思います。

- name: Download model for guided synthesis
  run: curl -L https://github.com/Patchethium/snfa/releases/download/v0.0.1/cv_jp.bin -o ./cv_jp.bin

Comment on lines 39 to 40
enable_guided: bool, optional, default=False
入力音声を解析して音声合成クエリで返す機能が有効かどうか
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

消し忘れかもしれません。

Suggested change
enable_guided: bool, optional, default=False
入力音声を解析して音声合成クエリで返す機能が有効かどうか

@Hiroshiba
Copy link
Member

@Patchethium
It's great that you're implementing the tests!!
It seems like a good idea to test if the expected AudioQuery is returned when providing a sample voice. The sample voice can be from the public domain, but my voice samples are also okay. For example, the "もしもーし 聞こえてますか.wav" seems to have good synthesis quality, so how about adding this audio data to the test directory as an OSS data?


テストを実装するの、良いですね!!
サンプル音声を与えた時、想定通りのAudioQueryが帰ってくることをテストするのが良さそうに感じました。
サンプル音声はパブリックドメインのものでも良いですが、僕の音声サンプルでも大丈夫です。
例えば「もしもーし 聞こえてますか.wav」などは合成結果の品質も良さそうなので、この音声データをOSSデータとしてテストディレクトリに放り込むのはどうでしょうか。

@Patchethium
Copy link
Contributor Author

That should be all. I'll finish the GUI in no time based on the old code.

Copy link
Member

@Hiroshiba Hiroshiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's shaping up nicely!
I have a suggestion to improve the maintainability of VOICEVOX.

Comment on lines +38 to +39
# convert to int, to avoid overflow in np.int32 type
wav = resample(wav, int(aligner.sr) * int(wav.shape[0]) // src_sr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a better approach?

Suggested change
# convert to int, to avoid overflow in np.int32 type
wav = resample(wav, int(aligner.sr) * int(wav.shape[0]) // src_sr)
# convert to int, to avoid overflow in np.int32 type
wav = resample(wav, int(aligner.sr * wav.shape[0]) // src_sr)

Comment on lines +99 to +102
# Download snfa forced aligner model
- name: Download model for guided synthesis
run: curl -N -L https://github.com/Patchethium/snfa/releases/download/v0.0.1/cv_jp.bin -o ./cv_jp.bin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suggestion that could improve the usability of the library! What do you think about bundling this binary file within the snfa library or adding a feature to automatically download the model file if it's missing?
Bundling is fairly common; for example, the soundfile library includes a DLL. Auto-downloading is also a common feature; for example, pyopenjtalk automatically downloads a missing dictionary
https://github.com/r9y9/pyopenjtalk/blob/22852ba6e36faaf2589b458e731c701e24f9dc9d/pyopenjtalk/__init__.py#L77-L79.


ライブラリの使い勝手が上がりそうな提案があります!
このバイナリファイルをsnfaライブラリの中に同梱したり、あるいはモデルファイルがなかったら自動でダウンロードする機能をつけるのはどうでしょうか?
同梱するのは結構普通のことで、例えばsoundfileなどもdllが同梱されていたと思います。
自動ダウンロードもよくある機能で、例えばpyopenjtalkは辞書がない場合に自動的にダウンロードしています。
https://github.com/r9y9/pyopenjtalk/blob/22852ba6e36faaf2589b458e731c701e24f9dc9d/pyopenjtalk/__init__.py#L77-L79

@@ -12,6 +12,8 @@ datas = [
('presets.yaml', '.'),
('default_setting.yml', '.'),
('ui_template', 'ui_template'),
('model', 'model'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is unnecessary.

Comment on lines +149 to +151
extract(
wav, sr, query=self.audio_query, model_path="cv_jp.bin"
) # as long as it doesn't throw exceptions it's okay
Copy link
Member

@Hiroshiba Hiroshiba Sep 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to write code to verify whether the output of extract matches the expected results. What do you think? 👀

@tarepan
Copy link
Contributor

tarepan commented Mar 25, 2024

@Hiroshiba
こちらの PR はどのような状態でしょうか?
PR 概要から関連 issue や機能詳細がわからないため、現時点では引き継ぎも厳しそうに感じます。

@Patchethium
Copy link
Contributor Author

The current state is that I'm not satisfied with the alignment and packaging strategy of the snfa lib, neither do I have time to really sit down and improve it.

So if you think it doesn't look good to be hanging on the pr tab, it's okay to close it and I may reopen it when I eventually release a better version of snfa.

@tarepan
Copy link
Contributor

tarepan commented Mar 25, 2024

@Patchethium
Thank you for quick response!

I'm not satisfied with the alignment and packaging strategy of the snfa lib, neither do I have time to really sit down and improve it.

I understood the situation👍️
Please feel free to improve this PR at your own pace!

if you think it doesn't look good to be hanging on the pr tab

No problem, you can keep this PR on PR tab👍️
Instead, can you do two things for us?

  1. Convert this PR as draft (because normal PR means Ready-Review)
  2. Write PR detail in initial comment (PULL_REQUEST_TEMPLATE.md will help you)

@Patchethium
Copy link
Contributor Author

Well it was ready for review as I was expecting others to improve it after it's released as an experimental feature but things didn't go as expected.
It's good to keep it as a draft but I prefer to close because my enthusiasm is not as much as the time I created this pr. I may pretty much abandon it.
I'll describe this pr with more detail when I have the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants