-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guided #723
Guided #723
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is still a draft, but I've taken a look at the pull request!
I don't have any comments regarding the direction at the moment.
Sorry, I forgot to set up the build pipeline. |
Think it's ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
音声合成周りはあまりわからないのでそれ以外を
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要だと思われる部分にコメントを付けました。
また、以下のファイルにcv_jp.bin
のダウンロード処理の追加が必要かと思われます。
- Dockerfile
- .github/workflows/build.yml
run.py
Outdated
@@ -1287,6 +1333,7 @@ def custom_openapi(): | |||
runtime_dirs=args.runtime_dir, | |||
cpu_num_threads=cpu_num_threads, | |||
enable_mock=args.enable_mock, | |||
enable_guided=args.enable_guided, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
この引数は使っていないので不要かと思います。
enable_guided=args.enable_guided, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually used in
https://github.com/Patchethium/voicevox_engine/blob/5b78b4d1c61ed9bfb671cb7154a69815e7c1bf00/run.py#L342-L346
in case as snfa
is still unstable, users may want to disable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
分かりにくく申し訳ありませんでした。
「引数」はargs.enable_guided
の事ではなく、voicevox_engine/synthesis_engine/make_synthesis_engines.py
のmake_synthesis_engines
関数の引数であるenable_guided
の事です。
関数内でenable_guided
は参照されていないため、削除しても問題ないと思います。
Sorry for the confusion.
"Argument" is not args.enable_guided
, but enable_guided
which is the argument of make_synthesis_engines
function in voicevox_engine/synthesis_engine/make_synthesis_engines.py
.
There is no reference to enable_guided
in the function, so it should be fine to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, fixed, thanks!
@@ -15,6 +15,7 @@ def make_synthesis_engines( | |||
runtime_dirs: Optional[List[Path]] = None, | |||
cpu_num_threads: Optional[int] = None, | |||
enable_mock: bool = True, | |||
enable_guided: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable_guided: bool = False, |
enable_guided: bool, optional, default=False | ||
入力音声を解析してAudio Queryで返す機能が有効かどうか |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable_guided: bool, optional, default=False | |
入力音声を解析してAudio Queryで返す機能が有効かどうか |
Co-authored-by: Nanashi. <[email protected]>
Co-authored-by: Nanashi. <[email protected]>
Co-authored-by: Nanashi. <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!!
I think it might be best to decide after listening to the actual synthesis results!
(@y-chan, what do you think? 👀 )
I can run it here, but due to time constraints, it might take a while.
I've prepared sample input voices. The filenames are the content of the lines. If you could create a sample with this, it would be very helpful 🙇
hiho_input_data.zip
We can decide later where to place the file and how to set the parameters!
PRありがとうございます!!
実際の合成結果を聞いてから判断するのが良いのかなと思いました!
( @y-chan どうでしょう 👀 )
こちらで動かしても良いのですが、時間の都合上かなり遅くなるかもしれません。
サンプルの入力音声をご用意しました。ファイル名がセリフ内容になっています。もしよかったらこれでサンプルを作っていただけるととても助かります 🙇
hiho_input_data.zip
@@ -12,6 +12,8 @@ datas = [ | |||
('presets.yaml', '.'), | |||
('default_setting.yml', '.'), | |||
('ui_template', 'ui_template'), | |||
('model', 'model'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要そう?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code is unnecessary.
parser.add_argument( | ||
"--guide_model", | ||
type=Path, | ||
default="cv_jp.bin", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can decide later where to place the file and how to set the parameters!
ファイル名をどうするかや、引数をどうするかは後で決めさせていただこうと思います!
@Hiroshiba I made a quick script to do generate audio samples from the reference audio you provided: Expandimport requests
import json
import os
guide_url = "http://localhost:50021/guide"
query_url = "http://localhost:50021/audio_query"
synthesis_url = "http://localhost:50021/synthesis"
cur_dir = os.getcwd()
result_dir = os.path.join(cur_dir, "result")
if not os.path.exists(result_dir):
os.mkdir(result_dir)
speaker = 1
filenames = os.listdir(".")
for filename in filenames:
if filename[-4:] == ".wav":
print(f"Processing: {filename}")
full_guide_url = f"{guide_url}?speaker={speaker}&normalize=true&ref_path={os.path.join(cur_dir, filename)}"
full_query_url = f"{query_url}?speaker={speaker}&text={filename[:-4]}"
query = requests.request("POST", full_query_url).text
headers = {"Content-Type": "application/json", "charset": "utf-8"}
guided_query = requests.request("POST", full_guide_url, headers=headers, data=query.encode("utf-8")).text
full_synthesis_url = f"{synthesis_url}?speaker={speaker}"
audio = requests.request("POST", full_synthesis_url, headers=headers, data=guided_query.encode("utf-8")).content
with open(os.path.join(result_dir, filename), "wb") as f:
f.write(audio) The result is attached as hiho_result.zip. I'd say the quality is not pretty satisfying for the alignment search collapses a several times especially with singing voice. However, it's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sharing the sample. I was able to intuitively grasp its accuracy!
I agree that there seems to be an issue with the singing.
Separately, I noticed that the control of intonation seems slightly off.
I've commented on possible causes for the observed trend where the pitch tends to go low.
Additionally, there also seems to be a pattern where the voice becomes noticeably quieter, as in the audio clip of "のるしか無いっすよね このビッグウェーブにね," but I couldn't think of a reason for this.
サンプルの共有ありがとうございます。精度を直感的に感じることができました!
歌に関して問題がありそうなのは同感です。
別で、イントネーションの制御が微妙にできていないのが若干気になりました。
ピッチが低くなる傾向がありそうでしたが、その原因となりそうな候補をコメントに書きました。
他にも「のるしか無いっすよね このビッグウェーブにね」の音声のように声が全体的になぜか小さくなっているパターンもありそうですが、こっちは原因を思いつくことができませんでした。
@takana-v I modified the Dockerfile and it should be fine, however, I'm not familiar with github workflow i.e. the @Hiroshiba I don't know the suitable way to test the guide API. I mean, while testing the synthesis engine it only produces some blank audio, which is not suitable for forced alignment. Should I add an addition audio source, like a clip from common voice dataset (public domain), into the repository and align on that for testing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build run.py with PyInstaller
の前に、ダウンロード処理を入れれば大丈夫だと思います。
ダウンロードするファイルをキャッシュするようにするとより良いですが、ひとまずはキャッシュ無しでも良いと思います。
- name: Download model for guided synthesis
run: curl -L https://github.com/Patchethium/snfa/releases/download/v0.0.1/cv_jp.bin -o ./cv_jp.bin
enable_guided: bool, optional, default=False | ||
入力音声を解析して音声合成クエリで返す機能が有効かどうか |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
消し忘れかもしれません。
enable_guided: bool, optional, default=False | |
入力音声を解析して音声合成クエリで返す機能が有効かどうか |
@Patchethium テストを実装するの、良いですね!! |
use -N to avoid overwrite
That should be all. I'll finish the GUI in no time based on the old code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's shaping up nicely!
I have a suggestion to improve the maintainability of VOICEVOX.
# convert to int, to avoid overflow in np.int32 type | ||
wav = resample(wav, int(aligner.sr) * int(wav.shape[0]) // src_sr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a better approach?
# convert to int, to avoid overflow in np.int32 type | |
wav = resample(wav, int(aligner.sr) * int(wav.shape[0]) // src_sr) | |
# convert to int, to avoid overflow in np.int32 type | |
wav = resample(wav, int(aligner.sr * wav.shape[0]) // src_sr) |
# Download snfa forced aligner model | ||
- name: Download model for guided synthesis | ||
run: curl -N -L https://github.com/Patchethium/snfa/releases/download/v0.0.1/cv_jp.bin -o ./cv_jp.bin | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a suggestion that could improve the usability of the library! What do you think about bundling this binary file within the snfa library or adding a feature to automatically download the model file if it's missing?
Bundling is fairly common; for example, the soundfile library includes a DLL. Auto-downloading is also a common feature; for example, pyopenjtalk automatically downloads a missing dictionary
https://github.com/r9y9/pyopenjtalk/blob/22852ba6e36faaf2589b458e731c701e24f9dc9d/pyopenjtalk/__init__.py#L77-L79.
ライブラリの使い勝手が上がりそうな提案があります!
このバイナリファイルをsnfaライブラリの中に同梱したり、あるいはモデルファイルがなかったら自動でダウンロードする機能をつけるのはどうでしょうか?
同梱するのは結構普通のことで、例えばsoundfileなどもdllが同梱されていたと思います。
自動ダウンロードもよくある機能で、例えばpyopenjtalkは辞書がない場合に自動的にダウンロードしています。
https://github.com/r9y9/pyopenjtalk/blob/22852ba6e36faaf2589b458e731c701e24f9dc9d/pyopenjtalk/__init__.py#L77-L79
@@ -12,6 +12,8 @@ datas = [ | |||
('presets.yaml', '.'), | |||
('default_setting.yml', '.'), | |||
('ui_template', 'ui_template'), | |||
('model', 'model'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code is unnecessary.
extract( | ||
wav, sr, query=self.audio_query, model_path="cv_jp.bin" | ||
) # as long as it doesn't throw exceptions it's okay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to write code to verify whether the output of extract
matches the expected results. What do you think? 👀
@Hiroshiba |
The current state is that I'm not satisfied with the alignment and packaging strategy of the So if you think it doesn't look good to be hanging on the pr tab, it's okay to close it and I may reopen it when I eventually release a better version of snfa. |
@Patchethium
I understood the situation👍️
No problem, you can keep this PR on PR tab👍️
|
Well it was ready for review as I was expecting others to improve it after it's released as an experimental feature but things didn't go as expected. |
The continue of #252, use a custom forced aligner
snfa
instead ofjulius
for the sake of my nerve.snfa
is very young and unstable, I'll keep this API inexperiment
sub dir and lock with--enable-guided
as in #252.