Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guided Synthesis #252

Merged
merged 32 commits into from
Mar 10, 2022
Merged

Guided Synthesis #252

merged 32 commits into from
Mar 10, 2022

Conversation

Patchethium
Copy link
Contributor

@Patchethium Patchethium commented Dec 28, 2021

Contents

As discussed in #231, I got julius4segment for the forced alignment and it seems to be working at least, here's an example (audio included):

guided_good.mp4

While most of the time it's just throwing out various exceptions and poorly-synthesized voices. I feel it necessary to share this progress and maybe get some help from developers who are more familiar with audios and signals.

I have some problems here:

  1. I use scipy to resample audios in order to match Julius' 16khz requirement, but sometimes it fails in testing while the one Audacity resampled from the same file works, this is pretty confusing.
  2. The julius4segment keeps throwing out exceptions even with a right sample rate
  3. Most of the results are pretty weird and low-quality like this failed example (audio included):
guided_bad.mp4

Don't know what's going on here...
4. I'm using a simple min-max to normalize the f0 extracted, I guess there should be some better methods...

Currently I only got the second method I mentioned in #231 implemented, hopefully the first one may perform better. Before I get that done, I'll keep this PR a WIP.

Issue

#231

@Patchethium Patchethium marked this pull request as draft December 28, 2021 13:31
@Hiroshiba
Copy link
Member

すごい成果だと思います!!!!!!
あなたの熱意に答えて、思いつく限りの助言をしてみます。

まず、juliusがエラーを出すことに関して。
もしかしたら、juliusがデフォルトで無音を無視しているせいかもしれません。
-nostripを指定して、無音も無視しないように設定すれば、エラーが出なくなるかもしれません。
たぶんこちらのjulius4segを使われていると思うのですが、いろいろ改造を加えた僕のjulius4segを利用すると、いろんな問題が解決するかもです。
https://github.com/Hiroshiba/julius4seg

品質が低いのは、おそらく音高の抽出方法がVOICEVOXコアの想定と違うためです。
VOICEVOXではminmax&normalizeではなく、対数を取得しています。
https://github.com/Hiroshiba/acoustic_feature_extractor/blob/478f730c1cd5b24015c73872c2186d123be1b3bc/acoustic_feature_extractor/data/f0.py#L60-L62

Have a great year!

@Patchethium
Copy link
Contributor Author

あけましておめでとうございます!

After changing the normalization algorithm, it turns out to be better than I thought, check out this example:

example1.mp4

And I also got the accent phrases part implemented, which also provides a pretty decent result:

usage.mp4
example2.mp4

As for the Forced Alignment part, unfortunately, switching it to your fork doesn't seem to help much 😢. Julius still throws exceptions kind of frequently, but I'm starting to consider it as acceptable since it's just how unreliable its ASR is. I added a simple error handling to tell the user to change their audio file when Julius crashes, guess it's enough in practice until someone kind enough to improve this part comes 🙏

As a result, I'm marking this PR as ready to be reviewed, feel free to bring up any questions.

@Patchethium Patchethium marked this pull request as ready for review December 30, 2021 11:24
@Patchethium Patchethium changed the title [WIP] Guided synthesis Guided Synthesis Dec 30, 2021
@Hiroshiba
Copy link
Member

ふむ、さすがに変更箇所が大きくて大変ですね。

juilus4segを別ライブラリとして切り出すことはできそうでしょうか。
コードはHiroshiba/julius4segからほとんど変えていませんか?
であれば、VOICEVOX/julius4segリポジトリにforkしてからvoicevox_engineで利用するというのはどうでしょう。


Hmm, the changes are indeed very large and hard.

Is it possible to extract juilus4seg as an independent library?
Is the code almost unchanged from Hiroshiba/julius4seg?
If so, how about forking it into the VOICEVOX/julius4seg repository and then using it in voicevox_engine?

@Patchethium
Copy link
Contributor Author

Oh again? Buhhhh...

No, I don't think I'm capable to do that, neither to make it a GitHub submodule nor creating a python module that can be downloaded and installed by pip. Excluding fastapi's Form() from flake8 took me almost one hour before I gave up, I just don't wanna jump into another rabbit hole to find myself end up spending a bunch of hours in figuring out that tons of configurations.

The julius4seg folder has literally NOTHING changed since I copied it from the original repository, you can simply ignore it in reviewing, making the workload a half. If you want me to divide the two APIs (guided synthesis and guided accent phrases) into separate pull requests, can do in five minutes.

@Patchethium
Copy link
Contributor Author

Patchethium commented Jan 8, 2022

It's been a week, how's it going now?

run.py Outdated
except ParseKanaError:
print(traceback.format_exc())
raise HTTPException(
status_code=500,
Copy link
Member

@takana-v takana-v Jan 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using 422 instead of 500 for the status code is better.
ref #91

self,
query: AudioQuery,
speaker_id: int,
audio_file: Optional[IO],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[QUESTION] Why is the audio_file argument set to Optional?
I think an error will occur if audio_file is None.
https://docs.python.org/3/library/typing.html#typing.Optional

Copy link
Member

@Hiroshiba Hiroshiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think I'm capable to do that, neither to make it a GitHub submodule nor creating a python module that can be downloaded and installed by pip.

なるほどです、承知しました。
では、guided synthesis機能は一旦experimental機能ということにしましょう!

voicevox_engine/experimentalのようなディレクトリを作り、guided_extractor.pyjulius4segをこのディレクトリの中に移動してください。
マージされてから、よりクールになるように修正していきましょう!


Okay, I get it.
So, let's define the guided synthesis feature as an experimental feature for now!

Please create a directory like voicevox_engine/experimental and move guided_extractor.py and julius4seg into this directory.
Once they're merged, we can modify them to be cooler!

Comment on lines 76 to 86
def get_normalize_scale(engine, kana: str, f0: np.ndarray, speaker_id: int):
f0_avg = _no_nan(np.average(f0[f0 != 0]))
predicted_phrases, _ = parse_kana(kana, False)
engine.replace_mora_data(predicted_phrases, speaker_id=speaker_id)
pitch_list = []
for phrase in predicted_phrases:
for mora in phrase.moras:
pitch_list.append(mora.pitch)
pitch_list = np.array(pitch_list, dtype=np.float64)
predicted_avg = _no_nan(np.average(pitch_list[pitch_list != 0]))
return predicted_avg / f0_avg
Copy link
Member

@Hiroshiba Hiroshiba Jan 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ターゲット話者のピッチ平均値を実際に一度作成する、というのは面白いアイデアですね!

ここでは、ターゲット話者のピッチ平均値を計算し、インプット話者のピッチをその平均値に合わせるということをしたいのでしょうか。

ピッチを合わせる正確な手法は、平均値をスケールするのではなく、平均値の差の加算です。
なのでここはpredicted_avg - f0_avgを返すようにし、利用側でpitch += diffとするのが正しい計算式になります。
関数名もget_pitch_diffとかにするとよりクールだと思います。


def guided_accent_phrases(
self,
query: AudioQuery,
Copy link
Member

@Hiroshiba Hiroshiba Jan 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

この関数はqueryは必要とせず、list[AccentPhrase]で良いと思います。こうすることで呼び出し側はqueryを作成する手間を省略できます。

kanalist[AccentPhrase]から作成できます。

def create_kana(accent_phrases: List[AccentPhrase]) -> str:

run.py Outdated
@@ -206,6 +207,63 @@ def accent_phrases(
enable_interrogative=enable_interrogative,
)

@app.post(
"/guided_accent_phrase",
response_model=AudioQuery,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ここはList[AccentPhrase]が正しそうです。

response_model=List[AccentPhrase],

run.py Outdated
Comment on lines 216 to 221
def guided_accent_phrase(
kana: str = Form(...), # noqa: B008
speaker_id: int = Form(...), # noqa: B008
normalize: int = Form(...), # noqa: B008
audio_file: UploadFile = File(...), # noqa: B008
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kanaは「ひらがな」の意味ではなく、「AquesTalk記法のテキスト」という意味で用いています。

他のAPIと形式を合わせておくと、ユーザーにとって使い勝手が良さそうです。
こちらとAPI形式を合わせて、このようにしてください。

    def guided_accent_phrase(
        text: str,
        speaker: int,
        is_kana: bool = False,
        enable_interrogative: bool = enable_interrogative_query_param(),  # noqa B008,
        audio_file: UploadFile = File(...),  # noqa: B008
    ):

@coveralls
Copy link

Pull Request Test Coverage Report for Build 1641048449

  • 82 of 622 (13.18%) changed or added relevant lines in 6 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-32.4%) to 54.282%

Changes Missing Coverage Covered Lines Changed/Added Lines %
voicevox_engine/dev/synthesis_engine/mock.py 4 6 66.67%
voicevox_engine/synthesis_engine/synthesis_engine_base.py 6 8 75.0%
voicevox_engine/synthesis_engine/synthesis_engine.py 6 69 8.7%
voicevox_engine/guided_extractor.py 36 125 28.8%
voicevox_engine/julius4seg/sp_inserter.py 27 116 23.28%
voicevox_engine/julius4seg/converter.py 3 298 1.01%
Totals Coverage Status
Change from base Build 1640294394: -32.4%
Covered Lines: 767
Relevant Lines: 1413

💛 - Coveralls

@github-actions
Copy link

github-actions bot commented Jan 9, 2022

Coverage Result

Resultを開く
Name Stmts Miss Cover
voicevox_engine/init.py 1 0 coverage-100%
voicevox_engine/acoustic_feature_extractor.py 75 0 coverage-100%
voicevox_engine/dev/synthesis_engine/init.py 2 0 coverage-100%
voicevox_engine/dev/synthesis_engine/mock.py 41 4 coverage-90%
voicevox_engine/experimental/init.py 0 0 coverage-100%
voicevox_engine/experimental/guided_extractor.py 124 89 coverage-28%
voicevox_engine/experimental/julius4seg/init.py 0 0 coverage-100%
voicevox_engine/experimental/julius4seg/converter.py 298 295 coverage-1%
voicevox_engine/experimental/julius4seg/sp_inserter.py 116 89 coverage-23%
voicevox_engine/full_context_label.py 162 3 coverage-98%
voicevox_engine/kana_parser.py 86 1 coverage-99%
voicevox_engine/model.py 136 7 coverage-95%
voicevox_engine/mora_list.py 4 0 coverage-100%
voicevox_engine/part_of_speech_data.py 5 0 coverage-100%
voicevox_engine/preset/Preset.py 12 0 coverage-100%
voicevox_engine/preset/PresetLoader.py 34 1 coverage-97%
voicevox_engine/preset/init.py 3 0 coverage-100%
voicevox_engine/synthesis_engine/init.py 5 0 coverage-100%
voicevox_engine/synthesis_engine/core_wrapper.py 156 126 coverage-19%
voicevox_engine/synthesis_engine/make_synthesis_engines.py 52 43 coverage-17%
voicevox_engine/synthesis_engine/synthesis_engine.py 179 66 coverage-63%
voicevox_engine/synthesis_engine/synthesis_engine_base.py 69 9 coverage-87%
voicevox_engine/user_dict.py 88 10 coverage-89%
voicevox_engine/utility/init.py 3 0 coverage-100%
voicevox_engine/utility/connect_base64_waves.py 35 3 coverage-91%
voicevox_engine/utility/engine_root.py 9 2 coverage-78%
TOTAL 1695 748 coverage-56%

# Conflicts:
#	.gitignore
#	voicevox_engine/dev/synthesis_engine/mock.py
# Conflicts:
#	run.py
#	voicevox_engine/dev/synthesis_engine/mock.py
#	voicevox_engine/synthesis_engine/synthesis_engine.py
#	voicevox_engine/synthesis_engine/synthesis_engine_base.py
@Patchethium
Copy link
Contributor Author

Should be okay now.

@Yosshi999 Yosshi999 requested a review from Hiroshiba February 21, 2022 14:12
Copy link
Member

@Hiroshiba Hiroshiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

すみません、ちょっと今週立て込んでいるので来週まで待って頂ければ・・・!
(新しいキャラクターが4人増えます)

Sorry, I'm a little busy this week, so please wait until next week...
Four new characters will be added. :->

@Patchethium
Copy link
Contributor Author

Okay, I'll be dealing with the GUI these days.

Four new characters will be added. :->

That's good, but I'm a bit concerned with the speed characters are joining in, TTS with characters itself is a niche market and too much products flooding in may destroy the balance in which customers take the time to accept a new character... Just a thought.

@takana-v takana-v requested a review from Hiroshiba March 3, 2022 07:38
@Hiroshiba
Copy link
Member

@Patchethium san
Sorry for the delay.
I'm hoping to take a closer look this weekend, just a little longer ...!
おまたせしてしまってすみません。
この週末にじっくり見てみたいと思っています、もう少しだけお待ちください・・・!

TTS with characters itself is a niche market and too much products flooding in may destroy the balance in which customers take the time to accept a new character... Just a thought.

I see...
If I begin to feel saturated with the number of character types, I will review the policy. :->
なるほど・・・。
キャラクター種類数の飽和を感じ始めたら、方針を見直そうと思います!

voicevox_engine/experimental/guided_extractor.py Outdated Show resolved Hide resolved
run.py Show resolved Hide resolved
@Hiroshiba
Copy link
Member

お待たせしました、レビューしてみました!
Here you go, I reviewed it!

@Patchethium
Copy link
Contributor Author

Should be okay now

Copy link
Member

@Hiroshiba Hiroshiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!!
READMEもありがとうございます!!
GUIの実装、とても楽しみにしています!

@Hiroshiba Hiroshiba merged commit 65e657e into VOICEVOX:master Mar 10, 2022
@Patchethium Patchethium mentioned this pull request Aug 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants