Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long sequence pronunciation issue of pre-train model for custom language #675

Open
4 tasks done
saifulislam79 opened this issue Dec 28, 2024 · 5 comments
Open
4 tasks done
Labels
bug Something isn't working

Comments

@saifulislam79
Copy link

Checks

  • This template is only for bug reports, usage problems go with 'Help Wanted'.
  • I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
  • I have searched for existing issues, including closed ones, and couldn't find a solution.
  • I confirm that I am using English to submit this report in order to facilitate communication.

Environment Details

  1. same as your requirements

Steps to Reproduce

  1. same as you defined steps

✔️ Expected Behavior

I have trained custom language around 500000 steps and vocab size 134 but the inference audio not clear on long sequence, not generated as text have. short sequence text is well generated but issue on long sequence around 100 character.

❌ Actual Behavior

What is the issue of long sequence round above 150 character in a sentence.

@saifulislam79 saifulislam79 added the bug Something isn't working label Dec 28, 2024
@SWivid
Copy link
Owner

SWivid commented Dec 28, 2024

the released base model is trained on up-to-30-sec audio dataset.
and the inference code will receive reference audio and clip to max 15 sec if longer
then an up-to-15 sec generation is supported

so if you got a dataset with up to 20 sec samples and train from scratch, the model never see samples longer to 30 seconds
thus, an up-to-5-sec generation is allowed with default setting if you provide up-to-15 sec ref_audio

try modify code in these places:

if clip_short:
# 1. try to find long silence for clipping
non_silent_segs = silence.split_on_silence(
aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=1000, seek_step=10
)
non_silent_wave = AudioSegment.silent(duration=0)
for non_silent_seg in non_silent_segs:
if len(non_silent_wave) > 6000 and len(non_silent_wave + non_silent_seg) > 15000:
show_info("Audio is over 15s, clipping short. (1)")
break
non_silent_wave += non_silent_seg
# 2. try to find short silence for clipping if 1. failed
if len(non_silent_wave) > 15000:
non_silent_segs = silence.split_on_silence(
aseg, min_silence_len=100, silence_thresh=-40, keep_silence=1000, seek_step=10
)
non_silent_wave = AudioSegment.silent(duration=0)
for non_silent_seg in non_silent_segs:
if len(non_silent_wave) > 6000 and len(non_silent_wave + non_silent_seg) > 15000:
show_info("Audio is over 15s, clipping short. (2)")
break
non_silent_wave += non_silent_seg
aseg = non_silent_wave
# 3. if no proper silence found for clipping
if len(aseg) > 15000:
aseg = aseg[:15000]
show_info("Audio is over 15s, clipping short. (3)")

def chunk_text(text, max_chars=135):

@saifulislam79
Copy link
Author

@SWivid Thank you for your response, but i want to know how many 30 second data need to contain into the custom dataset for long sequence handle. Basically the duration frequency like 30 second data how many time appear in your dataset and that's why long sequence generation quality is clear and natural.

@SWivid
Copy link
Owner

SWivid commented Dec 29, 2024

we use Emilia dataset which is an open-source one, you could just check out everything you want from https://huggingface.co/datasets/amphion/Emilia-Dataset
and also amphion's github repo for the pipeline they use for organizing this corpus

@saifulislam79
Copy link
Author

@SWivid another queries regarding long sequence issue, you have already mention base model will solved the long sequence issue, according you observation i am working on it generate data for custom dataset. but if i trained small variant like f5-tts small model, it will solve the long sequence issue if long data appear into my dataset.

@SWivid
Copy link
Owner

SWivid commented Dec 30, 2024

the released base model is trained on up-to-30-sec audio dataset.

the point is to include up-to-30-sec audio samples, the model size does not matter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants