Long sequence pronunciation issue of pre-train model for custom language #675

saifulislam79 · 2024-12-28T05:02:16Z

Checks

This template is only for bug reports, usage problems go with 'Help Wanted'.
I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
I have searched for existing issues, including closed ones, and couldn't find a solution.
I confirm that I am using English to submit this report in order to facilitate communication.

Environment Details

same as your requirements

Steps to Reproduce

same as you defined steps

✔️ Expected Behavior

I have trained custom language around 500000 steps and vocab size 134 but the inference audio not clear on long sequence, not generated as text have. short sequence text is well generated but issue on long sequence around 100 character.

❌ Actual Behavior

What is the issue of long sequence round above 150 character in a sentence.

SWivid · 2024-12-28T09:23:10Z

the released base model is trained on up-to-30-sec audio dataset.
and the inference code will receive reference audio and clip to max 15 sec if longer
then an up-to-15 sec generation is supported

so if you got a dataset with up to 20 sec samples and train from scratch, the model never see samples longer to 30 seconds
thus, an up-to-5-sec generation is allowed with default setting if you provide up-to-15 sec ref_audio

try modify code in these places:

F5-TTS/src/f5_tts/infer/utils_infer.py

Lines 288 to 317 in 20aa6a1

    
           if clip_short: 
        
               # 1. try to find long silence for clipping 
        
               non_silent_segs = silence.split_on_silence( 
        
                   aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=1000, seek_step=10 
        
               ) 
        
               non_silent_wave = AudioSegment.silent(duration=0) 
        
               for non_silent_seg in non_silent_segs: 
        
                   if len(non_silent_wave) > 6000 and len(non_silent_wave + non_silent_seg) > 15000: 
        
                       show_info("Audio is over 15s, clipping short. (1)") 
        
                       break 
        
                   non_silent_wave += non_silent_seg 
        
               # 2. try to find short silence for clipping if 1. failed 
        
               if len(non_silent_wave) > 15000: 
        
                   non_silent_segs = silence.split_on_silence( 
        
                       aseg, min_silence_len=100, silence_thresh=-40, keep_silence=1000, seek_step=10 
        
                   ) 
        
                   non_silent_wave = AudioSegment.silent(duration=0) 
        
                   for non_silent_seg in non_silent_segs: 
        
                       if len(non_silent_wave) > 6000 and len(non_silent_wave + non_silent_seg) > 15000: 
        
                           show_info("Audio is over 15s, clipping short. (2)") 
        
                           break 
        
                       non_silent_wave += non_silent_seg 
        
               aseg = non_silent_wave 
        
               # 3. if no proper silence found for clipping 
        
               if len(aseg) > 15000: 
        
                   aseg = aseg[:15000] 
        
                   show_info("Audio is over 15s, clipping short. (3)")

F5-TTS/src/f5_tts/infer/utils_infer.py

Line 61 in 20aa6a1

def chunk_text(text, max_chars=135):

saifulislam79 · 2024-12-29T05:33:29Z

@SWivid Thank you for your response, but i want to know how many 30 second data need to contain into the custom dataset for long sequence handle. Basically the duration frequency like 30 second data how many time appear in your dataset and that's why long sequence generation quality is clear and natural.

SWivid · 2024-12-29T07:24:46Z

we use Emilia dataset which is an open-source one, you could just check out everything you want from https://huggingface.co/datasets/amphion/Emilia-Dataset
and also amphion's github repo for the pipeline they use for organizing this corpus

saifulislam79 · 2024-12-30T10:03:03Z

@SWivid another queries regarding long sequence issue, you have already mention base model will solved the long sequence issue, according you observation i am working on it generate data for custom dataset. but if i trained small variant like f5-tts small model, it will solve the long sequence issue if long data appear into my dataset.

SWivid · 2024-12-30T12:30:17Z

the released base model is trained on up-to-30-sec audio dataset.

the point is to include up-to-30-sec audio samples, the model size does not matter

saifulislam79 added the bug Something isn't working label Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long sequence pronunciation issue of pre-train model for custom language #675

Long sequence pronunciation issue of pre-train model for custom language #675

saifulislam79 commented Dec 28, 2024

SWivid commented Dec 28, 2024

saifulislam79 commented Dec 29, 2024

SWivid commented Dec 29, 2024

saifulislam79 commented Dec 30, 2024

SWivid commented Dec 30, 2024 •

edited

Loading

Long sequence pronunciation issue of pre-train model for custom language #675

Long sequence pronunciation issue of pre-train model for custom language #675

Comments

saifulislam79 commented Dec 28, 2024

Checks

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

SWivid commented Dec 28, 2024

saifulislam79 commented Dec 29, 2024

SWivid commented Dec 29, 2024

saifulislam79 commented Dec 30, 2024

SWivid commented Dec 30, 2024 • edited Loading

SWivid commented Dec 30, 2024 •

edited

Loading