Regarding the issue of sentence length #916

heartInsert · 2024-11-03T04:03:17Z

Like, some sentences are too long for subtitle files. Is there a way to limit the length of transcribed sentences or split long sentences in code? Thanks.

jonathanfox5 · 2024-11-22T22:24:04Z

I've been playing about with this today. The SubtitlesProcessor module included with whisperx is really good!

from whisperx.SubtitlesProcessor import SubtitlesProcessor

# Do all of your whisper transcribing / alignment here
# Output of the alignment stage should be an object called `result`

# All variable names below apart from `result` are settings that can be exposed to the user.
subtitles_proccessor = SubtitlesProcessor(
    result["segments"],
    language_code, # str, two letter code to identify the language
    max_line_length=max_line_length, # int, around 100 has been working for me
    min_char_length_splitter=sub_split_threshold, # int, around 70 has been working for me
    is_vtt=is_vtt, # bool, true for vtt, false for srt format
)
subtitles_proccessor.save(output_path, advanced_splitting=True) # output_path is a str with your desired filename

There's an alternative example in the pull request here

heartInsert · 2024-11-23T11:08:23Z

I really love you , bro , you are my hero.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding the issue of sentence length #916

Regarding the issue of sentence length #916

heartInsert commented Nov 3, 2024

jonathanfox5 commented Nov 22, 2024 •

edited

Loading

heartInsert commented Nov 23, 2024

Regarding the issue of sentence length #916

Regarding the issue of sentence length #916

Comments

heartInsert commented Nov 3, 2024

jonathanfox5 commented Nov 22, 2024 • edited Loading

heartInsert commented Nov 23, 2024

jonathanfox5 commented Nov 22, 2024 •

edited

Loading