Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve LRC generation #83

Open
Chaphasilor opened this issue Apr 17, 2024 · 9 comments
Open

Improve LRC generation #83

Chaphasilor opened this issue Apr 17, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@Chaphasilor
Copy link

Currently there are a few issues when generating LRC files for lyrics:

  • Instrumental tracks are not properly detected. Here are a few examples of LRC files generated for some of my music:

    Examples Direct - So Sure ```lrc [00:14.53] This is the end of this video, I hope you enjoyed this video, If you [00:20.19] did hit that thumbs up button, it helps me to make good content for [00:26.03] you, other then that, I will see you in tomorrow's video, peace out. [00:58.57] Thanks for watching, I hope you enjoyed this video, If you did hit [00:59.97] that thumbs up button, it helps me to make good content for you, [00:59.97] other then that, I will see you in tomorrow's video, peace out. [01:13.84] Thanks for watching, I hope you enjoyed this video, If you did hit [01:28.28] that thumbs up button, it helps me to make good content for you, [01:28.28] other then that, I will see you in tomorrow's video, peace out. [01:44.45] Thanks for watching, I hope you enjoyed this video, If you did hit [01:52.73] that thumbs up button, it helps me to make good content for you, [01:52.84] other then that, I will see you in tomorrow's video, peace out. [02:28.02] Thanks for watching, I hope you enjoyed this video, If you did hit [02:29.97] that thumbs up button, it helps me to make good content for you, [02:29.97] other then that, I will see you in tomorrow's video, peace out. [02:58.56] Thanks for watching, I hope you enjoyed this video, If you did hit [02:59.97] that thumbs up button, it helps me to make good content for you, [02:59.97] other then that, I will see you in tomorrow's video, peace out. [03:16.03] Thanks for watching, I hope you enjoyed this video, If you did hit [03:22.87] that thumbs up button, it helps me to make good content for you, [03:23.02] other then that, I will see you in tomorrow's video, peace out. [03:55.96] Thanks for watching, I hope you enjoyed this video, If you did hit [03:57.43] that thumbs up button, it helps me to make good content for you, [03:57.43] other then that, I will see you in tomorrow's video, peace out. [04:01.87] Thanks for watching, I hope you enjoyed this video, If you did hit [04:03.50] that thumbs up button, it helps me to make good content for you, [04:03.50] other then that, I will see you in tomorrow's video, peace out. ```

    Direct - Opal

    [00:28.30]  Hello, and welcome to a new episode of my channel, where I'm going
    [00:29.53]  to be showing you how to make the most of your time in your life.
    [00:29.53]  I hope you enjoy this video, and I hope you enjoy the rest of your day.
    [00:58.57]  I don't know what to do with my life, I don't know what to do with my life
    [01:28.57]  I don't know what to do with my life
    [01:58.57]  I don't know what to do with my life
    [02:28.58]  I don't know what to do with my life
    [02:59.15]  I don't know what to do with my life
    [03:14.96]  I don't know what to do with my life
    

    ENV - Brave

    [01:34.43]  ចតសកសាបនបាថកន
    [01:35.84]  កាងត។ំរិណ្ gesamហែកសាង︶បោានដិْ។
    [01:42.92]  ឡឹូ្ក절បបោរិ ំហងកសំឯពពнова�ើមបាងleans�ន។
    [01:44.31]  ឡ � testimony ឡើង Standing�ហែ៖ង� Manufacture
    

    Droptek - Science

    [01:00.59]  ប � Sugun ប ប ប ប ប ប ទ ᢔ ប ᢔ ។  �Что, ប  ម។ voud paid ក ក ក ថ �athi យ០។ ម០ ។.
    [01:14.29]  ប យ០ ម។ ។ ។ �ietet ។ып៙។។។ 🌈.
    

    Falcon Funk - Catnip Trip (Perkulat0r Remix)

    [00:00.00]  పఢ్ధిటాల్ మాలో కేట్ మాలో క్ందిఎలిందలి మాలో మాలలో వారిందిలో
    [00:09.24]  మాలో శంగామాండి వాభా ఉం పబింది మారింది �納డి వారింద కారిఁ ఎసావా
    [01:57.45]  5.5 cm x 5 cm
    [02:04.34]  6 cm x 6 cm
    [02:12.59]  7 cm x 7 cm
    [02:13.24]  8 cm x 8 cm
    [02:23.40]  9 cm x 9 cm
    [02:35.09]  10 cm x 10 cm
    [02:36.50]  11 cm x 11 cm
    [02:39.59]  12 cm x 12 cm
    [02:46.03]  13 cm x 13 cm
    [02:47.34]  14 cm x 14 cm
    [02:56.40]  15 cm x 15 cm
    [03:01.41]  16 cm x 16 cm
    [03:04.36]  17 cm x 17 cm
    [03:11.90]  18 cm x 18 cm
    [03:20.96]  19 cm x 19 cm
    [03:27.63]  20 cm x 20 cm
    [03:31.00]  21 cm x 21 cm
    [03:35.53]  22 cm x 21 cm
    [03:42.84]  23 cm x 23 cm
    [03:44.59]  24 cm x 24 cm
    [03:55.41]  25 cm x 25 cm
    [03:56.81]  26 cm x 26 cm
    [04:06.34]  27 cm x 27 cm
    [04:08.53]  29 cm x 29 cm
    [04:23.57]  29 cm x 29 cm
    

    Falcon Funk & Bossfight

    [00:19.35]  Hey guys, welcome back to my channel, today
    [00:29.98]  I'm going to be showing you how to create a
    [03:00.09]  My, my, my, my, my
    

    Intercom - Decoy World (feat. Park Avenue) (notice the "Thank you for watching" at the end, that is pretty common)

    [00:11.83]  I stayed awake last night, cause I couldn't close my eyes And see you another night
    [00:23.60]  I drove myself crazy thinking You'd take my
    [00:28.80]  wildest dreams and Tear them all to the ground
    [00:36.07]  So I tried to create a decoy world for you To destroy in my mind
    [00:47.39]  You can stay and believe You're tearing me apart
    [00:54.88]  While I'm coming to life While I'm coming to life
    [01:45.82]  You can stay and believe You're tearing me apart
    [01:47.85]  I couldn't keep the secret You found my darkest
    [01:53.28]  demons And brought them out in the light
    [01:59.84]  So I ran to all the preachers Despite having every reason To shut down and mobilize
    [02:11.97]  So I prayed to the gods For one last safe and grace against all odds
    [02:23.80]  And I built an escape using all my energy Just to come back to life, back to life
    [02:59.12]  I couldn't keep the secret You found my darkest
    [02:59.47]  demons And brought them out in the light
    [03:08.71]  So I ran to all the preachers Despite having every reason To shut down and mobilize
    [03:09.74]  So I ran to all the preachers Despite having every reason To shut down and mobilize
    [03:49.69]  Thank you for watching!
    

    Hosini - Flyga

    [01:03.17]  ḩ� ning Ḥᶽ
    [01:12.65]  She cures the tip in a combo lamp for 30 seconds.
    [01:12.65]  She cures the tip in a combo lamp for 30 seconds.
    [01:28.54]  She cures the tip in a combo lamp for 30 seconds.
    [01:31.34]  She cures the tip in a combo lamp for 30 seconds.
    [01:50.14]  She cures the tip in a combo lamp for 30 seconds.
    [01:52.93]  She cures the tip in a combo lamp for 30 seconds.
    [02:07.73]  She cures the tip in a combo lamp for 30 seconds.
    [02:10.53]  She cures the tip in a combo lamp for 30 seconds.
    [02:25.78]  She cures the tip in a combo lamp for 30 seconds.
    [02:31.22]  She cures the tip in a combo lamp for 30 seconds.
    [02:38.58]  She cures the tip in a combo lamp for 30 seconds.
    [02:52.34]  She cures the tip in a combo lamp for 30 seconds.
    [02:55.13]  She cures the tip in a combo lamp for 30 seconds.
    [03:00.38]  She cures the tip in a combo lamp for 30 seconds.
    [03:12.13]  She cures the tip in a combo lamp for 30 seconds.
    [03:20.40]  She cures the tip in a combo lamp for 30 seconds.
    [03:30.28]  She cures the tip in a combo lamp for 30 seconds.
    [03:33.65]  She cures the tip in a combo lamp for 30 seconds.
    [03:48.68]  She cures the tip in a combo lamp for 30 seconds.
    [03:51.47]  She cures the tip in a combo lamp for 30 seconds.
    [04:09.47]  She cures the tip in a combo lamp for 30 seconds.
    [04:23.02]  She cures the tip in a combo lamp for 30 seconds.
    [04:24.89]  She cures the tip in a combo lamp for 30 seconds.
    [04:27.83]  She cures the tip in a combo lamp for 30 seconds.
    

    Inova - Grime

    [00:11.35]  Music
    [02:46.36]  Thanks for watching, I'll see you in the next one!
    [03:00.00]  Thanks for watching, I'll see you in the next one!
    

    Inova - Enraged

    [00:29.64]  Hubsan x4 H502E Desire
    [00:58.57]  Thanks for watching please subscribe and hit that like button.....
    [01:28.57]  Thanks for watching please subscribe and hit that like button.....
    [01:58.57]  Thanks for watching please subscribe and hit that like button.....
    [02:26.81]  Thanks for watching please subscribe and hit that like button.....
    [02:58.12]  Thanks for watching please subscribe and hit that like button.....
    [03:16.56]  Thanks for watching please subscribe and hit that like button.....
    

    Airmov - PRESENCE

    [00:28.57]  This video is a derivative
     work of the Touhou Project.
    [03:32.50]  You
    
    It would be nice if there was a was to detect instrumental tracks and then skip generating lyrics. Maybe the ML model returns some confidence weights that could be used along with a threshold? The thresholds could also be different for audio files than for video files, to make sure subtitles aren't affected by higher thresholds. Getting rid of the most common random phrases ("Thanks for watching", "Subscribe", "welcome to my channel", etc.) would also be a very nice addition.
  • Some non-instrumental tracks are also not properly detected:

    Examples

    Direct & Matt Van - I Don't Mind

    [00:00.00]  වූ්යෙන්ණය හැකින්තියිය, දැක් පිතින්යිටියිටට දැකින්තියි වීඩිනින්යි
    [00:11.96]  සුයෙන්ට් හානමන්නමිටි හැකින්තිය හැකින්තයිිටිටිටටටට කළඩ�
    [00:31.64]  ශන්හ් ඉඩින්මට කිරීම් ඉධා මන් කරන් කරන් අවශ්රයකරය සමුන් ඉඩින් කරන්
    [00:43.42]  මයකර හැන්නමට හොඳින් මන්න ඉඩින් මයකරයේ හැන් මිහ්ලට කිරා ඔබනට හැන�
    [01:02.79]  අත්ලට සමට කරමණකර ප෸ොහඩවඅි ඔබට එකතු පිසිසින් කරමණක වීත්සමේ.
    [01:16.20]  මිශ්‍ර දින්හයක් දැනන් පිහිනහන් පින්හන්.
    [01:55.92]  ඊලක් පිතු කෝ්බිතු හාරණයක හාරඟ්ඛකක් හෟ් පදිකියේ හාරගන කල් niin
    [02:14.08]  මිත් කරමි, කරමි හැඩාවිකාඦ  කරහරි හලලයක් ප්ළුදු සේ ඇත
    [02:24.78]  මිශ්ර කරමි හැඩාවික් කරමි
    [02:28.56]  කරමි කරමි, මිශ්රම අතිකර කරමි
    [03:45.74]  අවශ් මම දින්තාකට කැහියක් එකතු මිශ්රීම් ඇත නැන් පිදුරින් කරන් රත් පිට්
    [03:47.13]  නිටිෝපා කරන් එකතු පි මිශ්රීම් තුණකට මන්න පිට් ඔබට කරන් බිශ්රීම් �
    
    I know there is a way to "force" detection of a certain language, but it would be nice to have some kind of allow-list instead (e.g. library only contains certain languages, and the detected language has to be one of those). I could also try increasing the language detection duration.
  • Sometimes a randomly-formatted "Music" section (or another random section) is created:

    Examples
    [00:28.57]  〔Music〕
    
    [00:28.57]  Music playing
    
    [00:28.57]  《Joy to the World》
    
    [00:14.06]  Music
    [00:15.46]  Music
    [00:17.48]  Music
    [00:19.98]  Music
    
    [00:00.00]  .
    [00:06.51]  .
    
  • Lyrics lines often contain line breaks, which aren't properly detected by LRC parsers (since each line should be one lyric line with a time stamp, and the generated files are essentially a mix of synchronized and unsynchronized lyrics:

    Examples

    AWAY & Midoca & Dark Waves - Too Close

    [00:13.48]  Take the long way back to me
     It's the wrong way, has to be
    [00:26.69]  You pull up in your car,
     then we sit out in the drive
    [00:30.28]  But I keep the lights on, like
     you're still out on the highway
    [00:33.92]  Practice in the mirror,
     everything you wanna say
    [00:37.85]  Hope you come inside, tell
     me that you wanna stay
    [00:45.38]  I feel alone when you get too close to me
    [00:52.10]  It looks wrong, but we're
     just too close to see
    [00:59.38]  It's cliche, but the
     writing's on the wall
    [01:06.48]  Now I wonder why you
     even came home at all
    [01:21.07]  When you get too close to me
    [01:38.00]  Behind closed doors, it's a black hole
    [01:45.21]  It's an old war with old souls
    [01:50.12]  There was a place in my heart
     that only you could get to
    [01:54.64]  Now you feel more like a
     stranger than before I met you
    [01:58.18]  Let me hear the words,
     everything you never say
    [02:01.84]  I hope you never let
     go when I push you away
    [02:09.21]  I feel alone when you get too close to me
    [02:16.37]  It looks wrong, but we're
     just too close to see
    [02:23.52]  It's cliche, but the
     writing's on the wall
    [02:30.49]  Now I wonder why you
     even came home at all
    [02:44.96]  When you get too close to me
    [03:16.41]  Have we lost who we are?
    [03:20.86]  Trying to save what we have
    [03:23.78]  Tell each other it's love,
     even though it feels bad
    [03:30.63]  We should run for our lives
    [03:34.12]  We should never look back
    [03:37.75]  We're just too close to see
    [03:41.34]  Being close makes us sad
    [04:04.12]  Makes us sad
    [04:09.19]  Being close makes us sad
    

    Airmov & Trove - Make Me Break

    [00:10.22]  I lay awake, think of
     what I locked away Take
    [00:15.19]  those secrets to the
     grave, if I can't cave
    [00:20.30]  And it occurred I had taken all my turns
     I don't seem to ever learn, am I unsafe?
    [00:29.80]  Alive if I say, I won't
     ever waste away All
    [00:34.50]  this life is made for
     me, cause I know me well
    [00:39.17]  If this is to be done, my
     will is all I've won Cause I
    [00:44.78]  just can't stop holding on,
     holding on, holding on to you
    [00:58.79]  Hold my hands, I won't
     let go, don't step, don't
    [01:03.84]  step me down, this is
     how you make me break
    [01:26.40]  All I can give now, deep
     from the underground
    [01:32.78]  Take me away, take me
     away, take me away now
    [01:37.48]  It's on my sleeve now,
     stitches tearing out
    [01:42.70]  Take me away, take me away, take me away
    [01:46.42]  Alive if I say, I won't
     ever waste away All
    [01:51.29]  this life is made for
     me, cause I know me well
    [01:55.76]  If this is to be done, my
     will is all I've won Cause I
    [02:01.57]  just can't stop holding on,
     holding on, holding on to you
    [02:13.97]  All I can give now, deep
     from the underground
    [02:24.58]  Take me away, take me
     away, take me away now
    [02:37.46]  Hold my hands, I won't
     let go, don't step, don't
    [02:39.93]  step me down, this is
     how you make me break
    [03:07.46]  Thanks for watching!
    ```
    
    </details>
    I'm guessing this is originates from improved subtitle formatting, using multiple lines. Since it doesn't work well for lyrics though, I'd suggest either removing the line breaks, or (if possible) making the detected lines shorter (maybe by making pause detection more "aggressive" or something, not sure if that's a thing) to properly split the lines.
    
    

Not sure how much of these issues are under your control or could be manually fixed, or if you're even willing to improve the LRC generation. But I wanted to discuss these issues anyway :)

All of this was tested using the default settings, aside from setting up the Jellyfin connection and a transcribe folder. So maybe using another model is a better solution? Although I don't think all issues would be solved by that.

@McCloudS
Copy link
Owner

Hey, thanks for the writeup! LRC was added by request of someone else, I haven't used it.

As far as I know, there is no way to handle the instrumental/music aspect using the current model. As you mentioned, increasing the 'detect-language' is only for the actual detect-language webhook, it will not have any impact outside of using it in Bazarr at this point.

Having a 'library' of forced-languages doesn't work in my head. Say I want it to be fr or en, but Whisper detects it as German. What's my next step?

To fix the line breaks, you could change the CUSTOM_REGROUP back to cm_sp=,* /,_sg=.5_mg=.3+3_sp=.* /。/?/? and it should clean it up.

The rest of what you are seeing are hallucinations caused by the model, and there is no way to fix them here (see: openai/whisper#928 and openai/whisper#679). They would have to be fixed upstream.

If you wanted to give a hack at fixing some of the other stuff for LRC, i'd take a PR. You'd probably want to look at

def write_lrc(result, file_path):
and
if isAudioFileExtension(file_extension) and lrc_for_audio_files:

@Chaphasilor
Copy link
Author

Having a 'library' of forced-languages doesn't work in my head. Say I want it to be fr or en, but Whisper detects it as German. What's my next step?

I was hoping the model would maybe offer multiple languages with varying confidence scores (e.g. de: 0.8, en: 0.4, fr: 0.2), which would allow you to use the matching language with the highest score, falling back to the originally detected language if none of the allow-listed languages is present.
But I take it that isn't the case?

I wasn't aware of the custom regroup, I'll give it a try!

Would you be opposed to some kind of static regex to get rid of some of the more common hallucinations?

@McCloudS
Copy link
Owner

I think the model will provide an array of probabilities, though I haven't messed with it. Your idea makes sense now. I'll see if there is any easy way to get that array.

Yup, open to any regex you want to try to throw in.

@Chaphasilor
Copy link
Author

As you mentioned, increasing the 'detect-language' is only for the actual detect-language webhook, it will not have any impact outside of using it in Bazarr at this point.

Do I understand correctly that the detect-language setting will only use the detected language for sending it to Bazarr, but not for the STT? Or is Bazarr actually doing the detection, and sending the result back to subgen?

@McCloudS
Copy link
Owner

Bazarr will request detect-language if it doesn’t know the language of a file. Whisper does the detection and sends it back to Bazarr. Then Bazarr will use that to force the language on a subsequent call to generate a subtitle. Whereas the way the LRC is being made, Whisper autodetects the language and uses that for the rest of the file.

There’s no easy way to get the probabilities of languages without rewriting the flow of the program.

@Chaphasilor
Copy link
Author

Okay, and is the whisper-based autodetection using the configured first 30s (DETECT_LANGUAGE_LENGTH), or another duration, or the entire file?
Because it seems like some languages should be easily detectable, but have a long instrumental intro.

I got a bit confused by your comment about the 'detect-language'...

@McCloudS
Copy link
Owner

McCloudS commented Apr 18, 2024 via email

@Chaphasilor
Copy link
Author

Alright, thanks for the clarification. Being able to configure it would be very useful for lyrics.
An option to set the duration as a percentage of track length would also be nice!

@McCloudS
Copy link
Owner

I'm working on it, but it may not come to fruition.

How many of your files are not in the language you want? Could you not force the language 100% of the time and get your desired transcription?

@McCloudS McCloudS added the enhancement New feature or request label Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants