You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think Umlaut-characters (äüößÄÜÖ) are currently just being removed from the input texts instead of getting their own symbol id or being replaced by similar ASCII encodings ('ae', 'ue, 'oe', 'ss'...). Even though I guess the neural network learns to pronounce 'fnf' as 'fünf' I think the performance could be improved by fixing this.
The background is that german_transliterate actually doesn't change the umlaut-characters, even though it states it 'replaces Unicode symbols with ASCII characters'. They are still in the string afterwards and as there is no symbol id for them in symbol_to_id they are just left out in the resulting sequence.
A solution could be to append those characters to ALL_SYMBOLS to give them their own id. Unfortunately the network probably has to be retrained after changing this.
Please don't hesitate to tell me if I got something wrong and umlaut characters are being handled correctly.
[Edit: Thank you Monatis and Thorsten for this really great effort regardless of this issue anyway!]
The text was updated successfully, but these errors were encountered:
@luminosuslight Actually you're right, unfortunately :D
I noticed this after training is complete, and that's why the model has difficulty in umlauts sometimes (not always). Anyway, I'm retraining Tacotron2 (and then FastSpeech2 for mobile and embedded inference), and this issue will be fixed in those models. Thanks for the issue.
I think Umlaut-characters (äüößÄÜÖ) are currently just being removed from the input texts instead of getting their own symbol id or being replaced by similar ASCII encodings ('ae', 'ue, 'oe', 'ss'...). Even though I guess the neural network learns to pronounce 'fnf' as 'fünf' I think the performance could be improved by fixing this.
The background is that german_transliterate actually doesn't change the umlaut-characters, even though it states it 'replaces Unicode symbols with ASCII characters'. They are still in the string afterwards and as there is no symbol id for them in
symbol_to_id
they are just left out in the resulting sequence.A solution could be to append those characters to ALL_SYMBOLS to give them their own id. Unfortunately the network probably has to be retrained after changing this.
Please don't hesitate to tell me if I got something wrong and umlaut characters are being handled correctly.
[Edit: Thank you Monatis and Thorsten for this really great effort regardless of this issue anyway!]
The text was updated successfully, but these errors were encountered: