-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FR: per voice override for regex preprocessing #54
Comments
The regex can be modified at run time, no reload required it's loaded on demand for each API call. a regex per language makes a lot of sense, yes. Per voice? Is that what you really mean, if so can you explain why? |
Oh right I thought "per voice of piper", so meaning a language I guess. Well the line is blurry now because of multi lingual piper feature. |
Honestly, I kind of hate how much of a hack the regex file is. It doesn't work well for piper AND xtts at the same time either, they each have unique and different flaws. I am very open to suggestions for how to fix it, or a better system. For now, I am considering simply adding a search for a language specific regex file, ex. pre_process_map.fr.yaml |
Well. I don't know the flaws but I'm thinking a maximally customizable implementation has to be the way to go. You are bound to have to switch to new better tts models in the future. So I'm thinking:
Oh wait i'm guessing the split happens before it's sent to the voice right? If that's the case then a unique preprocessing yaml containing a list should apply before the split. And then what I said above but without the before split list :) Ps: i don't know how much time it takes to load and apply the regex but as you said it's at runtime, I do think you could gain some ms by loading them at startup, running re.compile on each. Then at query time reload the dict only if the yaml mod time was changed. It might gain actual time for things like a raspberry pi especially if using long regex in "before split". Iirc regex uncompiled can be 10 to 100 times slower than applying str.replace. I do think that each ms saved has value for tts :). Additionally it would allow checking the validity of the yaml on launch instead of waiting for the first query. Should I create an issue to track this? Ps2: there's an overhead when creating a subprocess to call piper but there's a way to call it directly as a lib. It was very hard to figure it out at the time from the repo but I'm using this in a small script I made a while ago i'm thinking you could create the subprocess in advance once and for all and use that inside. Lots more ms to gain there possibly? |
So, my take on your comments is that if I added a new configuration key, which is available at any level of the voice_to_speaker.yaml, and is merged with any higher level file, then it should work as you describe. # the default top level will be pre_process_map.yaml, but include it here for example sake:
pre_process_map: pre_process_map.yaml # this one is common to piper and xtts, and represents the current default configuration
tts-1:
pre_process_map: pre_process_map.tts-1.yaml # this one is for generic 'piper'isms
alloy:
pre_process_map: pre_process_map.alloy.yaml # This doesn't make a lot of sense to me, but it could work anyways
pre_process_map: pre_process_map.tts-1.alloy.yaml # OR: same here, not much purpose here
en:
pre_process_map: pre_process_map.tts-1.alloy.en.yaml # specific alloy en piper isms, doesn't make much sense
pre_process_map: pre_process_map.tts-1.en.yaml # OR: general piper/en specifics
pre_process_map: pre_process_map.en.yaml # OR: general language specific Even if they don't make much sense, that's how it could work, how people use it is really up to them. Some limitations to consider, the input to the API is: 'voice', 'model' - that's all, language is an auto-detected feature which happens after the model has been determined (because the set of languages possible for playback is limited by the model). Because the file is a model.voice config, there isn't a great way to organize it around languages instead. Also, I'm still not very happy with language detection, 99% of users should really disable it and I may require it to be enabled before use, the current dev branch isn't merged or released yet. Maybe like --detect-languages en,es with the default being none or 'en' but allowing 'any'. I've often considered switching piper to the python implementation, but the piped process is so fast and streams from a real other process which avoids all the python "multi-thread" problems, I don't think I will bother. It's actually a very efficient simple pipeline and the onnx models must be mmap'd into place, so it's essentially instant on most linux systems. In the same vein, the regex/yaml processing is so far unnoticeable, so I probably wont go out of my way to make it more complex than needed. If anyone is interested in an efficient, high performance, GPU accelerated, high concurrency implementation of text to speech for large scale deployment (1000's of users) - I'm happily available for paid consultation and would enjoy the challenge :-) |
Regarding lang detection, you're the boss. Imho what matters is that people who run instances can configure it, the default are not critical imo for docker containers because it's only adressed at pretty savvy people to start with. So if I have to enable it it's fine for me. Thanks again for doing it in the first place :) And about speed, yeah okay. Piper advertises as being able to run on raspberry pis so I'm a little suspicious that what you're saying is still actually true for low powered systems if you're just infering from your experience on desktop computers but you're probably right. In any case not having time to spend for those edge cases is understandable of course. And your code is clean enough to make it easy to PR if I happen to need it. |
Hi,
I noticed that the current way the pre-processing regex is implemented makes it so that for example "1-5" is read as "one to five" but obviously it does not work in all languages: french would be "un a cinq", not " un to cinq".
Another example is "ex." being read as "for example" even though french people use "ex." to but read it as "par example".
It would be great if we could specify an override file for a given voice. Meaning tbe voice yaml file accept a new key for the voice like "override_preproceds_file" that would be a string path.
Also a question: can the regex file be modified at runtime or is it only read at startup? If it can be modified at runtime (which helps for tweaking) then I think it deserves a small sentence in the readme.md :)
Have a nice day!
The text was updated successfully, but these errors were encountered: