-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General and speech recognition enhancement #140
Comments
Thanks @JorisCos for your feedback! Let's discuss:
Since 1.5.0 Scaper can return audio/annotations in memory (we're on 1.6.4 now).
We have source separation tutorial using Scaper (as a real-time data generator) this ISMIR 2020, and wrote it as an online book: https://source-separation.github.io/tutorial. It focuses on music source separation, but could be directly applied to speech with little or no modifications required I believe.
We've had a few requests for more background event controls, some of which are documented in #47. Can you give it a quick look and then add/elaborate on your required functionality? There are some non-trivial considerations to keep in mind, e.g., how we handle/determine ref_db, but we can map out a solution for this.
Yup, this has been a constraint since the beginning because it simplifies a lot of the (fairly complex) logic around soundscape generation. That said, I agree it'd be nice to have more flexibility here. Happy to discuss further via #1.
Right now we support this within a single soundscape via Well, that's quite a bit to unpack! How would you prioritize these issues? We have limited cycles so it would make sense to tackle this by priority. The way we work on issues is first discuss them in an issue until we reach consensus about the (1) problem, (2) high-level solution and (3) how to implement the solution. Once we complete 1-3, someone opens a PR. Cheers |
Thank you for your quick reply ! IMO the priority goes to the duration issue that we will discuss further in #1. |
About this issue
I open this issue to start a discussion about some limitations that I have encountered when using Scaper for speech recognition in noisy environments.
General limitations
It could be nice to have the possibility to have multiple background files to cover the soundscape duration. By that, I mean a succession of background files instead of duplicating the same file over and over.
When the background duration is shorter than the soundscape duration, the background is duplicated. The duplication is made "roughly" by using
numpy.tile
function. A smoother way could be to have an ascending hanning window on the new background file and a descending one the ending background file.A sampling without replacement method could be really useful for choosing background files and source files. By that, I mean that if I have 200 background files and 200 source files and I generate 200 soundscapes then each background file and source file should only be used once.
Being able to provide a glob object for choosing those files could be a plus.
The file generation function is tied to the writing on disk function which means you can't have a dynamic generator. Having the possibility to generate the soundscape without writing it to the disk would solve that.
Speech recognition related
As mentioned in #1 the duration of a soundscape is fixed. This is a real limitation for speech recognition purposes as the utterances have variable durations. Using scaper force you to choose to generate soundscapes as long as the longest duration of your utterances. Having a parameter to force the soundscape duration to match the utterance duration would avoid post-processing.
Contribution
As mentioned #92 a tutorial for source separation could be a cool thing to have. I would be glad to a tutorial that uses scaper to generate data Asteroid to perform source separation and ESPNet to perform speech recognition.
The text was updated successfully, but these errors were encountered: