You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the latest version of the repository (1.6.4) and when generating soundscapes I noticed a prohibitive slow generation of soundscapes in a loop. here's what I found out:
So following the example in the documentation : https://scaper.readthedocs.io/en/latest/examples.html
in the add_event() method when I specify a list of source files for event (instead of []), every time it wasnt to add an event it will go through validating the entire list of source files because of calling "_validate_source_file()" in "validate_event()" ! this is extremely redundant and unnecessary because Instead of doing so only "one" necessary time it will do it : {number of soundscapes times number of_added_event_per_soundscape} times. particularly if the list of event source files is long (as in my case).
The same argument of course applies when _validate event is called in add_background.
So I believe the _validate_source_file must be moved out of all loops and be just called once for the list of background files and source files through a separate method outside of all the loops.
I am just commenting them out in my case.
Also similary if you dont specify the list of source files in the add_event or add_background and go with [] instead, this redundant file checking happens through another function in utils.py called : "_get_sorted_files()" where in the last line of the method is written:
files = [f for f in files if os.path.isfile(f)]
this line is again is a huge unnecessary overhead and should be performed outside all the loops and only once for all background and source files.
The text was updated successfully, but these errors were encountered:
The reason every distribution tuple is validated when an event is added is that it can be different each time.
In your case you always add the same list of source files, but in general each time you add an event you could specify a different list of source files (or a different distribution tuple altogether) meaning we must validate the distribution tuple every time we call add_event.
I we don't validate the source files when an event is added it could cause Scaper to crash downstream. So from a design standpoint I believe what we are doing is the correct thing to do.
We could, potentially, add a flag to add_event to disable source file validation (e.g. disable_source_file_validation=False by default), since this is the only validation that involves file I/O - I believe all the other validation steps are insignificant computationally.
@moabaom To help us better understand how serious an issue this is, can you share a Minimal Working Example (MWE) and report the execution time with and without source file validation? Thanks!
p.s - actually another option would be to create a "cache" for validated files, such that if a scaper object has already validated that a file exists, it doesn't try to validate it again. This is probably safer than disabling validation and achieves the same speedup. If you create a new scaper object in a loop (like in the example), we could support providing a list of validated files as input, and the scaper object will not try to validate these source files.
Still, let's start with the MWE and timings to better scope the problem. thanks.
I am using the latest version of the repository (1.6.4) and when generating soundscapes I noticed a prohibitive slow generation of soundscapes in a loop. here's what I found out:
So following the example in the documentation : https://scaper.readthedocs.io/en/latest/examples.html
in the add_event() method when I specify a list of source files for event (instead of []), every time it wasnt to add an event it will go through validating the entire list of source files because of calling "_validate_source_file()" in "validate_event()" ! this is extremely redundant and unnecessary because Instead of doing so only "one" necessary time it will do it : {number of soundscapes times number of_added_event_per_soundscape} times. particularly if the list of event source files is long (as in my case).
The same argument of course applies when _validate event is called in add_background.
So I believe the _validate_source_file must be moved out of all loops and be just called once for the list of background files and source files through a separate method outside of all the loops.
I am just commenting them out in my case.
Also similary if you dont specify the list of source files in the add_event or add_background and go with [] instead, this redundant file checking happens through another function in utils.py called : "_get_sorted_files()" where in the last line of the method is written:
files = [f for f in files if os.path.isfile(f)]
this line is again is a huge unnecessary overhead and should be performed outside all the loops and only once for all background and source files.
The text was updated successfully, but these errors were encountered: