Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redundant unnecessary repetitive calling of _validate_source_file() in nested loops #139

Open
Moadab-AI opened this issue Oct 20, 2020 · 2 comments

Comments

@Moadab-AI
Copy link

Moadab-AI commented Oct 20, 2020

I am using the latest version of the repository (1.6.4) and when generating soundscapes I noticed a prohibitive slow generation of soundscapes in a loop. here's what I found out:

So following the example in the documentation : https://scaper.readthedocs.io/en/latest/examples.html
in the add_event() method when I specify a list of source files for event (instead of []), every time it wasnt to add an event it will go through validating the entire list of source files because of calling "_validate_source_file()" in "validate_event()" ! this is extremely redundant and unnecessary because Instead of doing so only "one" necessary time it will do it : {number of soundscapes times number of_added_event_per_soundscape} times. particularly if the list of event source files is long (as in my case).

The same argument of course applies when _validate event is called in add_background.
So I believe the _validate_source_file must be moved out of all loops and be just called once for the list of background files and source files through a separate method outside of all the loops.
I am just commenting them out in my case.

Also similary if you dont specify the list of source files in the add_event or add_background and go with [] instead, this redundant file checking happens through another function in utils.py called : "_get_sorted_files()" where in the last line of the method is written:
files = [f for f in files if os.path.isfile(f)]

this line is again is a huge unnecessary overhead and should be performed outside all the loops and only once for all background and source files.

@justinsalamon
Copy link
Owner

The reason every distribution tuple is validated when an event is added is that it can be different each time.

In your case you always add the same list of source files, but in general each time you add an event you could specify a different list of source files (or a different distribution tuple altogether) meaning we must validate the distribution tuple every time we call add_event.

I we don't validate the source files when an event is added it could cause Scaper to crash downstream. So from a design standpoint I believe what we are doing is the correct thing to do.

We could, potentially, add a flag to add_event to disable source file validation (e.g. disable_source_file_validation=False by default), since this is the only validation that involves file I/O - I believe all the other validation steps are insignificant computationally.

@moabaom To help us better understand how serious an issue this is, can you share a Minimal Working Example (MWE) and report the execution time with and without source file validation? Thanks!

@justinsalamon
Copy link
Owner

p.s - actually another option would be to create a "cache" for validated files, such that if a scaper object has already validated that a file exists, it doesn't try to validate it again. This is probably safer than disabling validation and achieves the same speedup. If you create a new scaper object in a loop (like in the example), we could support providing a list of validated files as input, and the scaper object will not try to validate these source files.

Still, let's start with the MWE and timings to better scope the problem. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants