Fix parallel parsing of many files in get_results_as_dataframe #76
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello,
When working with 1000+ simulation result files, the
parallel_parsing
option of theCampaignManager.get_results_as_dataframe()
utility causes the main process to get stuck trying to open and load all the files at once, with no display of progress. If the files are big, this could eventually lead to memory issues and crashes.This happens because the program is trying to build at once the full list of inputs to be fed to
pool.imap_unordered
. As a simple fix, I suggest to change the list into a generator. Generators are lazy iterators, so each file is only opened when the next element is requested bypool.imap_unordered
, and closed when the task is done. As an added benefit, this also fixes thetqdm
progress bar, which otherwise appears only when the list is completely loaded.Best,
Alessandro