-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory consumption (50 GB) for genbank bacteria #120
Comments
Hi, |
I observe memory issues as well. Here is what I did:
All of this screams memory leak to me. |
Running something similar again, I noticed that NCBI Genome Download appears to be spawning more Python processes, which each consume a bit of memory and amass over time. Their number exceeds the value of the |
Hi, so I actually finished downloading genbank bacteria. In the successfull run it actually used "only" 20 Gb of memory in 4 threads, so 5 Gb per thread, which is reasonable given the number of files. Still I assume this could be lowered. Here a plot from mprofile: Edit: I am not sure why it used only 20 Gb of memory this time where last time it requested more than that (50Gb) I looked at the code and it seems a strange to me to first collect all jobs and then work them off, instead of launching them directly (in core.py::config_download). List |
Why would a downloading job require that much memory? |
Its over 600.000 files, all of which require multiple informations in memory. So per download candidate its only 20Gb/630.000, which is 31Kb, which itself seems reasonable. Or does it not? |
Not really. NCBI stores file information in assembly summaries, most of which are below 1 kB and even then, you only need a fraction of this for downloading. While these files can be much larger (Megabytes) for some species like E. coli, there is no reason to keep all of this information for all species in memory all the time. |
ncbi-genome-download was built to allow intelligent filtering of files, not to just go grab all of them, and only the large file downloads were run in parallel, which is why there's this collect and execute split. |
I mean I agree that the tool could be more careful with dumping things from memory again, but it was build for workflows like "get me all assemblies of genus Streptomyces that have an assembly level of 'complete' or 'chromosome' from GenBank", and not "download all 630000 bacterial genomes on the server". |
I was using ncbi-genome-download to fetch all genbank bacteria (currently 630k). This seems to be a challenging task, I am unsure if this is even feasible this way.
I noticed it will require up to 50GB of memory and wondered how that can be. I am currently debugging the tool to figure it out, but thought maybe it is known to someone. This would save me some trouble.
I am launching it like this:
The text was updated successfully, but these errors were encountered: