High memory consumption (50 GB) for genbank bacteria #120

openpaul · 2020-04-30T15:57:22Z

I was using ncbi-genome-download to fetch all genbank bacteria (currently 630k). This seems to be a challenging task, I am unsure if this is even feasible this way.
I noticed it will require up to 50GB of memory and wondered how that can be. I am currently debugging the tool to figure it out, but thought maybe it is known to someone. This would save me some trouble.
I am launching it like this:

ncbi-genome-download -F fasta -s genbank     \
    --human-readable        \
    --retries 2     \
    --parallel 4    \
    --no-cache     \
    --verbose     \
    -o genomes     \
    bacteria

The text was updated successfully, but these errors were encountered:

kblin · 2020-05-10T14:18:09Z

Hi,
at what point is the memory usage up that high? I don't currently have the disk space or internet connection to really try downloading all of bacterial GenBank, but at least during the selection and MD5SUMS downloading step the memory usage is driven by your --parallel value and doesn't go above 2 GB for me with your input parameters.

Wrzlprmft · 2020-05-16T08:29:59Z

I observe memory issues as well. Here is what I did:

I wrote a Python script (using the Python interface) to download a list of custom genomes. All of this happened within one Python script.
This process was killed by the system due to excessive memory usage, which was evidenced by the kernel logs. (I have 16 GB of RAM plus some swap.)
~~The kill happened when trying to download something that wasn’t there (“No downloads matched your filter.”).~~ Moreover, when I used the same script to download only the genome during whose attempted download the kill occurred it worked fine and without an excessive memory usage.

All of this screams memory leak to me.

Wrzlprmft · 2020-05-16T18:24:48Z

Running something similar again, I noticed that NCBI Genome Download appears to be spawning more Python processes, which each consume a bit of memory and amass over time. Their number exceeds the value of the parallel argument.

openpaul · 2020-05-18T09:18:22Z

Hi, so I actually finished downloading genbank bacteria. In the successfull run it actually used "only" 20 Gb of memory in 4 threads, so 5 Gb per thread, which is reasonable given the number of files. Still I assume this could be lowered. Here a plot from mprofile:

Edit: I am not sure why it used only 20 Gb of memory this time where last time it requested more than that (50Gb)

I looked at the code and it seems a strange to me to first collect all jobs and then work them off, instead of launching them directly (in core.py::config_download). List download_jobs is created and used once in the next loop. Maybe thats the issue?
But I guess a more thorough code profiling has to be done to really find where the memory stacks up.

Wrzlprmft · 2020-05-18T10:39:45Z

5 Gb per thread, which is reasonable given the number of files

Why would a downloading job require that much memory?

openpaul · 2020-05-18T10:57:19Z

Its over 600.000 files, all of which require multiple informations in memory. So per download candidate its only 20Gb/630.000, which is 31Kb, which itself seems reasonable. Or does it not?
Still, as mentioned, I think the logic could be improved upon, reducing memory consumption. Right now I don't have the time to suggest and test alternatives, but maybe in a week or two I might be able to.

Wrzlprmft · 2020-05-18T12:32:33Z

So per download candidate its only 20Gb/630.000, which is 31Kb, which itself seems reasonable.

Not really. NCBI stores file information in assembly summaries, most of which are below 1 kB and even then, you only need a fraction of this for downloading. While these files can be much larger (Megabytes) for some species like E. coli, there is no reason to keep all of this information for all species in memory all the time.

kblin · 2020-05-18T15:49:07Z

ncbi-genome-download was built to allow intelligent filtering of files, not to just go grab all of them, and only the large file downloads were run in parallel, which is why there's this collect and execute split.
If you don't use the tool to download all the things at once, the memory consumption is very reasonable, and that's what I've built it for.
I'm happy to look at patches to restructure things, but it's not trivial and if you don't really need the filtering, a simpler script parsing the assembly_summary.txt file and then just downloading all those files might serve your needs better.

kblin · 2020-05-18T15:52:23Z

I mean I agree that the tool could be more careful with dumping things from memory again, but it was build for workflows like "get me all assemblies of genus Streptomyces that have an assembly level of 'complete' or 'chromosome' from GenBank", and not "download all 630000 bacterial genomes on the server".

…sue kblin#120.

…sue #120.

kblin added the need info Needs more info from the issue reporter label May 10, 2020

kblin added enhancement help wanted and removed need info Needs more info from the issue reporter labels May 18, 2020

Wrzlprmft pushed a commit to Wrzlprmft/ncbi-genome-download that referenced this issue May 18, 2020

Using context manager for pool. This should at least partially fix Is…

41d4ef6

…sue kblin#120.

Wrzlprmft mentioned this issue May 18, 2020

Removing zombie processes #122

Closed

kblin pushed a commit that referenced this issue Jun 2, 2020

Using context manager for pool. This should at least partially fix Is…

6ecff7d

…sue #120.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory consumption (50 GB) for genbank bacteria #120

High memory consumption (50 GB) for genbank bacteria #120

openpaul commented Apr 30, 2020

kblin commented May 10, 2020

Wrzlprmft commented May 16, 2020 •

edited

Loading

Wrzlprmft commented May 16, 2020

openpaul commented May 18, 2020 •

edited

Loading

Wrzlprmft commented May 18, 2020

openpaul commented May 18, 2020

Wrzlprmft commented May 18, 2020 •

edited

Loading

kblin commented May 18, 2020

kblin commented May 18, 2020

High memory consumption (50 GB) for genbank bacteria #120

High memory consumption (50 GB) for genbank bacteria #120

Comments

openpaul commented Apr 30, 2020

kblin commented May 10, 2020

Wrzlprmft commented May 16, 2020 • edited Loading

Wrzlprmft commented May 16, 2020

openpaul commented May 18, 2020 • edited Loading

Wrzlprmft commented May 18, 2020

openpaul commented May 18, 2020

Wrzlprmft commented May 18, 2020 • edited Loading

kblin commented May 18, 2020

kblin commented May 18, 2020

Wrzlprmft commented May 16, 2020 •

edited

Loading

openpaul commented May 18, 2020 •

edited

Loading

Wrzlprmft commented May 18, 2020 •

edited

Loading