Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory consumption (50 GB) for genbank bacteria #120

Open
openpaul opened this issue Apr 30, 2020 · 9 comments
Open

High memory consumption (50 GB) for genbank bacteria #120

openpaul opened this issue Apr 30, 2020 · 9 comments

Comments

@openpaul
Copy link

I was using ncbi-genome-download to fetch all genbank bacteria (currently 630k). This seems to be a challenging task, I am unsure if this is even feasible this way.
I noticed it will require up to 50GB of memory and wondered how that can be. I am currently debugging the tool to figure it out, but thought maybe it is known to someone. This would save me some trouble.
I am launching it like this:

ncbi-genome-download -F fasta -s genbank     \
    --human-readable        \
    --retries 2     \
    --parallel 4    \
    --no-cache     \
    --verbose     \
    -o genomes     \
    bacteria
@kblin
Copy link
Owner

kblin commented May 10, 2020

Hi,
at what point is the memory usage up that high? I don't currently have the disk space or internet connection to really try downloading all of bacterial GenBank, but at least during the selection and MD5SUMS downloading step the memory usage is driven by your --parallel value and doesn't go above 2 GB for me with your input parameters.

@kblin kblin added the need info Needs more info from the issue reporter label May 10, 2020
@Wrzlprmft
Copy link

Wrzlprmft commented May 16, 2020

I observe memory issues as well. Here is what I did:

  • I wrote a Python script (using the Python interface) to download a list of custom genomes. All of this happened within one Python script.

  • This process was killed by the system due to excessive memory usage, which was evidenced by the kernel logs. (I have 16 GB of RAM plus some swap.)

  • The kill happened when trying to download something that wasn’t there (“No downloads matched your filter.”). Moreover, when I used the same script to download only the genome during whose attempted download the kill occurred it worked fine and without an excessive memory usage.

All of this screams memory leak to me.

@Wrzlprmft
Copy link

Running something similar again, I noticed that NCBI Genome Download appears to be spawning more Python processes, which each consume a bit of memory and amass over time. Their number exceeds the value of the parallel argument.

@openpaul
Copy link
Author

openpaul commented May 18, 2020

Hi, so I actually finished downloading genbank bacteria. In the successfull run it actually used "only" 20 Gb of memory in 4 threads, so 5 Gb per thread, which is reasonable given the number of files. Still I assume this could be lowered. Here a plot from mprofile:

Screenshot_2020-05-18_11-13-39

Edit: I am not sure why it used only 20 Gb of memory this time where last time it requested more than that (50Gb)

I looked at the code and it seems a strange to me to first collect all jobs and then work them off, instead of launching them directly (in core.py::config_download). List download_jobs is created and used once in the next loop. Maybe thats the issue?
But I guess a more thorough code profiling has to be done to really find where the memory stacks up.

@Wrzlprmft
Copy link

5 Gb per thread, which is reasonable given the number of files

Why would a downloading job require that much memory?

@openpaul
Copy link
Author

Its over 600.000 files, all of which require multiple informations in memory. So per download candidate its only 20Gb/630.000, which is 31Kb, which itself seems reasonable. Or does it not?
Still, as mentioned, I think the logic could be improved upon, reducing memory consumption. Right now I don't have the time to suggest and test alternatives, but maybe in a week or two I might be able to.

@Wrzlprmft
Copy link

Wrzlprmft commented May 18, 2020

So per download candidate its only 20Gb/630.000, which is 31Kb, which itself seems reasonable.

Not really. NCBI stores file information in assembly summaries, most of which are below 1 kB and even then, you only need a fraction of this for downloading. While these files can be much larger (Megabytes) for some species like E. coli, there is no reason to keep all of this information for all species in memory all the time.

@kblin
Copy link
Owner

kblin commented May 18, 2020

ncbi-genome-download was built to allow intelligent filtering of files, not to just go grab all of them, and only the large file downloads were run in parallel, which is why there's this collect and execute split.
If you don't use the tool to download all the things at once, the memory consumption is very reasonable, and that's what I've built it for.
I'm happy to look at patches to restructure things, but it's not trivial and if you don't really need the filtering, a simpler script parsing the assembly_summary.txt file and then just downloading all those files might serve your needs better.

@kblin
Copy link
Owner

kblin commented May 18, 2020

I mean I agree that the tool could be more careful with dumping things from memory again, but it was build for workflows like "get me all assemblies of genus Streptomyces that have an assembly level of 'complete' or 'chromosome' from GenBank", and not "download all 630000 bacterial genomes on the server".

@kblin kblin added enhancement help wanted and removed need info Needs more info from the issue reporter labels May 18, 2020
Wrzlprmft pushed a commit to Wrzlprmft/ncbi-genome-download that referenced this issue May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants