Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastgather is faster than fastmultigather in loading the database #268

Open
mr-eyes opened this issue Mar 8, 2024 · 7 comments
Open

fastgather is faster than fastmultigather in loading the database #268

mr-eyes opened this issue Mar 8, 2024 · 7 comments

Comments

@mr-eyes
Copy link
Member

mr-eyes commented Mar 8, 2024

version info

== This is sourmash version 4.8.6. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> sourmash_plugin_branchwater 0.9.1; cite Irber et al., doi: 10.1101/2022.11.02.514947

fastgather

...fastgather is done! gather results in 'SAMD00293140_DRX333969.csv'
        Command being timed: "sourmash scripts fastgather sigs/SAMD00009664_SRX4035758.sig /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-k51.zip -k 51 --scaled 10000 -c 32 -o SAMD00293140_DRX333969.csv"
        User time (seconds): 1368.00
        System time (seconds): 17.77
        Percent of CPU this job got: 2787%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.71
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 13608416
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 25
        Minor (reclaiming a frame) page faults: 1072434

fastmultigather
I killed it at 15 minutes
image


I suspect the in-memory zip decompression from htop tracking, but I am unsure.

@mr-eyes mr-eyes changed the title fastgather is faster than fastmultigather fastgather is faster than fastmultigather in loading the database Mar 9, 2024
@bluegenes
Copy link
Contributor

bluegenes commented Mar 25, 2024

Ok, thinking through the different strategies used by fastgather and fastmultigather:

fastgather loads the single query into memory, and then loads just the collection (i.e. sketch metadata) of against sketches. It then uses load_sketches_above_threshold to load each against sketch into memory in parallel and generate the results.

fastmultigather loads the query and against collections first, then loads all against sketches into memory using load_sketches. It then iterates through the queries in parallel, loading each into memory and then comparing against each of the against sketches.

So perhaps what you're seeing is the effect of load_sketches_above_threshold being parallelized, while load_sketches is not. I think it would be relatively straightforward to parallelize load_sketches to speed up sketch loading. Note order may not be preserved if we parallelize.

@ctb
Copy link
Collaborator

ctb commented Apr 15, 2024

There's something else going on here... even when using the code from #292, fastmultigather takes 30 minutes to load the things, while fastgather can do a search against GTDB rs214 in well under a minute. Hmm.

@ctb
Copy link
Collaborator

ctb commented Apr 15, 2024

I guess it could be the need to store stuff in memory, or something. Maybe that consumes a lot of time. But it seems strange to me.

@bluegenes
Copy link
Contributor

I'm seeing this issue too -- my benchmarks over in #298 were ~ 2 mins for fastgather and ~8mins for fastmultigather.

The thing that confuses me is that I don't think any code in fastmultigather changed since v0.9.0, for which we have benchmarks where both searches each took ~ 2mins (#214).

The utils have changed more recently, but really only the recent parallelization of load_sketches, I think. Both fastgather and fastmultigather use the same underlying functions, prefetch and consume_query_by_gather for the actual gather.

@ctb
Copy link
Collaborator

ctb commented Apr 29, 2024

ref #312

@ctb
Copy link
Collaborator

ctb commented Jul 1, 2024

looking at sourmash-bio/sourmash#3232, it still kind of blows my mind how much faster fastgather is than fastmultigather. Grr.

@ctb
Copy link
Collaborator

ctb commented Oct 22, 2024

Per the benchmarks in #479, this is still true - fastgather is much faster and lower memory than fastmultigather.

Now that I'm way more read into the codebase, I am pretty sure there is no simple bug that is slowing down fastmultigather. I would guess that the slowdown is one or more of:

  • slow random memory access to large volumes of memory: because we're loading the whole database into memory, we're having to access across the entire db.
  • the use of Vec to store all the sketches.
  • the use of BinaryHeap in the prefetch step (although this doesn't ring true...)
  • bad/slow/incompetent use of par_iter in MultiCollection for the search step.

but in any case the next step here is to do profiling, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants