Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider adding sourmash gather execution directly to fastgather as postprocessing #107

Closed
ctb opened this issue Sep 9, 2023 · 8 comments

Comments

@ctb
Copy link
Collaborator

ctb commented Sep 9, 2023

could run w picklist via Python API

@ctb
Copy link
Collaborator Author

ctb commented Jan 3, 2024

I looked into this and it's a bit annoying because fastgather really does take different (and more limited) database types than gather... I think it will be easier after #134 is merged.

@bluegenes
Copy link
Contributor

bluegenes commented Jan 18, 2024

after #134, I was hoping to provide all gather columns in the branchwater fastmultigather results csv by using the rust GatherResult and adding in the query-specific columns within branchwater.

I think that's feasible, but might require a little more work in rust core:

Ref:

@bluegenes
Copy link
Contributor

.. hmm, this actually doesn't directly help with fastgather since it's not using the underlying database gather. Sorry! But you might be interested in using GatherResult directly if/when I get that working for fastmultigather.

@ctb
Copy link
Collaborator Author

ctb commented Jan 18, 2024

after #134, I was hoping to provide all gather columns in the branchwater fastmultigather results csv by using the rust GatherResult and adding in the query-specific columns within branchwater.

Two thoughts - I love the details in here about what needs to be done! I think maybe it should be its own issue or set of issues!

But, also, my experience with a performance issue in branchwater, see #71, makes me wary here. I don't want to slow fastgather down by accident. I guess I am of two minds: is the only purpose of fastgather to do a full gather, faster? Or are there situations where we might want to take the output of fastgather and NOT do a gather afterwards?

OK, I feel dumb for even saying it. But anyway, be wary of performance issues, is all.

@bluegenes
Copy link
Contributor

great point!

I think I want the full results often enough that it would be useful to enable. If there's a significant performance hit, it might be worth passing a flag to toggle between lightweight and full versions of gather.

I'll move the steps above over to a new fastmultigather issue :).

@ctb
Copy link
Collaborator Author

ctb commented Jan 24, 2024

soooo looking at #188, I have a hot take:

both fastgather and full gather are annoying and big and slow for some samples - for a collection of rumen, it's been taking ~36 hours per sample with 64 threads and fastgather against GTDB!, and 500 GB of RAM to do the full gather on the resulting ~6000 matches with a picklist.

this feels a bit hacky, but I think there's opportunity for a version of gather that just calculates the statistics without doing the full search and so on. That is, it should be possible to just take the fastgather output and flesh out the full stats by "believing" the fastgather output - not using it as a picklist, and instead using it as an ordered scaffold. It would presumably be much faster and lower memory too...

@ctb
Copy link
Collaborator Author

ctb commented Feb 4, 2024

This is becoming quite the tangle of issues and PRs 😅 but I wanted to connect them a bit more by pointing out we've gone this route with calc-full-gather.

ref PR dib-lab/sourmash-slainte#18 for sourmash-slainte workflow, and sourmash-bio/sourmash#2950 for problems revealed in sourmash.

@ctb
Copy link
Collaborator Author

ctb commented Feb 4, 2024

You know what? I'm going to close this issue. #187 has all the important stuff remaining.

@ctb ctb closed this as completed Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants