Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐌 ➡️ 🐎 Optimize DBSCAN clustering method for genome binning #287

Open
evanroyrees opened this issue Aug 7, 2022 · 0 comments
Open
Labels
enhancement New feature or request stretch wonderful for the community, yet may not be top priority

Comments

@evanroyrees
Copy link
Collaborator

From scikit-learn clustering docs, DBSCAN's memory consumption may be optimized:

Memory consumption for large sample sizes

This implementation is by default not memory efficient because it constructs a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot be used (e.g., with sparse matrices). This matrix will consume floats. A couple of mechanisms for getting around this are:

Use OPTICS clustering in conjunction with the extract_dbscan method. OPTICS clustering also calculates the full pairwise matrix, but only keeps one row in memory at a time (memory complexity n).

A sparse radius neighborhood graph (where missing entries are presumed to be out of eps) can be precomputed in a memory-efficient way and dbscan can be run over this with metric='precomputed'. See sklearn.neighbors.NearestNeighbors.radius_neighbors_graph.

@evanroyrees evanroyrees added enhancement New feature or request stretch wonderful for the community, yet may not be top priority labels Aug 7, 2022
@evanroyrees evanroyrees changed the title 🐌 ➡️ 🐎 Optimize DBSCAN clustering method for genome binning speedup 🐌 ➡️ 🐎 Optimize DBSCAN clustering method for genome binning Aug 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stretch wonderful for the community, yet may not be top priority
Projects
None yet
Development

No branches or pull requests

1 participant