🐌 ➡️ 🐎 Optimize DBSCAN clustering method for genome binning #287

evanroyrees · 2022-08-07T19:01:11Z

From scikit-learn clustering docs, DBSCAN's memory consumption may be optimized:

Memory consumption for large sample sizes

This implementation is by default not memory efficient because it constructs a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot be used (e.g., with sparse matrices). This matrix will consume floats. A couple of mechanisms for getting around this are:

Use OPTICS clustering in conjunction with the extract_dbscan method. OPTICS clustering also calculates the full pairwise matrix, but only keeps one row in memory at a time (memory complexity n).

A sparse radius neighborhood graph (where missing entries are presumed to be out of eps) can be precomputed in a memory-efficient way and dbscan can be run over this with metric='precomputed'. See sklearn.neighbors.NearestNeighbors.radius_neighbors_graph.

The text was updated successfully, but these errors were encountered:

evanroyrees added enhancement New feature or request stretch wonderful for the community, yet may not be top priority labels Aug 7, 2022

evanroyrees changed the title ~~🐌 ➡️ 🐎 Optimize DBSCAN clustering method for genome binning speedup~~ 🐌 ➡️ 🐎 Optimize DBSCAN clustering method for genome binning Aug 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐌 ➡️ 🐎 Optimize DBSCAN clustering method for genome binning #287

🐌 ➡️ 🐎 Optimize DBSCAN clustering method for genome binning #287

evanroyrees commented Aug 7, 2022

Memory consumption for large sample sizes

🐌 ➡️ 🐎 Optimize DBSCAN clustering method for genome binning #287

🐌 ➡️ 🐎 Optimize DBSCAN clustering method for genome binning #287

Comments

evanroyrees commented Aug 7, 2022

Memory consumption for large sample sizes