-
-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Coreset for Kmeans and GaussianMixture clustering #799
base: main
Are you sure you want to change the base?
Conversation
Currently I made this comparison, taking this sklearn doc as a baseline The original data has 150k rows and 2 features, 200 points are sampled with the Coreset class Notes |
Thanks for continuing to work on this.
Depending on the scheduler there's a roughly 10-200 microsecond per task overhead (https://docs.dask.org/en/latest/institutional-faq.html?highlight=overhead#how-well-does-dask-scale-what-are-dask-s-limitations). Can you think of anyone from the scikit-learn side who would be able to review this? I'm not too familiar with the coreset idea / Gaussian mixing. When you're ready, I can take a look at things from the Dask side. |
@TomAugspurger my pleasure. Concerning the review from the sklearn side, I think that Jérémie du Boisberranger worked on a set of runtime/memory benchmarks for KMeans, whose results have been presented at the 2019 scikit-learn consortium. Concerning interactions with the sklearn.GaussianMixture, I guess we can ask @gmaze, who initially raised the issue on adding Gaussian Mixture Models. He certainly is more informed than I am on how we should validate that our Coreset class can be used (with care ?) with GaussianMixture. My only intuition is that the sklearn.GaussianMixture uses KMeans internally to init the cluster centers (see the Also, I would need help on how to properly compare runtime with sklearn. My current benchmark takes the sklearn.KMeans class as an opponent, but sklearn has a MiniBatchKMeans which is meant to be faster.
|
Introduce a Coreset Meta Estimator, that samples a subset of the original data (keeping the original geometrical shape) and pass this sample to a scikit-learn estimator
original paper
TODO
self
(the dask meta estimator)refit
param in GridSearchCV