-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common clustering API (i.e., why aren't KShiftsClustering.jl, QuickShiftClustering.jl, QuickShiftClustering.jl SpectralClustering.jl here...?) #256
Comments
I agree that the 2nd alternative (common interface) is preferable -- we don't want to assemble all possible clustering algorithms in Clustering.jl -- that would potentially bloat its dependencies, compilation times, and make maintenance harder. Plus, it requires that the package authors agree to transfer their code here. The common interface was not implemented so far because nobody have contributed code implementing it.
I am not 100% sure that all keywords have to go to the ClusteringAlgorithm. |
To the best of my knowledge some clustering algorithms decide for themselves how many clusters there should be, such as DBSCAN, which only re-enforces the argument that this should be an argument to the cluster type.
Sure, a PR where? Clustering.jl has too many dependencies to be used as the interface holder. We need a new package, butt where should this package live? |
Yes, the essential part of this effort is to have an interface that covers all/most different flavors of clustering problem specifications and still provides a convenient user API.
Except Distances.jl and NearestNeighbors.jl, all the dependencies are the standard Julia packages, so essentially you have them already. The new package, e.g. ClusteringAlgorithms.jl, is possible.
The downside of the latter approach is that the implementation is less straightforward. The new package doesn't have to live in JuliaXXX organization, especially if it is a weak dependency. |
Not sure I agree with this. The point of an interface is that it doesn't change based on who participates on the interface. For an interface to be simple and intuitive it has to be the same no matter the types. So again I would argue it is much better if all parameters go into the algorithm type and there is a single function cluster(alg::ClusteringAlgorithm, data::AbstractArray{<:AbstractArray}) The function returns
I believe that there is a third way that is superior as it utilizes the new Julia package extensions infrastructure. ClusteringAlgorithms.jl defines the interface and exports the using ClusteringAlgorithms you get In this way, all existing packages remain completely unaffected and ClusteringAlgorithms becomes nobodies dependency. One just has to do PRs to the common package ClusteringAlgorithms.jl to add implementations. Naturally, the downstream packages are recommended to advertise the common interface of ClusteringAlgorithms.jl in their documentation. |
I think we both agree that there are common parameters that are required for specific classes of clustering problems, and these parameters are distinct between these problem classes.
But this is exactly the 2nd alternative I have described. :) It's great that we have the same vision! |
I fully support @Datseris ideas and have had similar gripes with Clustering.jl multiple times. The MLJ.jl wrappers don't improve the situation either. Currently the end-user experience is extremely bad. If you decide to move forward with this initiative, please let me know how I can help. I would also like to comment that the idea of having a centralized repository where the implementations are maintained together is good. Clustering.jl has this role today, but the lack of maintainers and "stuckness" of this package is really compromising progress here. I would simply start a fresh repository in JuliaML (or any organization where some of us is admin) and would start writing the most common clustering algorithms with modern idiomatic Julia. I am positive that the Julia ML community would jump in to help. This is a good GSoC project btw. |
I've started a solution here: https://discourse.julialang.org/t/rfc-clusteringapi-jl/112258 |
We are developing front-end software which use clustering algorithms as a small part of their infrastructure. Ideally we would like users to be able to use a plethora of different clustering algorithms, especially because our application scenarios can range from a medium amount of data points (where DBSCAN is ok) to very large amount (where DBSCAN is too expensive).
At the moment, our users only have access to Clustering.jl as that's our only dependency. But I just checked and there are many packages with clustering. So I have to ask, why aren't these algorithms included here?
Or, better yet, why not define a common Clustering interface that all of these packages can satisfy and the user could just load the specific "clustering backend" and thus use any kind of clustering they want? All clustering has the same API, yet all packages use different names for the clustering functions like
dbscan(data; kwargs...)
.I think it would be great to define a function
cluster(alg::ClusteringAlgorithm, data)
and the instance ofalg
will have all keywords relevant to the specific algorithm.cc @KalelR @rened @lucianolorenti
The text was updated successfully, but these errors were encountered: