-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is the kmeans algorithm column-oriented instead of row-oriented? #79
Comments
There were discussions about this a while back (though I'm struggling to locate one now), and ultimately it came down to the original package author's personal preference. Apparently observations-as-columns is the norm in machine learning, at least according to some, and observations-as-rows is the norm in classical statistics. Personally I'm in the camp of observations being rows and have been bitten by this package's use of columns. The problem with changing it at this point is that existing code will suddenly become very wrong, and it's not something we'd be able to catch and emit a deprecation warning for. We'd have to redesign the API and deprecate the existing API in favor of the new one. |
Unless we can find another package where clustering works that way, I'd favor moving to rows as observations. A possible deprecation path would be to add a EDIT: see also JuliaStats/Distances.jl#35 |
@nalimilan That sounds perfect. Simple and largely non-disruptive. Nice idea! 💯 |
Sounds like a plan. I can take a first stab at implementing this and submit a pull request for your review. |
That sounds great, @paulhendricks. Thanks! |
Awesome. As far as I can tell, kmeans is the only clustering algorithm that needs changed to close this issue. Below is a first pass at implementing @nalimilan 's suggestion: const _kmeans_default_init = :kmpp
const _kmeans_default_maxiter = 100
const _kmeans_default_tol = 1.0e-6
const _kmeans_default_display = :none
const _vardim = 2
function kmeans!{T<:AbstractFloat}(X::Matrix{T}, centers::Matrix{T};
weights=nothing,
maxiter::Integer=_kmeans_default_maxiter,
tol::Real=_kmeans_default_tol,
display::Symbol=_kmeans_default_display,
vardim::Integer=_vardim)
# Here, we use vardim to assess if matrix X is
# column-oriented (vardim == 1) or row-oriented (vardim == 2).
# By default, X should be row-oriented. Use vardim=1
# to recover the old behavior where X is column-oriented.
#
# See this issue for more details:
# https://github.com/JuliaStats/Clustering.jl/issues/79
info("vardim defaults to 2 in version x.x.x and will default to 1 in x.x.x.")
info("vardim will then be deprecated in version x.x.x.")
if vardim == 2:
# If matrix X is row-oriented (vardim == 2), transpose
# it to be column-oriented. When vardim is deprecated,
# this transposition will be the first operation.
X = transpose(X)
# Once X is column-oriented, the rest of algorithm
# can run as normal with no need for modification.
m, n = size(X)
m2, k = size(centers)
m == m2 || throw(DimensionMismatch("Inconsistent array dimensions."))
(2 <= k < n) || error("k must have 2 <= k < n.")
assignments = zeros(Int, n)
costs = zeros(T, n)
counts = Array(Int, k)
cweights = Array(Float64, k) Does this look good 👍 ? |
Use |
Just as a point of reference, one of the original arguments for having observations as columns was that Julia arrays are in column-major order, so this allows data for individual observations to be stored contiguously. See, e.g., http://julialang.org/blog/2013/09/fast-numeric, especially under "Write cache-friendly codes". |
Right, that's probably how Dahua chose the arrangement to use. Benchmarks are presented in Distances.jl README: https://github.com/JuliaStats/Distances.jl#benchmarks Though I wonder whether it really makes a difference since the text says the computations are done by calling BLAS's Somebody would have to run the benchmarks using the rowwise convention to check this. |
SciKit Learn has a great interface, but in my experience, it doesn't always have the fastest implementation. More importantly, Python/numpy are row-major order by default, so matching that convention already keeps the Dara contiguous/cache friendly. |
(But running the benchmarks is definitely the way to go!) Since Julia just obtained a Transpose type, this actually shouldn't be that costly. |
I didn't know that. It's indeed a good explanation. OTOH, R is column-major, but it doesn't care so much about speed... |
A confirmation that R's |
Most relevant comments have already been made here but just wanted to point out that it seems that ML algorithms are often expressed and written by observation, i.e. for OLS that would be |
|
It's fine to allow I assume that the major motivation of the change is |
Tables.jl really isn't my motivation for this change, as it can easily gain an option to transpose the columns, which should be relatively efficient compared with the cost of computing pairwise distances and clustering (JuliaData/Tables.jl#66). The main reason is that I want to ensure consistency of functions that take a dimension argument across the ecosystem, and
Given what I said above, I'd rather require users to choose explicitly between consistency with the rest of Julia and the world ( We should also provide a Tables.jl-based interface which will automatically transpose the data under the hood, so that the efficient approach is used and people don't have to think about it. |
JuliaStats/Distances.jl#123 implements the convenience Tables.jl interface. I've tried measuring the performance difference between |
As far as I can see, MultivariateStats is unfortunately also implemented columns-are-observations - see e.g. the PCA example julia> using MultivariateStats, RDatasets
julia> iris = dataset("datasets", "iris");
julia> # split half to training set
Xtr = convert(Matrix,iris[:,1:4])'
4×150 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 … 6.8 6.7 6.7 6.3 6.5 6.2 5.9
3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 3.2 3.3 3.0 2.5 3.0 3.4 3.0
1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 5.9 5.7 5.2 5.0 5.2 5.4 5.1
0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 2.3 2.5 2.3 1.9 2.0 2.3 1.8
julia> fit(PCA, Xtr)
PCA(indim = 4, outdim = 3, principalratio = 0.9947878161267246)
julia> transform(ans, Xtr)
3×150 Array{Float64,2}:
2.68413 2.71414 2.88899 … -1.76435 -1.90094 -1.39019
0.319397 -0.177001 -0.144949 0.0788589 0.116628 -0.282661
0.0279148 0.210464 -0.0179003 -0.130482 -0.723252 -0.36291 In my view, observations-as-columns are practically never found in the wild outside some parts of Julia's ecosystem, and it would seem quite necessary to converge on observations-as-rows throughout, in spite of the issues with deprecations. |
As an potential additional data point (against the observations-as-row view), observations as columns is nice when converting between a complex number vector and the array used for clustering. Since complex numbers are stored as reshape(reinterpret(Float64,cvals::Vector{Complex{Float64}}),2,length(cvals)) No transpose is needed. |
In the docs (below), the
kmeans
algorithm takes a matrix where each column X[:, i] corresponds to an observed sample. This implementation goes against the idea of tidy data as well as differs from Python's scikit-learn implementation of kmeans and R's base implementation of kmeans.Is there a good reason for this? Should this algorithm be changed from column-oriented to row-oriented so as to be consistent with R and Python as well as with the concept of tidy data?
URL: http://clusteringjl.readthedocs.io/en/stable/overview.html
Inputs
A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:
The text was updated successfully, but these errors were encountered: