ClusterShapley is a technique to explain non-linear dimendionality reduction results. You can explain the cluster formation after reducing the dimensionality to 2D. Read the preprint or publisher versions for further details.
ClusterShapley depends upon common machine learning libraries, such as scikit-learn
and NumPy
. It also depends on SHAP.
Requirements:
- shap
- numpy
- scipy
- scikit-learn
- pybind11
If you have these requirements installed, use PyPI:
pip install cluster-shapley
ClusterShapley package follows the same idea of sklearn classes, in which you need to fit and transform data.
Explaining cluster formation
Suppose you want to investigate the decisions of a dimensionality reduction (DR) technique to impose a projection on 2D. The first thing to do is to project the dataset.
import umap
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
data = load_iris()
X, y = data.data, data.target
reducer = umap.UMAP(verbose=0, random_state=0)
embedding = reducer.fit_transform(X)
plt.scatter(embedding[:, 0], embedding[:, 1], c=y)
Compute explanations
Now, you can generate explanations to understand why UMAP (or any other DR technique) imposed that cluster formation.
import random
import numpy as np
# our library
import dr_explainer as dre
# fit the dataset
clusterShapley = dre.ClusterShapley()
clusterShapley.fit(X, y)
# compute explanations for data subset
to_explain = np.array(random.sample(X.tolist(), int(X.shape[0] * 0.2)))
shap_values = clusterShapley.transform(to_explain)
- The matrix shap_values of shape (3, 30, 4) contains:
- the features' contributions for each class (3);
- upon the samples used to generate explanations (30);
- for each feature (4).
Visualize the contributions using SHAP plot
For now, you can rely on SHAP library to visualize the contributions
klass = 0
c_exp = shap.Explanation(shap_values[klass], data=to_explain, feature_names=data.feature_names)
shap.plots.beeswarm(c_exp)
The plot shows the contributions of each feature for the cohesion of the selected class. Example for 'petal length (cm)':
- Low feature values (blue) contribute for the cohesion of the selected class.
- Higher feature values (red) do not contribute for the cohesion.
Defining your own clusters
Suppose you want to investigate why UMAP clustered 2 classes together while projecting the third one distant in 2D.
To understand that, we can use ClusterShapley to explain how the features contribute to these two major clusters.
# fit KMeans with two clusters (see notebooks/ for the complete code)
Lets generate explanations knowing that cluster 0 is on right and cluster 1 is on left.
clusterShapley = dre.ClusterShapley()
clusterShapley.fit(X, kmeans.labels_)
shap_values = clusterShapley.transform(to_explain)
*For the right cluster*
c_exp = shap.Explanation(shap_values[0], data=to_explain, feature_names=data.feature_names)
shap.plots.beeswarm(c_exp)
The right cluster is characterized by the low values of petal length (cm), petal width (cm), sepal length (cm).
*For the left cluster*
c_exp = shap.Explanation(shap_values[1], data=to_explain, feature_names=data.feature_names)
shap.plots.beeswarm(c_exp)
On the other hand, the left cluster (composed by two classes) is characterized by high values of petal length (cm), petal width (cm), sepal length (cm).
Please, use the following reference to further details and to cite ClusterShapley in your work:
@article{MarcilioJr2021_ClusterShapley,
title = {Explaining dimensionality reduction results using Shapley values},
journal = {Expert Systems with Applications},
volume = {178},
pages = {115020},
year = {2021},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2021.115020},
url = {https://www.sciencedirect.com/science/article/pii/S0957417421004619},
author = {Wilson E. Marcílio-Jr and Danilo M. Eler}
}
ClusterShapley follows the 3-clause BSD license.
ClusterShapley uses the open-source SHAP implementation from SHAP.