The t-SNE algorithm is immesely popular for data visualization. But is t-SNE sometimes showing us features -- in other words creating artefacts -- that do not exist in the data? Does it "halucinate" clusters, and shapes that do not exist? does it miss other structures that do?
For example, we know from, e.g., this paper that t-SNE separates data into clusters, when there are clusters.
--a figure of e.g. digits data from sk-learn
But what if t-SNE just likes to create clusters, whether the data is clustered or not? More generally, an artefact a feature that we observe in the embedding of the data by t-SNE, but which does not exist in the data. Such features can be clusters, holes, "arms" and anything else that we may find interesting, when they are not in the data but appear to us in the embedding by t-SNE. (Of course, the question of artefacts applies to any other embedding algorithm).
Note that here we do not give a formal definition of "feature", just like researchers to may observe these features in data do not use formal definitions.
So, we will give t-SNE very simple artificial data that has no features, and if we observe clusters or other structure, we will learn what artefacts t-SNE likes to create. Since the output of the embedding algorithms also depends on the input parameters, we shall also look briefly into the effect of parameter choices. Here we use the sk-learn implementation of t-SNE. The reader unfamiliar with the algorithm can consult the original paper van der Maaten and Hinton 2008, and find a variety of implementations here
How to use t-SNE effectively describes some unexpected behaviours of t-SNE, that are mostly specific to data with clusters. While their experiments illustrate small data sets, here we consider larger data, up to
Here is a first example to illustrate the issues a practitioner may face with t-SNE. A scientist has painstakingly obtained multi-dimensional measurements from thousands of objects (they can be cells, patients, users of a web site, stars or galaxies). They use t-SNE to represent the data in 2D. Unbeknownst to them, the data is in fact precisely 2-dimensional, and it can be represented in 2D by a simple rotation in the original space. Hence from the point of view of an embedding algorithm, this should be an easy problem with a stable solution (up to rigid translations and rotations).
The research group is aware that the parameter choices may affect the algorithm, so they try many possible choices of perplexity. Perplexity is proportional to the number of neighbors
These are the embeddings obtained.
From left to right, we see: a circular blob with higher density in the center, clusters of various granularities, or a disk with a hole (that is, a ring). If these data were cells, we could hypothesise that they all behave the same way (the blob), that there are multiple groups with different behaviors (clusters), or that that the behavior varies continuously and cyclically (the ring). How to figure out how the original data looked like?
Based on generic results on manifold learning from non-parameteric statistics, to consistently represent a manifold from samples,
On the other hand, by the recommendation of the algorithm authors and implementers, larger perplexity is better, hence we should believe the plots on the extreme right (the ring). Which is the correct answer in this case.
[plot of true ring data]
We now embed a disk of radius 1 using various
Note that here the number of neighbors
It appears that more neighbors work better for t-SNE. However, requiring more neighbors can get an algorithm into problems, when the manifold is non-trivially curved. When the previous disk is deformed, neighborhoods become curved, and their shape in 2D cannot be accurately reproduced. Thus t-SNE has a much harder time finding a good local minimum.
This paper https://arxiv.org/abs/2102.13009 offers an analysis of t-SNE on structureless graphs, such as the random
A $k$-regular graph is a graph where each node has exactly
An Erdos-Renyi ER$(p)$ graph is a random graph in which each possible edge
These simple to construct graphs are examples of "high-dimensional" data, i.e. of data that cannot be embedded in low dimensions without distortion. They are also featureless with very high probability, hence if their embedding shows anything different from a blob, that would be an artefact. Our experiments show that t-SNE does find blobs, but it still adds some features like filaments and even faint clusters.
Indeed, for the
The results are more variable for the ER$(p)$ graph, and in fact in some cases we observe artefacts such as thin rings.
For very large
The reason for this (now with actual proofs) is the same as the more intuitive explanation in this FAQ: t-SNE "solves a problem known as the crowding problem", in other words, it works by trying to push points away from each other, against the "neighborhood ties". In a data set with clusters, the weaker ties between clusters will give. However, when there are no clusters, t-SNE will still be happier if it can break some ties. Note that the number of ties in a random graph grows like
In most embedding algorithms, the number of neighbors
For this, we use the ring data.
-- to make a table with 2 columns. these figures need axes and labels!--
Runtime | Algorithm Cost |
---|---|
t-SNE takes a "statistical" or shall we say "geometric" parameter, the perplexity, and a number of other parameters that control the iterative descent algorithm that estimates the embedding. In our experiments, we focus solely on perplexity, and we run the algorithm with the default parameters in sk-learn and a sufficient number of iterations.
But what is the perplexity? It turn out that the t-SNE algorithm chooses for each point
But the t-sne authors recommend a parameter called perplexity, which is
So can we find any intuition on the choice of the perplexity (or equivalently of the $k$ ) parameter?
We look again at the embedding of the disk. Maybe
Rather, from Figure ... reference the fig above... suggests that the neighborhood radius is a parameter that influences the topology of the embedding.
Many other artefact producing behavior was exemplified at https://distill.pub/2016/misread-tsne/