title | author | date | subparagraph | numbersections | documentclass | urlcolor | geometry | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
Artificial Intelligence - Knowledge Representation and Planning - Assignment 3 |
Lorenzo Soligo - 875566 |
Academic Year 2018-2019 |
true |
true |
article |
blue |
|
\clearpage
Read this article presenting a way to improve the disciminative power of graph kernels. Choose one graph kernel among
- Shortest-path Kernel
- Graphlet Kernel
- Random Walk Kernel
- Weisfeiler-Lehman Kernel
Choose one manifold learning technique among
- Isomap
- Diffusion Maps
- Laplacian Eigenmaps
- Local Linear Embedding
Compare the performance of an SVM trained on the given kernel, with or without the manifold learning step, on the following datasets:
- PPI: this is a Protein-Protein Interaction dataset. Here proteins (nodes) are connected by an edge in the graph if they have a physical or functional association. Contains 2 classes.
- Shock: representing 2D shapes. Each graph is a skeletal-based representation of the differential structure of the boundary of a 2D shape. Contains 10 classes.
Note: the datasets are contained in Matlab files. The variable \texttt{G} contains a vector of cells, one per graph. The entry \texttt{am} of each cell is the adjacency matrix of the graph. The variable \texttt{labels}, contains the class-labels of each graph.
A positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix.
Let
Kernel methods (namely SVMs and many more) exploit kernel functions to work on high-dimensional, implicit feature spaces without having to compute the coordinates of the data in that space. This is achieved by performing inner products between the images of all pairs of data in the feature space. This operation is called kernel trick. It is extremely useful in the case the dataset is not linearly separable, but can be easily separated by an hyperplane in a higher-dimensional space.
Formally, a kernel maps two objects
Given two graphs
Given two graphs
At the moment we do not know a polynomial time algorithm for graph isomorphism, but we also do not know whether the problem is NP-complete.
On the other hand, we know that subgraph isomorphism is NP-complete. Subgraph isomorphism checks whether there is a subset of edges and vertices of
The idea is to count the number of operations that is necessary to transform
The idea here is to map each graph to a feature vector and then using distances and metrics on vectors for learning on graphs. In this case the clear advantage is that known, efficient tools for feature vectors can be reused, but the feature vector transformation either leads to a loss of topological information or still includes subgraph isomorphism as one step.
From the background, we have understood that computing whether two graphs are isomorphic is usually expensive, often becoming infeasible for "big" graphs. Therefore it would be great to have a polynomial time similarity measure for graphs. Graph kernels allow us to compare substructures of graphs that are computable in polynomial time. We want a graph kernel to be expressive, efficient to compute, positive definite and applicable to a wide range of graphs.
Graphs are usually represented using adjacency lists/matrices. However, standard pattern recognition techniques require data to be represented in vectorial form. This is quite a tough operation for graphs. First of all, the nodes in a graph are not ordered, therefore a reference structure must be established as a prerequisite. Second, even though the vectors could be encoded as vectors, their length would be variable and they would therefore belong to different spaces.
The kernel trick has the advantage of shifting the problem from a vectorial representation -now implicit- to a similarity representation, allowing standard learning techniques to be applied to data for which a vectorial representation is hard to achieve.
A graph kernel is a kernel function that computes an inner product on graphs. Graph kernels can be intuitively understood as functions measuring the similarity of pairs of graphs. They allow kernelized learning algorithms such as SVMs to work directly on graphs, without having to do feature extraction to transform them to fixed-length, real-valued feature vectors.
Computing graph similarities is fundamental in many research areas: for example, a common assumption when working with previously unseen molecules is that molecules with similar structures will have similar functional properties. This is a typical case in which measuring the similarity between graphs is a fundamental aspect of the research work.
To better explain graph kernels, let us introduce R-convolution kernels, a family graph kernels are instances of. These kernels compare decompositions of two discrete, structured, compound objects. Most R-convolution kernels simply count the number of isomorphic substructures in the two compared graphs and differ mainly by the type of substructures used in the deconvolution and the algorithms used to count them efficiently.
Formally, once we define a positive semi-definite kernel
One of the main problems with the aforementioned approach is that given the high degree of information that graphs express, the task of defining complete kernels (i.e.
In particular, let
In fact, since
\begin{align*}
&\sqrt{k(G, G)-2k(G,G')+k(G',G')} \
&=\sqrt{\big \langle \phi(G)-\phi(G'), \phi(G)-\phi(G') \big \rangle} \
&=||\phi(G)-\phi(G')|| = 0
\end{align*}
iff
As we have said, computing kernels for injective mappings is as hard as deciding graph isomorphism. Instead, what we are looking for is a polynomial-time, non-injective kernel which combines expressivity with efficiency.
Many graph kernels are very effective in generating implicit embeddings, but there is no guarantee that the data in the Hilbert space will show a better class separation. This happens because of the complexity of the structural embedding problem and the limits for efficient kernel computation. For example, data tends to cluster tightly along a curve that wraps around the embedding space due to the consistent underestimation of the geodesic distances on the manifold, placing data onto a highly non-linear manifold in the Hilbert space. As a matter of fact, this horseshoe is the intersection between the manifold and the plane used to visualise the data. It might be caused by kernel normalisation, that projects data points from the Hilbert space to the unit sphere possibly creating an artificial curvature of the space that either generates or exaggerates the horseshoe effect.
Generally the non-linearity of the mapping is used to improve local class separability, while a large curvature might fold the manifold reducing long range separability. The impact of the locality of distance information on the performance of the kernel thus becomes a key point to be studied: we will use some manifold learning techniques to embed the graphs onto a low-dimensional vectorial space, trying to unfold the embedding manifold and increase class separation.
\begin{figure}[H] \centering \includegraphics[width=0.9\textwidth]{img/embedding.png} \caption{Example of reduced linear separability due to high curvature of the embedding. Introducing a non-linear mapping to a low-curvature manifold makes the data linearly separable. Mapping to high global curvature manifold results in low linear separability of the data. The higher the curvature the less separable the data is.} \end{figure}
Also, many kernels proposed in the literature neglet locational information for the substructures in a graph, and cannot therefore establish reliable structural correspondences between the substructures in a pair of graphs, lowering the precision of the similarity measure.
In this assignment, I will test the Wiesfeiler-Lehman kernel.
The Weisfeiler-Lehman kernel is a state-of-the-art graph kernel that enumerates the common subtrees between two graphs by using the Weisfeiler-Lehman test of graph isomorphism.
Given two attributed graphs
Let
where
Perform the following three steps
- sorting: represent each node
$v$ as a sorted list$L_v$ of its neighbors ($\mathcal{O}(m)$) - compression: compress this list into a hash value
$h(L_v)$ ($\mathcal{O}(m)$) - relabeling: relabel
$v$ with$h(L_v)$ as its new node label ($\mathcal{O}(n)$)
Complexity:
-
$\mathcal{O}(m\cdot h)$ per pair of graphs -
$\mathcal{O}(N\cdot m\cdot h + N \cdot n \cdot h)$ for$N$ graphs.
Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high and the data actually resides in a low-dimensional manifold embedded in the high-dimensional feature space. Also, the manifold may fold or wrap in the feature space so much that the natural feature-space parametrization does not capture the underlying structure of the problem. Manifold learning algorithms attempt to uncover a non-linear parametrization for the data manifold in order to find a low-dimensional representation of the data that effectively unfolds the manifold and reveals the underlying data structure.
High-dimensional datasets can be very difficult to visualize. In order to visualize the structure of a dataset, the dimension must be reduced in some way. The simplest way to accomplish this dimensionality reduction is by taking a random projection of the data. Though this allows some degree of visualization of the data structure, in the random projection it is likely that the more interesting structure within the data will be lost.
To address this concern, a number of supervised and unsupervised linear dimensionality reduction frameworks have been designed, such as Principal Component Analysis (PCA) and many others. These algorithms define specific rubrics to choose an "interesting" linear projection of the data. These methods can be powerful, but often miss important non-linear structure in the data.
\begin{figure}[H] \centering \begin{minipage}[b]{0.45\textwidth} \includegraphics[width=\textwidth]{img/sphx_glr_plot_lle_digits_001.png} \end{minipage} \hfill \begin{minipage}[b]{0.45\textwidth} \includegraphics[width=\textwidth]{img/sphx_glr_plot_lle_digits_002.png} \end{minipage} \end{figure}
\begin{figure}[H] \centering \includegraphics[width=0.45\textwidth]{img/sphx_glr_plot_lle_digits_004.png} \caption{The representation drastically improves using dimensionality reduction techniques} \end{figure}
Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications. Intuitively, the "curvier" is the considered manifold, the denser the data must be.
Now we will define the two manifold learning algorithms used in this assignment: we will see a global approach (Isomap) and a local one (LLE).
Isomap (short for isometric feature mapping) seeks a low-dimensional representation of the data which maintains geodesic (namely, the shortest path between two points on a surface/manifold) distances between all points. In this sense, it is a direct generalization of Multidimensional Scaling (MDS). Isomap assumes that only the pairwise distances between neighboring points are known. The geodesic distances are approximated as the length of the minimal path on a neighborhood graph, i.e. each distance is estimated as the shortest path distance between the corresponding nodes in the graph.
In Isomap, the long-range distances become more important than the local structure, and this makes it quite sensitive to noise: depending on the topology of the neighborhood graph, Isomap suffers shortcutting and other distortions.
Isomap mainly relies on a
Locally linear embedding (LLE) seeks a lower-dimensional projection of the data which preserves distances within local neighborhoods. The underlying assumption is that a point and its neighbors in the original space should line on (or close to) a locally linear patch of the manifold; this also makes t possible to reconstruct each point as a linear combination of its neighbors. In particular, the optimal coefficients
LLE can be thought of as a series of local Principal Component Analyses which are globally compared to find the best non-linear embedding. Here the manifold is seen as a collection of overlapping coordinate patches: if the neighborhoods are small enough and the manifold is smooth enough, the local geometry of the patches can be considered approximately linear.
Since LLE focuses on preserving distances locally, LLE can distort the global structure of the data. The idea, in fact, is to characterize the local geometry of each neighborhood as a linear function and to find a mapping to a lower dimensional Euclidean space that preserves the linear relationship between a point and its neighbors.
One well-known issue with LLE is the regularization problem. When the number of neighbors is greater than the number of input dimensions, the matrix defining each local neighborhood is rank-deficient. Standard LLE addresses this problem by applying an arbitrary regularization parameter
One method to address the regularization problem is to use multiple weight vectors in each neighborhood. This is the essence of modified locally linear embedding (MLLE).
The steps taken are the same as standard LLE, but the weight matrix construction takes more time because we need to construct the weight matrix from multiple weights. In practice, however, this increase in the cost is negligible.
As previously said, applying multidimensional scaling to the distances in the implicit Hilbert space obtained from R-convolution graph kernels often results in the horseshoe effect: data is distributed tightly on a highly curved line or manifold. This comes from a consistent underestimation of the long-range distances consistent with the properties of these kernels.
R-convolution kernels typically count the number of isomorphic substructures in the decomposition of the two graphs, not considering locational information for the substructures in a graph. The similarity of the substructures are not related to the relative position in the graphs.
When graphs are very dissimilar, many similar small substructures can appear simply because of the statistics of random graphs, and the smaller and simpler the substructures are in the decomposition, the more likely it is to find them in many locations of the two structures. In other words, the smaller is the considered sample, the higher is the probability of finding similarities because of random fluctuations. Notice that what decreases as the size increases is the proportion of correct matches with respect to the total possible correspondences of the same size.
The lack of a locality condition and the consequent summation over the entire structure amplifies the effects of these random similarities, resulting in a lower bound on the kernel value that is a function only of the random graph statistics. This leads to a consistent reduction in the estimated distances for dissimilar graphs, adding a strong curvature to the embedding manifold -which can fold on itself- and increasing the effective dimensionality of the embedding.
This assignment requires us to compare an SVM trained on a kernel with and without the manifold learning step. The goal is to try and see whether applying an optimal manifold learning process to the distance matrix leads to an increase in the class separation.
Given a set of
The code has been implemented in Python, mainly relying on the Numpy and Scikit-Learn libraries.
Here is a high-level description of what it does, taken from the original paper:
-
Multiset label determination
- assign a multiset label
$M_i(v)$ to each node$v \in G$ which consists of the multiset${l_{i-1}(u)$ | u is a neighbor of v$}$- done in \texttt{determine_labels}
- as per the paper, since our graphs are unlabelled, we use the node-degrees as starting labels for the node
- assign a multiset label
-
Sorting each multiset
- Sort elements in
$M_i(v)$ in ascending order and concatenate them into a string$s_i(v)$ - sorted and merged in \texttt{get_labels}
- Add
$l_{i-1}(v)$ as a prefix to$s_i(v)$ - done in \texttt{extend_labels}. Returns the string formatted as requested
- Sort elements in
-
Label compression
- Map each string
$s_i(v)$ to a compressed label using a hash function$f : \Sigma^{*} \rightarrow \Sigma$ such that$f(s_i(v)) = f(s_i (w))$ if and only if$s_i(v) = s_i(w)$ - done in \texttt{compress_label} and \texttt{relabel}
- As the first "hash", I use the highest degree of a node in all graphs, plus one (hence I'm sure that one is a hash instead of an original label
- Map each string
-
Relabeling
- Set
$l_i(v) = f(s_i(v))$ for all nodes in$G$ - done in \texttt{relabel}
- Set
After having done all of the above, the similarity matrix for the N graphs is computed. The \texttt{run()} method returns the similarity matrix containing the normalized values for all the graphs.
The reported results are obtained from a 10-fold Cross Validation with shuffled dataset. The Weisfeiler-Lehman kernel is run with
\begin{figure}[H] \centering \begin{minipage}[b]{0.48\textwidth} \includegraphics[width=\textwidth]{img/shock_dm.png} \caption{Pairwise distances for the SHOCK dataset} \end{minipage} \hfill \begin{minipage}[b]{0.48\textwidth} \includegraphics[width=\textwidth]{img/ppi_dm.png} \caption{Pairwise distances for the PPI dataset} \end{minipage} \end{figure}
Value | Accuracy |
---|---|
Minimum | 0.2 |
Mean | 0.32 |
Max | 0.5 |
Standard deviation | 0.1 |
Value | Accuracy |
---|---|
Minimum | 0.5 |
Mean | 0.71 |
Max | 0.889 |
Standard deviation | 0.11 |
The performance with the manifold learning step heavily vary depending on two parameters: the number of neighbors and the number of components to be considered.
I tested all the combinations with the number of neighbors
$\ $ \newline LLE
Best results: 20 neighbors, 8 components
Value | Accuracy |
---|---|
Minimum | 0.2 |
Mean | 0.40 |
Max | 0.8 |
Standard deviation | 0.17 |
Worst results: 2 neighbors, 2 components
Value | Accuracy |
---|---|
Minimum | 0.05 |
Mean | 0.135 |
Max | 0.2 |
Standard deviation | 0.06 |
Isomap
Best results: 10 neighbors, 9 components
Value | Accuracy |
---|---|
Minimum | 0.25 |
Mean | 0.41 |
Max | 0.6 |
Standard deviation | 0.09 |
Worst results: 2 neighbor, 7 components
Value | Accuracy |
---|---|
Minimum | 0.0 |
Mean | 0.14 |
Max | 0.2 |
Standard deviation | 0.06 |
$\ $ \newline LLE
Best results: 6 neighbors, 8 components
Value | Accuracy |
---|---|
Minimum | 0.555 |
Mean | 0.776 |
Max | 1.0 |
Standard deviation | 0.12 |
Worst results: 2 neighbors, 6 components
Value | Accuracy |
---|---|
Minimum | 0.25 |
Mean | 0.53 |
Max | 0.666 |
Standard deviation | 0.11 |
Isomap
Best results: 4 neighbors, 7 components
Value | Accuracy |
---|---|
Minimum | 0.555 |
Mean | 0.79 |
Max | 1.0 |
Standard deviation | 0.12 |
Worst results: 2 neighbors, 6 components
Value | Accuracy |
---|---|
Minimum | 0.375 |
Mean | 0.59 |
Max | 0.875 |
Standard deviation | 0.13 |
As we can see, the application of a manifold learning technique doesn't always improve the performance of the SVM classifier. The experimental results tell us that if we can find the "right"
Let us now focus on the two datasets separately.
When considering the Shock dataset, in the best case applying the manifold learning step leads to a +10% increase (from ~30% to ~40%) in the classification accuracy of the SVM, while in the worst case it decreases the accuracy to a ~10% classification accuracy, which is very poor compared to the ~30% obtained by not applying a manifold learning step at all. As said before, the obtained results are all a matter of choosing the right parameters, which can not be seen as an easy task. As a matter of fact, while testing the various
In the PPI dataset the situation changes: here applying the manifold learning step seldom leads to worse performance than the "standard" ones, and the manifold learning step usually improves the classification accuracy. In the best case, both Isomap and LLE lead to a 100% classification accuracy: even though this result should be checked against a wider dataset, the mean classification accuracy is still improved, leading to a 6-7% increase in the accuracy of the SVM when the parameters are carefully chosen.
For both datasets, it is easy to notice that the worst performance obtained by using the manifold learning algorithms come from the executions with the number of neighbors set to 2: in other words, as one might think, considering few neighbors gives very little information on the global structure of the graph and hence leads to poor performance. On the other hand, also considering too many neighbors does not seem like a good idea, especially in the PPI dataset. Here the best results are yielded by runs considering 4 or 6 neighbors, while the Shock dataset -that seems to be tougher to classify- requires many more neighbors. This might also be related to the fact that the PPI dataset only includes elements from 2 classes, while the Shock dataset contains 150 elements divided into 10 classes, also providing very few training examples per class. When considering the Shock dataset, LLE seems to work better than Isomap, probably because the information given by local neighborhoods is more useful than the one coming from higher distances. In PPI, on the other hand, Isomap seems to work better than PPI, probably because in this case the global structure is more relevant than the local one: this sounds reasonable if we recall that the dataset represents interactions between proteins.
To conclude, I think it is fundamental to remark that the usage of manifold learning techniques in order to improve graph kernels can lead to an improvement in the classification accuracy, but this comes at the high cost of computing the two "best" hyper-parameter for the manifold learning algorithms. In other words, even though we have seen that manifold learning can improve our classifiers, we are not sure whether certain values will improve them - unless we first try many possible combinations.
- http://www.dsi.unive.it/~atorsell/AI/graph/Unfolding.pdf
- http://www.dsi.unive.it/~atorsell/AI/graph/kernels.pdf
- https://en.wikipedia.org/wiki/Kernel_method
- https://en.wikipedia.org/wiki/Positive-definite_kernel
- https://www.ethz.ch/content/dam/ethz/special-interest/bsse/borgwardt-lab/documents/slides/
- https://ethz.ch/content/dam/ethz/special-interest/bsse/borgwardt-lab/documents/slides/CA10_WeisfeilerLehman.pdf
- https://scikit-learn.org/stable/modules/manifold.html
- https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html
- https://en.wikipedia.org/wiki/Graph_kernel