Tools to convert images into vector representations in a vector space with specified properties
This README serves as a general overview of what we are attempting to do and gives context to the code written. Specific instructions on how the code is structured and how to use it can be found README files in src
To use the code, run the BFMethodExample.py file with your choice of image type, image filters, image product, and embedding type.
More detailed explanation of the options can be found under the "Generalised Problems" section.
Note that an Octave installation is required to run certain functions.
The first goal of the project is to create a vectorisation function that maps images in the form of matrices to an N-Dimensional unit vector (called a vector embedding), such that the dot product between any two such vector embeddings is equal to the NCC score between the two original images.
We hope to gain a better understanding of the process of generating image embeddings, a process usually carried out using AI and ML.
This would also speed up calculation of similarity between images since dot products are less computationally intensive than calculating image products.
Given such a function, it would also be possible to convert any input image into a vector embedding, even a completely new image the functions has never seen before.
Clustering or classification can then be easily carried out on the vector embeddings.
The NCC (Normalisation Cross Correlation) score between 2 images is defined as the max value obtained after performing a template matching between one image and another. For more details see here.
Normally the template image given has to be smaller than the main image, so that the template is able to slide across the main image. However our template image is the same size as our main image. Our solution to this is to pad the main image with copies of itself.
More important than the exact process of finding the NCC score are the properties that the NCC score has. The 3 important properties of the function are
- Commutative
- Range of [0,1]
- Scalar representation of similarity
The NCC score between 2 images is commutative, i.e. The NCC score with image A as the main image and image B as the template image, is the same as the NCC score if image B is the main image and image A is the template
The maximum value of the NCC score is 1, and the minimum value is 0.
The NCC score will be high if the images are similar, and will have a maximum score of 1 if the 2 input images are identical or a simple translation of each other.
If the images are dissimilar, the NCC score will be lower.
These 3 above characteristics are similar to that of dot products between unit vectors
Unit vector which have a low angular distance will have a high dot product and unit vectors with a high angular distance will have low dot product
With these similarities, our goal is to find a function to map images into unit vectors such that that their dot product is equal to their NCC score.
We define these unit vectors as the vector embeddings of the image.
We define this function we are attempting to find as the embedding function.
Since the number of possible 2x2 binary grids is small, we will first attempt a brute force (BF) approach where we numerically solve for the correct vector embedding for every possible image before hand.
These vector embeddings will be stored in a dictionary, with the images they represent as keys
When our embedding function takes in a new image, it then looks inside the dictionary and finds which vector embedding has this new image as a key.
For a 2x2 grid of binary numbers, there are 16 possible images
In an ideal scenario, images which are simple translations of each other should be represented by the same vector. This is because the NCC scores between the images would be 1, and hence the dot product between their vector embeddings will also equal to 1.
As a result, in our dictionary we can put an image and its all possible translations as the key for a vector embedding.
Hence, we would only like to consider the set of images which are not simple translations of each other. We find only 7 of such images.
Given that there are only 7 "unique" images, this means that our function has to produce exactly 7 different vectors for 16 inputs (Multiple squares map to the same vector). Given our constraints this also means that in general the vectors must have 7 dimensions. 1
Lets define matrix Am as a matrix with its columns corresponding to the 7 vector embeddings we wish to find. This matrix Am will also be called the embedding matrix
Lets also define matrix G as a matrix containing a table of image products between each image. This matrix G will also be called the image product matrix
Recall that we wish to have the dot products of the vector embeddings to be equal to the NCC score between the respective images.
[i.e. x1x2 + y1y2 + z1z2 + ... + λ1λ2 = NCC score(image 1, image 2)]
Notice that matrix A transposed multiplied by matrix A exactly equals matrix G. (AtA = G)
We are able to find matrix G by simply computing the NCC score between all possible 2x2 images 2
There are various methods to find matrix A here given matrix G. One example is the Cholesky decomposition. We use the following method, which makes use of the fact that matrix G is a symmetric matrix.
(Diagonalize the matrix, note Pt = P-1 as G is symmetric)
Since we need to take the square root of D, the solution only works if matrix G has no negative eigenvalues (i.e. matrix G is positive semi-definite). We can show that if matrix B is not positive semi-definite, there exists no matrix A (containing only real numbers).
Suppose there exists a matrix A such that
Consider the expression , where λ is an eigenvalue of matrix G and v it's corresponding eigenvector
As such, given that there exists a embedding matrix that perfectly satisfies the NCC score requirement, the eigenvalues of matrix G must positive or zero. (i.e. matrix G is positive semi-definite)
In such a case, we are able to calculate to obtain all the vector embeddings for the images.
However, it is often the case that matrix G is not positive semi-definite. As such, we have to find methods to transform matrix G into a positive semi-definite matrix, while minimising the error incurred.
The first option is to just ceiling all the negative eigenvalues in the diagonal matrix D to zero, then calculate
The issue with this approach is that the unit vector constraint is not preserved. Previously, the unit vector constraint was preserved by the diagonal of ones in the image product matrix. However after the negative eigenvectors are zeroed, the diagonal entries will not longer be equal to one.
This can be remedied by normalising the embedding vectors again, though this results in more deviation of G' from G and potentially causes G' to not be symmetric.
A second option is to add a scalar multiple of the identity matrix to G.
For any polynomial f, it can be shown that f(G) has eigenvalues equal to f applied to each eigenvalue of G. This means that adding a scalar multiple n of the identity matrix to G adds n to each eigenvalue of G. By setting the negative of the most negative eigenvalue of G as the scalar, all eigenvalues of G will become positive. We then divide all elements of G by the new value of the diagonal elements to set them back to 1.
However, this is also not the main method due to rank constraint issues.
A matrix which is positive semi-definite with unit diagonals is called a correlation matrix. Finding the nearest correlation matrix, while minimising the error is a studied problem.
The algorithm we use to find the nearest correlation matrix is the Penalised Correlation Algorithm. This algorithm finds the nearest correlation matrix G' such that
We can use these algorithms to find G' while maintaining the unit vector constraint. The matrix can then be decomposed. More details can be found here.
A weighted version of the above algorithm is also available. This function instead minimises
Weights assigned to each element dictate which element we would like to remain the same while a lower weight indicates that the element is allowed to change more.
More details can be found here and in the section on Pencorr weights
While vector embeddings of dimensions equal to the size of the image set are needed in general to perfectly solve the problem, lower dimensionality is important for scalability.
As such we would like to have methods which are able to obtain vectors with fewer dimensions, while minimising the impact of the results, especially for the closest neighbours of an image.
We define Ad, n as the embedding matrix with reduced dimensions, such that there are n vector embeddings which only have d dimensions. Ad, n is a d by n matrix.
For instance, Ad=5, n=7 would be 5 by 7 matrix, where 7 is the number of vectors and 5 is the number of dimensions of the vector embedding.
Given an n by n matrix G which only has d non-zero positive eigenvalues, it is possible to decompose it to a matrix Ad, n such that
(Diagonalize the matrix, note Pt = P-1 as G is symmetric)
Sort the eigenvalues in D and their corresponding eigenvectors in P in descending order, note that all values other than the top M diagonals are zero.
D can be written as a matrix multiplication of Droott and Droot, where Droot is an M by N matrix with the diagonal entries being the square root of the eigenvalues
Similar to the method above for dealing with negative eigenvalues, instead of just zeroing the negative eigenvalues, we now zero all but the largest N eigenvalues.
The resulting matrix then can be decomposed using the reduced dimension decomposition.
However, this is again not the preferred method due to deviations from G mentioned above.
Correction of eigenvalues can also be carried out similar to above by adding a negative of the $$n-d$$th eigenvalue. However, this presents problems when the selected eigenvalue is positive, causing a change in the relative ordering of elements in G. Therefore, this method is explored but not preferred.
There are also methods to find the nearest correlation matrix with a rank constraint, such that the resulting matrix G' has only d nonzero eigenvalues
The Penalised Correlation algorithm used above has this function built-in, allowing a rank constraint to be specified.
The resulting matrix then can be decomposed using the reduced dimension decomposition.
Now that we have generated embeddings of images within our image set, we wish to estimate a vector
Having decomposed the Matrix G' into
and obtaining the value of
The problem statement
From above,
From (1),
From (2),
We will thereafter solve for the lagrangian multipliers using numerical methods provided by scipy, and find the x that minimises
The first numerical method tried was the Newton Raphson method The second numerical method tried was the brent method under scipy which resulted in the code being faster without a noticeable trade-off in accuracy.
The jenkins traub method can be tried out to see if it provides a faster way to run the code more accurately
Since it is not possible to perfectly solve the problem, we would like to come up with metrics to quantify the accuracy of our embeddings.
Ideally, close neighbours of the vector embeddings should correspond to close neighbours the original image. The Relative Positioning Score (k-score) is the fraction of the K most similar images (i.e. highest image product) in k-nearest vector embeddings (i.e. highest dot product) for the corresponding vector. In an ideal case, the score would be one.
This measurement makes use of the k-nearest neighbours, but is not related to a KNN Classification algorithm.
Note that this score only cares how many of the original closest neighbours remain as the closest neighbours. The order within the closest neighbours does not matter (we can make it matter though).
Let the image products be b, and the image product function be f, s is any random image that is chosen, and n is the total number of images. For example, for the image product table of 3x3 binary unique images, n = 64, and if we pick an image s = 20,
Note that
Let x be the estimated vector embedding, and
Let k be the nearest neighbour score, where we track the k nearest neighbours to our vector x that represents image s. In vectors b and c, if k = 3, we will find the top 3 highest values and store their indices in a list.
For example let n = 5 and s = 3 and k = 2, an example vector b can be
An example vector c can be
Using the formula of
Using another example, if for the same values of n, s and k,
In practice, for the BF method, we are able to directly compare matrices G and G' to see the k-nearest neighbours for images and embeddings respectively. However, measuring accuracy of embeddings generated using the Lagrangian method, each dot product and NCC score would need to be calculated.
Using this
Using the data that we have obtained, we are able to plot a number of graphs that will allow us to see how the data correlates to one another. The details of the graphs can be found under visualisations.
While the Brute Force method is the main method of finding embeddings, we would like to improve the accuracy and scalability of the method. The following are variations in the method used to find more accurate embeddings at lower rank constraints.
While the BF approach may be able to find vector embeddings, it is not scalable and requires exhaustively calculating vectors for all possible images.
A more practical approach would be to first get a smaller sample the total image set and calculate the vector embeddings for these images using the BF approach. Then a vector embedding for a new image can be calculated from the sampled image set using the Lagrangian method.
The accuracy of this method is tested by finding the k-scores using new embeddings found of images within a test image set.
The subset of images from the original image set will be a random sample of n images. This subset will be called the sampled image set.
The image product matrix computed using this sample image set will be called matrix Gn where n is the size of the sampled image set.
The vector embeddings for the sampled image set can then be calculated using the brute force approach. The matrix with its columns corresponding to the sampled vector embeddings will be called the sampled embedding matrix or matrix Ad,n, where d is the number of dimensions chosen for the vector embeddings and n is the size of the sampled image set and . Note that matrix Ad,n will be an m by n matrix.
The sample set and testing sets are often non-overlapping samples of the same image set.
First, the size of the sample set will be chosen to be some n, where n is less than the total size of the image set.
A random sample of n images from the original image set will be chosen, and their image product will be computed to obtain matrix Gn.
Next the dimensionality of the embedding vectors will be chosen to be some d.
Matrix Gn will undergo a rank reduction to obtain matrix G'n followed by a decomposition, to obtain the sampled embedding matrix, Ad,n
Once the new matrix G' and subsequently the embedding matrix A is generated, we can use the Lagrangian method to estimate embeddings for all images in our test image set, then calculate the k-score by comparing their dot products with other embeddings to the actual k-nearest neighbours calculated using NCC score.
By itself, the embeddings generated for the NCC score cannot reach the perfect k-score of 1 even if the rank of G' is unrestricted. This is because G always has negative eigenvalues which have to be corrected during the Pencorr algorithm. Using an increasing function that preserves the value of 1, we can retain the relative ordering of elements in G (to preserve the relative positioning of how close images are to each other) while manipulating the eigenvalues of G. The value of 1 should be preserved to ensure all diagonal elements remain as 1 but can be scaled above or below 1 as well since the Pencorr algorithm will correct for this. Some functions tested are:
-
$x^n$ for some$n>0$ -
$n^(x-1)$ for some$n>1$ -
$log_n n-1+x$ for some$n>1$ -
$nx$ for some$n>0$
Monotonic transformations have shown to both decrease and increase the k-score as well as the rank at which they stabilise.
Convex transformations such as
Mathematically, a plausible explanation for this is that as the transformations become increasingly convex, the elements of G that are between 0 and 1 tend towards 0, causing G to tend towards the identity matrix. Since the identity matrix has eigenvalues of all 1s, the eigenvalues of G tend towards 1. This causes the largest eigenvalue to fall. Since the trace of a matrix (which we kept constant) is equal to the sum of all eigenvalues, the largest eigenvalue decreasing causes all other eigenvalues to increase, resulting in less negative eigenvalues which have to be corrected in Pencorr and therefore, resulting in less deviation of G' from G.
However, this also spreads the information of our embeddings out to more dimensions, causing more information to be lost when a rank constraint is imposed.
Concave transformations such as
The explanation is similar in that increasingly concave transformations cause elements of G to tend towards 1. G therefore tends towards a matrix of 1s which has eigenvalues of 0 (of multiplicity n-1) and n. This causes information to be concentrated in one dimension which decreases the accuracy of our embeddings. At the same time, the largest eigenvalue increases towards n. Since the sum of eigenvalues is still constant, the value of all other eigenvalues fall, resulting in more negative eigenvalues which have to be corrected by the Pencorr algorithm. This causes more deviation of G' from G and decreases the k-score.
The concentration of information however causes less information to be lost when rank constraints are imposed, causing a lower rank at which the k-score stabilises.
Overall, we hope to find optimal monotonic transformations to produce the most accurate embeddings for a given rank constraint.
For some monotonic transformations, the embeddings generated are more accuracy at all rank constraints (often the case for
Weighting matrices can be used to assign weights during the Pencorr algorithm. We generally want to assign higher weights to the k-nearest neighbours in each column as these elements that are close to 1 are the elements we care more about when calculating k-score.
Several weighting matrices have been tested such as the matrix of 1s in positions corresponding to the k-nearest neighbours and 0 everywhere else. This weighting matrix has shown to generate less accurate embeddings.
The matrix G itself has shown to act as a good weighting matrix as it already has larger values for the larger elements which we hope to keep and smaller values in further neighbours that we care less about.
Monotonic transformations of G can also act as good weighting matrices as applying a convex transformation to G before using it as a weighting matrix has sometimes proven to create more accurate embeddings. However, this is not universally true as extremely convex transformations can cause the weighting matrix to lose too much information, resulting in less accurate embeddings.
While the Relative Positioning Score provides a metric to measure the accuracy of our embeddings, we also require metrics to test the scalability of our methods.
The plateau rank is defined as the rank at which the k-score stabilises and stops changing even as rank constraints are lifted.
We find this plateau rank using an algorithm similar to a binary search, measuring the stabilized k-score before searching downward to find the rank at which is changes or stops changing. The start point of our search is the rank of G' when there is no rank constraints imposed as this represents the number of nonzero eigenvalues which is the upper bound of the plateau rank.
The plateau rank represents the rank constraint beyond which we obtain no returns for allowing more dimensions. It can also represent the rank at which we achieve the maximum accuracy of our embeddings.
We can also attempt to find the rank at which our generated embeddings reach a specified k-score. However, due to time constraints, this option is yet to be tested. This metric would provide a good gauge of rank constraints to be used in large image sets to obtain embeddings that are "good enough".
An attempt has been made to conduct shape matching on the image set of shapes. More details of this image set can be found here.
This process displays the k + 1 nearest shapes as dictated using the image dot products and actual NCC score calculations. However, the practicality of this remains to be seen as testing has only been conducted on small shapes (4 by 4). Use on larger shapes may show more characteristics.
While we are investigating toy problems, ideally we would like to build our code in mind for future extensions, for example using a more realistic image datasets. Hence, we define a few terms to find a more general solution to the problem.
There is nothing special in particular about the NCC score. Any function which has the 3 properties above would be suitable. The NCC score was chosen as a relatively simple function to investigate.
While we will continue working with NCC, we define the term Image product to be a function which takes in 2 images, and outputs a scalar, with the scalar satisfying the above 3 properties (Commutative, range bounded to [-1,1] and being a scalar representation of similarity)
Currently, our list of implemented image products are:
- NCC score
We would also want to work with larger images in the future. While the toy example uses 2x2 binary grids, we would hope to work with NxN grids which can contain any number.
A 2D array that serves as an input to a image product will be called an image. This initial choice of input type will be defined as the image type of the problem.
Currently, our list of implemented images types are:
- Nbin: N by N binary squares
- Nbin_MmaxOnes: N by N binary squares, with M% of the s
- triangles: 4 by 4 triangles in 8 by 8 matrices
- quadrilaterals: 4 by 4 quadrilaterals in 8 by 8 matrices
- shapes: shapes of varying size
- random_shapes: smaller random sample of shapes
More details can be found here.
We can attempt to manipulate the eigenvalues of G to provide better embeddings.
Details can be found here
Weights can be used to augment the Pencorr process.
Details can be found here
While its possible to enumerate all possible binary images for small values of N, the total number of possible images increases exponentially with N. As such, we would want certain filters to reduce the size of the total image set, leaving only images which are less noisy to investigate.
A function which takes in a set of images, and outputs a subset of those images based on a certain criteria will be called a filter
Currently, our list of implemented images types are:
- one_island: Outputs a set of images such that each image has only one connected island of 1s (Diagonals are not connected)
- Pmax_ones: Outputs a set of images such that each image has only P or lower percentage of 1s (etc. only up to 50% of the squares can be 1s)
- translationally_unique: Outputs a set of images such that each image is NOT a simple translation of another image in the set 3
Footnotes
-
For each vector, we have a specified dot product between it and each other vector. By taking the cosine inverse of that dot product, we have a certain angle being specified. Taking the example of 2 vectors, lets say we have vector A and vector B, with the angle between them being specified as 45 degrees (∠AB = 45)
In such an example, we are able to put both vectors in the same 2D plane and satisfy the conditions.
Now adding in a third vector C, with ∠AC = 50 and ∠BC = 60. In general we would need 3 dimensions for A, B and C to satisfy all angle requrements (If ∠AC != ∠AB + ∠BC)
In the case where we tried adding vector C with ∠AC = 10 and ∠BC = 5, the system of equations would not be solvable
Extending the above example, if we added in a vector D with ∠AD, ∠BD and ∠CD being specified, we would need 4 dimensions if the equation is solvable. Hence for 7 vectors, we would need 7 dimensions in general to perfectly satisfy the angle requirements ↩
-
Note that this method quickly explodes. Number of computations is proportional to 2^(Size of binary grid)^2 without any filters ↩
-
This means the NCC score between every image is less than 1. This filter is also how we find the 7 translationally unique squares for the brute force approach ↩