Skip to content

Approximate target embeddings using convex optimization to maximize cosine similarity

License

Notifications You must be signed in to change notification settings

HerrIvan/embedding-cvx-projection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Embedding-Cvx-Projection

This repository presents a basic convex optimization technique that, given a target embedding, aims to find the best approximation of that embedding by means of a reduced set of relevant but diverse embeddings.

The process is similar to answering the question: how can I most effectively explain my input embedding as a meaningful combination of embeddings from a given set?

Presumed applications for it are:

For an in-depth discussion regarding its derivation and applications, please check the extended background document that discusses the theory, implementation, and examples.

HOW TO USE IT

Installation

Before running any examples, you'll need to install the package from root directory of this repository. There you can run:

pip install .

This will install the embedding_cvx_projection package along with its dependencies.

Running examples

Once the package is installed, you can run the example scripts to understand how to use this technique. Indeed, the embedding_cvx_projection package is very lean, so that this repository can be best understood as documentation and examples in how to use the given cvxpy projection technique.

Common nouns example

Overview

This script explores a scenario mimicking the extension of a Knowledge Graph (KG) by linking additional features to an entity based on already linked features.

For instance, an e-commerce shop selling furniture may have features associated to products, where the features are drawn from a closed but extensive set. In a specific example, say a furniture piece has properties waterproof and durable; leveraging the inference mechanism implemented in our algorithm, it would possible to automatically attach (or suggest attaching) outdoors to it.

⚠️ Note: The given algorithm uses OpenAI embeddings, so make sure you have a valid OPENAI_API_KEY set up for on-the-fly embedding if any of your words are not in the list of 500 common nouns.

Task

The goal here is to identify which nouns, from a predefined list of around 500 common nouns, are closely related or can be derived based on a given list of input words. This algorithm utilizes the OpenAI embeddings for those 500 nouns, encapsulated in an EmbeddingLibrary object.

Running it

Before running this example you need to install the modules in requirements.txt.

Navigate to examples/common_nouns directory, and execute the script infer_nouns_from_nouns.py with your input words as arguments:

python infer_nouns_from_nouns.py word1 word2 ...

The script processes then processes input words which can either belong to a list of 500 common nouns or not. If not, they can be embedded on-the-fly using the OpenAI embeddings endpoint. For on-the-fly embeddings, a valid OPENAI_API_KEY environment variable needs to be set.

Output

Here's an example output for the inputs mountain, summit, explosion, and fire:

word, 	  value, 	mixture, 		gain
volcano,  0.918, 	[0.38 0.1  0.35 0.17], 	0.377
lava, 	  0.893, 	[0.33 0.08 0.31 0.28], 	0.330
impact,   0.887, 	[0.17 0.23 0.29 0.32], 	0.317
bomb, 	  0.909, 	[0.11 0.13 0.38 0.37], 	0.310
  • The first column lists words whose embeddings are better approximated given the embeddings of the input words.
  • The second column measures the goodness of the approximation (cosine similarity minus the regularization term).
  • The third column provides the mixing recipe, indicating the proportions in which input embeddings should be combined to approximate the target embedding more closely.
  • The fourth column shows the gain in similarity, indicating how much closer the target embedding is to the combined embedding than to any simple embedding in the set.

📈 Interpreting the Output: In the example output, the word 'volcano' is well approximated because it has a high value of 0.918 and has a mixture of all the input words where 'mountain' and 'explosion' have higher weights. The 'gain' indicates that combining the input words brings us 0.377 closer to 'volcano' than any single input word would.

'Bomb' is also well approximated, but in this case, the words weighted higher are 'explosion' and 'fire'.

Synthetic retrieval example

Overview

This example simulates a scenario of matching a query embedding against a set of embeddings. That is, the same operation that may be performed in the Retrieval step of a Retrieval Augment Generation (RAG) workflow. In that case the embeddings retrieved will point to contexts (text document or chunks thereof) that may contain relevant information with respect to the input query. In practice, what we expect from applying our method to that end is retrieving a set of relevant and diverse embeddings, with which to construct a compact but relevant and informative context for the downstream Generation step.

🔍 What to Expect: This example aims to mimic a real-world retrieval scenario but in a simplified manner. It helps in evaluating the effectiveness of our algorithm in returning diverse and relevant embeddings.

Task

Instead of using embeddings extracted from real texts, here we abstract the problem generating random vector embeddings with a particular structure. Namely, embeddings are generated around four prototype embeddings, therefore generating four clusters. Moreover, those clusters are exponentially unbalanced in size, with size (number of members) 1, 9, 90 and 900.

As a final step, the query embedding is constructed also as a random additive combination of the four prototype embeddings. With that, we expect that in order to better reconstruct input embedding, one should gather at least one embedding from each of the four clusters.

The reason for that, is that one you have retrieved an embedding from one cluster, even though other embeddings from that cluster may still lie close to the query embedding, returning them will result mostly in redundant information that would because of its redundancy will hinder the performance of the generation step.

📚 Why This Matters: The example demonstrates how redundant or imbalanced clusters can affect the quality of retrieved information. The goal is to showcase the capability of the algorithm to provide a diverse set of relevant embeddings.

Running it

Navigate to examples/synthetic_retrieval and run:

python redundant_unbalanced_contexts.py

You can use different parameter to configure the problem, controlling how disperse are the clusters and how balanced are they combined in the input query.

python redundant_unbalanced_contexts.py
   -p: number of problems to run
   -n: number of items to retrieve
   -c: minimum contribution of each cluster
   -d: cluster dispersion

🛠 Customization: Feel free to experiment with the parameters to see how the algorithm performs under different conditions. For example, a higher cluster dispersion might make the problem more challenging.

Note that if the cluster dispersion is set very high (>0.5) cluster structure may eventually be lost.

Output

The script outputs the average number of contexts present in the solution, along with a summary of the configuration parameters.

Problem conf. summary:
----------------------
Number of problems run: 40
Number of items retrieved for each the solution: 5
Minimum contribution of each cluster to the solution: 0.3
Cluster dispersion: 0.1
----------------------
Different clusters per solver (average and std):
solve_by_cvx_projection: mean 2.75, std: 0.54
solve_by_least_squares: mean 2.225, std: 0.88
solve_by_similarities: mean 1.325, std: 0.52

📊 Analysis: The average number of different clusters returned by solve_by_cvx_projection is here consistently higher, indicating that our method is effectively capturing diversity, which would be beneficial for downstream tasks like information retrieval or question answering.

Summary

This repository presents a convex-optimization method designed for approximating target embeddings using a reduced set of diverse and relevant embeddings. Utilizing cosine similarity as the measure of closeness, the method projects a given embedding onto a convex cone defined by a collection of other embeddings.

Is this method novel?

While the method and its application in the domain of reasoning with embeddings may not be groundbreaking, it addresses an important gap in practical implementation. Convex optimization techniques are often buried in technical jargon, making them less accessible to practitioners. This very simplicity and obviousness are what make these techniques powerful yet underutilized, particularly as we see a surge in embedding adoption through Language Models and Vector Databases.

Why this repo matters

This repository aims to demystify these robust, mathematically sound techniques by offering practical examples. By doing so, it seeks to accelerate community adoption and enhance current pipelines, providing a valuable resource even if the methods themselves are not novel. The focus here is not on reinventing the wheel but on facilitating its more effective use in a broader context.

Note: indeed, at this stage, depending on your use case, rather than importing this module, it will be leaner for your project to directly import cvxpy and simply add the code with the problem definition (here).

Next steps

There mostly two ways forward for this repo:

  • Additional use cases: Develop a more comprehensive set of use case examples, for instance, with a real RAG workflow or specific relation inference from known ontologies.

  • Integrations: Explore integrations either with existing vector DBs or an application with a specific Knowledge DB (e.g. an integration with a neo4j graph DB.)

Contributions are welcome.

About

Approximate target embeddings using convex optimization to maximize cosine similarity

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages