diff --git a/qdrant-landing/content/articles/solving-contexto.md b/qdrant-landing/content/articles/solving-contexto.md new file mode 100644 index 000000000..cc8e60670 --- /dev/null +++ b/qdrant-landing/content/articles/solving-contexto.md @@ -0,0 +1,200 @@ +--- +title: Solving Contexto.me with Vector Search +short_description: Solving Contexto.me with Vector Search and its practical implications +description: How to solve Contexto.me with Vector Search and why it is more important than it seems +social_preview_image: /articles_data/solving-contexto/preview/social_preview.jpg +preview_dir: /articles_data/solving-contexto/preview +small_preview_image: /articles_data/solving-contexto/icon.svg +weight: 8 +author: Andrei Vasnetsov +author_link: https://blog.vasnetsov.com/ +date: 2022-06-28T08:57:07.604Z +# aliases: [ /articles/solving-contexto/ ] +--- + + + +## Solving what? + + + +[Contexto.me](https://contexto.me/) is a linguistic game that takes the popular word game [Wordle](https://www.nytimes.com/games/wordle/index.html) to the next level. +In this game, players must guess a secret word by submitting guesses and receiving feedback on the similarity of their guess to the secret word. + +The game claims that it "uses an artificial intelligence algorithm to sort words by their similarity to the secret word". +When a player submits a guess, they receive feedback on its position in the sorted list of words. +Players have an unlimited number of guesses, but the game rewards those who can solve it with fewer attempts. + +Try to solve it yourself and then come back to see how we tough the machine to solve it! + +## Naive approaches + + + +It's clear that the game is using some kind of Word2Vec model to sort words by their similarity to the secret word. + +
+Spoiler + + +Contexto.me uses GloVe model: [link](https://nlp.stanford.edu/projects/glove/) + +
+ +
+ + +Word2vec is a method for representing words in a way that captures their meanings and relationships to other words. +It uses machine learning algorithms to learn the representation of words in a way that captures the meanings of words based on the context in which they appear. +This means that words with similar meanings will have similar representations, and words that often appear together will also have similar representations. +The goal of word2vec is to create a compact and efficient representation of words that can be used in natural language processing tasks, such as determining the similarity between words or predicting the next word in a sentence. + +{{< figure src=/articles_data/solving-contexto/cbow-word2vec.webp caption="Word2Vec training architecture">}} + + +Word2vec is one of the first methods used to represent objects in vector space. +Currently there are a lot of more sophisticated methods, that can capture meaning of the whole texts, not just a single word. +But for our purposes word2vec will work just fine. + +So, here's the naive approach you've probably already thought of: + +We can start with a random word and look into the list of similar words using some Word2Vec model (not necessarily the same one used in the game). +If we see a word closer to the secret word, we use it as a reference and repeat the process. + +Although this approach works if we are initially close enough to the secret word, it is generally quite slow and inefficient. +It tends to get stuck in clusters of words that are similar to each other, forcing us to retrieve many words from the model. + +Additionally, using linear algebra techniques to evaluate the exact vector based on distances to given points does not look feasible in this scenario. +This is because the exact word2vec model used to sort the words is unknown, as is the exact distance to the secret word. +The only option is to compare distances between words. + + +## One working approach + +Based on the previous section, we can conclude, that using only the most similar word found so far is not enough to generate efficient guesses. +The more efficient solution must also consider the words deemed dissimilar. + +Let's consider the simplest case: we have guessed 2 words `house` and `blue` and received feedback on their similarity to the secret word. + +One of the words is closer to the secret word than the other, so we can make some assumptions about the secret word. +We understand that the secret word is more likely to be similar to `house` than `blue`, but we only have the information about its relative similarity to these two words. + +Let's assign a score to each word in the vocabulary based on this observation: + +{{< figure src=/articles_data/solving-contexto/scoring-1.png caption="Scoring words based on 2 guesses">}} + +We assign +1 score to those words that are closer to `house` than `blue` and -1 score to those words that are closer to `blue` than `house`. + +Now, we can use this score to rank the words in the vocabulary and use word with the highest score as our next guess. + +Let's see how scores change after we make a third guess: + +{{< figure src=/articles_data/solving-contexto/scoring-2.png caption="Ranking words based on next 2 guesses">}} + +We can generalize this approach to any number of guesses. +The simplest way to do this is to sample pairs of guesses and update the score iteratively. + +That's it! We can use this approach to suggest words one by one and extend guess list accordingly. + +Benefits of this approach: + +- It is stochastic. If there are inconsistencies in the input data, the algorithm can tolerate them. +- The algorithm does not require using exactly the same model as used in the game. It can work with any distance metric and any dimensionality of the vector space. +- The algorithm is invariant to the order of the input data. +- Algorithm only relies on the relative similarity of the words and can be easily adapted to other types of input. + +We even made a simple script you that you can run yourself, check it out on [GitHub](https://github.com/qdrant/contexto). + +The script uses [Gensim](https://radimrehurek.com/gensim/) and `word2vec-google-news-300` embeddings. +On average, it takes 20-30 guesses to solve the game. +If we would use the same model as in the game, it converges much faster, but in real life such information is rarely available, so we decided to test with a more realistic scenario. + + +
+There is an animation how script selects real words + + +{{< figure src=/articles_data/solving-contexto/sonving.webp caption="Solving Contexto.me with our script">}} + +
+ +
+ +## Why it might be useful in real life + + + +Although this game seems to have nothing to do with issues that arise in real life, it is, in fact, a simplified version of a problem found in many industries. + +For example, recommendation systems are trying to find the most relevant items for a user based on their previous purchases and reviews. + +Search for a piece of art or graphics is a similar problem. Users might not know what exactly they want, but they can use similarities to explore the collection. +This scenario can be implemented as a navigation in the vector space. + +In general, all the cases when users do not know what exactly they want or can not describe it with a text query but can use other items as references can be solved using this approach. + +Moreover, with modern Multi-modal neural networks, like [CLIP](https://openai.com/blog/clip/), you can combine initial text queries with more detailed clarification selections. +So, for example, the user can type "I want a picture of a cat" and then select a breed of a cat based on how it looks. + +{{< figure src=/articles_data/solving-contexto/clip.png caption="CLIP model by OpenAI">}} + +### How it scales + +Previously, we mentioned that the algorithm scores each word in the vocabulary. +This operation is fast enough for small vocabularies, but it can become a bottleneck for large ones. + +Fortunately, in most real-life scenarios, we don't need to score all entries in the collection. +Moreover, it will work even better if we don't score pairs that have a slight difference in their similarities to the target object. + +So what we actually need is to find the top of the most similar and most dissimilar vectors to the reference query. +And Qdrant is the perfect tool for this task! + +In Qdrant, you can use already stored records to find the most similar vectors **fast**. +And dissimilar vectors are just vectors that are similar to an inverted query. + +Check out our [documentation](https://qdrant.tech/documentation/search/#recommendation-api) to learn more about how to use Qdrant for this and other tasks. + + diff --git a/qdrant-landing/content/documentation/cloud.md b/qdrant-landing/content/documentation/cloud.md deleted file mode 100644 index 1e60b0957..000000000 --- a/qdrant-landing/content/documentation/cloud.md +++ /dev/null @@ -1,27 +0,0 @@ ---- -title: Cloud -weight: 55 ---- - -[Qdrant Cloud](https://qdrant.tech/surveys/cloud-request/) is an official SaaS offering of the Qdrant vector database. It provides the same -fast and reliable similarity search engine, but without a need to maintain your own infrastructure. The transition from the on-premise -to the cloud version of Qdrant does not require changing anything in the way you interact with the service, except for an API key that has -to be provided to each request. - -The transition is even easier if you use the official client libraries. For example, the [Python Qdrant client](/documentation/install/#python-client) -has the support of the API key already built-in, so you only need to provide it once, when the `QdrantClient` instance is created. - -```python -from qdrant_client import QdrantClient - -qdrant_client = QdrantClient( - host="xyz-example.eu-central.aws.staging-cloud.qdrant.io", - prefer_grpc=True, - api_key="<<-provide-your-own-key->>", -) -``` - -```bash -curl \ - -X GET https://xyz-example.eu-central.aws.staging-cloud.qdrant.io:6333 \ - --header 'api-key: ' diff --git a/qdrant-landing/content/documentation/configuration.md b/qdrant-landing/content/documentation/configuration.md index 906430ee6..3ee13a233 100644 --- a/qdrant-landing/content/documentation/configuration.md +++ b/qdrant-landing/content/documentation/configuration.md @@ -77,7 +77,7 @@ storage: # Segments larger than this threshold will be stored as read-only memmaped file. # To enable memmap storage, lower the threshold # Note: 1Kb = 1 vector of size 256 - memmap_threshold_kb: 200000 + memmap_threshold_kb: null # Maximum size (in KiloBytes) of vectors allowed for plain index. # Default value based on https://github.com/google-research/google-research/blob/master/scann/docs/algorithms.md diff --git a/qdrant-landing/content/documentation/storage.md b/qdrant-landing/content/documentation/storage.md index 70ad58d26..4031ae7ec 100644 --- a/qdrant-landing/content/documentation/storage.md +++ b/qdrant-landing/content/documentation/storage.md @@ -27,8 +27,26 @@ The choice has to be made between the search speed and the size of the RAM used. **Memmap storage** - creates a virtual address space associated with the file on disk. [Wiki](https://en.wikipedia.org/wiki/Memory-mapped_file). Mmapped files are not directly loaded into RAM. Instead, they use page cache to access the contents of the file. This scheme allows flexible use of available memory. With sufficient RAM, it is almost as fast as in-memory storage. + + + +### Configuring Memmap storage + +To configure usage of mmap storage, you need to specify the threshold after which the segment will be converted to mmap storage. +There are two ways to do this: + +1. You can set the threshold globally in the [configuration file](../configuration/). The parameter is called `memmap_threshold_kb`. +2. You can set the threshold for each collection separately during [creation](../collections/#create-collection) or [update](../collections/#update-collection-parameters). + + +In addition, you can use mmap storage not only for vectors, but also for HNSW index. +To enable this, you need to set the `hnsw_config.on_disk` parameter to `true` during [creation](../collections/#create-collection) of the collection. + ## Payload storage diff --git a/qdrant-landing/static/articles_data/solving-contexto/cbow-word2vec.webp b/qdrant-landing/static/articles_data/solving-contexto/cbow-word2vec.webp new file mode 100644 index 000000000..bfad7b142 Binary files /dev/null and b/qdrant-landing/static/articles_data/solving-contexto/cbow-word2vec.webp differ diff --git a/qdrant-landing/static/articles_data/solving-contexto/clip.png b/qdrant-landing/static/articles_data/solving-contexto/clip.png new file mode 100644 index 000000000..15de9986a Binary files /dev/null and b/qdrant-landing/static/articles_data/solving-contexto/clip.png differ diff --git a/qdrant-landing/static/articles_data/solving-contexto/icon.svg b/qdrant-landing/static/articles_data/solving-contexto/icon.svg new file mode 100644 index 000000000..a3ef5c37e --- /dev/null +++ b/qdrant-landing/static/articles_data/solving-contexto/icon.svg @@ -0,0 +1,110 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/qdrant-landing/static/articles_data/solving-contexto/image.jpg b/qdrant-landing/static/articles_data/solving-contexto/image.jpg new file mode 100644 index 000000000..4b97c3fe8 Binary files /dev/null and b/qdrant-landing/static/articles_data/solving-contexto/image.jpg differ diff --git a/qdrant-landing/static/articles_data/solving-contexto/preview/preview.jpg b/qdrant-landing/static/articles_data/solving-contexto/preview/preview.jpg new file mode 100644 index 000000000..d073f4edd Binary files /dev/null and b/qdrant-landing/static/articles_data/solving-contexto/preview/preview.jpg differ diff --git a/qdrant-landing/static/articles_data/solving-contexto/preview/preview.webp b/qdrant-landing/static/articles_data/solving-contexto/preview/preview.webp new file mode 100644 index 000000000..ea04fb091 Binary files /dev/null and b/qdrant-landing/static/articles_data/solving-contexto/preview/preview.webp differ diff --git a/qdrant-landing/static/articles_data/solving-contexto/preview/social_preview.jpg b/qdrant-landing/static/articles_data/solving-contexto/preview/social_preview.jpg new file mode 100644 index 000000000..c7c3f3113 Binary files /dev/null and b/qdrant-landing/static/articles_data/solving-contexto/preview/social_preview.jpg differ diff --git a/qdrant-landing/static/articles_data/solving-contexto/preview/title.jpg b/qdrant-landing/static/articles_data/solving-contexto/preview/title.jpg new file mode 100644 index 000000000..91d3370c9 Binary files /dev/null and b/qdrant-landing/static/articles_data/solving-contexto/preview/title.jpg differ diff --git a/qdrant-landing/static/articles_data/solving-contexto/preview/title.webp b/qdrant-landing/static/articles_data/solving-contexto/preview/title.webp new file mode 100644 index 000000000..ae317fe32 Binary files /dev/null and b/qdrant-landing/static/articles_data/solving-contexto/preview/title.webp differ diff --git a/qdrant-landing/static/articles_data/solving-contexto/scoring-1.png b/qdrant-landing/static/articles_data/solving-contexto/scoring-1.png new file mode 100644 index 000000000..d0b21467e Binary files /dev/null and b/qdrant-landing/static/articles_data/solving-contexto/scoring-1.png differ diff --git a/qdrant-landing/static/articles_data/solving-contexto/scoring-1.svg b/qdrant-landing/static/articles_data/solving-contexto/scoring-1.svg new file mode 100644 index 000000000..b1ccbc23b --- /dev/null +++ b/qdrant-landing/static/articles_data/solving-contexto/scoring-1.svg @@ -0,0 +1,298 @@ + + + + + + + + + + + + + + house + blue + 5625 + 846 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + blue + house + + --1 + +1 + + diff --git a/qdrant-landing/static/articles_data/solving-contexto/scoring-2.png b/qdrant-landing/static/articles_data/solving-contexto/scoring-2.png new file mode 100644 index 000000000..13a881823 Binary files /dev/null and b/qdrant-landing/static/articles_data/solving-contexto/scoring-2.png differ diff --git a/qdrant-landing/static/articles_data/solving-contexto/scoring-2.svg b/qdrant-landing/static/articles_data/solving-contexto/scoring-2.svg new file mode 100644 index 000000000..97a7548e1 --- /dev/null +++ b/qdrant-landing/static/articles_data/solving-contexto/scoring-2.svg @@ -0,0 +1,606 @@ + + + + + + + + + + + + + + grill + 121 + + + + + house + 846 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + house + grill + + --1 + +1 + + diff --git a/qdrant-landing/static/articles_data/solving-contexto/sonving.webp b/qdrant-landing/static/articles_data/solving-contexto/sonving.webp new file mode 100644 index 000000000..e8aeb51d1 Binary files /dev/null and b/qdrant-landing/static/articles_data/solving-contexto/sonving.webp differ