diff --git a/qdrant-landing/content/documentation/101-foundations/04_qdrant_101_cv.md b/qdrant-landing/content/documentation/101-foundations/04_qdrant_101_cv.md new file mode 100644 index 000000000..608a74e7b --- /dev/null +++ b/qdrant-landing/content/documentation/101-foundations/04_qdrant_101_cv.md @@ -0,0 +1,899 @@ +--- +notebook_path: 101-foundations/qdrant_101_image_data/04_qdrant_101_cv.ipynb +reading_time_min: 38 +title: Qdrant & Image Data +--- + +# Qdrant & Image Data + +![crab](documentation/101-foundations/04_qdrant_101_cv/crabmera.png) + +In this tutorial, you will learn how use semantic search for accurate skin cancer image comparison usign Qdrant and the +Hugging Face `transformers` and `datasets` libraries. + +## 1. Overview + +This aim of this tutorial is to walk you through the process of implementing semantic search techniques with image data and +vector databases. In particular, we'll go over an example on how to assist doctors in comparing rare or challenging images +with labels representing different skin diseases. + +Why did we choose this example? With the power of semantic search, medical professionals could enhance +their diagnostic capabilities and make more accurate decisions regarding skin disease diagnosis. Effectively helping out +people in need of such medical evaluations. + +That said, you can swap the dataset used in this tutorial with your own and follow along with minimal adjustments to the code. + +The dataset used can be found in the [Hugging Face Hub](https://huggingface.co/datasets/marmal88/skin_cancer) and you don't +need to take any additional step to download it other than to run the code below. + +Here is a short description of each of the variables available in the dataset. + +- `image` - PIL objct of size 600x450 +- `image_id` - unique id for the image +- `lesion_id` - unique id for the type of lesion on the skin of the patient +- `dx` - diagnosis given to the patient (e.g., melanocytic_Nevi, melanoma, benign_keratosis-like_lesions, basal_cell_carcinoma, + actinic_keratoses, vascular_lesions, dermatofibroma) +- `dx_type` - type of diagnosis (e.g., histo, follow_up, consensus, confocal) +- `age` - the age of the patients from 5 to 86 (some values are missing) +- `sex` - the gender of the patient (female, male, and unknown) +- `localization` - location of the spot in the body (e.g., 'lower extremity', 'upper extremity', 'neck', 'face', 'back', + 'chest', 'ear', 'abdomen', 'scalp', 'hand', 'trunk', 'unknown', 'foot', 'genital', 'acral') + +By the end of the tutorial, you will be able to extract embeddings from images using transformers and conduct image-to-image +semantic search with Qdrant. Please note, we do assume a bit of familiarity with machine learning and vector databases concepts. + +## 2. Set Up + +Before you run any line of code, please make sure you have + +1. downloaded the data +1. created a virtual environment (if not in Google Colab) +1. installed the packages below +1. started a container with Qdrant + +The open source version of Qdrant is available as a docker image and it can be pulled and run from any machine with docker +installed. If you don't have Docker installed in your PC you can follow the instructions in the official documentation +[here](https://docs.docker.com/get-docker/). After that, open your terminal and start by downloading the latest Qdrant +image with the following command. + +```sh +docker pull qdrant/qdrant +``` + +Next, initialize Qdrant with the following command, and you should be good to go. + +```sh +docker run -p 6333:6333 \ + -v $(pwd)/qdrant_storage:/qdrant/storage \ + qdrant/qdrant +``` + +Verify that you are ready to go by importing the following libraries and connecting to Qdrant via its Python client. + +```python +# install packages +%pip install qdrant-client transformers datasets pandas numpy torch librosa tensorflow openl3 panns-inference pedalboard streamlit +``` + +
+ +```python +from transformers import ViTImageProcessor, ViTModel +from qdrant_client import QdrantClient +from qdrant_client import models +from datasets import load_dataset +import numpy as np +import torch +``` + +
+ +```python +client = QdrantClient(location=":memory:") +``` + +
+ +```python +my_collection = "image_collection" +client.recreate_collection( + collection_name=my_collection, + vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE), +) +``` + +``` +:2: DeprecationWarning: `recreate_collection` method is deprecated and will be removed in the future. Use `collection_exists` to check collection existence and `create_collection` instead. + client.recreate_collection( + + + + + +True +``` + +## 3. Image Embeddings + +In computer vision systems, vector databases are used to store image features. These image features are vector representations +of images that capture their visual content, and they are used to improve the performance of computer vision tasks such +as object detection, image classification, and image retrieval. + +To extract these useful feature representation from our images, we'll use vision transformers (ViT). ViTs are advanced +algorithms that enable computers to "see" and understand visual information in a similar fashion to humans. They +use a transformer architecture to process images and extract meaningful features from them. + +To understand how ViTs work, imagine you have a large jigsaw puzzle with many different pieces. To solve the puzzle, +you would typically look at the individual pieces, their shapes, and how they fit together to form the full picture. ViTs +work in a similar way, meaning, instead of looking at the entire image at once, vision transformers break it down +into smaller parts called "patches." Each of these patches is like one piece of the puzzle that captures a specific portion +of the image, and these pieces are then analyzed and processed by the ViTs. + +By analyzing these patches, the ViTs identify important patterns such as edges, colors, and textures, and combines them +to form a coherent understanding of a given image. + +That said, let's get started using transformers to extract features from our images. + +We'll begin by reading in the data and examining a sample. + +```python +dataset = load_dataset("marmal88/skin_cancer", split="train") +dataset +``` + +``` +Dataset({ + features: ['image', 'image_id', 'lesion_id', 'dx', 'dx_type', 'age', 'sex', 'localization'], + num_rows: 9577 +}) +``` + +```python +dataset[8500] +``` + +``` +{'image': , + 'image_id': 'ISIC_0025927', + 'lesion_id': 'HAM_0002557', + 'dx': 'melanoma', + 'dx_type': 'histo', + 'age': 50.0, + 'sex': 'female', + 'localization': 'upper extremity'} +``` + +```python +image = dataset[8500]["image"] +image +``` + +![png](documentation/101-foundations/04_qdrant_101_cv/output_15_0.png) + +The image at index 8500, as shown above, is an instance of melanoma, which is a type of skin cancer that starts +in the cells called melanocytes. These are responsible for producing a pigment called melanin that gives color +to our skin, hair, and eyes. When melanocytes become damaged or mutate, they can start growing and dividing rapidly, +forming a cancerous growth known as melanoma. Melanoma often appears as an unusual or changing mole, spot, or +growth on the skin, and it can be caused by excessive exposure to ultraviolet (UV) radiation from the sun or +tanning beds, as well as genetic factors. If detected early, melanoma can usually be treated successfully, +but if left untreated, it can spread to other parts of the body and become more difficult to treat. + +Because Melanoma can often be hard to detect, and we want to empower doctors with the ability to compare +and contrast cases that are difficult to classify without invasive procedures (i.e., by taking a sample of the +skin of the patient), we will create for them a system that allows them to compare images taken from patients +with those already inside Qdrant in the shape of a vector. + +In order to search through the images and provide the most similar ones to the doctors, we'll need to download +a pre-trained model that will help us extract the embedding layer from our dataset. We'll do this using the +transformers library and Facebook's [DINO model](https://huggingface.co/facebook/dino-vitb8). + +```python +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") +processor = ViTImageProcessor.from_pretrained("facebook/dino-vits16") +model = ViTModel.from_pretrained("facebook/dino-vits16").to(device) +``` + +Let's process the instance of melanoma we selected earlier using our feature extractor from above. To learn more about `ViTImageProcessor` +and `ViTModel`, check out the [docs here](https://huggingface.co/docs/transformers/tasks/image_classification). + +```python +inputs = processor(images=image, return_tensors="pt").to(device) +inputs["pixel_values"].shape, inputs +``` + +``` +(torch.Size([1, 3, 224, 224]), + {'pixel_values': tensor([[[[-1.3987, -1.4329, -1.4500, ..., -1.0904, -1.1075, -1.0904], + [-1.4158, -1.4329, -1.4500, ..., -1.0904, -1.1075, -1.1075], + [-1.4329, -1.4500, -1.4500, ..., -1.1247, -1.1075, -1.1075], + ..., + [-1.1075, -1.1075, -1.0904, ..., -1.4843, -1.5014, -1.5357], + [-1.1075, -1.1075, -1.0904, ..., -1.4843, -1.5185, -1.5528], + [-1.1247, -1.1247, -1.0904, ..., -1.4843, -1.5357, -1.5528]], + + [[-1.7381, -1.7381, -1.7556, ..., -1.5980, -1.6155, -1.6331], + [-1.7381, -1.7381, -1.7556, ..., -1.6155, -1.6331, -1.6506], + [-1.7381, -1.7381, -1.7556, ..., -1.5980, -1.6155, -1.6155], + ..., + [-1.5630, -1.5630, -1.5630, ..., -1.7731, -1.7906, -1.7906], + [-1.5630, -1.5630, -1.5630, ..., -1.7906, -1.7906, -1.8081], + [-1.5805, -1.5805, -1.5630, ..., -1.7906, -1.8081, -1.8081]], + + [[-1.3513, -1.3687, -1.3861, ..., -1.2119, -1.2119, -1.2293], + [-1.3687, -1.3687, -1.3861, ..., -1.2119, -1.2293, -1.2467], + [-1.3687, -1.3861, -1.4036, ..., -1.2119, -1.2119, -1.2293], + ..., + [-1.1770, -1.1770, -1.1596, ..., -1.4559, -1.4907, -1.5081], + [-1.1596, -1.1770, -1.1596, ..., -1.4733, -1.4907, -1.5081], + [-1.1596, -1.1770, -1.1770, ..., -1.4733, -1.4733, -1.5081]]]])}) +``` + +```python +one_embedding = model(**inputs).last_hidden_state +one_embedding.shape, one_embedding[0, 0, :20] +``` + +``` +(torch.Size([1, 197, 384]), + tensor([ 3.0854, 4.9196, -1.1094, 3.3949, -0.8139, 4.8751, 4.4032, -0.6903, + 5.5181, 8.6680, 1.6411, 5.6704, 2.2703, -1.3895, -1.8102, -1.4204, + 8.9997, 8.5076, 5.1398, -7.1862], grad_fn=)) +``` + +As you can see above, what we get back from our preprocessing function is a multi-dimensional tensor represented +as \[`batch_size`, `channels`, `rows`, `columns`\]. The `batch_size` is the amount of samples passed through our +feature extractor and the channels represent the red, green, and blue hues of the image. Lastly, the rows and +columns, which can also be thought of as vectors and dimensions, represent the width and height of the image. This +4-dimensional representation is the input our model expects. In return, we get back a tensor +of \[`batch_size`, `patches`, `dimensions`\], and what's left for us to do is to choose a pooling method +for our embedding as it is not feasible to use 197 embedding vectors when one compressed one would suffice for our use +case. For the final step, we'll use mean pooling. + +```python +one_embedding.mean(dim=1).shape +``` + +``` +torch.Size([1, 384]) +``` + +Let's create a function with the process we just walked through above and map it to our dataset to get an +embedding vector for each image. + +```python +def get_embeddings(batch): + inputs = processor(images=batch["image"], return_tensors="pt").to(device) + with torch.no_grad(): + outputs = model(**inputs).last_hidden_state.mean(dim=1).cpu().numpy() + batch["embeddings"] = outputs + return batch +``` + +
+ +```python +dataset = dataset.map(get_embeddings, batched=True, batch_size=16) +``` + +``` +Map: 0%| | 0/9577 [00:00 + +```python +payload = ( + dataset.select_columns(["image_id", "dx", "dx_type", "age", "sex", "localization"]) + .to_pandas() + .fillna({"age": 0}) + .to_dict(orient="records") +) + +payload[:3] +``` + +``` +[{'image_id': 'ISIC_0024329', + 'dx': 'actinic_keratoses', + 'dx_type': 'histo', + 'age': 75.0, + 'sex': 'female', + 'localization': 'lower extremity'}, + {'image_id': 'ISIC_0024372', + 'dx': 'actinic_keratoses', + 'dx_type': 'histo', + 'age': 70.0, + 'sex': 'male', + 'localization': 'lower extremity'}, + {'image_id': 'ISIC_0024418', + 'dx': 'actinic_keratoses', + 'dx_type': 'histo', + 'age': 75.0, + 'sex': 'female', + 'localization': 'lower extremity'}] +``` + +Note that in the cell above we use `.fillna({"age": 0})`, that is because there are several missing values in the `age` column. Because +we don't want to assume the age of a patient, we'll leave this number as 0. Also, at the time of writing, Qdrant will not take in NumPy +`NaN`s but rather regular `None` Python values for anything that might be missing in our dataset. + +To make sure each image has an explicit `id` inside of the Qdrant collection we created earlier, we'll create a new column with a range of +numbers equivalent to the rows in our dataset. In addition, we'll load the embeddings we just saved. + +```python +ids = list(range(dataset.num_rows)) +embeddings = np.load("vectors.npy").tolist() +``` + +We are now ready to upsert the combination of ids, vectors and payload to our collection, and we'll do so in batches of 1000. + +```python +batch_size = 1000 + +for i in range(0, dataset.num_rows, batch_size): + low_idx = min(i + batch_size, dataset.num_rows) + + batch_of_ids = ids[i:low_idx] + batch_of_embs = embeddings[i:low_idx] + batch_of_payloads = payload[i:low_idx] + + client.upsert( + collection_name=my_collection, + points=models.Batch( + ids=batch_of_ids, vectors=batch_of_embs, payloads=batch_of_payloads + ), + ) +``` + +We can make sure our vectors were uploaded successfully by counting them with the `client.count()` method. + +```python +client.count( + collection_name=my_collection, + exact=True, +) +``` + +``` +CountResult(count=9577) +``` + +To visually inspect the collection we just created, we can scroll through our vectors with the `client.scroll()` method. + +```python +client.scroll(collection_name=my_collection, limit=5) +``` + +``` +([Record(id=0, payload={'image_id': 'ISIC_0024329', 'dx': 'actinic_keratoses', 'dx_type': 'histo', 'age': 75.0, 'sex': 'female', 'localization': 'lower extremity'}, vector=None, shard_key=None, order_value=None), + Record(id=1, payload={'image_id': 'ISIC_0024372', 'dx': 'actinic_keratoses', 'dx_type': 'histo', 'age': 70.0, 'sex': 'male', 'localization': 'lower extremity'}, vector=None, shard_key=None, order_value=None), + Record(id=2, payload={'image_id': 'ISIC_0024418', 'dx': 'actinic_keratoses', 'dx_type': 'histo', 'age': 75.0, 'sex': 'female', 'localization': 'lower extremity'}, vector=None, shard_key=None, order_value=None), + Record(id=3, payload={'image_id': 'ISIC_0024450', 'dx': 'actinic_keratoses', 'dx_type': 'histo', 'age': 50.0, 'sex': 'male', 'localization': 'upper extremity'}, vector=None, shard_key=None, order_value=None), + Record(id=4, payload={'image_id': 'ISIC_0024463', 'dx': 'actinic_keratoses', 'dx_type': 'histo', 'age': 50.0, 'sex': 'male', 'localization': 'upper extremity'}, vector=None, shard_key=None, order_value=None)], + 5) +``` + +## 4. Semantic Search + +Semantic search, in the context of vector databases and image retrieval, refers to a method of searching for information or images +based on their meaning or content rather than just using keywords. Imagine you're looking for a specific picture of a skin disease and you +don't know the file name or where it is stored. With semantic search, you can describe what you're looking for using words like +"red rashes with blisters," or you can upload an image that will get processed into an embedding vector, and the system will then analyze +the content of the images to find matches that closely match your description or input image. + +Semantic search enables a more intuitive and efficient way of searching for images, making it easier to find what you're +looking for, even if you can't remember specific details or tags. + +With Qdrant, we can get started searching through our collection with the `client.search()` method. + +```python +client.query_points( + collection_name=my_collection, query=one_embedding.mean(dim=1)[0].tolist(), limit=10 +).points +``` + +``` +[ScoredPoint(id=8500, version=0, score=0.9999999958397132, payload={'image_id': 'ISIC_0025927', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 50.0, 'sex': 'female', 'localization': 'upper extremity'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=9205, version=0, score=0.9296641157036876, payload={'image_id': 'ISIC_0033269', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 55.0, 'sex': 'female', 'localization': 'back'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=9420, version=0, score=0.9249702905493299, payload={'image_id': 'ISIC_0034216', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 35.0, 'sex': 'male', 'localization': 'abdomen'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=9241, version=0, score=0.9203313354555653, payload={'image_id': 'ISIC_0033426', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 55.0, 'sex': 'male', 'localization': 'back'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=968, version=0, score=0.9159926447426332, payload={'image_id': 'ISIC_0025851', 'dx': 'benign_keratosis-like_lesions', 'dx_type': 'consensus', 'age': 80.0, 'sex': 'female', 'localization': 'face'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=8510, version=0, score=0.9090264148967792, payload={'image_id': 'ISIC_0026086', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 55.0, 'sex': 'male', 'localization': 'upper extremity'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=9009, version=0, score=0.9010038920927373, payload={'image_id': 'ISIC_0032244', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 75.0, 'sex': 'female', 'localization': 'face'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=9037, version=0, score=0.8995734112012342, payload={'image_id': 'ISIC_0032511', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 70.0, 'sex': 'male', 'localization': 'scalp'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=180, version=0, score=0.8969780053043659, payload={'image_id': 'ISIC_0029417', 'dx': 'actinic_keratoses', 'dx_type': 'histo', 'age': 80.0, 'sex': 'female', 'localization': 'neck'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=1377, version=0, score=0.8967672074951868, payload={'image_id': 'ISIC_0029613', 'dx': 'benign_keratosis-like_lesions', 'dx_type': 'histo', 'age': 85.0, 'sex': 'female', 'localization': 'chest'}, vector=None, shard_key=None, order_value=None)] +``` + +As you can see in the cell above, we used the melanoma image from earlier and got back other images with Melanoma. The similarity +score also gives us a good indication regarding the similarity of our query image and those in our database (excluding the first one +which is the image itself, of course). But what if our doctors want to look for images from patients as demographically similar +to the one they are evaluating. For this, we can take advantage of Qdrant's Filters. + +```python +female_older_than_55 = models.Filter( + must=[ + models.FieldCondition(key="sex", match=models.MatchValue(value="female")), + ], + should=[ + models.FieldCondition( + key="age", range=models.Range(lt=None, gt=None, gte=55.0, lte=None) + ) + ], +) +``` + +
+ +```python +results = client.query_points( + collection_name=my_collection, + query=one_embedding.mean(dim=1)[0].tolist(), + query_filter=female_older_than_55, + limit=10, +).points +results +``` + +``` +[ScoredPoint(id=9205, version=0, score=0.9296641157036876, payload={'image_id': 'ISIC_0033269', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 55.0, 'sex': 'female', 'localization': 'back'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=968, version=0, score=0.9159926447426332, payload={'image_id': 'ISIC_0025851', 'dx': 'benign_keratosis-like_lesions', 'dx_type': 'consensus', 'age': 80.0, 'sex': 'female', 'localization': 'face'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=9009, version=0, score=0.9010037824853423, payload={'image_id': 'ISIC_0032244', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 75.0, 'sex': 'female', 'localization': 'face'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=180, version=0, score=0.8969780053043659, payload={'image_id': 'ISIC_0029417', 'dx': 'actinic_keratoses', 'dx_type': 'histo', 'age': 80.0, 'sex': 'female', 'localization': 'neck'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=1377, version=0, score=0.8967672074951868, payload={'image_id': 'ISIC_0029613', 'dx': 'benign_keratosis-like_lesions', 'dx_type': 'histo', 'age': 85.0, 'sex': 'female', 'localization': 'chest'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=8943, version=0, score=0.8930567353248282, payload={'image_id': 'ISIC_0031479', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 75.0, 'sex': 'female', 'localization': 'lower extremity'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=8372, version=0, score=0.892454491134191, payload={'image_id': 'ISIC_0024400', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 75.0, 'sex': 'female', 'localization': 'unknown'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=9301, version=0, score=0.8887279207833202, payload={'image_id': 'ISIC_0033678', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 75.0, 'sex': 'female', 'localization': 'back'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=9118, version=0, score=0.8857579899731108, payload={'image_id': 'ISIC_0032892', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 85.0, 'sex': 'female', 'localization': 'back'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=9329, version=0, score=0.8855790164955197, payload={'image_id': 'ISIC_0033834', 'dx': 'melanoma', 'dx_type': 'histo', 'age': 75.0, 'sex': 'female', 'localization': 'back'}, vector=None, shard_key=None, order_value=None)] +``` + +Notice on the payload above how we were able to match the doctors' criteria effortlessly. We can include much convoluted +requests using the other filtering methods available in Qdrant. For more info, please check out the [docs here](https://qdrant.tech/documentation/concepts/filtering/). + +It is important to figure out early whether users of our application should see the similarity score of the results of their search as +this would give them an idea as to how useful the images might be. In addition, they could set up a similarity threshold in Qdrant +and further distill the results they get back. + +Let's evaluate visually the images we just got. + +```python +def see_images(results, top_k=5): + for i in range(top_k): + image_id = results[i].payload["image_id"] + score = results[i].score + dx = results[i].payload["dx"] + gender = results[i].payload["sex"] + age = results[i].payload["age"] + image = dataset.filter( + lambda x: x == image_id, input_columns="image_id" + ).select_columns("image")[0]["image"] + + print(f"Result #{i+1}: {gender} age {age} was diagnosed with {dx}") + print(f"This image score was {score}") + display(image) + print("-" * 50) + print() +``` + +
+ +```python +see_images(results, 3) +``` + +``` +Filter: 0%| | 0/9577 [00:00 + +```python +results3 = client.query_points( + collection_name=my_collection, + query=melo_sample_2["embeddings"], + query_filter=not_cancer_not_face_or_neck, + limit=10, +).points +results3 +``` + +``` +[ScoredPoint(id=1504, version=0, score=0.9205647139973949, payload={'image_id': 'ISIC_0031050', 'dx': 'benign_keratosis-like_lesions', 'dx_type': 'histo', 'age': 70.0, 'sex': 'female', 'localization': 'lower extremity'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=4549, version=0, score=0.9064056682217525, payload={'image_id': 'ISIC_0028252', 'dx': 'melanocytic_Nevi', 'dx_type': 'consensus', 'age': 40.0, 'sex': 'female', 'localization': 'abdomen'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=1552, version=0, score=0.9024446764617793, payload={'image_id': 'ISIC_0031522', 'dx': 'benign_keratosis-like_lesions', 'dx_type': 'histo', 'age': 70.0, 'sex': 'male', 'localization': 'chest'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=1698, version=0, score=0.8992105068273915, payload={'image_id': 'ISIC_0032978', 'dx': 'benign_keratosis-like_lesions', 'dx_type': 'histo', 'age': 70.0, 'sex': 'male', 'localization': 'chest'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=1289, version=0, score=0.8982408384104963, payload={'image_id': 'ISIC_0028972', 'dx': 'benign_keratosis-like_lesions', 'dx_type': 'histo', 'age': 70.0, 'sex': 'male', 'localization': 'lower extremity'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=1426, version=0, score=0.8979201424714957, payload={'image_id': 'ISIC_0030172', 'dx': 'benign_keratosis-like_lesions', 'dx_type': 'histo', 'age': 55.0, 'sex': 'male', 'localization': 'back'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=7734, version=0, score=0.8963290522398759, payload={'image_id': 'ISIC_0033162', 'dx': 'melanocytic_Nevi', 'dx_type': 'histo', 'age': 35.0, 'sex': 'male', 'localization': 'trunk'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=7376, version=0, score=0.8954019301275025, payload={'image_id': 'ISIC_0032555', 'dx': 'melanocytic_Nevi', 'dx_type': 'histo', 'age': 55.0, 'sex': 'female', 'localization': 'back'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=6138, version=0, score=0.8939650176647458, payload={'image_id': 'ISIC_0030645', 'dx': 'melanocytic_Nevi', 'dx_type': 'histo', 'age': 35.0, 'sex': 'female', 'localization': 'upper extremity'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=4450, version=0, score=0.8927497509545003, payload={'image_id': 'ISIC_0028098', 'dx': 'melanocytic_Nevi', 'dx_type': 'histo', 'age': 15.0, 'sex': 'female', 'localization': 'upper extremity'}, vector=None, shard_key=None, order_value=None)] +``` + +```python +see_images(results3, 3) +``` + +``` +Filter: 0%| | 0/9577 [00:00 + +```python +filter_2 = models.Filter( + must_not=[ + models.FieldCondition( + key="dx", match=models.MatchValue(value="benign_keratosis-like_lesions") + ) + ] +) +``` + +
+ +```python +dataset[7700]["dx"] +``` + +``` +'melanocytic_Nevi' +``` + +```python +query_1 = models.QueryRequest( + query=dataset[700]["embeddings"], + filter=filter_1, + with_payload=models.PayloadSelectorExclude( + exclude=["image_id", "dx_type"], + ), + limit=4, +) +``` + +
+ +```python +query_2 = models.QueryRequest( + query=dataset[7700]["embeddings"], + filter=filter_2, + with_payload=models.PayloadSelectorExclude( + exclude=["image_id", "dx_type", "localization"], + ), + limit=7, +) +``` + +
+ +```python +client.query_batch_points(collection_name=my_collection, requests=[query_1, query_2]) +``` + +``` +[QueryResponse(points=[ScoredPoint(id=6965, version=0, score=0.9660647703074432, payload={'dx': 'melanocytic_Nevi', 'age': 55.0, 'sex': 'male', 'localization': 'chest'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=486, version=0, score=0.9543207786490415, payload={'dx': 'basal_cell_carcinoma', 'age': 55.0, 'sex': 'male', 'localization': 'chest'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=7130, version=0, score=0.9542695008427471, payload={'dx': 'melanocytic_Nevi', 'age': 55.0, 'sex': 'male', 'localization': 'chest'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=4063, version=0, score=0.9509104691893362, payload={'dx': 'melanocytic_Nevi', 'age': 75.0, 'sex': 'male', 'localization': 'chest'}, vector=None, shard_key=None, order_value=None)]), + QueryResponse(points=[ScoredPoint(id=7700, version=0, score=1.0000000534635882, payload={'dx': 'melanocytic_Nevi', 'age': 45.0, 'sex': 'female'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=7726, version=0, score=0.9644913911190263, payload={'dx': 'melanocytic_Nevi', 'age': 45.0, 'sex': 'female'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=7624, version=0, score=0.9172229754291651, payload={'dx': 'melanocytic_Nevi', 'age': 45.0, 'sex': 'female'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=6246, version=0, score=0.9084365037261836, payload={'dx': 'melanocytic_Nevi', 'age': 35.0, 'sex': 'male'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=7852, version=0, score=0.9067266960379179, payload={'dx': 'melanocytic_Nevi', 'age': 40.0, 'sex': 'female'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=8090, version=0, score=0.9048006034570659, payload={'dx': 'melanocytic_Nevi', 'age': 45.0, 'sex': 'male'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=9344, version=0, score=0.9044525894722604, payload={'dx': 'melanoma', 'age': 75.0, 'sex': 'male'}, vector=None, shard_key=None, order_value=None)])] +``` + +Excellent! Notice how we got back two lists of results that respect the criteria we've chosen to filter by, and +the payload we wanted to exclude from each. + +That's it! In the next section, we will create an app to showcase the usability of our search engine, Qdrant. + +## 5. Putting It All Together + +```python +%%writefile image_search_app.py + +from transformers import ViTImageProcessor, ViTModel +from qdrant_client import QdrantClient +import streamlit as st +import torch + +st.title("Skin Images Search Engine") +st.markdown("Upload images with different skin conditions and you'll get the most similar ones from our database of images.") + +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") +processor = ViTImageProcessor.from_pretrained('facebook/dino-vits16') +model = ViTModel.from_pretrained('facebook/dino-vits16').to(device) +client = QdrantClient("localhost", port=6333) + +search_top_k = st.slider('How many search results do you want to retrieve?', 1, 40, 5) +image_file = st.file_uploader(label="πŸ“· Skin Condition Image file πŸ”") + +if image_file: + st.image(image_file) + + inputs = processor(images=image_file, return_tensors="pt").to(device) + with torch.no_grad(): + outputs = model(**inputs).last_hidden_state.mean(dim=1).cpu().numpy() + + st.markdown("## Semantic Search") + results = client.search(collection_name="image_collection", query_vector=outputs[0], limit=search_top_k) + + for i in range(search_top_k): + st.header(f"Decease: {results[i].payload['dx']}") + st.subheader(f"Image ID: {results[i].payload['image_id']}") + st.markdown(f"Location: {results[i].payload['localization']}") + st.markdown(f"Gender: {results[i].payload['sex']}") + st.markdown(f"Age: {results[i].payload['age']}") +``` + +``` +Writing image_search_app.py +``` + +```python +!streamlit run image_search_app.py +``` + +``` +Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false. + + + You can now view your Streamlit app in your browser. + + Local URL: http://localhost:8501 + Network URL: http://172.28.0.12:8501 + External URL: http://35.192.189.178:8501 + +``` + +## 5. Final Thoughts + +We have covered quite a bit in this tutorial, and we've only scratched the surface of what we can +do with vectors from images and Qdrant. Since we can appreciate hoy there are a lot of moving parts when +it comes to converting images into vectors or how to go beyond sematic search with Qdrant, here are +tutorials you can go through to increase your knowledge on these topics. + +- [Fine Tuning Similar Cars Search](https://qdrant.tech/articles/cars-recognition/) +- [Image Similarity with Hugging Face Datasets and Transformers](https://huggingface.co/blog/image-similarity) +- [Fine-Tune ViT for Image Classification with πŸ€— Transformers](https://huggingface.co/blog/fine-tune-vit) +- [Image search with πŸ€— datasets](https://huggingface.co/blog/image-search-datasets) +- [Create a Simple Neural Search Service](https://qdrant.tech/documentation/tutorials/neural-search/) diff --git a/qdrant-landing/content/documentation/101-foundations/collaborative-filtering.md b/qdrant-landing/content/documentation/101-foundations/collaborative-filtering.md new file mode 100644 index 000000000..4c7a2fdc0 --- /dev/null +++ b/qdrant-landing/content/documentation/101-foundations/collaborative-filtering.md @@ -0,0 +1,307 @@ +--- +notebook_path: 101-foundations/collaborative-filtering/collaborative-filtering.ipynb +reading_time_min: 5 +title: Collaborative filtering system for movie recommendations +--- + +# Collaborative filtering system for movie recommendations + +```python +import os +import pandas as pd +import requests +from IPython.display import display, HTML +from qdrant_client import models, QdrantClient +from qdrant_client.http.models import PointStruct, SparseVector, NamedSparseVector +from collections import defaultdict +from dotenv import load_dotenv + +load_dotenv() + +# OMDB API Key +omdb_api_key = os.getenv("OMDB_API_KEY") + +# Collection name +collection_name = "movies" + +# Set Qdrant Client +qdrant_client = QdrantClient( + os.getenv("QDRANT_HOST"), api_key=os.getenv("QDRANT_API_KEY") +) +``` + +
+ +```python +# Function to get movie poster using OMDB API +def get_movie_poster(imdb_id, api_key): + url = f"https://www.omdbapi.com/?i={imdb_id}&apikey={api_key}" + response = requests.get(url) + if response.status_code == 200: + data = response.json() + return data.get("Poster", "No Poster Found"), data + return "No Poster Found" +``` + +## Preparing the data + +For experimental purposes, the dataset used in this example was [Movielens](https://files.grouplens.org/datasets/movielens/ml-latest.zip), with approximately 33,000,000 ratings and 86,000 movies. + +But you can reproduce it with a smaller dataset if you wish; below are two alternatives: + +- [Movielens Small](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip) +- [The Movies Dataset from Kaggle](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/) + +```python +# Load CSV files +ratings_df = pd.read_csv("data/ratings.csv", low_memory=False) +movies_df = pd.read_csv("data/movies.csv", low_memory=False) +links = pd.read_csv("data/links.csv") + +# Convert movieId in ratings_df and movies_df to string +ratings_df["movieId"] = ratings_df["movieId"].astype(str) +movies_df["movieId"] = movies_df["movieId"].astype(str) + +# Add step to convert imdbId to tt format with leading zeros +links["imdbId"] = "tt" + links["imdbId"].astype(str).str.zfill(7) + +# Normalize ratings +ratings_df["rating"] = ( + ratings_df["rating"] - ratings_df["rating"].mean() +) / ratings_df["rating"].std() + +# Merge ratings with movie metadata to get movie titles +merged_df = ratings_df.merge( + movies_df[["movieId", "title"]], left_on="movieId", right_on="movieId", how="inner" +) + +# Aggregate ratings to handle duplicate (userId, title) pairs +ratings_agg_df = merged_df.groupby(["userId", "movieId"]).rating.mean().reset_index() +``` + +
+ +```python +ratings_agg_df.head() +``` + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
userIdmovieIdrating
0110.429960
1110361.369846
211049-0.509926
3110660.429960
411100.429960
+
+ +## Create a new Qdrant collection and send the data + +```python +# Create a new Qdrant collection +qdrant_client.create_collection( + collection_name=collection_name, + vectors_config={}, + sparse_vectors_config={"ratings": models.SparseVectorParams()}, +) +``` + +
+ +```python +# Convert ratings to sparse vectors +user_sparse_vectors = defaultdict(lambda: {"values": [], "indices": []}) +for row in ratings_agg_df.itertuples(): + user_sparse_vectors[row.userId]["values"].append(row.rating) + user_sparse_vectors[row.userId]["indices"].append(int(row.movieId)) + + +# Define a data generator +def data_generator(): + for user_id, sparse_vector in user_sparse_vectors.items(): + yield PointStruct( + id=user_id, + vector={ + "ratings": SparseVector( + indices=sparse_vector["indices"], values=sparse_vector["values"] + ) + }, + payload={"user_id": user_id, "movie_id": sparse_vector["indices"]}, + ) + + +# Upload points using the data generator +qdrant_client.upload_points(collection_name=collection_name, points=data_generator()) +``` + +## Making a recommendation + +```python +my_ratings = { + 603: 1, # Matrix + 13475: 1, # Star Trek + 11: 1, # Star Wars + 1091: -1, # The Thing + 862: 1, # Toy Story + 597: -1, # Titanic + 680: -1, # Pulp Fiction + 13: 1, # Forrest Gump + 120: 1, # Lord of the Rings + 87: -1, # Indiana Jones + 562: -1, # Die Hard +} +``` + +
+ +```python +# Create sparse vector from my_ratings +def to_vector(ratings): + vector = SparseVector(values=[], indices=[]) + for movie_id, rating in ratings.items(): + vector.values.append(rating) + vector.indices.append(movie_id) + return vector +``` + +
+ +```python +# Perform the search +results = qdrant_client.search( + collection_name=collection_name, + query_vector=NamedSparseVector(name="ratings", vector=to_vector(my_ratings)), + limit=20, +) + + +# Convert results to scores and sort by score +def results_to_scores(results): + movie_scores = defaultdict(lambda: 0) + for result in results: + for movie_id in result.payload["movie_id"]: + movie_scores[movie_id] += result.score + return movie_scores + + +# Convert results to scores and sort by score +movie_scores = results_to_scores(results) +top_movies = sorted(movie_scores.items(), key=lambda x: x[1], reverse=True) +``` + +
+ +```python +# Create HTML to display top 5 results +html_content = "
" + +for movie_id, score in top_movies[:5]: + imdb_id_row = links.loc[links["movieId"] == int(movie_id), "imdbId"] + if not imdb_id_row.empty: + imdb_id = imdb_id_row.values[0] + poster_url, movie_info = get_movie_poster(imdb_id, omdb_api_key) + movie_title = movie_info.get("Title", "Unknown Title") + + html_content += f""" +
+ Poster +
{movie_title}
+
Score: {score}
+
+ """ + else: + continue # Skip if imdb_id is not found + +html_content += "
" + +display(HTML(html_content)) +``` + +
+
+ Poster +
Toy Story
+
Score: 131.2033799
+
+ +``` +
+ Poster +
Monty Python and the Holy Grail
+
Score: 131.2033799
+
+ +
+ Poster +
Star Wars: Episode V - The Empire Strikes Back
+
Score: 131.2033799
+
+ +
+ Poster +
Star Wars: Episode VI - Return of the Jedi
+
Score: 131.2033799
+
+ +
+ Poster +
Men in Black
+
Score: 131.2033799
+
+
+``` + +```python + +``` diff --git a/qdrant-landing/content/documentation/101-foundations/ecommerce-reverse-image-search.md b/qdrant-landing/content/documentation/101-foundations/ecommerce-reverse-image-search.md new file mode 100644 index 000000000..2a8a48fd7 --- /dev/null +++ b/qdrant-landing/content/documentation/101-foundations/ecommerce-reverse-image-search.md @@ -0,0 +1,2489 @@ +--- +notebook_path: 101-foundations/ecommerce_reverse_image_search/ecommerce-reverse-image-search.ipynb +reading_time_min: 43 +title: 'Ecommerce: reverse image search' +--- + +# Ecommerce: reverse image search + +All e-commerce platforms need a search mechanism. The built-in methods usually rely on some variation of full-text search which finds the relevant documents based on the presence of the words used in a query. In some cases, it might be enough, but there are ways to improve that mechanism and increase sales. If your customer can easily find a product they need, they are more likely to buy it. + +Semantic search is one of the possibilities. It relies not only on keywords but considers the meaning and intention of the query. However, reverse image search might be a way to go if you want to enable non-textual search capabilities. Your customers may struggle to express themselves, so why don't you ease that and start accepting images as your search queries? + +# Amazon product dataset 2020 + +We will use the [Amazon product dataset 2020](https://www.kaggle.com/datasets/promptcloud/amazon-product-dataset-2020/) and see how to enable visual queries for it. The following lines will download it from the cloud and create a directory structure so you can reproduce the results independently. + +```python +!mkdir data +!mkdir data/images +!mkdir queries +!wget -nc --directory-prefix=data/ "https://storage.googleapis.com/qdrant-examples/amazon-product-dataset-2020.zip" +!wget -nc --directory-prefix=queries/ "https://storage.googleapis.com/qdrant-examples/ecommerce-reverse-image-search-queries.zip" +!unzip -u -d queries queries/ecommerce-reverse-image-search-queries.zip +``` + +``` +mkdir: cannot create directory β€˜data’: File exists +mkdir: cannot create directory β€˜data/images’: File exists +mkdir: cannot create directory β€˜queries’: File exists +File β€˜data/amazon-product-dataset-2020.zip’ already there; not retrieving. + +File β€˜queries/ecommerce-reverse-image-search-queries.zip’ already there; not retrieving. + +Archive: queries/ecommerce-reverse-image-search-queries.zip +``` + +```python +!pip install jupyter pandas sentence_transformers "qdrant_client~=1.1.1" pyarrow fastembed +``` + +``` +Requirement already satisfied: jupyter in /usr/local/lib/python3.10/dist-packages (1.0.0) +Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (2.0.3) +Requirement already satisfied: sentence_transformers in /usr/local/lib/python3.10/dist-packages (3.0.1) +Collecting qdrant_client~=1.1.1 + Downloading qdrant_client-1.1.7-py3-none-any.whl (127 kB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 127.8/127.8 kB 1.4 MB/s eta 0:00:00 +[?25hRequirement already satisfied: pyarrow in /usr/local/lib/python3.10/dist-packages (14.0.2) +Requirement already satisfied: fastembed in /usr/local/lib/python3.10/dist-packages (0.3.3) +Requirement already satisfied: notebook in /usr/local/lib/python3.10/dist-packages (from jupyter) (6.5.5) +Requirement already satisfied: qtconsole in /usr/local/lib/python3.10/dist-packages (from jupyter) (5.5.2) +Requirement already satisfied: jupyter-console in /usr/local/lib/python3.10/dist-packages (from jupyter) (6.1.0) +Requirement already satisfied: nbconvert in /usr/local/lib/python3.10/dist-packages (from jupyter) (6.5.4) +Requirement already satisfied: ipykernel in /usr/local/lib/python3.10/dist-packages (from jupyter) (5.5.6) +Requirement already satisfied: ipywidgets in /usr/local/lib/python3.10/dist-packages (from jupyter) (7.7.1) +Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2) +Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.4) +Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2024.1) +Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas) (1.25.2) +Requirement already satisfied: transformers<5.0.0,>=4.34.0 in /usr/local/lib/python3.10/dist-packages (from sentence_transformers) (4.41.2) +Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from sentence_transformers) (4.66.4) +Requirement already satisfied: torch>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from sentence_transformers) (2.3.0+cu121) +Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from sentence_transformers) (1.2.2) +Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from sentence_transformers) (1.11.4) +Requirement already satisfied: huggingface-hub>=0.15.1 in /usr/local/lib/python3.10/dist-packages (from sentence_transformers) (0.23.4) +Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from sentence_transformers) (10.4.0) +Requirement already satisfied: grpcio>=1.41.0 in /usr/local/lib/python3.10/dist-packages (from qdrant_client~=1.1.1) (1.64.1) +Requirement already satisfied: grpcio-tools>=1.41.0 in /usr/local/lib/python3.10/dist-packages (from qdrant_client~=1.1.1) (1.64.1) +Requirement already satisfied: httpx[http2]>=0.14.0 in /usr/local/lib/python3.10/dist-packages (from qdrant_client~=1.1.1) (0.27.0) +Requirement already satisfied: portalocker<3.0.0,>=2.7.0 in /usr/local/lib/python3.10/dist-packages (from qdrant_client~=1.1.1) (2.10.0) +Collecting pydantic<2.0,>=1.8 (from qdrant_client~=1.1.1) + Downloading pydantic-1.10.17-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 18.1 MB/s eta 0:00:00 +[?25hRequirement already satisfied: typing-extensions<5.0.0,>=4.0.0 in /usr/local/lib/python3.10/dist-packages (from qdrant_client~=1.1.1) (4.12.2) +Requirement already satisfied: urllib3<2.0.0,>=1.26.14 in /usr/local/lib/python3.10/dist-packages (from qdrant_client~=1.1.1) (1.26.19) +Requirement already satisfied: PyStemmer<3.0.0,>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from fastembed) (2.2.0.1) +Requirement already satisfied: loguru<0.8.0,>=0.7.2 in /usr/local/lib/python3.10/dist-packages (from fastembed) (0.7.2) +Requirement already satisfied: mmh3<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from fastembed) (4.1.0) +Requirement already satisfied: onnx<2.0.0,>=1.15.0 in /usr/local/lib/python3.10/dist-packages (from fastembed) (1.16.1) +Requirement already satisfied: onnxruntime<2.0.0,>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from fastembed) (1.18.1) +Requirement already satisfied: requests<3.0,>=2.31 in /usr/local/lib/python3.10/dist-packages (from fastembed) (2.31.0) +Requirement already satisfied: snowballstemmer<3.0.0,>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from fastembed) (2.2.0) +Requirement already satisfied: tokenizers<1.0,>=0.15 in /usr/local/lib/python3.10/dist-packages (from fastembed) (0.19.1) +Requirement already satisfied: protobuf<6.0dev,>=5.26.1 in /usr/local/lib/python3.10/dist-packages (from grpcio-tools>=1.41.0->qdrant_client~=1.1.1) (5.27.2) +Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from grpcio-tools>=1.41.0->qdrant_client~=1.1.1) (67.7.2) +Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.14.0->qdrant_client~=1.1.1) (3.7.1) +Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.14.0->qdrant_client~=1.1.1) (2024.6.2) +Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.14.0->qdrant_client~=1.1.1) (1.0.5) +Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.14.0->qdrant_client~=1.1.1) (3.7) +Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.14.0->qdrant_client~=1.1.1) (1.3.1) +Requirement already satisfied: h2<5,>=3 in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.14.0->qdrant_client~=1.1.1) (4.1.0) +Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx[http2]>=0.14.0->qdrant_client~=1.1.1) (0.14.0) +Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence_transformers) (3.15.4) +Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence_transformers) (2023.6.0) +Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence_transformers) (24.1) +Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence_transformers) (6.0.1) +Requirement already satisfied: coloredlogs in /usr/local/lib/python3.10/dist-packages (from onnxruntime<2.0.0,>=1.17.0->fastembed) (15.0.1) +Requirement already satisfied: flatbuffers in /usr/local/lib/python3.10/dist-packages (from onnxruntime<2.0.0,>=1.17.0->fastembed) (24.3.25) +Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from onnxruntime<2.0.0,>=1.17.0->fastembed) (1.12.1) +Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) +Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0,>=2.31->fastembed) (3.3.2) +Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (3.3) +Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (3.1.4) +Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (12.1.105) +Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (12.1.105) +Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (12.1.105) +Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (8.9.2.26) +Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (12.1.3.1) +Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (11.0.2.54) +Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (10.3.2.106) +Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (11.4.5.107) +Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (12.1.0.106) +Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (2.20.5) +Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (12.1.105) +Requirement already satisfied: triton==2.3.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence_transformers) (2.3.0) +Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.11.0->sentence_transformers) (12.5.82) +Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.34.0->sentence_transformers) (2024.5.15) +Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.34.0->sentence_transformers) (0.4.3) +Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.10/dist-packages (from ipykernel->jupyter) (0.2.0) +Requirement already satisfied: ipython>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from ipykernel->jupyter) (7.34.0) +Requirement already satisfied: traitlets>=4.1.0 in /usr/local/lib/python3.10/dist-packages (from ipykernel->jupyter) (5.7.1) +Requirement already satisfied: jupyter-client in /usr/local/lib/python3.10/dist-packages (from ipykernel->jupyter) (6.1.12) +Requirement already satisfied: tornado>=4.2 in /usr/local/lib/python3.10/dist-packages (from ipykernel->jupyter) (6.3.3) +Requirement already satisfied: widgetsnbextension~=3.6.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets->jupyter) (3.6.6) +Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets->jupyter) (3.0.11) +Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from jupyter-console->jupyter) (3.0.47) +Requirement already satisfied: pygments in /usr/local/lib/python3.10/dist-packages (from jupyter-console->jupyter) (2.16.1) +Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (4.9.4) +Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (4.12.3) +Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (6.1.0) +Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (0.7.1) +Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (0.4) +Requirement already satisfied: jupyter-core>=4.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (5.7.2) +Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (0.3.0) +Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (2.1.5) +Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (0.8.4) +Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (0.10.0) +Requirement already satisfied: nbformat>=5.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (5.10.4) +Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (1.5.1) +Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter) (1.3.0) +Requirement already satisfied: pyzmq<25,>=17 in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter) (24.0.1) +Requirement already satisfied: argon2-cffi in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter) (23.1.0) +Requirement already satisfied: nest-asyncio>=1.5 in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter) (1.6.0) +Requirement already satisfied: Send2Trash>=1.8.0 in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter) (1.8.3) +Requirement already satisfied: terminado>=0.8.3 in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter) (0.18.1) +Requirement already satisfied: prometheus-client in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter) (0.20.0) +Requirement already satisfied: nbclassic>=0.4.7 in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter) (1.1.0) +Requirement already satisfied: qtpy>=2.4.0 in /usr/local/lib/python3.10/dist-packages (from qtconsole->jupyter) (2.4.1) +Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence_transformers) (1.4.2) +Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence_transformers) (3.5.0) +Requirement already satisfied: hyperframe<7,>=6.0 in /usr/local/lib/python3.10/dist-packages (from h2<5,>=3->httpx[http2]>=0.14.0->qdrant_client~=1.1.1) (6.0.1) +Requirement already satisfied: hpack<5,>=4.0 in /usr/local/lib/python3.10/dist-packages (from h2<5,>=3->httpx[http2]>=0.14.0->qdrant_client~=1.1.1) (4.0.0) +Requirement already satisfied: jedi>=0.16 in /usr/local/lib/python3.10/dist-packages (from ipython>=5.0.0->ipykernel->jupyter) (0.19.1) +Requirement already satisfied: decorator in /usr/local/lib/python3.10/dist-packages (from ipython>=5.0.0->ipykernel->jupyter) (4.4.2) +Requirement already satisfied: pickleshare in /usr/local/lib/python3.10/dist-packages (from ipython>=5.0.0->ipykernel->jupyter) (0.7.5) +Requirement already satisfied: backcall in /usr/local/lib/python3.10/dist-packages (from ipython>=5.0.0->ipykernel->jupyter) (0.2.0) +Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.10/dist-packages (from ipython>=5.0.0->ipykernel->jupyter) (0.1.7) +Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.10/dist-packages (from ipython>=5.0.0->ipykernel->jupyter) (4.9.0) +Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.7->nbconvert->jupyter) (4.2.2) +Requirement already satisfied: notebook-shim>=0.2.3 in /usr/local/lib/python3.10/dist-packages (from nbclassic>=0.4.7->notebook->jupyter) (0.2.4) +Requirement already satisfied: fastjsonschema>=2.15 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert->jupyter) (2.20.0) +Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert->jupyter) (4.19.2) +Requirement already satisfied: wcwidth in /usr/local/lib/python3.10/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->jupyter-console->jupyter) (0.2.13) +Requirement already satisfied: ptyprocess in /usr/local/lib/python3.10/dist-packages (from terminado>=0.8.3->notebook->jupyter) (0.7.0) +Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx[http2]>=0.14.0->qdrant_client~=1.1.1) (1.2.1) +Requirement already satisfied: argon2-cffi-bindings in /usr/local/lib/python3.10/dist-packages (from argon2-cffi->notebook->jupyter) (21.2.0) +Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert->jupyter) (2.5) +Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert->jupyter) (0.5.1) +Requirement already satisfied: humanfriendly>=9.1 in /usr/local/lib/python3.10/dist-packages (from coloredlogs->onnxruntime<2.0.0,>=1.17.0->fastembed) (10.0) +Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->onnxruntime<2.0.0,>=1.17.0->fastembed) (1.3.0) +Requirement already satisfied: parso<0.9.0,>=0.8.3 in /usr/local/lib/python3.10/dist-packages (from jedi>=0.16->ipython>=5.0.0->ipykernel->jupyter) (0.8.4) +Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert->jupyter) (23.2.0) +Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert->jupyter) (2023.12.1) +Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert->jupyter) (0.35.1) +Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert->jupyter) (0.18.1) +Requirement already satisfied: jupyter-server<3,>=1.8 in /usr/local/lib/python3.10/dist-packages (from notebook-shim>=0.2.3->nbclassic>=0.4.7->notebook->jupyter) (1.24.0) +Requirement already satisfied: cffi>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from argon2-cffi-bindings->argon2-cffi->notebook->jupyter) (1.16.0) +Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->notebook->jupyter) (2.22) +Requirement already satisfied: websocket-client in /usr/local/lib/python3.10/dist-packages (from jupyter-server<3,>=1.8->notebook-shim>=0.2.3->nbclassic>=0.4.7->notebook->jupyter) (1.8.0) +Installing collected packages: pydantic, qdrant_client + Attempting uninstall: pydantic + Found existing installation: pydantic 2.8.0 + Uninstalling pydantic-2.8.0: + Successfully uninstalled pydantic-2.8.0 + Attempting uninstall: qdrant_client + Found existing installation: qdrant-client 1.3.2 + Uninstalling qdrant-client-1.3.2: + Successfully uninstalled qdrant-client-1.3.2 +ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. +google-cloud-aiplatform 1.57.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 5.27.2 which is incompatible. +Successfully installed pydantic-1.10.17 qdrant_client-1.1.7 +``` + +The dataset is provided as a CSV file and contains multiple attributes of the products, including URLs of the product images. That gives us a real case to work on. Let's check the dataset structure and prepare it for further processing. + +**For the testing purposes, we can use only a small subset of the dataset. It's enough to show the concept, but you can easily scale it to the whole dataset. The variable below is a fraction of the dataset that will be used for the rest of the notebook. Feel free to change it to `1.0` if you want to use the whole dataset.** + +```python +DATASET_FRACTION = 0.1 +``` + +Now, we can load the dataset and see what it contains. + +```python +import pandas as pd +import zipfile + +with zipfile.ZipFile("./data/amazon-product-dataset-2020.zip", "r") as z: + with z.open( + "home/sdf/marketing_sample_for_amazon_com-ecommerce__20200101_20200131__10k_data.csv" + ) as f: + dataset_df = pd.read_csv(f).sample(frac=DATASET_FRACTION) + +dataset_df.sample(n=5).T +``` + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
42485784883991283981
Uniq Id8fef5093eb479b4d6e57e9da43a7f2c695eae46a6d1531897b36b4533571e4012bb3761b6fe3bbba43eca93ebe735678aa1ee02920a0175c5f924b95d58b83fe88165b349420bf28bdad83d297142e86
Product NameEureka Classroom Supplies Learn to Count Count...Rockland Luggage 17 Inch Rolling Backpack, Ban...Losi High Endurance Clutch Bell, 14T: 2.0, LOS...Hygloss Products Emoji Emoticon Stickers - 240...In the Breeze Thin Patriot Tails, 24-Inch
Brand NameNaNNaNNaNNaNNaN
AsinNaNNaNNaNNaNNaN
CategoryNaNClothing, Shoes & Jewelry | Luggage & Travel G...Toys & Games | Hobbies | Remote & App Controll...Toys & Games | Arts & Crafts | StickersToys & Games | Sports & Outdoor Play | Kites &...
Upc Ean CodeNaNNaNNaNNaNNaN
List PriceNaNNaNNaNNaNNaN
Selling Price$10.36$28.00$21.99$8.17$8.60
QuantityNaNNaNNaNNaNNaN
Model Number867470R01-BANDANALOSA91271896NaN
About ProductMake sure this fits by entering your model num...Make sure this fits by entering your model num...Losi High Endurance Clutch Bell, 14T: 2.0, LOS...Make sure this fits by entering your model num...Make sure this fits by entering your model num...
Product SpecificationASIN:B000T2YKIM|ShippingWeight:7.4ounces(Views...ProductDimensions:13x10x17inches|ItemWeight:10...ProductDimensions:4x2.1x1.2inches|ItemWeight:0...ProductDimensions:7.8x4.6x0.1inches|ItemWeight...NaN
Technical DetailsColor:Animal Counters show up to 2 reviews by ...Go to your orders and start the return Select ...0.8 ounces (View shipping rates and policies) ...Go to your orders and start the return Select ...Go to your orders and start the return Select ...
Shipping Weight7.4 ounces3.8 pounds0.8 ounces0.8 ouncesNaN
Product DimensionsNaNNaNNaNNaNNaN
Imagehttps://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...
Variantshttps://www.amazon.com/Eureka-Classroom-Suppli...https://www.amazon.com/Rockland-Luggage-Rollin...NaNhttps://www.amazon.com/Hygloss-Products-Emoji-...https://www.amazon.com/Breeze-Thin-Patriot-Tai...
SkuNaNNaNNaNNaNNaN
Product Urlhttps://www.amazon.com/Eureka-Classroom-Suppli...https://www.amazon.com/Rockland-Luggage-Rollin...https://www.amazon.com/Team-Losi-High-Enduranc...https://www.amazon.com/Hygloss-Products-Emoji-...https://www.amazon.com/Breeze-Thin-Patriot-Tai...
StockNaNNaNNaNNaNNaN
Product DetailsNaNNaNNaNNaNNaN
DimensionsNaNNaNNaNNaNNaN
ColorNaNNaNNaNNaNNaN
IngredientsNaNNaNNaNNaNNaN
Direction To UseNaNNaNNaNNaNNaN
Is Amazon SellerYYYYY
Size Quantity VariantNaNNaNNaNNaNNaN
Product DescriptionNaNNaNNaNNaNNaN
+
+
+ +
+ + + + +``` + +``` + +
+ +
+ + + + + + +
+ +``` +
+``` + +
+ +```python +dataset_df.shape +``` + +``` +(1000, 28) +``` + +```python +dataset_df.iloc[0]["Image"] +``` + +``` +'https://images-na.ssl-images-amazon.com/images/I/41SiG59Kt%2BL.jpg|https://images-na.ssl-images-amazon.com/images/I/51gvvBRjG5L.jpg|https://images-na.ssl-images-amazon.com/images/I/51WofS2WwHL.jpg|https://images-na.ssl-images-amazon.com/images/I/51Dk-9h2XEL.jpg|https://images-na.ssl-images-amazon.com/images/I/31tjBvZmVEL.jpg|https://images-na.ssl-images-amazon.com/images/I/31mH8F6HYyL.jpg|https://images-na.ssl-images-amazon.com/images/I/31DVHiEty%2BL.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel.jpg' +``` + +It turns out a single product may have several images. They are stored in a pipe-separated string. + +```python +dataset_df.iloc[0]["Image"].split("|") +``` + +``` +['https://images-na.ssl-images-amazon.com/images/I/41SiG59Kt%2BL.jpg', + 'https://images-na.ssl-images-amazon.com/images/I/51gvvBRjG5L.jpg', + 'https://images-na.ssl-images-amazon.com/images/I/51WofS2WwHL.jpg', + 'https://images-na.ssl-images-amazon.com/images/I/51Dk-9h2XEL.jpg', + 'https://images-na.ssl-images-amazon.com/images/I/31tjBvZmVEL.jpg', + 'https://images-na.ssl-images-amazon.com/images/I/31mH8F6HYyL.jpg', + 'https://images-na.ssl-images-amazon.com/images/I/31DVHiEty%2BL.jpg', + 'https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel.jpg'] +``` + +The last entry is common for multiple products, so we can simply remove it. + +```python +dataset_df["Image"] = dataset_df["Image"].map(lambda x: x.split("|")[:-1]) +``` + +
+ +```python +dataset_df.iloc[0]["Image"] +``` + +``` +['https://images-na.ssl-images-amazon.com/images/I/41SiG59Kt%2BL.jpg', + 'https://images-na.ssl-images-amazon.com/images/I/51gvvBRjG5L.jpg', + 'https://images-na.ssl-images-amazon.com/images/I/51WofS2WwHL.jpg', + 'https://images-na.ssl-images-amazon.com/images/I/51Dk-9h2XEL.jpg', + 'https://images-na.ssl-images-amazon.com/images/I/31tjBvZmVEL.jpg', + 'https://images-na.ssl-images-amazon.com/images/I/31mH8F6HYyL.jpg', + 'https://images-na.ssl-images-amazon.com/images/I/31DVHiEty%2BL.jpg'] +``` + +```python +dataset_df = dataset_df.explode("Image").dropna(subset=["Image"]) +dataset_df.sample(n=5).T +``` + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
76461705165669342912
Uniq Idc87f138612fe395f96266d0a1597faf7a3daad923af839e1e0c4adf447f1b0d8925f24d67d03a19f33f83015c759c23a83177fad1c5979dc0a5b46bbb958026078d3087cc7035ed78eeb102366a76e02
Product NameAmscan Dream Big Disposable Paper Dessert Plat...12 Inch Trolls Girls' BikeHome Dynamix Harper Ash Accent Rug, 48" Round,...FanWraps Star Wars Resistance Droids Passenger...MightySkins Skin Compatible with Hover-1 H1 Ho...
Brand NameNaNNaNNaNNaNNaN
AsinNaNNaNNaNNaNNaN
CategoryHome & Kitchen | Event & Party Supplies | Tabl...Sports & Outdoors | Outdoor Recreation | Cycli...Home & Kitchen | Home DΓ©cor | Kids' Room DΓ©cor...NaNSports & Outdoors | Outdoor Recreation | Skate...
Upc Ean CodeNaNNaNNaNNaNNaN
List PriceNaNNaNNaNNaNNaN
Selling Price$12.89$99.99$52.39$19.99$9.86 - $11.86
QuantityNaNNaNNaNNaNNaN
Model Number7413778008-50ZTJNaNFW 1221HOVH1-Laughing Skulls
About ProductMake sure this fits by entering your model num...Make sure this fits by entering your model num...Make sure this fits by entering your model num...Make sure this fits by entering your model num...SET THE TREND: Show off your own unique style ...
Product SpecificationProductDimensions:7x7x2.3inches|ItemWeight:0.3...Product Dimensions: 28.2 x 6.8 x 15 in...NaNProductDimensions:19.8x13.8x0.1inches|ItemWeig...ProductDimensions:13x0.1x12inches|ItemWeight:0...
Technical DetailsDream big graduate! Share them to your guests ...Go to your orders and start the return Select ...Go to your orders and start the return Select ...show up to 2 reviews by default Made from perf...Do You Want Your Hover-1 Ultra Hover board Sco...
Shipping Weight9.6 ounces23.1 poundsNaN5 ounces0.96 ounces
Product DimensionsNaN28.2 x 6.8 x 15 inches 20.3 poundsNaNNaNNaN
Imagehttps://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...
VariantsNaNNaNhttps://www.amazon.com/Home-Dynamix-Harper-Acc...NaNNaN
SkuNaNNaNNaNNaNNaN
Product Urlhttps://www.amazon.com/Disposable-Dessert-Grad...https://www.amazon.com/Trolls-12-Inch-Girls-Bi...https://www.amazon.com/Home-Dynamix-Harper-Acc...https://www.amazon.com/FanWraps-Resistance-Dro...https://www.amazon.com/MightySkins-Hover-1-Ult...
StockNaNNaNNaNNaNNaN
Product DetailsNaNNaNNaNNaNNaN
DimensionsNaNNaNNaNNaNNaN
ColorNaNNaNNaNNaNNaN
IngredientsNaNNaNNaNNaNNaN
Direction To UseNaNNaNNaNNaNNaN
Is Amazon SellerYYYYN
Size Quantity VariantNaNNaNNaNNaNNaN
Product DescriptionNaNNaNNaNNaNNaN
+
+
+ +
+ + + + +``` + +``` + +
+ +
+ + + + + + +
+ +``` +
+``` + +
+ +## Downloading the images + +We want to create the embeddings out of the images, but we need to have them downloaded in the first place. + +```python +from typing import Optional + +import urllib +import os + + +def download_file(url: str) -> Optional[str]: + basename = os.path.basename(url) + target_path = f"./data/images/{basename}" + if not os.path.exists(target_path): + try: + urllib.request.urlretrieve(url, target_path) + except urllib.error.HTTPError: + return None + return target_path +``` + +
+ +```python +import numpy as np + +# Our download_file function returns None in case of any HTTP issues. +# We can use that property to filter out the problematic images. +dataset_df["LocalImage"] = ( + dataset_df["Image"].map(download_file).replace({None: np.nan}) +) +dataset_df = dataset_df.dropna(subset=["LocalImage"]) +dataset_df.sample(n=5).T +``` + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
642591348330457589
Uniq Idb4358a38037a7e7fbcd7fe16970e7bff853ae53d4c3dd833d07787f452e3c13acb7e81a2574c42628feb2ea88cc465c876cd3a9b8a05103b70fefe3766459d6010801e1b6ef7e39eb8875234a49b7d70
Product NameDC Collectibles DCTV: Black Lightning Resin St...Click N' Play Set of 8 Kids Pretend Play Beaut...Application DC Comics Originals Justice League...Suck UK TV Lunch Box I Food Storage Containers...LEGO Disney Ariel’s Seaside Castle 41160 4+ Bu...
Brand NameNaNNaNNaNNaNNaN
AsinNaNNaNNaNNaNNaN
CategoryToys & Games | Collectible Toys | Statues, Bob...Toys & Games | Dress Up & Pretend Play | Beaut...Toys & Games | Arts & CraftsHome & Kitchen | Kitchen & Dining | Storage & ...Toys & Games | Building Toys | Building Sets
Upc Ean CodeNaNNaNNaNNaNNaN
List PriceNaNNaNNaNNaNNaN
Selling Price$17.75$19.99$6.99$18.48$19.78 $ 19 . 78
QuantityNaNNaNNaNNaNNaN
Model NumberDEC170420CNP01008P-DC-0136-SSK LUNCHTV36250992
About ProductMake sure this fits by entering your model num...Make sure this fits by entering your model num...Make sure this fits by entering your model num...Make sure this fits by entering your model num...Make sure this fits by entering your model num...
Product SpecificationProductDimensions:3x3x12.4inches|ItemWeight:2p...ProductDimensions:12x12x2inches|ItemWeight:11....ProductDimensions:4.8x0x6inches|ItemWeight:0.8...ProductDimensions:3.9x8.1x7.7inches|ItemWeight...ProductDimensions:10.3x7.5x2.4inches|ItemWeigh...
Technical Detailsshow up to 2 reviews by default Jefferson Pier...Go to your orders and start the return Select ...show up to 2 reviews by default C&D Visionary ...Go to your orders and start the return Select ...show up to 2 reviews by default Let your child...
Shipping Weight3.7 pounds11.2 ounces0.8 ounces15.5 ounces11.2 ounces
Product DimensionsNaNNaNNaNNaNNaN
Imagehttps://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...
VariantsNaNNaNNaNhttps://www.amazon.com/Suck-UK-Storage-Contain...NaN
SkuNaNNaNNaNNaNNaN
Product Urlhttps://www.amazon.com/DC-Collectibles-DCTV-Li...https://www.amazon.com/Click-Play-Pretend-Hair...https://www.amazon.com/Application-Comics-Orig...https://www.amazon.com/Suck-UK-Storage-Contain...https://www.amazon.com/LEGO-Disney-Ariels-Seas...
StockNaNNaNNaNNaNNaN
Product DetailsNaNNaNNaNNaNNaN
DimensionsNaNNaNNaNNaNNaN
ColorNaNNaNNaNNaNNaN
IngredientsNaNNaNNaNNaNNaN
Direction To UseNaNNaNNaNNaNNaN
Is Amazon SellerYYYYY
Size Quantity VariantNaNNaNNaNNaNNaN
Product DescriptionNaNNaNNaNNaNNaN
LocalImage./data/images/41vJ5amvKeL.jpg./data/images/51sNsHAae4L.jpg./data/images/61g66OGcaaL.jpg./data/images/41EULQJ9LxL.jpg./data/images/41Np1paMBvL.jpg
+
+
+ +
+ + + + +``` + +``` + +
+ +
+ + + + + + +
+ +``` +
+``` + +
+ +## Creating the embeddings + +There are various options for creating the embeddings out of our images. But do not even think about training your neural encoder from scratch! Plenty of pre-trained models are available, and some may already give you some decent results in your domain. And if not, you can use them as a base for the fine-tuning that might be done way faster than the full training. + +### Available options + +Using the pretrained models is easy if you choose a library that exposes them with a convenient interface. Some of the possibilities are: + +- [torchvision](https://pytorch.org/vision/stable/index.html) - part of PyTorch +- [embetter](https://koaning.github.io/embetter/) - if you prefer using pandas-like API, that's a great choice +- [Sentence-Transformers](https://www.sbert.net/examples/applications/image-search/README.html) - one of the standard libraries for NLP exposes OpenAI CLIP model as well +- [FastEmbed](https://github.com/qdrant/fastembed) - Light-weight, CPU-first library to generate vector embeddings using the [ONNX runtime](https://onnxruntime.ai/) by Qdrant. + +### Choosing the right model + +If you run an e-commerce business, you probably already have a standard full-text search mechanism. Reverse image search is one option to enrich the user experience, but if you also want to experiment with [hybrid search](https://qdrant.tech/articles/hybrid-search/), you should keep that in mind from the beginning. If that's your scenario, it's better to consider multimodality from day one. Such a model can encode texts and images in the same vector space. + +For that reason, we are going to use the OpenAI CLIP model, so in the future, we can extend our search mechanism with a semantic search using the same component. + +```python +from fastembed import ImageEmbedding + +model = ImageEmbedding(model_name="Qdrant/clip-ViT-B-32-vision") +``` + +``` +Fetching 3 files: 0%| | 0/3 [00:00 Optional[List[float]]: + try: + return next(model.embed([image_path])).tolist() + except Exception: + return None +``` + +
+ +```python +# Again, our helper function returns None in case of any error, such as +# unsupported image format. We need to remove those entries. +dataset_df["Embedding"] = dataset_df["LocalImage"].map(calculate_embedding) +dataset_df["Embedding"] = dataset_df["Embedding"].replace({None: np.nan}) +dataset_df = dataset_df.dropna(subset=["Embedding"]) +dataset_df.sample(n=5).T +``` + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
8688245378469192642
Uniq Id4f870803642cd124be4b6f0c89e9d4bf1b038ee56304c43ba0feb5e405c05c470b65e29108345f6d68e9eb18fd1ee020c4ebd4572a2a0710b3b074fd4302bd10198651a0f2df856435f4e7401d2329ca
Product NameSafescan ScanSafe Concierge Card Case with RFI...Playgro 0184557 Bendy Ball for Baby Infant Tod...EPIKGO Sport Balance Board Self Balance Scoote...Rubie's Costume Men's Batman Arkham City Adult...DJI Smart Controller
Brand NameNaNNaNNaNNaNNaN
AsinNaNNaNNaNNaNNaN
CategoryToys & Games | Novelty & Gag ToysBaby Products | Baby Care | Pacifiers, Teether...Sports & Outdoors | Outdoor Recreation | Skate...Clothing, Shoes & Jewelry | Costumes & Accesso...Toys & Games | Hobbies | Remote & App Controll...
Upc Ean CodeNaNNaNNaNNaNNaN
List PriceNaNNaNNaNNaNNaN
Selling Price$8.32$12.03$199.00$38.00$718.98
QuantityNaNNaNNaNNaNNaN
Model NumberSCON-TEA0184557EPIKGO SportNaNNaN
About ProductMake sure this fits by entering your model num...Rattle sounds for baby's auditory stimulation ...Make sure this fits by entering your model num...100% Synthetic | Imported | Hand Wash | Printe...Make sure this fits by entering your model num...
Product SpecificationProductDimensions:4.1x3x0.2inches|ItemWeight:0...NaNProduct Dimensions: 31 x 12 x 12 inche...NaNNaN
Technical DetailsColor:Teal Β |Β  Size:One Size show up to 2 revi...Style:Bendy Ball show up to 2 reviews by defau...Introducing the EPIKGO SPORTS Self-Balancing B...With costumes designed for toddlers through pl...NaN
Shipping Weight0.64 ounces9.1 ounces41.2 pounds10.7 ouncesNaN
Product DimensionsNaNNaN31 x 12 x 12 inches 33 poundsNaNNaN
Imagehttps://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...https://images-na.ssl-images-amazon.com/images...
Variantshttps://www.amazon.com/Scansafe-Concierge-Card...NaNNaNhttps://www.amazon.com/Rubies-Batman-Arkham-Mu...NaN
SkuNaNNaNNaNNaNNaN
Product Urlhttps://www.amazon.com/Scansafe-Concierge-Card...https://www.amazon.com/Playgro-0184557-Childre...https://www.amazon.com/EPIKGO-Sport-All-Terrai...https://www.amazon.com/Rubies-Batman-Arkham-Mu...https://www.amazon.com/DJI-CP-MA-00000080-01-S...
StockNaNNaNNaNNaNNaN
Product DetailsNaNNaNNaNNaNNaN
DimensionsNaNNaNNaNNaNNaN
ColorNaNNaNNaNNaNNaN
IngredientsNaNNaNNaNNaNNaN
Direction To UseNaNNaNNaNNaNNaN
Is Amazon SellerYYYYY
Size Quantity VariantNaNNaNNaNNaNNaN
Product DescriptionNaNNaNNaNNaNNaN
LocalImage./data/images/51Sws0v1GkL.jpg./data/images/416vTIPTJDL.jpg./data/images/51pCEz-LL0L.jpg./data/images/51p0gDolMyL.jpg./data/images/41Y6SkFMnTL.jpg
Embedding[-0.01141652837395668, 0.014521843753755093, 0...[0.004719291348010302, 0.0180919598788023, -0....[-0.01707524061203003, -0.008010529913008213, ...[0.009645373560488224, 0.0459967665374279, 0.0...[0.017103899270296097, 0.009327375330030918, 0...
+
+
+ +
+ + + + +``` + +``` + +
+ +
+ + + + + + +
+ +``` +
+``` + +
+ +```python +dataset_df.to_parquet("./data/amazon-with-embeddings.parquet") +``` + +## Indexing embeddings in Qdrant + +Reverse image search compares the embeddings of the image used as a query and the embeddings of the indexed pictures. That can be theoretically done in a naive way by comparing the query to every single item from our store, but that won't scale if we even go beyond a few hundred. That's what the vector search engines are designed for. Qdrant acts as a fast retrieval layer that performs an efficient search for the closest vectors in the space. + +There are various ways to start using Qdrant, and even though the local mode in Python SDK is possible, it should be running as a service in production. The easiest way is to use a Docker container, which we'll do. + +```python +!docker run -d -p "6333:6333" -p "6334:6334" --name "reverse_image_search" qdrant/qdrant:v1.10.1 +``` + +``` +/bin/bash: line 1: docker: command not found +``` + +```python +from qdrant_client import QdrantClient +from qdrant_client.http import models as rest + +try: + client = QdrantClient("localhost") + collections = client.get_collections() +except Exception: + # Docker is unavailable in Google Colab so we switch to local + # mode available in Python SDK + client = QdrantClient(":memory:") + collections = client.get_collections() + +collections +``` + +``` +CollectionsResponse(collections=[]) +``` + +```python +client.recreate_collection( + collection_name="amazon", + vectors_config=rest.VectorParams( + size=512, + distance=rest.Distance.COSINE, + ), +) +``` + +``` +True +``` + +It's a good practice to use batching while inserting the vectors into the collection. Python SDK has a utility method that performs it automatically. For the purposes of our demo, we're going to store vectors with the product id, name, and description as a payload. + +```python +payloads = ( + dataset_df[["Uniq Id", "Product Name", "About Product", "Image", "LocalImage"]] + .fillna("Unknown") + .rename( + columns={ + "Uniq Id": "ID", + "Product Name": "Name", + "About Product": "Description", + "LocalImage": "Path", + } + ) + .to_dict("records") +) +payloads[0] +``` + +``` +{'ID': 'ec13f3287da541a31f72f5a047ec2e36', + 'Name': 'Btswim NFL Pool Noodles (Pack of 3)', + 'Description': 'Make sure this fits by entering your model number. | Includes: 3 high quality pool noodles measuring 57" x 3" | Removable and re-washable spandex-like cover | 3 unique styles per team', + 'Image': 'https://images-na.ssl-images-amazon.com/images/I/41SiG59Kt%2BL.jpg', + 'Path': './data/images/41SiG59Kt%2BL.jpg'} +``` + +```python +import uuid + +client.upload_collection( + collection_name="amazon", + vectors=list(map(list, dataset_df["Embedding"].tolist())), + payload=payloads, + ids=[uuid.uuid4().hex for _ in payloads], +) +``` + +
+ +```python +client.count("amazon") +``` + +``` +CountResult(count=3359) +``` + +## Running the reverse image search + +As soon as we have the data indexed in Qdrant, it may already start acting as our reverse image search mechanism. Our queries no longer can be just textual, but we can freely use images to find similar items. For that, we surely need a query, and that will rarely be an image from the dataset. Let's find some different examples - for example from [Unsplash](https://unsplash.com), which is a source of freely usable images. + +```python +from io import BytesIO + +import base64 + + +def pillow_image_to_base64(image: Image) -> str: + """ + Convert a Pillow image to a base64 encoded string that can be used as an image + source in HTML. + :param image: + :return: + """ + buffered = BytesIO() + image.save(buffered, format="JPEG") + img_str = base64.b64encode(buffered.getvalue()).decode("utf-8") + return f"data:image/jpeg;base64,{img_str}" +``` + +
+ +```python +from IPython.display import display, HTML + +import glob + +query_image_paths = list(glob.glob("./queries/*.jpg")) +images_html = "".join( + f"" + for path in query_image_paths +) +display(HTML(f"{images_html}
")) +``` + +
+ +```python +for query_image_path in query_image_paths: + query_embedding = next(model.embed(query_image_path)).tolist() + + results = client.query_points( + collection_name="amazon", + query=query_embedding, + with_payload=True, + limit=5, + ).points + + output_images = [ + pillow_image_to_base64(Image.open(query_image_path)), + ] + for result in results: + output_images.append(result.payload["Image"]) + + images_html = "".join( + f"" for path in output_images + ) + display(HTML(f"{images_html}
")) +``` + +We've implemented a reverse image search mechanism for e-commerce within a single notebook. We can kill the running Docker container for now, so nothing is left dangling in our environment. + +```python +!docker kill reverse_image_search +!docker rm reverse_image_search +``` + +``` +reverse_image_search +reverse_image_search +``` + +## Futher steps + +The notebook shows the general pipeline of encoding the inventory and using Qdrant to perform the reverse image search. There are, however, some challenges you may encounter while trying to implement it in the real world: + +1. Pretrained models are great to start with but may struggle for some specific kinds of inventory if not trained on similar examples. You can always fine-tune them with small amounts of data to avoid a full training cycle. +1. Models should not be hosted within Jupyter notebooks, but there are some ways to serve them efficiently. We're going to describe the possibilities in a separate tutorial. +1. If you don't want to worry about maintaining another system in your stack, please consider using [Qdrant Cloud](https://cloud.qdrant.io/), our managed solution. Our tier is free forever and available to everyone - no credit card is required. + +```python + +``` diff --git a/qdrant-landing/content/documentation/101-foundations/from-pinecone-to-qdrant.md b/qdrant-landing/content/documentation/101-foundations/from-pinecone-to-qdrant.md new file mode 100644 index 000000000..7228c76a5 --- /dev/null +++ b/qdrant-landing/content/documentation/101-foundations/from-pinecone-to-qdrant.md @@ -0,0 +1,249 @@ +--- +notebook_path: 101-foundations/data-migration/from-pinecone-to-qdrant.ipynb +reading_time_min: 5 +title: Migrating Data From Pinecone to Qdrant +--- + +# Migrating Data From Pinecone to Qdrant + +In this notebook, you will migrate your data into [Qdrant](https://qdrant.to/cloud) from another vector database. +You will use [Vector-io](https://github.com/AI-Northstar-Tech/vector-io), a library that makes it easy to migrate, transform, and manage your data across different vector databases. + +Vector-io uses a standard format called Vector Dataset Format (VDF). This format ensures consistency in the data structure, regardless of the destination database. + +To illustrate, let's consider a Pinecone index that contains several data from a [PubMed dataset](https://huggingface.co/datasets/llamafactory/PubMedQA) generated using the 1536-dimensional OpenAI "text-embedding-3-small" embedding model. + +!["Pinecone"](documentation/101-foundations/from-pinecone-to-qdrant/pinecone.png) + +## Initialize the Environment + +```python +import os +from dotenv import load_dotenv +from datasets import load_dataset + +load_dotenv() +``` + +``` +True +``` + +## Load the Data + +```python +data = load_dataset("llamafactory/PubMedQA", split="train") +data = data.to_pandas() +data.head() +``` + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
instructioninputoutput
0Answer the question based on the following con...Question: Is naturopathy as effective as conve...Naturopathy appears to be an effective alterna...
1Answer the question based on the following con...Question: Can randomised trials rely on existi...Routine data have the potential to support hea...
2Answer the question based on the following con...Question: Is laparoscopic radical prostatectom...The results of our non-randomized study show t...
3Answer the question based on the following con...Question: Does bacterial gastroenteritis predi...Symptoms consistent with IBS and functional di...
4Answer the question based on the following con...Question: Is early colonoscopy after admission...No significant association is apparent between...
+
+ +```python +MAX_ROWS = 1000 +OUTPUT = "output" +subset_data = data.head(MAX_ROWS) + +chunks = subset_data[OUTPUT].to_list() +``` + +## Create a Pinecone Index + +```python +from pinecone import Pinecone, ServerlessSpec + +pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY")) +``` + +
+ +```python +# create index +pc.create_index( + name="pubmed", + dimension=1536, + metric="cosine", + spec=ServerlessSpec( + cloud=os.getenv("PINECONE_CLOUD"), region=os.getenv("PINECONE_REGION") + ), +) +``` + +
+ +```python +# set embedding model +import openai + +openai.api_key = os.getenv("OPENAI_API_KEY") + +index = pc.Index("pubmed") + + +def embed(docs: list[str]) -> list[list[float]]: + res = openai.embeddings.create(input=docs, model="text-embedding-3-small") + doc_embeds = [r.embedding for r in res.data] + return doc_embeds +``` + +
+ +```python +# upsert data to index +from tqdm.auto import tqdm + +batch_size = 100 + +for i in tqdm(range(0, len(chunks), batch_size)): + i_end = min(len(chunks), i + batch_size) + ids = [str(x) for x in range(i, i_end)] + metadatas = [{"text": chunk} for chunk in chunks[i:i_end]] + embeds = embed(chunk for chunk in chunks[i:i_end]) + records = list(zip(ids, embeds, metadatas)) + index.upsert(vectors=records) +``` + +``` + 0%| | 0/10 [00:00\n child 0, element: double\ntext: string\n-- schema metadata --\npandas: '{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": 0, \"' + 592" + } + ] + }, + "exported_at": "2024-05-10T00:13:31.858158-03:00", + "id_column": null +} +Export to disk completed. Exported to: vdf_20240509_145419_88ae5/ +Time taken to export data: 00:00:06 +``` + +### Import Data to Qdrant + +```shell +$ import_vdf qdrant -u $QDRANT_HOST + +Enter the directory of vector dataset to be imported: vdf_20240509_145419_88ae5 +ImportVDB initialized successfully. +Importing data for index 'pubmed' +/Users/infoslack/Projects/vector-migration/vdf_20240509_145419_88ae5/pubmed/i1.parquet/1.parquet read successfully. len(df)=1000 rows +Extracting vectors: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [00:00<00:00, 6349.32it/s] +Metadata was parsed to JSON +Uploading points in batches of 64 in 5 threads: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [00:03<00:00, 280.44it/s] +Iterating parquet files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:04<00:00, 4.14s/it] +Index 'pubmed' has 1000 vectors after import +1000 vectors were imported +Importing namespaces: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:05<00:00, 5.55s/it] +Importing indexes: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:05<00:00, 5.55s/it] +Data import completed successfully. +Time taken: 5.62 seconds +``` + +### Verify Data Migration + +!["Qdrant Cloud"](documentation/101-foundations/from-pinecone-to-qdrant/qdrant.png) + +```python + +``` diff --git a/qdrant-landing/content/documentation/101-foundations/qdrant_and_text_data.md b/qdrant-landing/content/documentation/101-foundations/qdrant_and_text_data.md new file mode 100644 index 000000000..94333e509 --- /dev/null +++ b/qdrant-landing/content/documentation/101-foundations/qdrant_and_text_data.md @@ -0,0 +1,955 @@ +--- +notebook_path: 101-foundations/qdrant_101_text_data/qdrant_and_text_data.ipynb +reading_time_min: 59 +title: Qdrant & Text Data +--- + +# Qdrant & Text Data + +![qdrant](documentation/101-foundations/qdrant_and_text_data/crab_nlp.png) + +This tutorial will show you how to use Qdrant to develop a semantic search service. At its core, this service will harness Natural Language Processing (NLP) methods and use Qdrant's API to store, search, and manage vectors with an additional payload. + +## Table of Contents + +1. Learning Outcomes +1. Overview +1. Prerequisites +1. Basic concepts + - Initial setup + - Examine the dataset + - Tokenize and embed data +1. Semantic Search with Qdrant +1. Recommendations API +1. Conclusion +1. Resources + +## 1. Learning outcomes + +By the end of this tutorial, you will be able to + +- Generate embeddings from text data. +- Create collections of vectors using Qdrant. +- Conduct semantic search over a corpus of documents using Qdrant. +- Provide recommendations with Qdrant. + +## 2. Overview + +Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and human language. It involves teaching computers to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP techniques can help us with tasks such as text classification, named entity recognition, sentiment analysis, and language generation. + +Vector databases specialize in storing and querying high-dimensional vectors. In the context of NLP, vectors are numerical representations of words, sentences, or documents that capture their semantic meaning. These vector representations, often referred to as word embeddings or document embeddings, transform textual data into a numerical format that machines can easily process and analyze. + +Vector databases serve as efficient storage systems for these vector representations, allowing for fast and accurate similarity searches. They enable users to find similar words, sentences, or documents based on their semantic meaning rather than relying solely on exact matches or keywords. By organizing vectors in a way that facilitates quick retrieval and comparison, Vector databases are instrumental in powering various NLP applications, including information retrieval, recommendation systems, semantic search, and content clustering. + +NLP and vector databases are connected through vector representations in NLP tasks. Vector representations enable NLP algorithms to understand the contextual relationships and semantic meaning of textual data. By leveraging Vector Databases, NLP systems can efficiently store and retrieve these vector representations, making it easier to process and analyze large volumes of textual data. + +Throughout this tutorial, we will delve deeper into the fundamentals of NLP and Vector Databases. You will learn how to create embeddings for a sample dataset of newspaper articles via transformers. After that, you will use Qdrant to store, search and recommend best matches for a chosen newspaper article. + +## 3. Prerequisites + +To get started, use the latest Qdrant Docker image: `docker pull qdrant/qdrant`. + +Next, initialize Qdrant with: + +```sh +docker run -p 6333:6333 \ + -v $(pwd)/qdrant_storage:/qdrant/storage \ + qdrant/qdrant +``` + +```python +# install packages +%pip install qdrant-client transformers datasets torch fastembed matplotlib +``` + +## 4. Basic concepts + +You might have heard of models like [GPT-4](https://openai.com/product/gpt-4), [Codex](https://openai.com/blog/openai-codex), and [PaLM-2](https://ai.google/discover/palm2) which are powering incredible tools such as [ChatGPT](https://openai.com/blog/chatgpt), [GitHub Copilot](https://github.com/features/copilot), and [Bard](https://bard.google.com/?hl=en), respectively. These three models are part of a family of deep learning architectures called [transformers](https://arxiv.org/abs/1706.03762). Transformers are known for their ability to learn long-range dependencies between words in a sentence. This ability to learn from text makes them well-suited for tasks such as machine translation, text summarization, and question answering. The transformers architecture has been incredibly influential in the field of machine learning, and one of the tools at the heart of this is the [`transformers`](https://huggingface.co/docs/transformers/index) library. + +Transformer models work by using a technique called attention, which allows them to focus on different parts of a sentence when making predictions. For example, if you are trying to translate a sentence from English to Spanish, the transformer model will use attention to focus on the words in the English sentence that are most important for the translation into Spanish. + +One analogy that can be used to explain transformer models is to think of them as a group of people who are trying to solve a puzzle. Each person in the group is given a different piece of the puzzle, and they need to work together to figure out how the pieces fit together. The transformer model is like the group of people, and the attention mechanism is like the way that the people in the group communicate with each other. + +Transformers are essential to this tutorial. Your initial task is to 1) load sample data, 2) transform it and 3) create embeddings. You will then store the embeddings 4) in the Qdrant vector database and 5) retrieve the data using a recommendation system. + +### 4.1 Initial setup + +In this tutorial, you will create a newspaper article recommendation system. When the user chooses an article, the system will suggest other articles that are similar. + +You will use the **AG News** [sample data set from HuggingFace](https://huggingface.co/datasets/ag_news): + +> "AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link " + +### 4.2 Load sample dataset + +Use HuggingFace's [`datasets`](https://huggingface.co/docs/datasets/index) library to download the dataset and load it into your session. This library is quick, efficient and will allow you to manipulate unstructured data in other ways. + +The `load_dataset` function directly downloads the dataset from the [HuggingFace Data Hub](https://huggingface.co/datasets) to your local machine. + +```python +from datasets import load_dataset +``` + +Indicate that you want to **split** the dataset into a `train` set only. This avoids creating partitions. + +```python +dataset = load_dataset("ag_news", split="train") +dataset +``` + +``` +/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: +The secret `HF_TOKEN` does not exist in your Colab secrets. +To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. +You will be able to reuse this secret in all of your notebooks. +Please note that authentication is recommended but still optional to access public models or datasets. + warnings.warn( + + + + + +Dataset({ + features: ['text', 'label'], + num_rows: 120000 +}) +``` + +### 4.3 Examine the dataset + +Now that we have loaded the dataset, let's look at some sample articles: + +```python +from random import choice + +for i in range(5): + random_sample = choice(range(len(dataset))) + print(f"Sample {i+1}") + print("=" * 70) + print(dataset[random_sample]["text"]) + print() +``` + +``` +Sample 1 +====================================================================== +Blast Kills Man at Chechen Polling Station OISKHARA, Russia - Against a backdrop of war and squalor, Chechens voted Sunday for a replacement for their assassinated president in a vote the Kremlin hopes will bring some stability to the violence-torn region. A man was killed when he attempted to blow up a polling station... + +Sample 2 +====================================================================== +Robin makes rare Christmas visit A rare species of robin settles in for its first winter in Scotland after being spotted on a reserve in Aberdeenshire. + +Sample 3 +====================================================================== +Dravid, Gambhir blast Bangladeshi bowlers Gautam Gambhir scored a half century in his second consecutive innings after India suffered an early setback on the first day of the second and final cricket Test against Bangladesh at Chittagong today. + +Sample 4 +====================================================================== +Tennis: Flying Finn Nieminen knocks off Nalbandian for third semi <b>...</b> BEIJING : Flying Finn Jarkko Nieminen soared into his third semi-final of the season, stunning third seed David Nalbandian 6-2, 2-6, 6-2 at the 500,000-dollar China Open. + +Sample 5 +====================================================================== +Stocks Seen Flat as Microsoft Weighs NEW YORK (Reuters) - U.S. stock futures fell slightly on Friday, indicating stocks would open little changed, as Wall Street weighed climbing oil prices and a disappointing revenue outlook from software maker Microsoft Corp. <A HREF="http://www.investor.reuters.com/FullQuote.aspx?ticker=MSFT.O target=/stocks/quickinfo/fullquote">MSFT.O</A>. +``` + +You can switch to a pandas dataframe by using the method `.to_pandas()`. This can come in handy when you want to manipulate and plot the data. + +Here you will extract the class names of news articles and plot the frequency with which they appear: + +```python +id2label = {str(i): label for i, label in enumerate(dataset.features["label"].names)} +``` + +
+ +```python +( + dataset.select_columns("label") + .to_pandas() + .astype(str)["label"] + .map(id2label) + .value_counts() + .plot(kind="barh", title="Frequency with which each label appears") +); +``` + +![png](documentation/101-foundations/qdrant_and_text_data/output_26_0.png) + +As you can see, the dataset is well balanced. + +What if you want to know the average length of text per each class label? + +Write a function for this and map to all of the elements in the dataset. +`'length_of_text'` will be the new column in the dataset. + +```python +def get_length_of_text(example): + example["length_of_text"] = len(example["text"]) + return example + + +dataset = dataset.map(get_length_of_text) +dataset[:10]["length_of_text"] +``` + +``` +Map: 0%| | 0/120000 [00:00 + +```python +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") +tokenizer = AutoTokenizer.from_pretrained("gpt2") +model = AutoModel.from_pretrained("gpt2").to(device) +``` + +First, you will need to set a padding token for GPT-2. In natural language processing (NLP), padding refers to adding extra tokens to make all input sequences the same length. When processing text data, it's common for sentences or documents to have different lengths. However, many machine learning models require fixed-size inputs. Padding solves this issue by adding special tokens (such as zeros) to the shorter sequences, making them equal in length to the longest sequence in the dataset. For example, say you have a set of sentences and you want to process them using a model that requires fixed-length input, you may pad the sequences to match the length of the longest sentence, let's say five tokens. The padded sentences would look like this: + +1. "I love cats" -> "I love cats [PAD] [PAD]" +1. "Dogs are friendly" -> "Dogs are friendly [PAD]" +1. "Birds can fly" -> "Birds can fly [PAD] [PAD]" + +By padding the sequences, you ensure that all inputs have the same size, allowing the model to process them uniformly. Because GPT-2 does not have a padding token, we will use the "end of text" token instead. + +```python +tokenizer.eos_token +``` + +``` +'<|endoftext|>' +``` + +```python +tokenizer.pad_token +``` + +
+ +```python +tokenizer.pad_token = tokenizer.eos_token +``` + +Let's go through a quick example: + +```python +text = "What does a cow use to do math? A cow-culator." +inputs = tokenizer( + text, padding=True, truncation=True, max_length=128, return_tensors="pt" +) # .to(device) +inputs +``` + +``` +{'input_ids': tensor([[ 2061, 857, 257, 9875, 779, 284, 466, 10688, 30, 317, + 9875, 12, 3129, 1352, 13]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])} +``` + +The tokenizer will return an input IDs and an attention mask for every word in the sentence. These IDs are represented internally in the vocabulary of the model. To view your tokens, do the following: + +```python +toks = tokenizer.convert_ids_to_tokens(inputs.input_ids[0]) +toks +``` + +``` +['What', + 'Δ does', + 'Δ a', + 'Δ cow', + 'Δ use', + 'Δ to', + 'Δ do', + 'Δ math', + '?', + 'Δ A', + 'Δ cow', + '-', + 'cul', + 'ator', + '.'] +``` + +You can always go back to a sentence as well. + +```python +tokenizer.convert_tokens_to_string(toks) +``` + +``` +'What does a cow use to do math? A cow-culator.' +``` + +If you are curious about how large is the vocabulary in your model, you can always access it with the method `.vocab_size`. + +```python +tokenizer.vocab_size +``` + +``` +50257 +``` + +Now pass the inputs from the tokenizer to your model and check out the response: + +```python +with torch.no_grad(): + embs = model(**inputs) + +embs.last_hidden_state.size(), embs[0] +``` + +``` +(torch.Size([1, 15, 768]), + tensor([[[-0.1643, 0.0957, -0.2844, ..., -0.1632, -0.0774, -0.2154], + [ 0.0472, 0.2181, 0.0754, ..., 0.0281, 0.2386, -0.0731], + [-0.1410, 0.1957, 0.5674, ..., -0.4050, 0.1199, -0.0043], + ..., + [ 0.0686, 0.2000, 0.2881, ..., 0.2151, -0.5111, -0.2907], + [-0.0662, 0.3934, -0.8001, ..., 0.2597, -0.1465, -0.1695], + [-0.1900, -0.2704, -0.3135, ..., 0.3318, -0.4120, -0.0153]]])) +``` + +Notice that you got a tensor of shape `[batch_size, inputs, dimensions]`. The inputs are your tokens and these dimensions are the embedding representation that you want for your sentence rather than each token. So what can you do to get one rather than 15? The answer is **mean pooling**. + +You are going to take the average of all 15 vectors while paying attention to the most important parts of it. The details of how this is happening are outside of the scope of this tutorial, but please refer to the Natural Language Processing with Transformers book mentioned earlier for a richer discussion on the concepts touched on in this section (including the borrowed functions we are about to use). + +```python +def mean_pooling(model_output, attention_mask): + token_embeddings = model_output[0] + input_mask_expanded = ( + attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() + ) + sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) + sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9) + return sum_embeddings / sum_mask +``` + +
+ +```python +embedding = mean_pooling(embs, inputs["attention_mask"]) +embedding.shape, embedding[0, :10] +``` + +``` +(torch.Size([1, 768]), + tensor([-0.2175, -0.0280, -0.4393, -0.0739, -0.1338, 0.3550, 3.4335, 0.1762, + -0.1412, 0.1184])) +``` + +Now you have everything you need to extract the embedding layers from our corpus of news. The last piece of the puzzle is to create a function that we can map to every news article and extract the embedding layers with. Use your tokenizer and model from earlier and apply it to a smaller subset of the data (since the dataset is quite large). + +```python +def embed_text(examples): + inputs = tokenizer( + examples["text"], padding=True, truncation=True, return_tensors="pt" + ) # .to(device) + with torch.no_grad(): + model_output = model(**inputs) + pooled_embeds = mean_pooling(model_output, inputs["attention_mask"]) + return {"embedding": pooled_embeds.cpu().numpy()} +``` + +
+ +```python +small_set = ( + dataset.shuffle(42) # randomly shuffles the data, 42 is the seed + .select(range(100)) # we'll take 100 rows + .map( + embed_text, batched=True, batch_size=128 + ) # and apply our function above to 128 articles at a time +) +``` + +``` +Map: 0%| | 0/100 [00:00 + +```python +client = QdrantClient(location=":memory:") +client +``` + +``` + +``` + +```python +my_collection = "news_embeddings" +client.recreate_collection( + collection_name=my_collection, + vectors_config=models.VectorParams(size=dim_size, distance=models.Distance.COSINE), +) +``` + +``` +:2: DeprecationWarning: `recreate_collection` method is deprecated and will be removed in the future. Use `collection_exists` to check collection existence and `create_collection` instead. + client.recreate_collection( + + + + + +True +``` + +Before you fill in your new collection, create a payload that contains the news domain the article belongs to, plus the article itself. Note that this payload is a list of JSON objects where the key is the name of the column and the value is the label or text of that same column. + +```python +payloads = ( + small_set.select_columns(["label_names", "text"]) + .to_pandas() + .to_dict(orient="records") +) +payloads[:3] +``` + +``` +[{'label_names': 'World', + 'text': 'Bangladesh paralysed by strikes Opposition activists have brought many towns and cities in Bangladesh to a halt, the day after 18 people died in explosions at a political rally.'}, + {'label_names': 'Sports', + 'text': 'Desiring Stability Redskins coach Joe Gibbs expects few major personnel changes in the offseason and wants to instill a culture of stability in Washington.'}, + {'label_names': 'World', + 'text': 'Will Putin #39;s Power Play Make Russia Safer? Outwardly, Russia has not changed since the barrage of terrorist attacks that culminated in the school massacre in Beslan on Sept.'}] +``` + +```python +client.upsert( + collection_name=my_collection, + points=models.Batch( + ids=small_set["idx"], vectors=small_set["embedding"], payloads=payloads + ), +) +``` + +``` +UpdateResult(operation_id=0, status=) +``` + +Verify that the collection has been created by scrolling through the points with the following command: + +```python +client.scroll( + collection_name=my_collection, + limit=10, + with_payload=False, # change to True to see the payload + with_vectors=False, # change to True to see the vectors +) +``` + +``` +([Record(id=0, payload=None, vector=None, shard_key=None, order_value=None), + Record(id=1, payload=None, vector=None, shard_key=None, order_value=None), + Record(id=2, payload=None, vector=None, shard_key=None, order_value=None), + Record(id=3, payload=None, vector=None, shard_key=None, order_value=None), + Record(id=4, payload=None, vector=None, shard_key=None, order_value=None), + Record(id=5, payload=None, vector=None, shard_key=None, order_value=None), + Record(id=6, payload=None, vector=None, shard_key=None, order_value=None), + Record(id=7, payload=None, vector=None, shard_key=None, order_value=None), + Record(id=8, payload=None, vector=None, shard_key=None, order_value=None), + Record(id=9, payload=None, vector=None, shard_key=None, order_value=None)], + 10) +``` + +Now that we have our collection ready, let's start querying the data and see what we get. + +```python +query1 = small_set[99]["embedding"] +small_set[99]["text"], query1[:7] +``` + +``` +("Busch's ambience set it apart from others ST. LOUIS -- Even a cookie-cutter stadium -- which is the pejorative term that came to describe the multipurpose bowls that seemed to spring up simultaneously in Pittsburgh, Cincinnati, Atlanta, Queens, Philadelphia, and St. Louis in the late 1960s and early '70s -- can have their quirks.", + [0.1424039751291275, + -0.051233235746622086, + -0.27102717757225037, + 0.07963515818119049, + 0.1585829257965088, + -0.35505735874176025, + 3.133955240249634]) +``` + +As you can see the text above is talking about stocks so let's have a look at what kinds of articles we can find with Qdrant. + +```python +client.query_points(collection_name=my_collection, query=query1, limit=3).points +``` + +``` +[ScoredPoint(id=99, version=0, score=0.9999999905189133, payload={'label_names': 'Sports', 'text': "Busch's ambience set it apart from others ST. LOUIS -- Even a cookie-cutter stadium -- which is the pejorative term that came to describe the multipurpose bowls that seemed to spring up simultaneously in Pittsburgh, Cincinnati, Atlanta, Queens, Philadelphia, and St. Louis in the late 1960s and early '70s -- can have their quirks."}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=3, version=0, score=0.999327473250493, payload={'label_names': 'Sci/Tech', 'text': 'U2 pitches for Apple New iTunes ads airing during baseball games Tuesday will feature the advertising-shy Irish rockers.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=68, version=0, score=0.9992285114798962, payload={'label_names': 'Sports', 'text': 'Mistakes hinder Knicks in loss It will be labeled a learning experience, but that won #39;t remove any of the sting. Momentum was lost again Wednesday night as the Knicks blew a lead and lost 94-93 to the Detroit Pistons at Madison Square Garden.'}, vector=None, shard_key=None, order_value=None)] +``` + +Of course, the first article is going to be the same one we used to query the data as there is no distance between the same vector. The other interesting thing we can see here is that even though we have different labels, we still get semantically similar articles with the label `World` as we do with the label `Business`. + +The nice thing about what we have done is that we are getting decent results and we haven't even fine-tuned the model to our use case. To fine-tune a model means to take a pre-trained model that has learned general knowledge from (usually large amounts of) data and adapt it to a specific task or domain. It's like giving a smart assistant some additional training to make them better at a particular job. When we do this, we should expect even better results from our search. + +Let's pick a random sample from the larger dataset and see what we get back from Qdrant. Note that because our function was created to be applied on a dictionary object, we'll represent the random text in the same way. + +```python +# Step 1 - Select Random Sample +query2 = {"text": dataset[choice(range(len(dataset)))]["text"]} +query2 +``` + +``` +{'text': 'Fifteenth-Ranked Utah Rocks Utah St. 48-6 (AP) AP - Alex Smith threw for one touchdown and ran for another, and Utah converted three Utah State turnovers into touchdowns to rout its rival 48-6 Saturday night.'} +``` + +```python +# Step 2 - Create a Vector +query2 = embed_text(query2)["embedding"][0, :] +query2.shape, query2[:20] +``` + +``` +((768,), + array([ 0.4631669 , 0.21426679, 0.1769519 , 0.06818997, 0.57228196, + -0.23123097, 7.268908 , -0.34892577, 0.18357149, -0.33726475, + 0.26699057, -0.16434783, -0.30012795, -0.03731229, -0.2809622 , + 0.21101162, -0.28782076, -0.07745638, -0.1231352 , -0.9009491 ], + dtype=float32)) +``` + +```python +query2.tolist()[:20] +``` + +``` +[0.46316689252853394, + 0.2142667919397354, + 0.17695190012454987, + 0.06818997114896774, + 0.5722819566726685, + -0.2312309741973877, + 7.2689080238342285, + -0.34892576932907104, + 0.1835714876651764, + -0.337264746427536, + 0.2669905722141266, + -0.1643478274345398, + -0.30012795329093933, + -0.03731228783726692, + -0.280962198972702, + 0.2110116183757782, + -0.2878207564353943, + -0.07745637744665146, + -0.12313520163297653, + -0.9009491205215454] +``` + +```python +# Step 3 - Search for similar articles. Don't forget to convert the vector to a list. +client.query_points( + collection_name=my_collection, query=query2.tolist(), limit=5 +).points +``` + +``` +[ScoredPoint(id=88, version=0, score=0.9992023641037459, payload={'label_names': 'Sports', 'text': 'Pitt Locks Up BCS Bid Tyler Palko tosses a career-high 411 yards and five touchdowns to push No. 19 Pittsburgh over South Florida, 43-14, on Satudray.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=51, version=0, score=0.9991318771336059, payload={'label_names': 'Sports', 'text': 'Panthers #39; defense leads to another win Carl Krauser scored all but two of his 17 points at the free throw line to lead No. 11 Pittsburgh to a 70-51 victory over Memphis on Tuesday night in the Jimmy V Classic.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=43, version=0, score=0.9990993751804613, payload={'label_names': 'Sports', 'text': 'Federer, Moya Dodge the Rain and Post Masters Cup Wins Ending after midnight due to rain delays, Roger Federer finally put his second round robin match in the books early Thursday morning at the ATP Tennis Masters Cup in Houston, defeating Lleyton Hewitt for the fifth consecutive time 6-3, 6-4.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=78, version=0, score=0.9988737613574972, payload={'label_names': 'Sports', 'text': 'One last mile to go before they sweep Pedro Martinez dazzles for seven as Boston takes a seemingly safe three games to none lead. By MARC TOPKIN, Times Staff Writer. ST.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=84, version=0, score=0.9988350105229953, payload={'label_names': 'Sports', 'text': 'Padres Paste Slumping Mets 4-0 (AP) AP - Brian Lawrence escaped a bases-loaded jam in the first inning and allowed just four hits the rest of the way Wednesday night, pitching the San Diego Padres to a 4-0 victory over the slumping New York Mets.'}, vector=None, shard_key=None, order_value=None)] +``` + +Because we selected a random sample, you will see something different everytime you go through this part of the tutorial so make sure you read some of the articles that come back and evaluate the similarity of these articles to the one you randomly got from the larger dataset. Have some fun with it. πŸ™‚ + +Let's make things more interesting and pick the most similar results from a Business context. We'll do so by creating a field condition with `models.FieldCondition()` with the `key=` parameter set to `label_names` and the `match=` parameter set to `"Business"` via the `models.MatchValue()` function. + +```python +business = models.Filter( + must=[ + models.FieldCondition( + key="label_names", match=models.MatchValue(value="Business") + ) + ] +) +``` + +We will add our `business` variable as a query filter to our `client.search()` call and see what we get. + +```python +client.query_points( + collection_name=my_collection, query=query2.tolist(), query_filter=business, limit=5 +).points +``` + +``` +[ScoredPoint(id=11, version=0, score=0.9980901049053494, payload={'label_names': 'Business', 'text': 'RBC Centura CEO steps down RALEIGH, NC - The head of RBC Centura Bank has stepped down, and his successor will run the bank out of Raleigh rather than Rocky Mount, where the bank is based.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=27, version=0, score=0.9977535531917637, payload={'label_names': 'Business', 'text': 'Merger could affect Nextel Partners The proposed \\$35 billion merger of Sprint Corp. and Nextel Communications could mean changes for Kirkland-based Nextel Partners Inc.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=14, version=0, score=0.9977418817397671, payload={'label_names': 'Business', 'text': 'Oracle acquisition of PeopleSoft leads flurry of deals NEW YORK (CBS.MW) -- US stocks closed higher Monday, with the Dow Jones Industrial Average ending at its best level in more than nine months amid better-than-expected economic data and merger-related optimism.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=82, version=0, score=0.997737225923265, payload={'label_names': 'Business', 'text': 'Tests Show No Mad Cow, Cattle Prices Rise WASHINGTON/CHICAGO (Reuters) - An animal suspected of having mad cow disease was given a clean bill of health in a second round of sophisticated testing, the U.S. Agriculture Department said on Tuesday after cattle prices soared in expectation of the news.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=85, version=0, score=0.9977358452541891, payload={'label_names': 'Business', 'text': 'Quiz Eisner in shareholder suit Disney big cheese Michael Eisner took the witness stand for the first time yesterday in an ongoing shareholder lawsuit, defending his choice of ex-super talent agent Michael Ovitz as Disney #39;s No.'}, vector=None, shard_key=None, order_value=None)] +``` + +## 6. Recommendations API + +You might notice that even though the similarity score we are getting seem quite high, the results seem to be a bit all over the place. To solve this, we could fine-tune our model and create a new embedding layer, but that would take some time and, most-likely, a GPU. + +What we can do instead is to pick a model that works better off the bat and test the quality of the embeddings while we explore the recommendations API of Qdrant. + +Let's do just that by using the package [`fastembed`](https://github.com/qdrant/fastembed) with the model `sentence-transformers/all-MiniLM-L6-v2`. + +```python +from fastembed import TextEmbedding +``` + +
+ +```python +model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2") +``` + +We will call the second embedding feature `embedding_2` and extract it from our articles in the same way as before. + +```python +def get_st_embedding(example): + example["embedding_2"] = list(model.embed(example["text"]))[0] + return example + + +small_set = small_set.map(get_st_embedding) +``` + +``` +Map: 0%| | 0/100 [00:00:2: DeprecationWarning: `recreate_collection` method is deprecated and will be removed in the future. Use `collection_exists` to check collection existence and `create_collection` instead. + client.recreate_collection( + + + + + +True +``` + +```python +client.upsert( + collection_name=second_collection, + points=models.Batch( + ids=small_set["idx"], vectors=small_set["embedding_2"], payloads=payloads + ), +) +``` + +``` +UpdateResult(operation_id=0, status=) +``` + +Let's pick a random news article. + +```python +some_txt = small_set[87] +some_txt["idx"], some_txt["text"] +``` + +``` +(87, + '1,600 internet cafes closed in China The Chinese local governments have closed 1600 Internet cafes and fined operators a total of 100m yuan (\\$12m) between February and August this year.') +``` + +They key thing about the recommendation API of Qdrant is that we need at least 1 id for an article +that a user liked, or gave us a πŸ‘ for, but the number of negative articles is optional. + +```python +article_we_liked = small_set[21] +article_we_liked["idx"], article_we_liked["text"] +``` + +``` +(21, + 'Icahn pushes harder to stop Mylan #39;s King acquisition PITTSBURGH Carl Icahn, the largest shareholder of Mylan Laboratories, is now threatening to push for new company directors to stop the generic drug maker #39;s four (B) billion-dollar takeover bid of King Pharmaceuticals.') +``` + +```python +client.query_points( + collection_name=second_collection, + query=models.RecommendQuery( + recommend=models.RecommendInput( + positive=[some_txt["idx"], article_we_liked["idx"]] + ) + ), + limit=5, +).points +``` + +``` +[ScoredPoint(id=93, version=0, score=0.4016504396936016, payload={'label_names': 'Sci/Tech', 'text': 'Feds go after #39;Spam King #39; Federal attorneys are trying to shut down an allegedly illicit Internet business operated out of a dance club in Rochester that is deluging computers with unwanted pop-up ads by hijacking their browsers.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=56, version=0, score=0.3821894452008371, payload={'label_names': 'Business', 'text': "Amazon moves into China US internet giant Amazon.com is buying China's largest web retailer Joyo.com, in a deal worth \\$75m (41m)."}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=73, version=0, score=0.36303007487200234, payload={'label_names': 'Sci/Tech', 'text': 'China #39;blocks Google news site #39; China has been accused of blocking access to Google News by the media watchdog, Reporters Without Borders. The Paris-based pressure group said the English-language news site had been unavailable for the past 10 days.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=14, version=0, score=0.3339852217199736, payload={'label_names': 'Business', 'text': 'Oracle acquisition of PeopleSoft leads flurry of deals NEW YORK (CBS.MW) -- US stocks closed higher Monday, with the Dow Jones Industrial Average ending at its best level in more than nine months amid better-than-expected economic data and merger-related optimism.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=20, version=0, score=0.3047674328733027, payload={'label_names': 'Sci/Tech', 'text': 'Cisco switch products target small business Company aggressively addresses smaller businesses with products that reduce the cost and complexity of operating a Cisco network.'}, vector=None, shard_key=None, order_value=None)] +``` + +Not surprisingly, we get a lot of tech news with decent degrees of similarity. + +```python +another_article_we_liked = small_set[70] +another_article_we_liked["idx"], another_article_we_liked["text"] +``` + +``` +(70, + 'Cranes rain down as peace gesture Millions of origami paper cranes yesterday rained down upon Thailand #39;s three violence-torn southern provinces in an extravagant government-sponsored gesture aimed at bringing peace to the predominantly Muslim region.') +``` + +```python +article_we_dont_like = small_set[14] +article_we_dont_like["idx"], article_we_dont_like["text"] +``` + +``` +(14, + 'Oracle acquisition of PeopleSoft leads flurry of deals NEW YORK (CBS.MW) -- US stocks closed higher Monday, with the Dow Jones Industrial Average ending at its best level in more than nine months amid better-than-expected economic data and merger-related optimism.') +``` + +```python +some_other_txt = small_set[88] +some_other_txt +``` + +
+ +```python +query4 = model.embed(some_other_txt) +``` + +
+ +```python +client.query_points( + collection_name=second_collection, + query=models.RecommendQuery( + recommend=models.RecommendInput( + positive=[ + some_other_txt["idx"], + article_we_liked["idx"], + another_article_we_liked["idx"], + ], + negative=[article_we_dont_like["idx"]], + ) + ), + limit=8, +).points +``` + +``` +[ScoredPoint(id=51, version=0, score=0.2447113759461485, payload={'label_names': 'Sports', 'text': 'Panthers #39; defense leads to another win Carl Krauser scored all but two of his 17 points at the free throw line to lead No. 11 Pittsburgh to a 70-51 victory over Memphis on Tuesday night in the Jimmy V Classic.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=97, version=0, score=0.19922779504187965, payload={'label_names': 'Sports', 'text': 'MOTOR RACING: IT #39;S TWO IN A ROW FOR DARIO DARIO FRANCHITTI scored a scorching second Indy Racing League race of 2004 on a triumphant return to Pikes Peak. The Scot finished fourth at the Colorado track last year but this time he led the race four '}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=31, version=0, score=0.17695470777361116, payload={'label_names': 'Sports', 'text': 'Shanahan says he intends to honour his deal with Broncos Trying to defuse rumours he might be leaving soon, Denver Broncos coach Mike Shanahan said Thursday night he intends to honour the final four years of his contract.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=15, version=0, score=0.171702610948617, payload={'label_names': 'Sports', 'text': 'They #146;re in the wrong ATHENS -- Matt Emmons was focusing on staying calm. He should have been focusing on the right target.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=0, version=0, score=0.14830260252946847, payload={'label_names': 'World', 'text': 'Bangladesh paralysed by strikes Opposition activists have brought many towns and cities in Bangladesh to a halt, the day after 18 people died in explosions at a political rally.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=28, version=0, score=0.1449231087011586, payload={'label_names': 'Sports', 'text': 'TEXANS STAT CENTER After finally winning consecutive games, the Texans are a team in search of a challenge. quot;I knew that was going to come up, quot; Texans coach Dom Capers said with a laugh.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=29, version=0, score=0.14203110720347584, payload={'label_names': 'Sports', 'text': 'About-face for Heels Rashad McCants wasn #39;t thinking about last year at Kentucky. Jawad Williams said last year was last year. And Sean May was thinking more about his last game than North Carolina #39;s Jan. 3 defeat at Rupp Arena.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=16, version=0, score=0.127455928492053, payload={'label_names': 'Sports', 'text': "Mularkey Sticking With Bledsoe As Bills QB (AP) AP - Mike Mularkey has a message to those clamoring for rookie quarterback J.P. Losman to replace Drew Bledsoe as Buffalo's starter. Not yet."}, vector=None, shard_key=None, order_value=None)] +``` + +```python + +``` + +The results seem palatable given the search criteria. Also, this time we want to see only articles that +pass a certain similarity threshold to make sure we only get very relevant results back. + +```python +client.query_points( + collection_name=second_collection, + query=models.RecommendQuery( + recommend=models.RecommendInput( + positive=[some_other_txt["idx"], another_article_we_liked["idx"]], + negative=[article_we_dont_like["idx"]], + ) + ), + score_threshold=0.10, + limit=8, +).points +``` + +``` +[ScoredPoint(id=51, version=0, score=0.23785189164775716, payload={'label_names': 'Sports', 'text': 'Panthers #39; defense leads to another win Carl Krauser scored all but two of his 17 points at the free throw line to lead No. 11 Pittsburgh to a 70-51 victory over Memphis on Tuesday night in the Jimmy V Classic.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=0, version=0, score=0.1807877638829638, payload={'label_names': 'World', 'text': 'Bangladesh paralysed by strikes Opposition activists have brought many towns and cities in Bangladesh to a halt, the day after 18 people died in explosions at a political rally.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=29, version=0, score=0.1774875422419001, payload={'label_names': 'Sports', 'text': 'About-face for Heels Rashad McCants wasn #39;t thinking about last year at Kentucky. Jawad Williams said last year was last year. And Sean May was thinking more about his last game than North Carolina #39;s Jan. 3 defeat at Rupp Arena.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=97, version=0, score=0.16742433889881747, payload={'label_names': 'Sports', 'text': 'MOTOR RACING: IT #39;S TWO IN A ROW FOR DARIO DARIO FRANCHITTI scored a scorching second Indy Racing League race of 2004 on a triumphant return to Pikes Peak. The Scot finished fourth at the Colorado track last year but this time he led the race four '}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=55, version=0, score=0.16378504481516357, payload={'label_names': 'Sports', 'text': 'Bama offense faces tough test Like most Alabama fans, injured quarterback Brodie Croyle will sweat out tonight #39;s LSU game at home. If, somehow, the Crimson Tide can knock off the 17th-ranked Tigers, Croyle '}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=13, version=0, score=0.16075136748638885, payload={'label_names': 'World', 'text': 'Paris Marks Liberation Mindful of Collaboration With solemn commemorations, a ceremonial flag-raising at the Eiffel Tower and columns of 1940s-era tanks and army jeeps, Parisians on Wednesday marked the 60th anniversary '}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=75, version=0, score=0.15894293793543512, payload={'label_names': 'Sports', 'text': 'Orton Engineers Purdue Victory Over Illinois The Illinois rushing offense and the Purdue passing offense dueled in the Big Ten opener for both schools on Saturday at Memorial Stadium in Champaign.'}, vector=None, shard_key=None, order_value=None), + ScoredPoint(id=15, version=0, score=0.15206416157764854, payload={'label_names': 'Sports', 'text': 'They #146;re in the wrong ATHENS -- Matt Emmons was focusing on staying calm. He should have been focusing on the right target.'}, vector=None, shard_key=None, order_value=None)] +``` + +That's it! To see all of the collections that you have created in this tutorial, use `client.get_collections`. + +```python +client.get_collections() +``` + +``` +CollectionsResponse(collections=[CollectionDescription(name='news_embeddings'), CollectionDescription(name='better_news')]) +``` + +## 5. Conclusion + +In this tutorial you have learned that (1) vector databases provide efficient storage and retrieval of high-dimensional vectors, making them ideal for similarity-based search tasks. (2) Natural language processing enables us to understand and process human language, opening up possibilities for different kinds of useful applications for digital technologies. (3) Transformers, with their attention mechanism, capture long-range dependencies in language and achieve incredible results in different tasks. Finally, embeddings encode words or sentences into dense vectors, capturing semantic relationships and enabling powerful language understanding. + +By combining these technologies, you can unlock new levels of language understanding, information retrieval, and intelligent systems that continue to push the boundaries of what's possible in the realm of AI. + +## 6. Resources + +Here is a list with some resources that we found useful, and that helped with the development of this tutorial. + +1. Books + - [Natural Language Processing with Transformers](https://transformersbook.com/) by Lewis Tunstall, Leandro von Werra, and Thomas Wolf + - [Natural Language Processing in Action, Second Edition](https://www.manning.com/books/natural-language-processing-in-action-second-edition) by Hobson Lane and Maria Dyshel +1. Articles + - [Fine Tuning Similar Cars Search](https://qdrant.tech/articles/cars-recognition/) + - [Q&A with Similarity Learning](https://qdrant.tech/articles/faq-question-answering/) + - [Question Answering with LangChain and Qdrant without boilerplate](https://qdrant.tech/articles/langchain-integration/) + - [Extending ChatGPT with a Qdrant-based knowledge base](https://qdrant.tech/articles/chatgpt-plugin/) +1. Videos + - [Word Embedding and Word2Vec, Clearly Explained!!!](https://www.youtube.com/watch?v=viZrOnJclY0&ab_channel=StatQuestwithJoshStarmer) by StatQuest with Josh Starmer + - [Word Embeddings, Bias in ML, Why You Don't Like Math, & Why AI Needs You](https://www.youtube.com/watch?v=25nC0n9ERq4&ab_channel=RachelThomas) by Rachel Thomas +1. Courses + - [fast.ai Code-First Intro to Natural Language Processing](https://www.youtube.com/playlist?list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9) + - [NLP Course by Hugging Face](https://huggingface.co/learn/nlp-course/chapter1/1) diff --git a/qdrant-landing/content/documentation/101-foundations/recommend-movies.md b/qdrant-landing/content/documentation/101-foundations/recommend-movies.md new file mode 100644 index 000000000..d0cd4f650 --- /dev/null +++ b/qdrant-landing/content/documentation/101-foundations/recommend-movies.md @@ -0,0 +1,482 @@ +--- +notebook_path: 101-foundations/sparse-vectors-movies-reco/recommend-movies.ipynb +reading_time_min: 7 +title: Movie recommendation system with Qdrant space vectors +--- + +# Movie recommendation system with Qdrant space vectors + +This notebook is a simple example of how to use Qdrant to build a movie recommendation system. +We will use the MovieLens dataset and Qdrant to build a simple recommendation system. + +## How it works + +MovieLens dataset contains a list of movies and ratings given by users. We will use this data to build a recommendation system. + +Our recommendation system will use an approach called **collaborative filtering**. + +The idea behind collaborative filtering is that if two users have similar tastes, then they will like similar movies. +We will use this idea to find the most similar users to our own ratings and see what movies these similar users liked, which we haven't seen yet. + +1. We will represent each user's ratings as a vector in a sparse high-dimensional space. +1. We will use Qdrant to index these vectors. +1. We will use Qdrant to find the most similar users to our own ratings. +1. We will see what movies these similar users liked, which we haven't seen yet. + +```python +!pip install qdrant-client pandas +``` + +
+ +```python +# Download and unzip the dataset + +!mkdir -p data +!wget https://files.grouplens.org/datasets/movielens/ml-1m.zip +!unzip ml-1m.zip -d data +``` + +
+ +```python +from qdrant_client import QdrantClient, models +import pandas as pd +``` + +
+ +```python +users = pd.read_csv( + "./data/ml-1m/users.dat", + sep="::", + names=["user_id", "gender", "age", "occupation", "zip"], + engine="python", +) +users +``` + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
user_idgenderageoccupationzip
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455
..................
60356036F251532603
60366037F45176006
60376038F56114706
60386039F45001060
60396040M25611106
+

6040 rows Γ— 5 columns

+
+ +```python +movies = pd.read_csv( + "./data/ml-1m/movies.dat", + sep="::", + names=["movie_id", "title", "genres"], + engine="python", + encoding="latin-1", +) +movies +``` + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
movie_idtitlegenres
01Toy Story (1995)Animation|Children's|Comedy
12Jumanji (1995)Adventure|Children's|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
............
38783948Meet the Parents (2000)Comedy
38793949Requiem for a Dream (2000)Drama
38803950Tigerland (2000)Drama
38813951Two Family House (2000)Drama
38823952Contender, The (2000)Drama|Thriller
+

3883 rows Γ— 3 columns

+
+ +```python +ratings = pd.read_csv( + "./data/ml-1m/ratings.dat", + sep="::", + names=["user_id", "movie_id", "rating", "timestamp"], + engine="python", +) +``` + +
+ +```python +# Normalize ratings + +# Sparse vectors can use advantage of negative values, so we can normalize ratings to have mean 0 and std 1 +# In this scenario we can take into account movies that we don't like + +ratings.rating = (ratings.rating - ratings.rating.mean()) / ratings.rating.std() +``` + +
+ +```python +# Convert ratings to sparse vectors + +from collections import defaultdict + +user_sparse_vectors = defaultdict(lambda: {"values": [], "indices": []}) + +for row in ratings.itertuples(): + user_sparse_vectors[row.user_id]["values"].append(row.rating) + user_sparse_vectors[row.user_id]["indices"].append(row.movie_id) +``` + +
+ +```python +# For this small dataset we can use in-memory Qdrant +# But for production we recommend to use server-based version + +qdrant = QdrantClient(":memory:") # or QdrantClient("http://localhost:6333") +``` + +
+ +```python +# Create collection with configured sparse vectors +# Sparse vectors don't require to specify dimension, because it's extracted from the data automatically + +qdrant.create_collection( + "movielens", + vectors_config={}, + sparse_vectors_config={"ratings": models.SparseVectorParams()}, +) +``` + +``` +True +``` + +```python +# Upload all user's votes as sparse vectors + + +def data_generator(): + for user in users.itertuples(): + yield models.PointStruct( + id=user.user_id, + vector={"ratings": user_sparse_vectors[user.user_id]}, + payload=user._asdict(), + ) + + +# This will do lazy upload of the data +qdrant.upload_points("movielens", data_generator()) +``` + +
+ +```python +# Let's try to recommend something for ourselves + +# 1 - like +# -1 - dislike + +# Search with +# movies[movies.title.str.contains("Matrix", case=False)] + +my_ratings = { + 2571: 1, # Matrix + 329: 1, # Star Trek + 260: 1, # Star Wars + 2288: -1, # The Thing + 1: 1, # Toy Story + 1721: -1, # Titanic + 296: -1, # Pulp Fiction + 356: 1, # Forrest Gump + 2116: 1, # Lord of the Rings + 1291: -1, # Indiana Jones + 1036: -1, # Die Hard +} + +inverse_ratings = {k: -v for k, v in my_ratings.items()} + + +def to_vector(ratings): + vector = models.SparseVector(values=[], indices=[]) + for movie_id, rating in ratings.items(): + vector.values.append(rating) + vector.indices.append(movie_id) + return vector +``` + +
+ +```python +# Find users with similar taste + +results = qdrant.search( + "movielens", + query_vector=models.NamedSparseVector(name="ratings", vector=to_vector(my_ratings)), + with_vectors=True, # We will use those to find new movies + limit=20, +) +``` + +
+ +```python +# Calculate how frequently each movie is found in similar users' ratings + + +def results_to_scores(results): + movie_scores = defaultdict(lambda: 0) + + for user in results: + user_scores = user.vector["ratings"] + for idx, rating in zip(user_scores.indices, user_scores.values): + if idx in my_ratings: + continue + movie_scores[idx] += rating + + return movie_scores +``` + +
+ +```python +# Sort movies by score and print top 5 + +movie_scores = results_to_scores(results) +top_movies = sorted(movie_scores.items(), key=lambda x: x[1], reverse=True) + +for movie_id, score in top_movies[:5]: + print(movies[movies.movie_id == movie_id].title.values[0], score) +``` + +``` +Star Wars: Episode V - The Empire Strikes Back (1980) 20.023877887283938 +Star Wars: Episode VI - Return of the Jedi (1983) 16.44318377549194 +Princess Bride, The (1987) 15.84006760423755 +Raiders of the Lost Ark (1981) 14.94489407628955 +Sixth Sense, The (1999) 14.570321651488953 +``` + +```python +# Find users with similar taste, but only within my age group +# We can also filter by other fields, like `gender`, `occupation`, etc. + +results = qdrant.search( + "movielens", + query_vector=models.NamedSparseVector(name="ratings", vector=to_vector(my_ratings)), + query_filter=models.Filter( + must=[models.FieldCondition(key="age", match=models.MatchValue(value=25))] + ), + with_vectors=True, + limit=20, +) + +movie_scores = results_to_scores(results) +top_movies = sorted(movie_scores.items(), key=lambda x: x[1], reverse=True) + +for movie_id, score in top_movies[:5]: + print(movies[movies.movie_id == movie_id].title.values[0], score) +``` + +``` +Princess Bride, The (1987) 16.214640029038147 +Star Wars: Episode V - The Empire Strikes Back (1980) 14.652836719595939 +Blade Runner (1982) 13.52911944519415 +Usual Suspects, The (1995) 13.446604377087162 +Godfather, The (1972) 13.300575698740357 +``` diff --git a/qdrant-landing/content/documentation/201-intermediate/Multimodal_Search_with_FastEmbed.md b/qdrant-landing/content/documentation/201-intermediate/Multimodal_Search_with_FastEmbed.md new file mode 100644 index 000000000..4a59dd30e --- /dev/null +++ b/qdrant-landing/content/documentation/201-intermediate/Multimodal_Search_with_FastEmbed.md @@ -0,0 +1,211 @@ +--- +notebook_path: 201-intermediate/multimodal-search/Multimodal_Search_with_FastEmbed.ipynb +reading_time_min: 8 +title: +--- + +### Tutorial + +Install & import **Qdrant** and **FastEmbed** + +We will **FastEmbed** for generating multimodal embeddings and **Qdrant** for storing and retrieving them. + +```python +!python3 -m pip install --upgrade qdrant-client fastembed Pillow +``` + +``` +Requirement already satisfied: qdrant-client in /usr/local/lib/python3.10/dist-packages (1.11.0) +Requirement already satisfied: fastembed in /usr/local/lib/python3.10/dist-packages (0.3.4) +Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (10.4.0) +Requirement already satisfied: grpcio>=1.41.0 in /usr/local/lib/python3.10/dist-packages (from qdrant-client) (1.65.5) +Requirement already satisfied: grpcio-tools>=1.41.0 in /usr/local/lib/python3.10/dist-packages (from qdrant-client) (1.65.5) +Requirement already satisfied: httpx>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.20.0->qdrant-client) (0.27.0) +Requirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.10/dist-packages (from qdrant-client) (1.26.4) +Requirement already satisfied: portalocker<3.0.0,>=2.7.0 in /usr/local/lib/python3.10/dist-packages (from qdrant-client) (2.10.1) +Requirement already satisfied: pydantic>=1.10.8 in /usr/local/lib/python3.10/dist-packages (from qdrant-client) (2.8.2) +Requirement already satisfied: urllib3<3,>=1.26.14 in /usr/local/lib/python3.10/dist-packages (from qdrant-client) (2.0.7) +Requirement already satisfied: PyStemmer<3.0.0,>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from fastembed) (2.2.0.1) +Requirement already satisfied: huggingface-hub<1.0,>=0.20 in /usr/local/lib/python3.10/dist-packages (from fastembed) (0.23.5) +Requirement already satisfied: loguru<0.8.0,>=0.7.2 in /usr/local/lib/python3.10/dist-packages (from fastembed) (0.7.2) +Requirement already satisfied: mmh3<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from fastembed) (4.1.0) +Requirement already satisfied: onnx<2.0.0,>=1.15.0 in /usr/local/lib/python3.10/dist-packages (from fastembed) (1.16.2) +Requirement already satisfied: onnxruntime<2.0.0,>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from fastembed) (1.19.0) +Requirement already satisfied: requests<3.0,>=2.31 in /usr/local/lib/python3.10/dist-packages (from fastembed) (2.32.3) +Requirement already satisfied: snowballstemmer<3.0.0,>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from fastembed) (2.2.0) +Requirement already satisfied: tokenizers<1.0,>=0.15 in /usr/local/lib/python3.10/dist-packages (from fastembed) (0.19.1) +Requirement already satisfied: tqdm<5.0,>=4.66 in /usr/local/lib/python3.10/dist-packages (from fastembed) (4.66.5) +Requirement already satisfied: protobuf<6.0dev,>=5.26.1 in /usr/local/lib/python3.10/dist-packages (from grpcio-tools>=1.41.0->qdrant-client) (5.27.3) +Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from grpcio-tools>=1.41.0->qdrant-client) (71.0.4) +Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client) (3.7.1) +Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client) (2024.7.4) +Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client) (1.0.5) +Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client) (3.7) +Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client) (1.3.1) +Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client) (0.14.0) +Requirement already satisfied: h2<5,>=3 in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.20.0->qdrant-client) (4.1.0) +Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.20->fastembed) (3.15.4) +Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.20->fastembed) (2024.6.1) +Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.20->fastembed) (24.1) +Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.20->fastembed) (6.0.2) +Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.20->fastembed) (4.12.2) +Requirement already satisfied: coloredlogs in /usr/local/lib/python3.10/dist-packages (from onnxruntime<2.0.0,>=1.17.0->fastembed) (15.0.1) +Requirement already satisfied: flatbuffers in /usr/local/lib/python3.10/dist-packages (from onnxruntime<2.0.0,>=1.17.0->fastembed) (24.3.25) +Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from onnxruntime<2.0.0,>=1.17.0->fastembed) (1.13.2) +Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10.8->qdrant-client) (0.7.0) +Requirement already satisfied: pydantic-core==2.20.1 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10.8->qdrant-client) (2.20.1) +Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0,>=2.31->fastembed) (3.3.2) +Requirement already satisfied: hyperframe<7,>=6.0 in /usr/local/lib/python3.10/dist-packages (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant-client) (6.0.1) +Requirement already satisfied: hpack<5,>=4.0 in /usr/local/lib/python3.10/dist-packages (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant-client) (4.0.0) +Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client) (1.2.2) +Requirement already satisfied: humanfriendly>=9.1 in /usr/local/lib/python3.10/dist-packages (from coloredlogs->onnxruntime<2.0.0,>=1.17.0->fastembed) (10.0) +Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->onnxruntime<2.0.0,>=1.17.0->fastembed) (1.3.0) +``` + +```python +from qdrant_client import QdrantClient, models + +client = QdrantClient( + ":memory:" +) #:memory: option is suitable only for simple prototypes/demos with Python client (!) +``` + +Let's embed a very short selection of images and their captions in the **shared embedding space** with CLIP. + +```python +from fastembed import TextEmbedding, ImageEmbedding + + +documents = [ + {"caption": "A photo of a cute pig", "image": "images/piggy.jpg"}, + {"caption": "A picture with a coffee cup", "image": "images/coffee.jpg"}, + {"caption": "A photo of a colourful lizard", "image": "images/lizard.jpg"}, +] + +text_model_name = "Qdrant/clip-ViT-B-32-text" # CLIP text encoder +text_model = TextEmbedding(model_name=text_model_name) +text_embeddings_size = text_model._get_model_description(text_model_name)[ + "dim" +] # dimension of text embeddings, produced by CLIP text encoder (512) +texts_embeded = list( + text_model.embed([document["caption"] for document in documents]) +) # embedding captions with CLIP text encoder + +image_model_name = "Qdrant/clip-ViT-B-32-vision" # CLIP image encoder +image_model = ImageEmbedding(model_name=image_model_name) +image_embeddings_size = image_model._get_model_description(image_model_name)[ + "dim" +] # dimension of image embeddings, produced by CLIP image encoder (512) +images_embeded = list( + image_model.embed([document["image"] for document in documents]) +) # embedding images with CLIP image encoder +``` + +``` +/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: +The secret `HF_TOKEN` does not exist in your Colab secrets. +To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. +You will be able to reuse this secret in all of your notebooks. +Please note that authentication is recommended but still optional to access public models or datasets. + warnings.warn( + + + +Fetching 5 files: 0%| | 0/5 [00:00) + +# Load Sample Dataset + +First we need to load our documents. In this example, we will use the [News Category Dataset v3](https://huggingface.co/datasets/heegyu/news-category-dataset). This dataset contains news articles with various fields like `headline`, `category`, `short_description`, `link`, `authors`, and date. Once we load the data, we will reformat it to suit our needs. + +```python +dataset = load_dataset("heegyu/news-category-dataset", split="train") +``` + +``` +Found cached dataset json (/Users/nirantk/.cache/huggingface/datasets/heegyu___json/heegyu--news-category-dataset-a0dcb53f17af71bf/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) +``` + +```python +def get_single_text(k): + return f"Under the category:\n{k['category']}:\n{k['headline']}\n{k['short_description']}" + + +df = pd.DataFrame(dataset) +df.head() +``` + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
linkheadlinecategoryshort_descriptionauthorsdate
0https://www.huffpost.com/entry/covid-boosters-...Over 4 Million Americans Roll Up Sleeves For O...U.S. NEWSHealth experts said it is too early to predict...Carla K. Johnson, AP2022-09-23
1https://www.huffpost.com/entry/american-airlin...American Airlines Flyer Charged, Banned For Li...U.S. NEWSHe was subdued by passengers and crew when he ...Mary Papenfuss2022-09-23
2https://www.huffpost.com/entry/funniest-tweets...23 Of The Funniest Tweets About Cats And Dogs ...COMEDY"Until you have a dog you don't understand wha...Elyse Wanshel2022-09-23
3https://www.huffpost.com/entry/funniest-parent...The Funniest Tweets From Parents This Week (Se...PARENTING"Accidentally put grown-up toothpaste on my to...Caroline Bologna2022-09-23
4https://www.huffpost.com/entry/amy-cooper-lose...Woman Who Called Cops On Black Bird-Watcher Lo...U.S. NEWSAmy Cooper accused investment firm Franklin Te...Nina Golgowski2022-09-22
+
+ +```python +# Assuming `df` is your original dataframe +df["year"] = df["date"].dt.year + +category_columns_to_keep = ["POLITICS", "THE WORLDPOST", "WORLD NEWS", "WORLDPOST", "U.S. NEWS"] + +# Filter by category +df_filtered = df[df["category"].isin(category_columns_to_keep)] + +# Sample data for each year + + +def sample_func(x): + return x.sample(min(len(x), 200), random_state=42) + + +df_sampled = df_filtered.groupby("year").apply(sample_func).reset_index(drop=True) +``` + +
+ +```python +df_sampled["year"].value_counts() +``` + +``` +year +2014 200 +2015 200 +2016 200 +2017 200 +2018 200 +2019 200 +2020 200 +2021 200 +2022 200 +Name: count, dtype: int64 +``` + +```python +del df +``` + +
+ +```python +df = df_sampled +``` + +
+ +```python +df["text"] = df.apply(get_single_text, axis=1) +df["text"] +``` + +``` +0 Under the category:\nWORLDPOST:\nAfghans Don't... +1 Under the category:\nPOLITICS:\nACLU Seeks To ... +2 Under the category:\nPOLITICS:\nWork and Worth... +3 Under the category:\nPOLITICS:\nJody Hice, Ant... +4 Under the category:\nPOLITICS:\nCapito Wins We... + ... +1795 Under the category:\nPOLITICS:\nA Hard-Right R... +1796 Under the category:\nPOLITICS:\nHerschel Walke... +1797 Under the category:\nU.S. NEWS:\nStocks Fall, ... +1798 Under the category:\nWORLD NEWS:\nPeru Court O... +1799 Under the category:\nPOLITICS:\nMichigan Secre... +Name: text, Length: 1800, dtype: object +``` + +```python +df["text"][9] +``` + +``` +"Under the category:\nWORLDPOST:\nFreed Taliban Commander Tells Relative He'll Fight Americans Again\n" +``` + +```python +df.drop(columns=["year"], inplace=True) +``` + +Next, write these documents to text files in a directory. Each document will be written to a text file named after its date. + +```python +%%time +write_dir = Path("../data/sample").resolve() +if write_dir.exists(): + [f.unlink() for f in write_dir.ls()] +write_dir.mkdir(exist_ok=True, parents=True) +for index, row in df.iterrows(): + date = str(row["date"]).replace("-", "_") # replace '-' in date with '_' to avoid issues with file names + file_path = write_dir / f"date_{date}_row_{index}.txt" + with file_path.open("w") as f: + f.write(row["text"]) +``` + +``` +CPU times: user 45.6 ms, sys: 116 ms, total: 161 ms +Wall time: 162 ms +``` + +```python +# del dataset, df +``` + +## Store Dataset with Qdrant Client + +We'll be using Qdrant as our vector storage system. Qdrant is a high-performance vector database designed for storing and searching large-scale high-dimensional vectors. + +### Local Qdrant Server/Docker + Cloud Instructions + +- If you're running a local Qdrant instance with Docker, use `uri`: + - `uri="http://:"` + +Here I'll be using the cloud, so I am using the url set to my cloud instance + +- Set the API KEY for Qdrant Cloud: + - `api_key=""` + - `url` + +### Memory + +- You can use `:memory:` mode for fast and lightweight experiments. It does not require Qdrant to be deployed anywhere. + +```python +client = QdrantClient(":memory:") +``` + +## Load Data into LlamaIndex + +LlamaIndex has a simple way to load documents from a directory. We can define a function to get the metadata from a file name, and pass this function to the `SimpleDirectoryReader` class. + +```python +def get_file_metadata(file_name: str): + """Get file metadata.""" + date_str = Path(file_name).stem.split("_")[1:4] + return {"date": "-".join(date_str)} + + +documents = SimpleDirectoryReader(input_files=write_dir.ls(), file_metadata=get_file_metadata).load_data() +``` + +
+ +```python +len(documents) +``` + +``` +1800 +``` + +Let's look at the date ranges in our dataset: + +```python +dates, years = [], [] + +for document in documents: + dt = datetime.datetime.fromisoformat(document.extra_info["date"]) + # print(d) + try: + dates.append(dt) + years.append(dt.year) + except Exception: + print(dt) +``` + +This `date` key is *necessary* for the Recency Postprocessor that we are going to use later. + +We have to parse these documents into nodes and create our QdrantVectorStore: + +```python +# define service context (wrapper container around current classes) +service_context = ServiceContext.from_defaults(chunk_size_limit=512) +vector_store = QdrantVectorStore(client=client, collection_name="NewsCategoryv3PoliticsSample") +``` + +Next, we will create our `GPTVectorStoreIndex` from the documents. This operation might take some time as it's creating the index from the documents. + +```python +%%time +index = GPTVectorStoreIndex.from_documents( + documents, vector_store=vector_store, service_context=service_context +) +``` + +## Run a Test Query + +We have made an index. But as we saw in the diagram, we also need some added functionality to do 3 things: + +1. Retrieval + - Convert the text query into embedding + - Find the most similar documents +1. Synthesis + - The LLM (here, OpenAI) texts the question, similar documents and a prompt to give you an answer + +```python +query_engine = index.as_query_engine(similarity_top_k=10) +``` + +
+ +```python +response = query_engine.query("Who is the US President?") +print(response) +``` + +``` +The US President is Joe Biden. +``` + +```python +response = query_engine.query("Who is the current US President?") +print(response) +``` + +``` +The current US President is Joe Biden. +``` + +# Adding Postprocessors + +LlamaIndex excels at composing Retrieval and Ranking steps. + +The intention behind this is to improve answer quality. Let's see if we can use Postprocessors to improve answer quality by using two approaches: + +1. Selecting the most recent nodes (Recency). +1. Reranking using a different model (Cohere Rerank). + +![]() + +Here is what the diagram represents: + +1. The user issues a query to the query engine. +1. The query engine, which has been configured with certain postprocessors, performs a search on the vector store based on the query. +1. The query engine then postprocesses the results. +1. The postprocessed results are then returned to the user + +### Define a Recency Postprocessor + +LlamaIndex allows us to add postprocessors to our query engine. These postprocessors can modify the results of our queries after they are returned from the index. Here, we'll add a recency postprocessor to our query engine. This postprocessor will prioritize recent documents in the results. + +We'll define a single type of recency postprocessor: `FixedRecencyPostprocessor`. + +```python +recency_postprocessor = FixedRecencyPostprocessor(service_context=service_context, top_k=1) +``` + +### Rerank with Cohere + +Cohere Rerank works on the top K results which the Retrieval step from Qdrant returns. While Qdrant works on your entire corpus (here thousands, but Qdrant is designed to work with millions) -- Cohere works with the result from Qdrant. This can improve the search results since it's working on smaller number of entries. + +![]() + +Rerank endpoint takes in a query and a list of texts and produces an ordered array with each text assigned a relevance score. We'll define a `CohereRerank` postprocessor and add it to our query engine. + +## Defining Query Engines + +We'll define four query engines for this tutorial: + +1. Just the Vector Store i.e. Qdrant here +1. A recency query engine +1. A reranking query engine +1. And a combined query engine. + +The recency query engine uses the `FixedRecencyPostprocessor`, the reranking query engine uses the `CohereRerank` postprocessor, and the combined query engine uses both. + +```python +top_k = 10 # set one, reuse from now on, ensures consistency +``` + +
+ +```python +index_query_engine = index.as_query_engine( + similarity_top_k=top_k, +) +``` + +
+ +```python +recency_query_engine = index.as_query_engine( + similarity_top_k=top_k, + node_postprocessors=[recency_postprocessor], +) +``` + +
+ +```python +cohere_rerank = CohereRerank(api_key=os.environ["COHERE_API_KEY"], top_n=top_k) +reranking_query_engine = index.as_query_engine( + similarity_top_k=top_k, + node_postprocessors=[cohere_rerank], +) +``` + +
+ +```python +query_engine = index.as_query_engine( + similarity_top_k=top_k, + node_postprocessors=[cohere_rerank, recency_postprocessor], +) +``` + +## Querying the Engine + +Finally, we can query our engine. Let's ask it "Who is the current US President?" and see the results from each query engine. + +```python +# question = "Who is the current US President?" +response = index_query_engine.query("Who is the US President?") +print(response) +``` + +``` +The US President is Joe Biden. +``` + +The `response` object has a few interesting attributes which help us quickly debug and understand what happened in each of our steps: + +1. What source nodes (similar to Document Chunks in Langchain) were used to answer the question +1. What `extra_info` does the index have which we can use? This could also be sent as a payload to Qdrant to filter on (via epoch time) -- but Llama Index does not + +Let's unpack that a bit, and we'll use what we learn from `response` to improve our understanding of the query engines and post processors themselves. + +Note that `10` which is the top-k parameter we set. This confirms that we retrieved the 10 documents most similar to the question (or more correct: 10 nearest neighbours to the question) and a confidence score. + +Can we show this in a more human-readable way? + +```python +print(response.get_formatted_sources()[:318]) +``` + +``` +> Source (Doc id: 24ec05e1-cb35-492e-8741-fdfe2c582e43): date: 2017-01-28 00:00:00 + +Under the category: +THE WORLDPOST: +World Leaders React To The Reality ... + +> Source (Doc id: 098c2482-ce52-4e31-aa1c-825a385b56a1): date: 2015-01-18 00:00:00 + +Under the category: +POLITICS: +The Issue That's Looming Over The Final ... +``` + +Let's check what is stored in the `extra_info` attribute. + +```python +response.extra_info +``` + +``` +{'24ec05e1-cb35-492e-8741-fdfe2c582e43': {'date': '2017-01-28 00:00:00'}, + '098c2482-ce52-4e31-aa1c-825a385b56a1': {'date': '2015-01-18 00:00:00'}, + 'a3993bb5-64a4-46ce-aa15-0e0672f0994f': {'date': '2014-08-21 00:00:00'}, + 'e48f4521-1bf3-45a3-b00b-fd6a03855d6f': {'date': '2018-12-26 00:00:00'}, + '2a13360c-2c18-4917-aef8-1002931d6a3c': {'date': '2016-06-24 00:00:00'}, + '77bd45bf-5418-4eee-bc47-33d2942e2fb8': {'date': '2014-05-31 00:00:00'}, + '51ab3ea9-67af-48a0-864a-5fa1559b2a63': {'date': '2017-06-29 00:00:00'}, + '023a5a27-1f92-4028-aea6-38e681ff2032': {'date': '2014-12-03 00:00:00'}, + '360fac77-ff67-475e-96d8-1480f2447971': {'date': '2014-12-20 00:00:00'}, + '95f092f4-0bed-46de-bae0-4107b775d603': {'date': '2022-03-26 00:00:00'}} +``` + +This has a `date` key-value as a string against the `doc id` + +Let's setup some tools to have a question, answer and the responses from the index engine in the same object - this will come handy in a bit for explaining a wrong answer. + +```python +def mprint(text: str): + display_markdown(Markdown(text)) + + +class QAInfo: + """This class is used to store the question, correct answer and responses from different query engines.""" + + def __init__(self, question: str, correct_answer: str, query_engines: dict[str, Any]): + self.question = question + self.query_engines = query_engines + self.correct_answer = correct_answer + self.responses = {} + + def add_response(self, engine: str, response: str): + # This method is used to add the response of a query engine to the responses dictionary. + self.responses[engine] = response + + def compare_responses(self): + """This function takes in a QAInfo object and a dictionary of query engines, and runs the question through each query engine. + The responses from each engine are added to the QAInfo object.""" + mprint(f"### Question: {self.question}") + + for engine_name, engine in query_engines.items(): + response = engine.query(self.question) + self.add_response(engine_name, response) + mprint(f"**{engine_name.title()}**: {response}") + + mprint(f"Correct Answer is: {self.correct_answer}") + + def node_print(self, index, preview_count=5): + source_nodes = self.responses[index].source_nodes + for i in range(preview_count): + mprint(f"- {source_nodes[i].node.text}") + + +query_engines = { + "qdrant": index_query_engine, + "recency": recency_query_engine, + "reranking": reranking_query_engine, + "both": query_engine, +} +``` + +
+ +```python +question = "Who is the US President?" +correct_answer = "Joe Biden" # This would normally be determined programmatically. +president_qa_info = QAInfo(question=question, correct_answer=correct_answer, query_engines=query_engines) +president_qa_info.compare_responses() +``` + +### Question: Who is the US President? + +**Qdrant**: +The US President is Joe Biden. + +**Recency**: +The US President is Joe Biden. + +**Reranking**: +The US President is Barack Obama. + +**Both**: +The US President is Joe Biden. + +Correct Answer is: Joe Biden + +```python +president_qa_info.node_print(index="recency", preview_count=1) +``` + +- Under the category: + WORLD NEWS: + Biden On Putin: 'For God's Sake, This Man Cannot Remain In Power' + President Joe Biden visited Poland's capital on Saturday to speak with refugees who've been displaced amid Russia's attack on Ukraine. + +```python +president_qa_info.node_print(index="qdrant", preview_count=1) +``` + +- Under the category: + THE WORLDPOST: + World Leaders React To The Reality Of A Trump Presidency + Many of the presidential memorandums and executive decisions will fundamentally affect countries around the globe. + +## Impact of how a question is asked + +```python +question = "Who is US President in 2022?" +correct_answer = "Joe Biden" # This would normally be determined programmatically. +current_president_qa_info = QAInfo( + question=question, correct_answer=correct_answer, query_engines=query_engines +) +current_president_qa_info.compare_responses() +``` + +### Question: Who is US President in 2022? + +**Qdrant**: +Joe Biden is the US President in 2022. + +**Recency**: +The US President in 2022 is unknown at this time. + +**Reranking**: +Joe Biden is the US President in 2022. + +**Both**: +The US President in 2022 is unknown at this time. + +Correct Answer is: Joe Biden + +### Investigating for Ranking Challenges + +We pull the few top documents which from each query engine. To make them easy to read, we've a utility `node_print` here. + +πŸ’‘ We notice that Qdrant (using embeddings) correctly pulls out a few mentions of "2024", "Joe Biden" and "President Joe Biden" + +πŸ’‘ Cohere also re-orders the top 10 candidates to give the top 3 which mention "President Joe Biden". + +With Recency, we get an undetermined answer. This is because we're only using the one, most recent result. + +## πŸŽ“ Try this now: + +> Change the `top_k` value passed to `llama_index` and see how that changes the answers + +```python +current_president_qa_info.node_print(index="qdrant", preview_count=3) +``` + +- Under the category: + POLITICS: + Joe Biden Says He 'Can't Picture' U.S. Troops Being In Afghanistan In 2022 + The president doubled down on his promise to end America's longest-running war at a Thursday press conference, though he said a May 1 deadline seemed unlikely. + +- Under the category: + POLITICS: + How A Crowded GOP Field Could Bolster A Trump 2024 Campaign + As Donald Trump considers another White House run, polls show he's the most popular figure in the Republican Party. + +- Under the category: + POLITICS: + Biden To Give First State Of The Union Address At Fraught Moment + President Joe Biden aims to navigate the country out a pandemic, reboot his stalled domestic agenda and confront Russia’s aggression. + +```python +current_president_qa_info.node_print(index="recency", preview_count=1) +``` + +- Under the category: + POLITICS: + GOP Senators Refuse To Rule Out Supporting Donald Trump Again β€” Even If He's Indicted + With the ex-president reportedly under criminal investigation, many Senate Republicans are taking a wait-and-hope-it-doesn’t-happen stance. + +```python +current_president_qa_info.node_print(index="reranking", preview_count=3) +``` + +- Under the category: + POLITICS: + Biden To Give First State Of The Union Address At Fraught Moment + President Joe Biden aims to navigate the country out a pandemic, reboot his stalled domestic agenda and confront Russia’s aggression. + +- Under the category: + WORLD NEWS: + Biden On Putin: 'For God's Sake, This Man Cannot Remain In Power' + President Joe Biden visited Poland's capital on Saturday to speak with refugees who've been displaced amid Russia's attack on Ukraine. + +- Under the category: + POLITICS: + Joe Biden Says He 'Can't Picture' U.S. Troops Being In Afghanistan In 2022 + The president doubled down on his promise to end America's longest-running war at a Thursday press conference, though he said a May 1 deadline seemed unlikely. + +## Add a specific Year + +That looks interesting. Let's try this question after specifying the year: + +```python +question = "Who was the US President in 2010?" +correct_answer = "Barack Obama" # This would normally be determined programmatically. +president_2010_qa_info = QAInfo(question=question, correct_answer=correct_answer, query_engines=query_engines) +president_2010_qa_info.compare_responses() +``` + +### Question: Who was the US President in 2010? + +**Qdrant**: +The US President in 2010 was Barack Obama. + +**Recency**: +In 2010, the US President was Barack Obama. + +**Reranking**: +The US President in 2010 was Barack Obama. + +**Both**: +In 2010, the US President was Barack Obama. + +Correct Answer is: Barack Obama + +Let's try a different variant of this question, specify a year and see what happens? + +```python +question = "Who was the Finance Minister of India under Manmohan Singh Govt?" +correct_answer = "P. Chidambaram" # This would normally be determined programmatically. +prime_minister_jan2014 = QAInfo(question=question, correct_answer=correct_answer, query_engines=query_engines) +prime_minister_jan2014.compare_responses() +``` + +### Question: Who was the Finance Minister of India under Manmohan Singh Govt? + +**Qdrant**: +The Finance Minister of India under Manmohan Singh Govt was Palaniappan Chidambaram. + +**Recency**: +The Finance Minister of India under Manmohan Singh Govt was Palaniappan Chidambaram. + +**Reranking**: +The Finance Minister of India under Manmohan Singh Govt was Palaniappan Chidambaram. + +**Both**: +The Finance Minister of India under Manmohan Singh Govt was Palaniappan Chidambaram. + +Correct Answer is: P. Chidambaram + +### Observation + +In this question: All the engines give the correct answer! + +This is despite the fact that the Recency Postprocessor response does not even talk about the Indian Prime Minister! ❌ + +Qdrant via OpenAI Embeddings and Cohere Rerank do not do that much better + +The correct answer comes from OpenAI LLM's knowledge of the world! + +```python +prime_minister_jan2014.node_print(index="qdrant", preview_count=10) +``` + +- Under the category: + POLITICS: + Robbing Main Street to Prop Up Wall Street: Why Jerry Brown's Rainy Day Fund Is a Bad Idea + There is no need to sequester funds urgently needed by Main Street to pay for Wall Street's malfeasance. Californians can have their cake and eat it too - with a state-owned bank. + +- Under the category: + WORLDPOST: + Cities Need To Get Smarter -- And India's On It + +- Under the category: + POLITICS: + It Takes Just 4 Charts To Show A Big Part Of What's Wrong With Congress + +- Under the category: + WORLD NEWS: + Arundhati Roy's New Novel Lays India Bare, Unveiling Worlds Within Our Worlds + Malavika Binny, Jawaharlal Nehru University Wearing two hats at once can be an uncomfortable fit, but it does not seem to + +- Under the category: + POLITICS: + The World Bank Must Commit to Food Security + Much will be said about bringing roads, electricity and infrastructure to underdeveloped regions. But how committed is the World Bank to the planet as a whole when it is doling out its loans? + +- Under the category: + WORLDPOST: + Former Prime Minister: Japan Should Shelve the Islands Dispute With China to Avoid A Spiral into Conflict + +- Under the category: + POLITICS: + Senate Delays Vote On $1.1 Trillion Spending Bill + +- Under the category: + WORLDPOST: + Sweden Election Results Offer Uncertain Future For Austerity + +- Under the category: + THE WORLDPOST: + Greece Demands IMF Explain 'Disaster' Remarks In Explosive Leak + A letter from Greek prime minister Alexis Tsipras questions whether the country "can trust" the lender. + +- Under the category: + WORLDPOST: + Comedians Send Powerful Message Against Sexual Harassment In India + +```python +prime_minister_jan2014.node_print(index="recency", preview_count=1) +``` + +- Under the category: + WORLD NEWS: + Arundhati Roy's New Novel Lays India Bare, Unveiling Worlds Within Our Worlds + Malavika Binny, Jawaharlal Nehru University Wearing two hats at once can be an uncomfortable fit, but it does not seem to + +```python +prime_minister_jan2014.node_print(index="reranking", preview_count=10) +``` + +- Under the category: + WORLDPOST: + Comedians Send Powerful Message Against Sexual Harassment In India + +- Under the category: + WORLD NEWS: + Arundhati Roy's New Novel Lays India Bare, Unveiling Worlds Within Our Worlds + Malavika Binny, Jawaharlal Nehru University Wearing two hats at once can be an uncomfortable fit, but it does not seem to + +- Under the category: + POLITICS: + It Takes Just 4 Charts To Show A Big Part Of What's Wrong With Congress + +- Under the category: + WORLDPOST: + Sweden Election Results Offer Uncertain Future For Austerity + +- Under the category: + WORLDPOST: + Cities Need To Get Smarter -- And India's On It + +- Under the category: + WORLDPOST: + Former Prime Minister: Japan Should Shelve the Islands Dispute With China to Avoid A Spiral into Conflict + +- Under the category: + POLITICS: + Senate Delays Vote On $1.1 Trillion Spending Bill + +- Under the category: + POLITICS: + The World Bank Must Commit to Food Security + Much will be said about bringing roads, electricity and infrastructure to underdeveloped regions. But how committed is the World Bank to the planet as a whole when it is doling out its loans? + +- Under the category: + POLITICS: + Robbing Main Street to Prop Up Wall Street: Why Jerry Brown's Rainy Day Fund Is a Bad Idea + There is no need to sequester funds urgently needed by Main Street to pay for Wall Street's malfeasance. Californians can have their cake and eat it too - with a state-owned bank. + +- Under the category: + THE WORLDPOST: + Greece Demands IMF Explain 'Disaster' Remarks In Explosive Leak + A letter from Greek prime minister Alexis Tsipras questions whether the country "can trust" the lender. + +# Recap + +- 1️⃣ Crafting a Q&A bot with LlamaIndex and Qdrant + - We dumped a news dataset, kicked up a Qdrant client, and stuffed our data into a LlamaIndex +- 2️⃣ Keeping our Q&A bot fresh and cranking up the ranking goodness + - We used a recency postprocessor and a Cohere reranking postprocessor, and put them to work building different query engines +- 3️⃣ Using Node Sources in Llama Index to dig into the Q&A trails + - We threw a bunch of questions at these engines and saw how they stacked up! + +We figured out that recency postprocessing has its perks, but it can leave us hanging when we narrow down the info too much. Plugging in a reranking postprocessor like Cohere can help sort the responses better. diff --git a/qdrant-landing/content/documentation/201-intermediate/bq_with_qdrant.md b/qdrant-landing/content/documentation/201-intermediate/bq_with_qdrant.md new file mode 100644 index 000000000..fa23ac15b --- /dev/null +++ b/qdrant-landing/content/documentation/201-intermediate/bq_with_qdrant.md @@ -0,0 +1,326 @@ +--- +notebook_path: 201-intermediate/binary-quantization-qdrant/bq_with_qdrant.ipynb +reading_time_min: 10 +title: Binary Quantization with Qdrant +--- + +# Binary Quantization with Qdrant + +This notebook demonstrates/evaluates the search performance of Qdrant with Binary Quantization. We will use [Qdrant Cloud](https://qdrant.to/cloud?utm_source=qdrant&utm_medium=social&utm_campaign=binary-openai-v3&utm_content=article) to index and search the embeddings. This demo can be carried out on a free-tier Qdrant cluster as well. + +# Set Up Binary Quantization + +Let's install the 2 Python packages we'll work with. + +```python +%pip install qdrant-client datasets +``` + +For the demo, We use samples from the [Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K) dataset. The dataset includes embeddings generated using OpenAI's `text-embedding-3-small` model. + +You can use your own datasets for this evaluation by adjusting the config values below. + +We select 100 records at random from the dataset. We then use the embeddings of the queries to search for the nearest neighbors in the dataset. + +## Configure Credentials + +```python +# QDRANT CONFIG +URL = "https://xyz-example.eu-central.aws.cloud.qdrant.io:6333" +API_KEY = "" +COLLECTION_NAME = "bq-evaluation" + +# EMBEDDING CONFIG +DATASET_NAME = "Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K" +DIMENSIONS = 1536 +EMBEDDING_COLUMN_NAME = "text-embedding-3-small-1536-embedding" + +## UPLOAD CONFIG +BATCH_SIZE = 1024 # Batch size for uploading points +PARALLEL = 1 # Number of parallel processes for uploading points +``` + +## Setup A Qdrant Collection + +Let's create a Qdrant collection to index our vectors. We set `on_disk` in the vectors config to `True` offload the original vectors to disk to save memory. + +```python +from datasets import load_dataset +from qdrant_client import QdrantClient, models +import logging + +from tqdm import tqdm + +logging.basicConfig(level=logging.INFO) + +client = QdrantClient(url=URL, api_key=API_KEY) + +if not client.collection_exists(COLLECTION_NAME): + client.create_collection( + collection_name=COLLECTION_NAME, + vectors_config=models.VectorParams( + size=DIMENSIONS, distance=models.Distance.COSINE, on_disk=True + ), + quantization_config=models.BinaryQuantization( + binary=models.BinaryQuantizationConfig(always_ram=False), + ), + ) + logging.info(f"Created collection {COLLECTION_NAME}") +else: + collection_info = client.get_collection(collection_name=COLLECTION_NAME) + logging.info( + f"Collection {COLLECTION_NAME} already exists with {collection_info.points_count} points." + ) + +logging.info("Loading Dataset") +dataset = load_dataset( + DATASET_NAME, + split="train", +) +logging.info(f"Loaded {DATASET_NAME} dataset") + +logging.info("Loading Points") +points = [ + models.PointStruct(id=i, vector=embedding) + for i, embedding in enumerate(dataset[EMBEDDING_COLUMN_NAME]) +] +logging.info(f"Loaded {len(points)} points") + +logging.info("Uploading Points") +client.upload_points(COLLECTION_NAME, points=tqdm(points), batch_size=BATCH_SIZE) +logging.info(f"Collection {COLLECTION_NAME} is ready") +``` + +## Evaluate Results + +### Parameters: Oversampling, Rescoring, and Search Limits + +For each record, we run a parameter sweep over the number of oversampling, rescoring, and search limits. We can then understand the impact of these parameters on search accuracy and efficiency. Our experiment was designed to assess the impact of Binary Quantization under various conditions, based on the following parameters: + +- **Oversampling**: By oversampling, we can limit the loss of information inherent in quantization. We experimented with different oversampling factors, and identified the impact on the accuracy and efficiency of search. Spoiler: higher oversampling factors tend to improve the accuracy of searches. However, they usually require more computational resources. + +- **Rescoring**: Rescoring refines the first results of an initial binary search. This process leverages the original high-dimensional vectors to refine the search results, **always** improving accuracy. We toggled rescoring on and off to measure effectiveness, when combined with Binary Quantization. We also measured the impact on search performance. + +- **Search Limits**: We specify the number of results from the search process. We experimented with various search limits to measure their impact the accuracy and efficiency. We explored the trade-offs between search depth and performance. The results provide insight for applications with different precision and speed requirements. + +# Parameterized Search + +We will compare the exact search performance with the approximate search performance. + +```python +def parameterized_search( + point, + oversampling: float, + rescore: bool, + exact: bool, + collection_name: str, + ignore: bool = False, + limit: int = 10, +): + if exact: + return client.query_points( + collection_name=collection_name, + query=point.vector, + search_params=models.SearchParams(exact=exact), + limit=limit, + ).points + else: + return client.query_points( + collection_name=collection_name, + query=point.vector, + search_params=models.SearchParams( + quantization=models.QuantizationSearchParams( + ignore=ignore, + rescore=rescore, + oversampling=oversampling, + ), + exact=exact, + ), + limit=limit, + ).points +``` + +
+ +```python +import json +import random +import numpy as np + +oversampling_range = np.arange(1.0, 3.1, 1.0) +rescore_range = [True, False] + +ds = dataset.train_test_split(test_size=0.001, shuffle=True, seed=37)["test"] +ds = ds.to_pandas().to_dict(orient="records") + +results = [] +with open(f"{COLLECTION_NAME}.json", "w+") as f: + for element in tqdm(ds): + point = models.PointStruct( + id=random.randint(0, 100000), + vector=element[EMBEDDING_COLUMN_NAME], + ) + ## Running Grid Search + for oversampling in oversampling_range: + for rescore in rescore_range: + limit_range = [100, 50, 20, 10, 5] + for limit in limit_range: + try: + exact = parameterized_search( + point=point, + oversampling=oversampling, + rescore=rescore, + exact=True, + collection_name=COLLECTION_NAME, + limit=limit, + ) + hnsw = parameterized_search( + point=point, + oversampling=oversampling, + rescore=rescore, + exact=False, + collection_name=COLLECTION_NAME, + limit=limit, + ) + except Exception as e: + print(f"Skipping point: {point}\n{e}") + continue + + exact_ids = [item.id for item in exact] + hnsw_ids = [item.id for item in hnsw] + + accuracy = len(set(exact_ids) & set(hnsw_ids)) / len(exact_ids) + + result = { + "query_id": point.id, + "oversampling": oversampling, + "rescore": rescore, + "limit": limit, + "accuracy": accuracy, + } + f.write(json.dumps(result)) + f.write("\n") +``` + +## View The Results + +We can now tabulate our results across the ranges of oversampling and rescoring. + +```python +import pandas as pd + +results = pd.read_json(f"{COLLECTION_NAME}.json", lines=True) + +average_accuracy = results[results["limit"] != 1] +average_accuracy = average_accuracy[average_accuracy["limit"] != 5] +average_accuracy = average_accuracy.groupby(["oversampling", "rescore", "limit"])[ + "accuracy" +].mean() +average_accuracy = average_accuracy.reset_index() + +acc = average_accuracy.pivot( + index="limit", columns=["oversampling", "rescore"], values="accuracy" +) +``` + +
+ +```python +from IPython.display import display, HTML + +display(HTML(acc.to_html())) +``` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
oversampling123
rescoreFalseTrueFalseTrueFalseTrue
limit
100.69500.94300.69500.98600.69500.9930
200.70500.95200.70500.98600.70500.9915
500.69620.95460.69620.98380.69680.9926
1000.69910.95610.70030.99040.70070.9964
+ +## Results + +Here are some key observations, which analyzes the impact of rescoring (`True` or `False`): + +1. **Significantly Improved Accuracy**: + + - Enabling rescoring (`True`) consistently results in higher accuracy scores compared to when rescoring is disabled (`False`). + - The improvement in accuracy is true across various search limits (10, 20, 50, 100). + +1. **Model and Dimension Specific Observations**: + + - Th results suggest a diminishing return on accuracy improvement with higher oversampling in lower dimension spaces. + +1. **Influence of Search Limit**: + + - The performance gain from rescoring seems to be relatively stable across different search limits, suggesting that rescoring consistently enhances accuracy regardless of the number of top results considered. + +In summary, enabling rescoring dramatically improves search accuracy across all tested configurations. It is crucial feature for applications where precision is paramount. The consistent performance boost provided by rescoring underscores its value in refining search results, particularly when working with complex, high-dimensional data. This enhancement is critical for applications that demand high accuracy, such as semantic search, content discovery, and recommendation systems, where the quality of search results directly impacts user experience and satisfaction. + +## Leveraging Binary Quantization: Best Practices + +We recommend the following best practices for leveraging Binary Quantization: + +1. Oversampling: Use an oversampling factor of 3 for the best balance between accuracy and efficiency. This factor is suitable for a wide range of applications. +1. Rescoring: Enable rescoring to improve the accuracy of search results. +1. RAM: Store the full vectors and payload on disk. Limit what you load from memory to the binary quantization index. This helps reduce the memory footprint and improve the overall efficiency of the system. The incremental latency from the disk read is negligible compared to the latency savings from the binary scoring in Qdrant, which uses SIMD instructions where possible. diff --git a/qdrant-landing/content/documentation/201-intermediate/self-query.md b/qdrant-landing/content/documentation/201-intermediate/self-query.md new file mode 100644 index 000000000..3a41da1a7 --- /dev/null +++ b/qdrant-landing/content/documentation/201-intermediate/self-query.md @@ -0,0 +1,483 @@ +--- +notebook_path: 201-intermediate/self-query/self-query.ipynb +reading_time_min: 17 +title: Loading and cleaning data +--- + +```python +import os +import warnings +import pandas as pd +from qdrant_client import models, QdrantClient +from sentence_transformers import SentenceTransformer +from dotenv import load_dotenv + +warnings.filterwarnings("ignore") +encoder = SentenceTransformer("all-MiniLM-L6-v2") +load_dotenv() + +client = QdrantClient(os.getenv("QDRANT_HOST"), api_key=os.getenv("QDRANT_API_KEY")) +``` + +# Loading and cleaning data + +This [dataset](https://www.kaggle.com/datasets/zynicide/wine-reviews) contains approximately 130k reviews from the Wine Enthusiast + +Once cleaned we will have around 120k. + +```python +df = pd.read_csv("winemag-data-130k-v2.csv") +``` + +
+ +```python +wines = df.copy() +wines = wines.drop( + [ + "Unnamed: 0", + "designation", + "province", + "region_1", + "region_2", + "taster_name", + "taster_twitter_handle", + "winery", + ], + axis=1, +) +wines = wines.dropna(subset=["country", "price", "variety"]) +``` + +
+ +```python +wines.head() +``` + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
countrydescriptionpointspricetitlevariety
1PortugalThis is ripe and fruity, a wine that is smooth...8715.0Quinta dos Avidagos 2011 Avidagos Red (Douro)Portuguese Red
2USTart and snappy, the flavors of lime flesh and...8714.0Rainstorm 2013 Pinot Gris (Willamette Valley)Pinot Gris
3USPineapple rind, lemon pith and orange blossom ...8713.0St. Julian 2013 Reserve Late Harvest Riesling ...Riesling
4USMuch like the regular bottling from 2012, this...8765.0Sweet Cheeks 2012 Vintner's Reserve Wild Child...Pinot Noir
5SpainBlackberry and raspberry aromas show a typical...8715.0Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...Tempranillo-Merlot
+
+ +```python +wines.info() +``` + +``` + +Index: 120915 entries, 1 to 129970 +Data columns (total 6 columns): + # Column Non-Null Count Dtype +--- ------ -------------- ----- + 0 country 120915 non-null object + 1 description 120915 non-null object + 2 points 120915 non-null int64 + 3 price 120915 non-null float64 + 4 title 120915 non-null object + 5 variety 120915 non-null object +dtypes: float64(1), int64(1), object(4) +memory usage: 6.5+ MB +``` + +# Create a collection + +```python +client.create_collection( + collection_name="wine_reviews", + vectors_config=models.VectorParams( + size=encoder.get_sentence_embedding_dimension(), + distance=models.Distance.COSINE, + ), +) +``` + +
+ +```python +# Document class to structure data +class Document: + def __init__(self, page_content, metadata): + self.page_content = page_content + self.metadata = metadata + + +# Convert DataFrame rows into Document objects +def df_to_documents(df): + documents = [] + for _, row in df.iterrows(): + metadata = { + "country": row["country"], + "points": row["points"], + "price": row["price"], + "title": row["title"], + "variety": row["variety"], + } + document = Document(page_content=row["description"], metadata=metadata) + documents.append(document) + return documents + + +docs = df_to_documents(wines) +``` + +
+ +```python +points = [ + models.PointStruct( + id=idx, + vector=encoder.encode(doc.page_content).tolist(), + payload={"metadata": doc.metadata, "page_content": doc.page_content}, + ) + for idx, doc in enumerate(docs) +] +``` + +
+ +```python +client.upload_points( + collection_name="wine_reviews", + points=points, +) +``` + +# Test search + +```python +hits = client.search( + collection_name="wine_reviews", + query_vector=encoder.encode("Quinta dos Avidagos 2011").tolist(), + limit=3, +) + +for hit in hits: + print(hit.payload["metadata"]["title"], "score:", hit.score) +``` + +``` +Aveleda 2010 Follies Quinta da Agueira Touriga Nacional (Beiras) score: 0.46982175 +Quinta da Romaneira 2013 Sino da Romaneira Red (Douro) score: 0.43031913 +Quinta da Romaneira 2013 Sino da Romaneira Red (Douro) score: 0.43031913 +``` + +# Test filtering + +```python +# query filter +hits = client.search( + collection_name="wine_reviews", + query_vector=encoder.encode("Night Sky").tolist(), + query_filter=models.Filter( + must=[ + models.FieldCondition( + key="metadata.country", match=models.MatchValue(value="US") + ), + models.FieldCondition( + key="metadata.price", range=models.Range(gte=15.0, lte=30.0) + ), + models.FieldCondition( + key="metadata.points", range=models.Range(gte=90, lte=100) + ), + ] + ), + limit=3, +) + +for hit in hits: + print( + hit.payload["metadata"]["title"], + "\nprice:", + hit.payload["metadata"]["price"], + "\npoints:", + hit.payload["metadata"]["points"], + "\n\n", + ) +``` + +``` +Ballentine 2010 Fig Tree Vineyard Petite Sirah (St. Helena) +price: 28.0 +points: 91 + + +Seven Angels 2012 St. Peter of Alcantara Vineyard Zinfandel (Paso Robles) +price: 29.0 +points: 92 + + +Jamieson Canyon 1999 Cabernet Sauvignon (Napa Valley) +price: 20.0 +points: 91 +``` + +# Self-querying with LangChain + +```python +from langchain.chains.query_constructor.base import AttributeInfo +from langchain.retrievers.self_query.base import SelfQueryRetriever +from langchain_community.embeddings import HuggingFaceEmbeddings +from langchain.callbacks.tracers import ConsoleCallbackHandler +from langchain_openai import ChatOpenAI +from langchain_qdrant import Qdrant + +handler = ConsoleCallbackHandler() +llm = ChatOpenAI(temperature=0, model="gpt-4o") +# llm = OpenAI(temperature=0) + +embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") +vectorstore = Qdrant(client, collection_name="wine_reviews", embeddings=embeddings) +``` + +
+ +```python +metadata_field_info = [ + AttributeInfo( + name="country", + description="The country that the wine is from", + type="string", + ), + AttributeInfo( + name="points", + description="The number of points WineEnthusiast rated the wine on a scale of 1-100", + type="integer", + ), + AttributeInfo( + name="price", + description="The cost for a bottle of the wine", + type="float", + ), + AttributeInfo( + name="variety", + description="The grapes used to make the wine", + type="string", + ), +] + +document_content_description = "Brief description of the wine" + +retriever = SelfQueryRetriever.from_llm( + llm, vectorstore, document_content_description, metadata_field_info +) +``` + +
+ +```python +response = retriever.invoke( + "Which US wines are priced between 15 and 30 and have points above 90?" +) +response +``` + +``` +[Document(page_content='An outstanding value, the latest release of this wine dazzles with bold, black cherry and chocolate mocha flavors. The focus and definition throughout are exceptional also. This is a gem at a more than fair tariff.', metadata={'country': 'US', 'points': 91, 'price': 28.0, 'title': 'Dobbes Family Estate 2014 Grand Assemblage Pinot Noir (Willamette Valley)', 'variety': 'Pinot Noir', '_id': 10604, '_collection_name': 'wine_reviews'}), + Document(page_content='This is an amazingly fresh and fruity tank-fermented wine, imparting a subtle hint of grass before unleashing sublime layers of melon and apricot alongside measured, zesty acidity. New winemaker Chris Kajani is taking things in a refreshing, aim-for-the-top direction with this bottling.', metadata={'country': 'US', 'points': 92, 'price': 30.0, 'title': "Bouchaine 2013 ChΓͺne d'Argent Estate Vineyard Chardonnay (Carneros)", 'variety': 'Chardonnay', '_id': 102251, '_collection_name': 'wine_reviews'}), + Document(page_content="A streak of confectionary nougat and lemony acidity combine for a smooth, well-integrated wine, full bodied in style, that's lip-smacking in apple-cider juiciness on the finish.", metadata={'country': 'US', 'points': 92, 'price': 25.0, 'title': 'Conn Creek 2014 Chardonnay (Carneros)', 'variety': 'Chardonnay', '_id': 100685, '_collection_name': 'wine_reviews'}), + Document(page_content='Rick Longoria shows increasing mastery over this popular variety, lifting it into true complexity. After an outstanding 2010 vintage, his 2011 is even better, showing the same crisp acidity and savory orange, apricot and honey flavors, but with even greater elegance.', metadata={'country': 'US', 'points': 91, 'price': 19.0, 'title': 'Longoria 2011 Pinot Grigio (Santa Barbara County)', 'variety': 'Pinot Grigio', '_id': 105297, '_collection_name': 'wine_reviews'})] +``` + +```python +for resp in response: + print( + resp.metadata["title"], + "\n price:", + resp.metadata["price"], + "points:", + resp.metadata["points"], + "\n\n", + ) +``` + +``` +Dobbes Family Estate 2014 Grand Assemblage Pinot Noir (Willamette Valley) + price: 28.0 points: 91 + + +Bouchaine 2013 ChΓͺne d'Argent Estate Vineyard Chardonnay (Carneros) + price: 30.0 points: 92 + + +Conn Creek 2014 Chardonnay (Carneros) + price: 25.0 points: 92 + + +Longoria 2011 Pinot Grigio (Santa Barbara County) + price: 19.0 points: 91 +``` + +# Tracing to see filters in action + +```python +retriever.invoke( + "Which US wines are priced between 15 and 30 and have points above 90?", + {"callbacks": [handler]}, +) +``` + +```` +[chain/start] [retriever:Retriever > chain:query_constructor] Entering Chain run with input: +{ + "query": "Which US wines are priced between 15 and 30 and have points above 90?" +} +[chain/start] [retriever:Retriever > chain:query_constructor > prompt:FewShotPromptTemplate] Entering Prompt run with input: +{ + "query": "Which US wines are priced between 15 and 30 and have points above 90?" +} +[chain/end] [retriever:Retriever > chain:query_constructor > prompt:FewShotPromptTemplate] [1ms] Exiting Prompt run with output: +[outputs] +[llm/start] [retriever:Retriever > chain:query_constructor > llm:ChatOpenAI] Entering LLM run with input: +{ + "prompts": [ + "Human: Your goal is to structure the user's query to match the request schema provided below.\n\n<< Structured Request Schema >>\nWhen responding use a markdown code snippet with a JSON object formatted in the following schema:\n\n```json\n{\n \"query\": string \\ text string to compare to document contents\n \"filter\": string \\ logical condition statement for filtering documents\n}\n```\n\nThe query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.\n\nA logical condition statement is composed of one or more comparison and logical operation statements.\n\nA comparison statement takes the form: `comp(attr, val)`:\n- `comp` (eq | lt | lte | gt | gte | like): comparator\n- `attr` (string): name of attribute to apply the comparison to\n- `val` (string): is the comparison value\n\nA logical operation statement takes the form `op(statement1, statement2, ...)`:\n- `op` (and | or | not): logical operator\n- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to\n\nMake sure that you only use the comparators and logical operators listed above and no others.\nMake sure that filters only refer to attributes that exist in the data source.\nMake sure that filters only use the attributed names with its function names if there are functions applied on them.\nMake sure that filters only use format `YYYY-MM-DD` when handling date data typed values.\nMake sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.\nMake sure that filters are only used as needed. If there are no filters that should be applied return \"NO_FILTER\" for the filter value.\n\n<< Example 1. >>\nData Source:\n```json\n{\n \"content\": \"Lyrics of a song\",\n \"attributes\": {\n \"artist\": {\n \"type\": \"string\",\n \"description\": \"Name of the song artist\"\n },\n \"length\": {\n \"type\": \"integer\",\n \"description\": \"Length of the song in seconds\"\n },\n \"genre\": {\n \"type\": \"string\",\n \"description\": \"The song genre, one of \"pop\", \"rock\" or \"rap\"\"\n }\n }\n}\n```\n\nUser Query:\nWhat are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre\n\nStructured Request:\n```json\n{\n \"query\": \"teenager love\",\n \"filter\": \"and(or(eq(\\\"artist\\\", \\\"Taylor Swift\\\"), eq(\\\"artist\\\", \\\"Katy Perry\\\")), lt(\\\"length\\\", 180), eq(\\\"genre\\\", \\\"pop\\\"))\"\n}\n```\n\n\n<< Example 2. >>\nData Source:\n```json\n{\n \"content\": \"Lyrics of a song\",\n \"attributes\": {\n \"artist\": {\n \"type\": \"string\",\n \"description\": \"Name of the song artist\"\n },\n \"length\": {\n \"type\": \"integer\",\n \"description\": \"Length of the song in seconds\"\n },\n \"genre\": {\n \"type\": \"string\",\n \"description\": \"The song genre, one of \"pop\", \"rock\" or \"rap\"\"\n }\n }\n}\n```\n\nUser Query:\nWhat are songs that were not published on Spotify\n\nStructured Request:\n```json\n{\n \"query\": \"\",\n \"filter\": \"NO_FILTER\"\n}\n```\n\n\n<< Example 3. >>\nData Source:\n```json\n{\n \"content\": \"Brief description of the wine\",\n \"attributes\": {\n \"country\": {\n \"description\": \"The country that the wine is from\",\n \"type\": \"string\"\n },\n \"points\": {\n \"description\": \"The number of points WineEnthusiast rated the wine on a scale of 1-100\",\n \"type\": \"integer\"\n },\n \"price\": {\n \"description\": \"The cost for a bottle of the wine\",\n \"type\": \"float\"\n },\n \"variety\": {\n \"description\": \"The grapes used to make the wine\",\n \"type\": \"string\"\n }\n}\n}\n```\n\nUser Query:\nWhich US wines are priced between 15 and 30 and have points above 90?\n\nStructured Request:" + ] +} +[llm/end] [retriever:Retriever > chain:query_constructor > llm:ChatOpenAI] [3.00s] Exiting LLM run with output: +{ + "generations": [ + [ + { + "text": "```json\n{\n \"query\": \"\",\n \"filter\": \"and(eq(\\\"country\\\", \\\"US\\\"), gte(\\\"price\\\", 15), lte(\\\"price\\\", 30), gt(\\\"points\\\", 90))\"\n}\n```", + "generation_info": { + "finish_reason": "stop", + "logprobs": null + }, + "type": "ChatGeneration", + "message": { + "lc": 1, + "type": "constructor", + "id": [ + "langchain", + "schema", + "messages", + "AIMessage" + ], + "kwargs": { + "content": "```json\n{\n \"query\": \"\",\n \"filter\": \"and(eq(\\\"country\\\", \\\"US\\\"), gte(\\\"price\\\", 15), lte(\\\"price\\\", 30), gt(\\\"points\\\", 90))\"\n}\n```", + "response_metadata": { + "token_usage": { + "completion_tokens": 49, + "prompt_tokens": 922, + "total_tokens": 971 + }, + "model_name": "gpt-4o", + "system_fingerprint": "fp_729ea513f7", + "finish_reason": "stop", + "logprobs": null + }, + "type": "ai", + "id": "run-804927ef-53b2-4236-9c22-15e4913667f5-0", + "tool_calls": [], + "invalid_tool_calls": [] + } + } + } + ] + ], + "llm_output": { + "token_usage": { + "completion_tokens": 49, + "prompt_tokens": 922, + "total_tokens": 971 + }, + "model_name": "gpt-4o", + "system_fingerprint": "fp_729ea513f7" + }, + "run": null +} +[chain/start] [retriever:Retriever > chain:query_constructor > parser:StructuredQueryOutputParser] Entering Parser run with input: +[inputs] +[chain/end] [retriever:Retriever > chain:query_constructor > parser:StructuredQueryOutputParser] [4ms] Exiting Parser run with output: +[outputs] +[chain/end] [retriever:Retriever > chain:query_constructor] [3.01s] Exiting Chain run with output: +[outputs] + + + + + +[Document(page_content='An outstanding value, the latest release of this wine dazzles with bold, black cherry and chocolate mocha flavors. The focus and definition throughout are exceptional also. This is a gem at a more than fair tariff.', metadata={'country': 'US', 'points': 91, 'price': 28.0, 'title': 'Dobbes Family Estate 2014 Grand Assemblage Pinot Noir (Willamette Valley)', 'variety': 'Pinot Noir', '_id': 10604, '_collection_name': 'wine_reviews'}), + Document(page_content='This is an amazingly fresh and fruity tank-fermented wine, imparting a subtle hint of grass before unleashing sublime layers of melon and apricot alongside measured, zesty acidity. New winemaker Chris Kajani is taking things in a refreshing, aim-for-the-top direction with this bottling.', metadata={'country': 'US', 'points': 92, 'price': 30.0, 'title': "Bouchaine 2013 ChΓͺne d'Argent Estate Vineyard Chardonnay (Carneros)", 'variety': 'Chardonnay', '_id': 102251, '_collection_name': 'wine_reviews'}), + Document(page_content="A streak of confectionary nougat and lemony acidity combine for a smooth, well-integrated wine, full bodied in style, that's lip-smacking in apple-cider juiciness on the finish.", metadata={'country': 'US', 'points': 92, 'price': 25.0, 'title': 'Conn Creek 2014 Chardonnay (Carneros)', 'variety': 'Chardonnay', '_id': 100685, '_collection_name': 'wine_reviews'}), + Document(page_content='Rick Longoria shows increasing mastery over this popular variety, lifting it into true complexity. After an outstanding 2010 vintage, his 2011 is even better, showing the same crisp acidity and savory orange, apricot and honey flavors, but with even greater elegance.', metadata={'country': 'US', 'points': 91, 'price': 19.0, 'title': 'Longoria 2011 Pinot Grigio (Santa Barbara County)', 'variety': 'Pinot Grigio', '_id': 105297, '_collection_name': 'wine_reviews'})] +```` + +```python + +``` diff --git a/qdrant-landing/content/documentation/301-advanced/Qdrant_data_prep.md b/qdrant-landing/content/documentation/301-advanced/Qdrant_data_prep.md new file mode 100644 index 000000000..300de7e7f --- /dev/null +++ b/qdrant-landing/content/documentation/301-advanced/Qdrant_data_prep.md @@ -0,0 +1,1035 @@ +--- +notebook_path: 301-advanced/data_prep_webinar/Qdrant_data_prep.ipynb +reading_time_min: 60 +title: 'Prerequsites:' +--- + +# Prerequsites: + +## Qdrant Cloud Account + +## Open AI API Key + +drawing + +![alt text](documentation/301-advanced/Qdrant_data_prep/900.jpg) + +Credit: + +## What will I learn today? + +1. What is a vector database. +1. Why you can't just put your raw data in a vector database. +1. What kind of data you can use with Qdrant and some use cases for each modality. +1. What is chunking, an overview of 5 different chunking methods, and a comparison of semantic and fixed chunking. +1. What is FastEmbed and what kind of embeddings does it work with. +1. How to put your data in Qdrant, and a demonstration using Qdrant Cloud + +## What's a vector database? + +![alt text](documentation/301-advanced/Qdrant_data_prep/qdrant_overview_high_level.png) + +**** ![Screenshot 2024-08-27 at 12.34.22β€―PM.png](documentation/301-advanced/Qdrant_data_prep/bf78579e67de42fc885b21ab3f1a38e4.png) + +## What kind of data can you use with Qdrant? + +![alt text](documentation/301-advanced/Qdrant_data_prep/How-Embeddings-Work.jpg) + +### Text + +![alt text](documentation/301-advanced/Qdrant_data_prep/How-Do-Embeddings-Work_.jpg) + +#### Books, Tweets, Reddit posts + +Text can also be something like the lyrics of a song or the transcript of a youtube video. + +To get your text into a vdb, you'll first need to turn it into text. + +Some examples of applications that use text data in RAG could be a chatbot for customer service, a reccomendation engine for posts on social media, or language tutor to help people learn foreign languages. + +### Audio + +![alt text](documentation/301-advanced/Qdrant_data_prep/2-Figure1-1.png) + +#### You can also use Audio with Qdrant! Audio embeddings can capture more than the lyrics in a song, they can caputre the emotions or the volume of the audio as well! You can build a music reccodementation system or an anomaly detection system for camera audio at a facility with audio embeddings. + + + +### Images + +![alt text](documentation/301-advanced/Qdrant_data_prep/5accaa76-6fad-437b-8fbb-94b9544c3789_image7.png) + +Credit: + +You can also use images Qdrant. Why might you need to embed an image? + +Let's say you're beuilding an app to help doctors diagnose patients with skin cancer. + +With the power of semantic search, medical professionals could enhance +their diagnostic capabilities and make more accurate decisions regarding skin disease diagnosis. + +Another example might be image reccomendation on a photo sharing site. + + + +## What is chunking and a few common chunking methods + +![alt text](documentation/301-advanced/Qdrant_data_prep/1*yIMiJaQexgNqU3BXdR5WKg.png) + +How do you eat an elephant? One bit at a time! Chunkinking is the process of breaking your data into smaller bites so that the model can eat it. + +### Fixed Size Chunking + +![Screenshot 2024-08-28 at 3.37.14β€―PM.png](documentation/301-advanced/Qdrant_data_prep/c61fb9ee0a64446c8016d82b3709f48c.png) + +Concept: Your text is split into equal-sized segments, regardless of sentence or paragraph boundaries. + +Application: Useful in scenarios where uniformity in chunk size is critical, like certain types of vector indexing or when memory constraints are strict. + +Example: If you are building an application based on data where you know exactly how long each chunk will be-like a government form with a clear format. + +### Recursive Chunking + +![Screenshot 2024-08-28 at 3.38.09β€―PM.png](documentation/301-advanced/Qdrant_data_prep/927c14076d9f414c89abbcddb0b846d5.png) + +Concept: The document is recursively divided into smaller chunks, starting with large sections like chapters, then splitting those into smaller units like paragraphs, then splitting those paragraphs into sentences. + +Application: This method is great for hierarchical or nested documents, giving your application flexibility with chunking. + +Example: A novel that has distinct chapters, or tells a non chronological story. + +### Document Based Chunking + +![Screenshot 2024-08-28 at 3.41.46β€―PM.png](documentation/301-advanced/Qdrant_data_prep/2a5240721c8841188c22224f5a095d8c.png) + +Concept: The entire document is treated as a single chunk or divided at little as possible. + +Application: Best for tasks requiring the processing of large texts where breaking up the document too much would lose context, such as in legal, medical or scientific document analysis. + +Example: A legal document could be be chunked by charges, each charge being treated as a chunk, maintaining the document's structural integrity. + +### Semantic Chunking + +![Untitled-2024-08-09-0903.png](documentation/301-advanced/Qdrant_data_prep/e38eea92d36f4d67958cef06acc29797.png) + +Concept: The document is divided based on meaning, so that each chunk represents a complete idea or concept. + +Application: Works well for tasks where preserving the meaning is important , such as in summarization or when generating embeddings for semantic search. + +Example: A stream of thought writing session. A paragraph discussing a single idea might be kept as one chunk, while the next sentence discussing another idea could be a chunk. The next chunk could be a page. + +### Agentic Chunking + +![alt text](documentation/301-advanced/Qdrant_data_prep/1*aHXJ5wuWuh1faf_BF7i4og.png) + +Concept: Document is chunked based on the actions or intentions of an agent. + +Application: Useful in narrative analysis, dialogue systems, or any context where understanding the actions and motivations of agents is crucial. + +Example: In Harry Potter, each chunk would capture the story of an individual character, making sure that the story and characters are all learned by your application. + +## How to find the right Chunking method for your model? Let's compare using a fixed chunk size to semantic chunking! + +```python +# pip install llama_index llama-index-readers-web llama-index-embeddings-openai +``` + +
+ +```python +!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'pg_essay.txt' +``` + +``` +--2024-08-29 14:14:32-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt +Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... +Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. +HTTP request sent, awaiting response... 200 OK +Length: 75042 (73K) [text/plain] +Saving to: β€˜pg_essay.txt’ +``` + +pg_essay.txt 0%[ ] 0 --.-KB/s\ +pg_essay.txt 100%[===================>] 73.28K --.-KB/s in 0.02s + +``` +2024-08-29 14:14:33 (3.18 MB/s) - β€˜pg_essay.txt’ saved [75042/75042] +``` + +```python +from llama_index.core import SimpleDirectoryReader + +# load documents +documents = SimpleDirectoryReader(input_files=["pg_essay.txt"]).load_data() +``` + +
+ +```python +from llama_index.core.node_parser import ( + SentenceSplitter, + SemanticSplitterNodeParser, +) +from llama_index.embeddings.openai import OpenAIEmbedding + +import os + +os.environ["OPENAI_API_KEY"] = "insert OPEN AI API KEY" +``` + +
+ +```python +embed_model = OpenAIEmbedding() +splitter = SemanticSplitterNodeParser( + buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model +) + +# also baseline splitter +base_splitter = SentenceSplitter(chunk_size=512) +``` + +
+ +```python +nodes = splitter.get_nodes_from_documents(documents) +``` + +
+ +```python +print(nodes[1].get_content()) +``` + +``` +I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. + +The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines β€” CPU, disk drives, printer, card reader β€” sitting up on a raised floor under bright fluorescent lights. + +The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer. + +I was puzzled by the 1401. I couldn't figure out what to do with it. +``` + +```python +print(nodes[2].get_content()) +``` + +``` +And in retrospect there's not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type. So I'm not surprised I can't remember any programs I wrote, because they can't have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear. + +With microcomputers, everything changed. Now you could have a computer sitting right in front of you, on a desk, that could respond to your keystrokes as it was running instead of just churning through a stack of punch cards and then stopping. [1] + +The first of my friends to get a microcomputer built it himself. It was sold as a kit by Heathkit. I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer. + +Computers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good enough. This was when I really started programming. I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book. There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but it was a lot better than a typewriter. + +Though I liked programming, I didn't plan to study it in college. In college I was going to study philosophy, which sounded much more powerful. It seemed, to my naive high school self, to be the study of the ultimate truths, compared to which the things studied in other fields would be mere domain knowledge. What I discovered when I got to college was that the other fields took up so much of the space of ideas that there wasn't much left for these supposed ultimate truths. All that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored. + +I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI. + +AI was in the air in the mid 1980s, but there were two things especially that made me want to work on it: a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress, so I don't know how well it has aged, but when I read it I was drawn entirely into its world. It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDLU, it seemed like that time would be a few years at most. All you had to do was teach SHRDLU more words. + +There weren't any classes in AI at Cornell then, not even graduate classes, so I started trying to teach myself. Which meant learning Lisp, since in those days Lisp was regarded as the language of AI. The commonly used programming languages then were pretty primitive, and programmers' ideas correspondingly so. The default language at Cornell was a Pascal-like language called PL/I, and the situation was similar elsewhere. Learning Lisp expanded my concept of a program so fast that it was years before I started to have a sense of where the new limits were. This was more like it; this was what I had expected college to do. It wasn't happening in a class, like it was supposed to, but that was ok. For the next couple years I was on a roll. +``` + +```python +print(nodes[3].get_content()) +``` + +``` +I knew what I was going to do. + +For my undergraduate thesis, I reverse-engineered SHRDLU. My God did I love working on that program. It was a pleasing bit of code, but what made it even more exciting was my belief β€” hard to imagine now, but not unique in 1985 β€” that it was already climbing the lower slopes of intelligence. + +I had gotten into a program at Cornell that didn't make you choose a major. You could take whatever classes you liked, and choose whatever you liked to put on your degree. I of course chose "Artificial Intelligence." When I got the actual physical diploma, I was dismayed to find that the quotes had been included, which made them read as scare-quotes. At the time this bothered me, but now it seems amusingly accurate, for reasons I was about to discover. + +I applied to 3 grad schools: MIT and Yale, which were renowned for AI at the time, and Harvard, which I'd visited because Rich Draves went there, and was also home to Bill Woods, who'd invented the type of parser I used in my SHRDLU clone. Only Harvard accepted me, so that was where I went. + +I don't remember the moment it happened, or if there even was a specific moment, but during the first year of grad school I realized that AI, as practiced at the time, was a hoax. By which I mean the sort of AI in which a program that's told "the dog is sitting on the chair" translates this into some formal representation and adds it to the list of things it knows. + +What these programs really showed was that there's a subset of natural language that's a formal language. But a very proper subset. It was clear that there was an unbridgeable gap between what they could do and actually understanding natural language. It was not, in fact, simply a matter of teaching SHRDLU more words. That whole way of doing AI, with explicit data structures representing concepts, was not going to work. Its brokenness did, as so often happens, generate a lot of opportunities to write papers about various band-aids that could be applied to it, but it was never going to get us Mike. + +So I looked around to see what I could salvage from the wreckage of my plans, and there was Lisp. I knew from experience that Lisp was interesting for its own sake and not just for its association with AI, even though that was the main reason people cared about it at the time. So I decided to focus on Lisp. In fact, I decided to write a book about Lisp hacking. It's scary to think how little I knew about Lisp hacking when I started writing that book. But there's nothing like writing a book about something to help you learn it. The book, On Lisp, wasn't published till 1993, but I wrote much of it in grad school. + +Computer Science is an uneasy alliance between two halves, theory and systems. The theory people prove things, and the systems people build things. I wanted to build things. I had plenty of respect for theory β€” indeed, a sneaking suspicion that it was the more admirable of the two halves β€” but building things seemed so much more exciting. + +The problem with systems work, though, was that it didn't last. Any program you wrote today, no matter how good, would be obsolete in a couple decades at best. People might mention your software in footnotes, but no one would actually use it. And indeed, it would seem very feeble work. Only people with a sense of the history of the field would even realize that, in its time, it had been good. + +There were some surplus Xerox Dandelions floating around the computer lab at one point. Anyone who wanted one to play around with could have one. I was briefly tempted, but they were so slow by present standards; what was the point? No one else wanted one either, so off they went. That was what happened to systems work. + +I wanted not just to build things, but to build things that would last. + +In this dissatisfied state I went in 1988 to visit Rich Draves at CMU, where he was in grad school. One day I went to visit the Carnegie Institute, where I'd spent a lot of time as a kid. While looking at a painting there I realized something that might seem obvious, but was a big surprise to me. There, right on the wall, was something you could make that would last. Paintings didn't become obsolete. Some of the best ones were hundreds of years old. + +And moreover this was something you could make a living doing. Not as easily as you could by writing software, of course, but I thought if you were really industrious and lived really cheaply, it had to be possible to make enough to survive. And as an artist you could be truly independent. You wouldn't have a boss, or even need to get research funding. + +I had always liked looking at paintings. Could I make them? I had no idea. +``` + +```python +base_nodes = base_splitter.get_nodes_from_documents(documents) +``` + +
+ +```python +print(base_nodes[1].get_content()) +``` + +``` +The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type. So I'm not surprised I can't remember any programs I wrote, because they can't have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear. + +With microcomputers, everything changed. Now you could have a computer sitting right in front of you, on a desk, that could respond to your keystrokes as it was running instead of just churning through a stack of punch cards and then stopping. [1] + +The first of my friends to get a microcomputer built it himself. It was sold as a kit by Heathkit. I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer. + +Computers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good enough. This was when I really started programming. I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book. There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but it was a lot better than a typewriter. + +Though I liked programming, I didn't plan to study it in college. In college I was going to study philosophy, which sounded much more powerful. It seemed, to my naive high school self, to be the study of the ultimate truths, compared to which the things studied in other fields would be mere domain knowledge. What I discovered when I got to college was that the other fields took up so much of the space of ideas that there wasn't much left for these supposed ultimate truths. +``` + +```python +print(base_nodes[2].get_content()) +``` + +``` +This was when I really started programming. I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book. There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but it was a lot better than a typewriter. + +Though I liked programming, I didn't plan to study it in college. In college I was going to study philosophy, which sounded much more powerful. It seemed, to my naive high school self, to be the study of the ultimate truths, compared to which the things studied in other fields would be mere domain knowledge. What I discovered when I got to college was that the other fields took up so much of the space of ideas that there wasn't much left for these supposed ultimate truths. All that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored. + +I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI. + +AI was in the air in the mid 1980s, but there were two things especially that made me want to work on it: a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress, so I don't know how well it has aged, but when I read it I was drawn entirely into its world. It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDLU, it seemed like that time would be a few years at most. All you had to do was teach SHRDLU more words. + +There weren't any classes in AI at Cornell then, not even graduate classes, so I started trying to teach myself. Which meant learning Lisp, since in those days Lisp was regarded as the language of AI. The commonly used programming languages then were pretty primitive, and programmers' ideas correspondingly so. The default language at Cornell was a Pascal-like language called PL/I, and the situation was similar elsewhere. +``` + +```python +print(base_nodes[3].get_content()) +``` + +``` +I haven't tried rereading The Moon is a Harsh Mistress, so I don't know how well it has aged, but when I read it I was drawn entirely into its world. It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDLU, it seemed like that time would be a few years at most. All you had to do was teach SHRDLU more words. + +There weren't any classes in AI at Cornell then, not even graduate classes, so I started trying to teach myself. Which meant learning Lisp, since in those days Lisp was regarded as the language of AI. The commonly used programming languages then were pretty primitive, and programmers' ideas correspondingly so. The default language at Cornell was a Pascal-like language called PL/I, and the situation was similar elsewhere. Learning Lisp expanded my concept of a program so fast that it was years before I started to have a sense of where the new limits were. This was more like it; this was what I had expected college to do. It wasn't happening in a class, like it was supposed to, but that was ok. For the next couple years I was on a roll. I knew what I was going to do. + +For my undergraduate thesis, I reverse-engineered SHRDLU. My God did I love working on that program. It was a pleasing bit of code, but what made it even more exciting was my belief β€” hard to imagine now, but not unique in 1985 β€” that it was already climbing the lower slopes of intelligence. + +I had gotten into a program at Cornell that didn't make you choose a major. You could take whatever classes you liked, and choose whatever you liked to put on your degree. I of course chose "Artificial Intelligence." When I got the actual physical diploma, I was dismayed to find that the quotes had been included, which made them read as scare-quotes. At the time this bothered me, but now it seems amusingly accurate, for reasons I was about to discover. + +I applied to 3 grad schools: MIT and Yale, which were renowned for AI at the time, and Harvard, which I'd visited because Rich Draves went there, and was also home to Bill Woods, who'd invented the type of parser I used in my SHRDLU clone. Only Harvard accepted me, so that was where I went. +``` + + + +## How to use FastEmbed: Qdrant's efficient Python library for embedding generation + +![alt text](documentation/301-advanced/Qdrant_data_prep/social_preview.jpg) + +In the earlier examples I used a variety of methods to embed the data. Qdrant is compatible with all embeddings, but we do offer our own Embedding library that intergrates seamlessly with FastEmbed. FastEmbed supports Dense Text Embeddings, Sparse Text Embeddings, Late Interaction Text Embeddings, and Image Embeddings. + +![Screenshot 2024-08-28 at 3.53.13β€―PM.png](documentation/301-advanced/Qdrant_data_prep/704c6118363e46b79912e7b99e0472d9.png) + +![Screenshot 2024-08-28 at 3.53.06β€―PM.png](documentation/301-advanced/Qdrant_data_prep/88ab85414277454999d1898aa946dad2.png) + +### Dense Text Embeddings + +```python +# pip install qdrant-client[fastembed] +``` + +``` +Collecting qdrant-client[fastembed] + Downloading qdrant_client-1.11.1-py3-none-any.whl.metadata (10 kB) +Requirement already satisfied: grpcio>=1.41.0 in /usr/local/lib/python3.10/dist-packages (from qdrant-client[fastembed]) (1.64.1) +Collecting grpcio-tools>=1.41.0 (from qdrant-client[fastembed]) + Downloading grpcio_tools-1.66.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB) +Requirement already satisfied: httpx>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.20.0->qdrant-client[fastembed]) (0.27.2) +Requirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.10/dist-packages (from qdrant-client[fastembed]) (1.26.4) +Collecting portalocker<3.0.0,>=2.7.0 (from qdrant-client[fastembed]) + Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB) +Requirement already satisfied: pydantic>=1.10.8 in /usr/local/lib/python3.10/dist-packages (from qdrant-client[fastembed]) (2.8.2) +Requirement already satisfied: urllib3<3,>=1.26.14 in /usr/local/lib/python3.10/dist-packages (from qdrant-client[fastembed]) (2.0.7) +Collecting fastembed==0.3.6 (from qdrant-client[fastembed]) + Downloading fastembed-0.3.6-py3-none-any.whl.metadata (7.7 kB) +Collecting PyStemmer<3.0.0,>=2.2.0 (from fastembed==0.3.6->qdrant-client[fastembed]) + Downloading PyStemmer-2.2.0.1.tar.gz (303 kB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 303.0/303.0 kB 6.0 MB/s eta 0:00:00 +[?25h Preparing metadata (setup.py) ... [?25l[?25hdone +Requirement already satisfied: huggingface-hub<1.0,>=0.20 in /usr/local/lib/python3.10/dist-packages (from fastembed==0.3.6->qdrant-client[fastembed]) (0.23.5) +Collecting loguru<0.8.0,>=0.7.2 (from fastembed==0.3.6->qdrant-client[fastembed]) + Downloading loguru-0.7.2-py3-none-any.whl.metadata (23 kB) +Collecting mmh3<5.0,>=4.0 (from fastembed==0.3.6->qdrant-client[fastembed]) + Downloading mmh3-4.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB) +Collecting onnx<2.0.0,>=1.15.0 (from fastembed==0.3.6->qdrant-client[fastembed]) + Downloading onnx-1.16.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB) +Collecting onnxruntime<2.0.0,>=1.17.0 (from fastembed==0.3.6->qdrant-client[fastembed]) + Downloading onnxruntime-1.19.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.3 kB) +Collecting pillow<11.0.0,>=10.3.0 (from fastembed==0.3.6->qdrant-client[fastembed]) + Downloading pillow-10.4.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.2 kB) +Requirement already satisfied: requests<3.0,>=2.31 in /usr/local/lib/python3.10/dist-packages (from fastembed==0.3.6->qdrant-client[fastembed]) (2.32.3) +Requirement already satisfied: snowballstemmer<3.0.0,>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from fastembed==0.3.6->qdrant-client[fastembed]) (2.2.0) +Requirement already satisfied: tokenizers<1.0,>=0.15 in /usr/local/lib/python3.10/dist-packages (from fastembed==0.3.6->qdrant-client[fastembed]) (0.19.1) +Requirement already satisfied: tqdm<5.0,>=4.66 in /usr/local/lib/python3.10/dist-packages (from fastembed==0.3.6->qdrant-client[fastembed]) (4.66.5) +Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-tools>=1.41.0->qdrant-client[fastembed]) + Downloading protobuf-5.28.0-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes) +Collecting grpcio>=1.41.0 (from qdrant-client[fastembed]) + Downloading grpcio-1.66.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB) +Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from grpcio-tools>=1.41.0->qdrant-client[fastembed]) (71.0.4) +Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client[fastembed]) (3.7.1) +Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client[fastembed]) (2024.7.4) +Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client[fastembed]) (1.0.5) +Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client[fastembed]) (3.8) +Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client[fastembed]) (1.3.1) +Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client[fastembed]) (0.14.0) +Collecting h2<5,>=3 (from httpx[http2]>=0.20.0->qdrant-client[fastembed]) + Downloading h2-4.1.0-py3-none-any.whl.metadata (3.6 kB) +Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10.8->qdrant-client[fastembed]) (0.7.0) +Requirement already satisfied: pydantic-core==2.20.1 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10.8->qdrant-client[fastembed]) (2.20.1) +Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10.8->qdrant-client[fastembed]) (4.12.2) +Collecting hyperframe<7,>=6.0 (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant-client[fastembed]) + Downloading hyperframe-6.0.1-py3-none-any.whl.metadata (2.7 kB) +Collecting hpack<5,>=4.0 (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant-client[fastembed]) + Downloading hpack-4.0.0-py3-none-any.whl.metadata (2.5 kB) +Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.20->fastembed==0.3.6->qdrant-client[fastembed]) (3.15.4) +Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.20->fastembed==0.3.6->qdrant-client[fastembed]) (2024.6.1) +Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.20->fastembed==0.3.6->qdrant-client[fastembed]) (24.1) +Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.20->fastembed==0.3.6->qdrant-client[fastembed]) (6.0.2) +Collecting coloredlogs (from onnxruntime<2.0.0,>=1.17.0->fastembed==0.3.6->qdrant-client[fastembed]) + Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB) +Requirement already satisfied: flatbuffers in /usr/local/lib/python3.10/dist-packages (from onnxruntime<2.0.0,>=1.17.0->fastembed==0.3.6->qdrant-client[fastembed]) (24.3.25) +Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from onnxruntime<2.0.0,>=1.17.0->fastembed==0.3.6->qdrant-client[fastembed]) (1.13.2) +Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0,>=2.31->fastembed==0.3.6->qdrant-client[fastembed]) (3.3.2) +Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client[fastembed]) (1.2.2) +Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime<2.0.0,>=1.17.0->fastembed==0.3.6->qdrant-client[fastembed]) + Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB) +Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->onnxruntime<2.0.0,>=1.17.0->fastembed==0.3.6->qdrant-client[fastembed]) (1.3.0) +Downloading fastembed-0.3.6-py3-none-any.whl (55 kB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.6/55.6 kB 3.5 MB/s eta 0:00:00 +[?25hDownloading grpcio_tools-1.66.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.4/2.4 MB 19.3 MB/s eta 0:00:00 +[?25hDownloading grpcio-1.66.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.7 MB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.7/5.7 MB 26.7 MB/s eta 0:00:00 +[?25hDownloading portalocker-2.10.1-py3-none-any.whl (18 kB) +Downloading qdrant_client-1.11.1-py3-none-any.whl (259 kB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 259.4/259.4 kB 8.8 MB/s eta 0:00:00 +[?25hDownloading h2-4.1.0-py3-none-any.whl (57 kB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.5/57.5 kB 2.3 MB/s eta 0:00:00 +[?25hDownloading loguru-0.7.2-py3-none-any.whl (62 kB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.5/62.5 kB 1.9 MB/s eta 0:00:00 +[?25hDownloading mmh3-4.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (67 kB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.6/67.6 kB 2.2 MB/s eta 0:00:00 +[?25hDownloading onnx-1.16.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.9 MB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.9/15.9 MB 24.0 MB/s eta 0:00:00 +[?25hDownloading onnxruntime-1.19.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (13.2 MB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.2/13.2 MB 21.3 MB/s eta 0:00:00 +[?25hDownloading pillow-10.4.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.5 MB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 39.9 MB/s eta 0:00:00 +[?25hDownloading protobuf-5.28.0-cp38-abi3-manylinux2014_x86_64.whl (316 kB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 316.6/316.6 kB 15.6 MB/s eta 0:00:00 +[?25hDownloading hpack-4.0.0-py3-none-any.whl (32 kB) +Downloading hyperframe-6.0.1-py3-none-any.whl (12 kB) +Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.0/46.0 kB 2.9 MB/s eta 0:00:00 +[?25hDownloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.8/86.8 kB 5.3 MB/s eta 0:00:00 +[?25hBuilding wheels for collected packages: PyStemmer + Building wheel for PyStemmer (setup.py) ... [?25l[?25hdone + Created wheel for PyStemmer: filename=PyStemmer-2.2.0.1-cp310-cp310-linux_x86_64.whl size=579737 sha256=bbc92ffa06a639525dc079c4923a52457559268c171e9fa33ae1d72ec0d7fd0f + Stored in directory: /root/.cache/pip/wheels/45/7d/2c/a7ebb8319e01acc5306fa1f8558bf24063d6cec2c02de330c9 +Successfully built PyStemmer +Installing collected packages: PyStemmer, mmh3, protobuf, portalocker, pillow, loguru, hyperframe, humanfriendly, hpack, grpcio, onnx, h2, grpcio-tools, coloredlogs, onnxruntime, qdrant-client, fastembed + Attempting uninstall: protobuf + Found existing installation: protobuf 3.20.3 + Uninstalling protobuf-3.20.3: + Successfully uninstalled protobuf-3.20.3 + Attempting uninstall: pillow + Found existing installation: Pillow 9.4.0 + Uninstalling Pillow-9.4.0: + Successfully uninstalled Pillow-9.4.0 + Attempting uninstall: grpcio + Found existing installation: grpcio 1.64.1 + Uninstalling grpcio-1.64.1: + Successfully uninstalled grpcio-1.64.1 +ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. +cudf-cu12 24.4.1 requires protobuf<5,>=3.20, but you have protobuf 5.28.0 which is incompatible. +google-ai-generativelanguage 0.6.6 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 5.28.0 which is incompatible. +google-cloud-bigquery-storage 2.25.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 5.28.0 which is incompatible. +google-cloud-datastore 2.19.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 5.28.0 which is incompatible. +google-cloud-firestore 2.16.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 5.28.0 which is incompatible. +tensorboard 2.17.0 requires protobuf!=4.24.0,<5.0.0,>=3.19.6, but you have protobuf 5.28.0 which is incompatible. +tensorflow 2.17.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 5.28.0 which is incompatible. +tensorflow-metadata 1.15.0 requires protobuf<4.21,>=3.20.3; python_version < "3.11", but you have protobuf 5.28.0 which is incompatible. +Successfully installed PyStemmer-2.2.0.1 coloredlogs-15.0.1 fastembed-0.3.6 grpcio-1.66.1 grpcio-tools-1.66.1 h2-4.1.0 hpack-4.0.0 humanfriendly-10.0 hyperframe-6.0.1 loguru-0.7.2 mmh3-4.1.0 onnx-1.16.2 onnxruntime-1.19.0 pillow-10.4.0 portalocker-2.10.1 protobuf-5.28.0 qdrant-client-1.11.1 +``` + +```python +from fastembed import TextEmbedding +from typing import List + +# Example list of documents +documents: List[str] = [ + "This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.", + "fastembed is supported by and maintained by Qdrant.", +] + +# This will trigger the model download and initialization +embedding_model = TextEmbedding() +print("The model BAAI/bge-small-en-v1.5 is ready to use.") + +embeddings_generator = embedding_model.embed(documents) # reminder this is a generator +embeddings_list = list(embedding_model.embed(documents)) +# you can also convert the generator to a list, and that to a numpy array +len(embeddings_list[0]) # Vector of 384 dimensions +``` + +``` +/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: +The secret `HF_TOKEN` does not exist in your Colab secrets. +To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. +You will be able to reuse this secret in all of your notebooks. +Please note that authentication is recommended but still optional to access public models or datasets. + warnings.warn( + + + +Fetching 5 files: 0%| | 0/5 [00:00 + +```python +from fastembed import SparseTextEmbedding, SparseEmbedding +from typing import List +``` + +
+ +```python +SparseTextEmbedding.list_supported_models() +``` + +``` +[{'model': 'prithivida/Splade_PP_en_v1', + 'vocab_size': 30522, + 'description': 'Independent Implementation of SPLADE++ Model for English', + 'size_in_GB': 0.532, + 'sources': {'hf': 'Qdrant/SPLADE_PP_en_v1'}, + 'model_file': 'model.onnx'}, + {'model': 'prithvida/Splade_PP_en_v1', + 'vocab_size': 30522, + 'description': 'Independent Implementation of SPLADE++ Model for English', + 'size_in_GB': 0.532, + 'sources': {'hf': 'Qdrant/SPLADE_PP_en_v1'}, + 'model_file': 'model.onnx'}, + {'model': 'Qdrant/bm42-all-minilm-l6-v2-attentions', + 'vocab_size': 30522, + 'description': 'Light sparse embedding model, which assigns an importance score to each token in the text', + 'size_in_GB': 0.09, + 'sources': {'hf': 'Qdrant/all_miniLM_L6_v2_with_attentions'}, + 'model_file': 'model.onnx', + 'additional_files': ['stopwords.txt'], + 'requires_idf': True}, + {'model': 'Qdrant/bm25', + 'description': 'BM25 as sparse embeddings meant to be used with Qdrant', + 'size_in_GB': 0.01, + 'sources': {'hf': 'Qdrant/bm25'}, + 'model_file': 'mock.file', + 'additional_files': ['arabic.txt', + 'azerbaijani.txt', + 'basque.txt', + 'bengali.txt', + 'catalan.txt', + 'chinese.txt', + 'danish.txt', + 'dutch.txt', + 'english.txt', + 'finnish.txt', + 'french.txt', + 'german.txt', + 'greek.txt', + 'hebrew.txt', + 'hinglish.txt', + 'hungarian.txt', + 'indonesian.txt', + 'italian.txt', + 'kazakh.txt', + 'nepali.txt', + 'norwegian.txt', + 'portuguese.txt', + 'romanian.txt', + 'russian.txt', + 'slovene.txt', + 'spanish.txt', + 'swedish.txt', + 'tajik.txt', + 'turkish.txt'], + 'requires_idf': True}] +``` + +```python +model_name = "Qdrant/bm25" +# This triggers the model download +model = SparseTextEmbedding(model_name=model_name) +``` + +``` +Fetching 29 files: 0%| | 0/29 [00:00 + +```python +index = 0 +sparse_embeddings_list[index] +``` + +``` +SparseEmbedding(values=array([1.66528681, 1.66528681, 1.66528681, 1.66528681, 1.66528681, + 1.66528681]), indices=array([1558122631, 746093202, 691409538, 1391639301, 2042792262, + 1318831999])) +``` + +### Late Interaction Text Embeddings + +```python +from fastembed import LateInteractionTextEmbedding + +embedding_model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0") + +documents = [ + "ColBERT is a late interaction text embedding model, however, there are also other models such as TwinBERT.", + "On the contrary to the late interaction models, the early interaction models contains interaction steps at embedding generation process", +] +queries = [ + "Are there any other late interaction text embedding models except ColBERT?", + "What is the difference between late interaction and early interaction text embedding models?", +] +``` + +``` +Fetching 5 files: 0%| | 0/5 [00:00 + +```python +document_embeddings[0].shape, query_embeddings[0].shape +``` + +``` +((26, 128), (32, 128)) +``` + +Don't worry about query embeddings having the bigger shape in this case. ColBERT authors recommend to pad queries with [MASK] tokens to 32 tokens. They also recommends to truncate queries to 32 tokens, however we don't do that in FastEmbed, so you can put some straight into the queries. + +Image Embeddings + +```python +from fastembed import ImageEmbedding + +model = ImageEmbedding("Qdrant/resnet50-onnx") + +embeddings_generator = model.embed( + ["tests/misc/image.jpeg", "tests/misc/small_image.jpeg"] +) +embeddings_list = list(embeddings_generator) +embeddings_list +``` + +``` +Fetching 3 files: 0%| | 0/3 [00:00 in () + 6 ["tests/misc/image.jpeg", "tests/misc/small_image.jpeg"] + 7 ) +----> 8 embeddings_list = list(embeddings_generator) + 9 embeddings_list + + +/usr/local/lib/python3.10/dist-packages/fastembed/image/image_embedding.py in embed(self, images, batch_size, parallel, **kwargs) + 92 List of embeddings, one per document + 93 """ +---> 94 yield from self.model.embed(images, batch_size, parallel, **kwargs) + + +/usr/local/lib/python3.10/dist-packages/fastembed/image/onnx_embedding.py in embed(self, images, batch_size, parallel, **kwargs) + 121 List of embeddings, one per document + 122 """ +--> 123 yield from self._embed_images( + 124 model_name=self.model_name, + 125 cache_dir=str(self.cache_dir), + + +/usr/local/lib/python3.10/dist-packages/fastembed/image/onnx_image_model.py in _embed_images(self, model_name, cache_dir, images, batch_size, parallel, **kwargs) + 96 if parallel is None or is_small: + 97 for batch in iter_batch(images, batch_size): +---> 98 yield from self._post_process_onnx_output(self.onnx_embed(batch)) + 99 else: + 100 start_method = "forkserver" if "forkserver" in get_all_start_methods() else "spawn" + + +/usr/local/lib/python3.10/dist-packages/fastembed/image/onnx_image_model.py in onnx_embed(self, images, **kwargs) + 57 def onnx_embed(self, images: List[ImageInput], **kwargs) -> OnnxOutputContext: + 58 with contextlib.ExitStack(): +---> 59 image_files = [ + 60 Image.open(image) if not isinstance(image, Image.Image) else image + 61 for image in images + + +/usr/local/lib/python3.10/dist-packages/fastembed/image/onnx_image_model.py in (.0) + 58 with contextlib.ExitStack(): + 59 image_files = [ +---> 60 Image.open(image) if not isinstance(image, Image.Image) else image + 61 for image in images + 62 ] + + +/usr/local/lib/python3.10/dist-packages/PIL/Image.py in open(fp, mode, formats) + 3429 + 3430 if filename: +-> 3431 fp = builtins.open(filename, "rb") + 3432 exclusive_fp = True + 3433 else: + + +FileNotFoundError: [Errno 2] No such file or directory: '/content/tests/misc/image.jpeg' +``` + +Preprocessing is encapsulated in the ImageEmbedding class, applied operations are identical to the ones provided by Hugging Face Transformers. You don't need to think about batching, opening/closing files, resizing images, etc., Fastembed will take care of it. + +## Putting Data into Qdrant + +```python +from qdrant_client import QdrantClient + +client = QdrantClient( + url="insert your Qdrant URL", + api_key="insert your Qdrant API Key", +) +``` + +
+ +```python +# Initialize the client +# Prepare your documents, metadata, and IDs +docs = ["Audience Suggestion 1", "Audience suggestion 2"] +metadata = [ + {"source": "jane-doe"}, + {"source": "john-doe"}, +] +ids = [42, 2] + +# If you want to change the model: +client.set_model("sentence-transformers/all-MiniLM-L6-v2") +# List of supported models: https://qdrant.github.io/fastembed/examples/Supported_Models + +# Use the new add() instead of upsert() +# This internally calls embed() of the configured embedding model +client.add( + collection_name="demo_collection", documents=docs, metadata=metadata, ids=ids +) + +search_result = client.query( + collection_name="demo_collection", query_text="This is a query document" +) +print(search_result) +``` + +``` +[QueryResponse(id=2, embedding=None, sparse_embedding=None, metadata={'document': 'Audience suggestion 2', 'source': 'Ljohn-doe'}, document='Audience suggestion 2', score=-0.0038822955), QueryResponse(id=42, embedding=None, sparse_embedding=None, metadata={'document': 'Audience Suggestion 1', 'source': 'jane-doe'}, document='Audience Suggestion 1', score=-0.0069789663)] +``` diff --git a/qdrant-landing/content/documentation/301-advanced/code-search.md b/qdrant-landing/content/documentation/301-advanced/code-search.md new file mode 100644 index 000000000..90160c200 --- /dev/null +++ b/qdrant-landing/content/documentation/301-advanced/code-search.md @@ -0,0 +1,669 @@ +--- +notebook_path: 301-advanced/code-search/code-search.ipynb +reading_time_min: 25 +title: Code search with Qdrant +--- + +# Code search with Qdrant + +This is a notebook demonstrating how to implement a code search mechanism using two different neural encoders - one general purpuse, and another trained specifically for code. Let's start with installing all the required dependencies. + +```python +!pip install qdrant-client inflection sentence-transformers optimum onnx +``` + +``` +Collecting qdrant-client + Downloading qdrant_client-1.7.3-py3-none-any.whl (206 kB) +[?25l ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/206.3 kB ? eta -:--:-- +``` + +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m206.3/206.3 kB\[0m \[31m6.6 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hCollecting inflection +Downloading inflection-0.5.1-py2.py3-none-any.whl (9.5 kB) +Collecting sentence-transformers +Downloading sentence_transformers-2.5.1-py3-none-any.whl (156 kB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m156.5/156.5 kB\[0m \[31m13.9 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hCollecting optimum +Downloading optimum-1.17.1-py3-none-any.whl (407 kB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m407.1/407.1 kB\[0m \[31m9.5 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hCollecting onnx +Downloading onnx-1.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.7 MB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m15.7/15.7 MB\[0m \[31m36.3 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hRequirement already satisfied: grpcio>=1.41.0 in /usr/local/lib/python3.10/dist-packages (from qdrant-client) (1.62.0) +Collecting grpcio-tools>=1.41.0 (from qdrant-client) +Downloading grpcio_tools-1.62.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m2.8/2.8 MB\[0m \[31m44.2 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hCollecting httpx[http2]>=0.14.0 (from qdrant-client) +Downloading httpx-0.27.0-py3-none-any.whl (75 kB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m75.6/75.6 kB\[0m \[31m11.6 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hRequirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.10/dist-packages (from qdrant-client) (1.25.2) +Collecting portalocker\<3.0.0,>=2.7.0 (from qdrant-client) +Downloading portalocker-2.8.2-py3-none-any.whl (17 kB) +Requirement already satisfied: pydantic>=1.10.8 in /usr/local/lib/python3.10/dist-packages (from qdrant-client) (2.6.3) +Requirement already satisfied: urllib3\<3,>=1.26.14 in /usr/local/lib/python3.10/dist-packages (from qdrant-client) (2.0.7) +Requirement already satisfied: transformers\<5.0.0,>=4.32.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.38.1) +Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.66.2) +Requirement already satisfied: torch>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (2.1.0+cu121) +Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (1.2.2) +Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (1.11.4) +Requirement already satisfied: huggingface-hub>=0.15.1 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (0.20.3) +Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (9.4.0) +Collecting coloredlogs (from optimum) +Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m46.0/46.0 kB\[0m \[31m6.6 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hRequirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from optimum) (1.12) +Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from optimum) (23.2) +Collecting datasets (from optimum) +Downloading datasets-2.18.0-py3-none-any.whl (510 kB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m510.5/510.5 kB\[0m \[31m31.8 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hRequirement already satisfied: protobuf>=3.20.2 in /usr/local/lib/python3.10/dist-packages (from onnx) (3.20.3) +Collecting protobuf>=3.20.2 (from onnx) +Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl (294 kB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m294.6/294.6 kB\[0m \[31m36.3 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hRequirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from grpcio-tools>=1.41.0->qdrant-client) (67.7.2) +Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.14.0->qdrant-client) (3.7.1) +Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.14.0->qdrant-client) (2024.2.2) +Collecting httpcore==1.\* (from httpx[http2]>=0.14.0->qdrant-client) +Downloading httpcore-1.0.4-py3-none-any.whl (77 kB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m77.8/77.8 kB\[0m \[31m11.0 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hRequirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.14.0->qdrant-client) (3.6) +Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx[http2]>=0.14.0->qdrant-client) (1.3.1) +Collecting h2\<5,>=3 (from httpx[http2]>=0.14.0->qdrant-client) +Downloading h2-4.1.0-py3-none-any.whl (57 kB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m57.5/57.5 kB\[0m \[31m8.3 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hCollecting h11\<0.15,>=0.13 (from httpcore==1.\*->httpx[http2]>=0.14.0->qdrant-client) +Downloading h11-0.14.0-py3-none-any.whl (58 kB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m58.3/58.3 kB\[0m \[31m8.9 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hRequirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (3.13.1) +Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (2023.6.0) +Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (2.31.0) +Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (6.0.1) +Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (4.10.0) +Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10.8->qdrant-client) (0.6.0) +Requirement already satisfied: pydantic-core==2.16.3 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10.8->qdrant-client) (2.16.3) +Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.2.1) +Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.1.3) +Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (2.1.0) +Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers\<5.0.0,>=4.32.0->sentence-transformers) (2023.12.25) +Requirement already satisfied: tokenizers\<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers\<5.0.0,>=4.32.0->sentence-transformers) (0.15.2) +Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers\<5.0.0,>=4.32.0->sentence-transformers) (0.4.2) +Requirement already satisfied: sentencepiece!=0.1.92,>=0.1.91 in /usr/local/lib/python3.10/dist-packages (from transformers\<5.0.0,>=4.32.0->sentence-transformers) (0.1.99) +Collecting humanfriendly>=9.1 (from coloredlogs->optimum) +Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m86.8/86.8 kB\[0m \[31m12.2 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hRequirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets->optimum) (14.0.2) +Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets->optimum) (0.6) +Collecting dill\<0.3.9,>=0.3.0 (from datasets->optimum) +Downloading dill-0.3.8-py3-none-any.whl (116 kB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m116.3/116.3 kB\[0m \[31m16.8 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets->optimum) (1.5.3) +Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets->optimum) (3.4.1) +Collecting multiprocess (from datasets->optimum) +Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB) +\[2K \[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\[0m \[32m134.8/134.8 kB\[0m \[31m19.5 MB/s\[0m eta \[36m0:00:00\[0m +\[?25hRequirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets->optimum) (3.9.3) +Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence-transformers) (1.3.2) +Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence-transformers) (3.3.0) +Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->optimum) (1.3.0) +Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (1.3.1) +Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (23.2.0) +Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (1.4.1) +Requirement already satisfied: multidict\<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (6.0.5) +Requirement already satisfied: yarl\<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (1.9.4) +Requirement already satisfied: async-timeout\<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (4.0.3) +Collecting hyperframe\<7,>=6.0 (from h2\<5,>=3->httpx[http2]>=0.14.0->qdrant-client) +Downloading hyperframe-6.0.1-py3-none-any.whl (12 kB) +Collecting hpack\<5,>=4.0 (from h2\<5,>=3->httpx[http2]>=0.14.0->qdrant-client) +Downloading hpack-4.0.0-py3-none-any.whl (32 kB) +Requirement already satisfied: charset-normalizer\<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.15.1->sentence-transformers) (3.3.2) +Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx[http2]>=0.14.0->qdrant-client) (1.2.0) +Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.11.0->sentence-transformers) (2.1.5) +Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->optimum) (2.8.2) +Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->optimum) (2023.4) +Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets->optimum) (1.16.0) +Installing collected packages: protobuf, portalocker, inflection, hyperframe, humanfriendly, hpack, h11, dill, onnx, multiprocess, httpcore, h2, grpcio-tools, coloredlogs, httpx, datasets, sentence-transformers, qdrant-client, optimum +Attempting uninstall: protobuf +Found existing installation: protobuf 3.20.3 +Uninstalling protobuf-3.20.3: +Successfully uninstalled protobuf-3.20.3 +\[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. +tensorflow-metadata 1.14.0 requires protobuf\<4.21,>=3.20.3, but you have protobuf 4.25.3 which is incompatible.\[0m\[31m +\[0mSuccessfully installed coloredlogs-15.0.1 datasets-2.18.0 dill-0.3.8 grpcio-tools-1.62.0 h11-0.14.0 h2-4.1.0 hpack-4.0.0 httpcore-1.0.4 httpx-0.27.0 humanfriendly-10.0 hyperframe-6.0.1 inflection-0.5.1 multiprocess-0.70.16 onnx-1.15.0 optimum-1.17.1 portalocker-2.8.2 protobuf-4.25.3 qdrant-client-1.7.3 sentence-transformers-2.5.1 + +We are going to work with [Qdrant source code](https://github.com/qdrant/qdrant) that has been already converted into chunks. If you want to do it for a different project, please consider using one of the [LSP implementations](https://microsoft.github.io/language-server-protocol/) for your programming language. It should be fairly easy to build similar structures with the help of these tools. + +```python +!wget https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl +``` + +``` +--2024-03-05 11:08:28-- https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl +Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.119.207, 108.177.127.207, 172.217.218.207, ... +Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.119.207|:443... connected. +HTTP request sent, awaiting response... 200 OK +Length: 4921256 (4.7M) [application/json] +Saving to: β€˜structures.jsonl’ + +structures.jsonl 100%[===================>] 4.69M 20.4MB/s in 0.2s + +2024-03-05 11:08:29 (20.4 MB/s) - β€˜structures.jsonl’ saved [4921256/4921256] +``` + +```python +import json + +structures = [] +with open("structures.jsonl", "r") as fp: + for i, row in enumerate(fp): + entry = json.loads(row) + structures.append(entry) + +structures[0] +``` + +``` +{'name': 'InvertedIndexRam', + 'signature': '# [doc = " Inverted flatten index from dimension id to posting list"] # [derive (Debug , Clone , PartialEq)] pub struct InvertedIndexRam { # [doc = " Posting lists for each dimension flattened (dimension id -> posting list)"] # [doc = " Gaps are filled with empty posting lists"] pub postings : Vec < PostingList > , # [doc = " Number of unique indexed vectors"] # [doc = " pre-computed on build and upsert to avoid having to traverse the posting lists."] pub vector_count : usize , }', + 'code_type': 'Struct', + 'docstring': '= " Inverted flatten index from dimension id to posting list"', + 'line': 15, + 'line_from': 13, + 'line_to': 22, + 'context': {'module': 'inverted_index', + 'file_path': 'lib/sparse/src/index/inverted_index/inverted_index_ram.rs', + 'file_name': 'inverted_index_ram.rs', + 'struct_name': None, + 'snippet': '/// Inverted flatten index from dimension id to posting list\n#[derive(Debug, Clone, PartialEq)]\npub struct InvertedIndexRam {\n /// Posting lists for each dimension flattened (dimension id -> posting list)\n /// Gaps are filled with empty posting lists\n pub postings: Vec,\n /// Number of unique indexed vectors\n /// pre-computed on build and upsert to avoid having to traverse the posting lists.\n pub vector_count: usize,\n}\n'}} +``` + +We will use two different neural encoders - `all-MiniLM-L6-v2` and `jina-embeddings-v2-base-code`. Since the first one is trained for general purposes, and more natural language, there is a need to convert code into more human-friendly text representation. This normalization gets rid of language specifics, so the output looks more like a description of the particular code structure. + +```python +import inflection +import re + +from typing import Dict, Any + + +def textify(chunk: Dict[str, Any]) -> str: + """ + Convert the code structure into natural language like representation. + + Args: + chunk (dict): Dictionary-like representation of the code structure + Example: { + "name":"await_ready_for_timeout", + "signature":"fn await_ready_for_timeout (& self , timeout : Duration) -> bool", + "code_type":"Function", + "docstring":"= \" Return `true` if ready, `false` if timed out.\"", + "line":44, + "line_from":43, + "line_to":51, + "context":{ + "module":"common", + "file_path":"lib/collection/src/common/is_ready.rs", + "file_name":"is_ready.rs", + "struct_name":"IsReady", + "snippet":" /// Return `true` if ready, `false` if timed out.\n pub fn await_ready_for_timeout(&self, timeout: Duration) -> bool {\n let mut is_ready = self.value.lock();\n if !*is_ready {\n !self.condvar.wait_for(&mut is_ready, timeout).timed_out()\n } else {\n true\n }\n }\n" + } + } + + Returns: + str: A simplified natural language like description of the structure with some context info + Example: "Function Await ready for timeout that does Return true if ready false if timed out defined as Fn await ready for timeout self timeout duration bool defined in struct Isready module common file is_ready rs" + """ + # Get rid of all the camel case / snake case + # - inflection.underscore changes the camel case to snake case + # - inflection.humanize converts the snake case to human readable form + name = inflection.humanize(inflection.underscore(chunk["name"])) + signature = inflection.humanize(inflection.underscore(chunk["signature"])) + + # Check if docstring is provided + docstring = "" + if chunk["docstring"]: + docstring = f"that does {chunk['docstring']} " + + # Extract the location of that snippet of code + context = ( + f"module {chunk['context']['module']} " f"file {chunk['context']['file_name']}" + ) + if chunk["context"]["struct_name"]: + struct_name = inflection.humanize( + inflection.underscore(chunk["context"]["struct_name"]) + ) + context = f"defined in struct {struct_name} {context}" + + # Combine all the bits and pieces together + text_representation = ( + f"{chunk['code_type']} {name} " + f"{docstring}" + f"defined as {signature} " + f"{context}" + ) + + # Remove any special characters and concatenate the tokens + tokens = re.split(r"\W", text_representation) + tokens = filter(lambda x: x, tokens) + return " ".join(tokens) +``` + +Here is how the same structure looks like, after performing the normalization step: + +```python +textify(structures[0]) +``` + +``` +'Struct Inverted index ram that does Inverted flatten index from dimension id to posting list defined as doc inverted flatten index from dimension id to posting list derive debug clone partial eq pub struct inverted index ram doc posting lists for each dimension flattened dimension id posting list doc gaps are filled with empty posting lists pub postings vec posting list doc number of unique indexed vectors doc pre computed on build and upsert to avoid having to traverse the posting lists pub vector count usize module inverted_index file inverted_index_ram rs' +``` + +Let's do it for all the structures at once: + +```python +text_representations = list(map(textify, structures)) +``` + +Created text representations might be directly used as an input to the `all-MiniLM-L6-v2` model. + +```python +from sentence_transformers import SentenceTransformer + +nlp_model = SentenceTransformer("all-MiniLM-L6-v2") +nlp_embeddings = nlp_model.encode( + text_representations, + show_progress_bar=True, +) +nlp_embeddings.shape +``` + +``` +modules.json: 0%| | 0.00/349 [00:00 posting list)\n /// Gaps are filled with empty posting lists\n pub postings: Vec,\n /// Number of unique indexed vectors\n /// pre-computed on build and upsert to avoid having to traverse the posting lists.\n pub vector_count: usize,\n}\n' +``` + +The `jina-embeddings-v2-base-code` model is available for free, but requires accepting the rules on [the model page](https://huggingface.co/jinaai/jina-embeddings-v2-base-code). Please do it first, and put the key below. + +```python +# You have to accept the conditions in order to be able to access Jina embedding +# model. Please visit https://huggingface.co/jinaai/jina-embeddings-v2-base-code +# to accept the rules and generate the access token in your account settings: +# https://huggingface.co/settings/tokens + +HF_TOKEN = "THIS_IS_YOUR_TOKEN" +``` + +Once the token is ready, we can pass the code snippets through the second model. Please mind we set the `trust_remote_code` flag to `True` so the library can download and run some code from the remote server. This is required to run the model, so in general be aware of the potential security risks and make sure you trust the source. + +```python +code_model = SentenceTransformer( + "jinaai/jina-embeddings-v2-base-code", token=HF_TOKEN, trust_remote_code=True +) +code_model.max_seq_length = 8192 # increase the context length window +code_embeddings = code_model.encode( + code_snippets, + batch_size=4, + show_progress_bar=True, +) +code_embeddings.shape +``` + +``` +modules.json: 0%| | 0.00/229 [00:00 + +```python +from qdrant_client import QdrantClient, models + +client = QdrantClient(QDRANT_URL, api_key=QDRANT_API_KEY) +client.create_collection( + "qdrant-sources", + vectors_config={ + "text": models.VectorParams( + size=nlp_embeddings.shape[1], + distance=models.Distance.COSINE, + ), + "code": models.VectorParams( + size=code_embeddings.shape[1], + distance=models.Distance.COSINE, + ), + }, +) +``` + +``` +True +``` + +Our collection should be created already. As you may see, we configured so called **[named vectors](https://qdrant.tech/documentation/concepts/points/)**, to have two different embeddings stored in the same collection. + +Let's finally index all the data. + +```python +import uuid + +points = [ + models.PointStruct( + id=uuid.uuid4().hex, + vector={ + "text": text_embedding, + "code": code_embedding, + }, + payload=structure, + ) + for text_embedding, code_embedding, structure in zip( + nlp_embeddings, code_embeddings, structures + ) +] +len(points) +``` + +``` +4723 +``` + +```python +client.upload_points( + "qdrant-sources", + points=points, + batch_size=64, +) +``` + +If you want to check if all the points were sent, counting them might be the easiest idea. + +```python +client.count("qdrant-sources") +``` + +``` +CountResult(count=4723) +``` + +If you, however, want to know how the count endpoint works internally in the Qdrant server, that might be a question to ask. + +```python +query = "How do I count points in a collection?" +``` + +First of all, let's use one model at a time. Let's start with the general purpose one. + +```python +hits = client.search( + "qdrant-sources", + query_vector=("text", nlp_model.encode(query).tolist()), + limit=5, +) +for hit in hits: + print( + "| ", + hit.payload["context"]["module"], + " | ", + hit.payload["context"]["file_name"], + " | ", + hit.score, + " | `", + hit.payload["signature"], + "` |", + ) +``` + +``` +| toc | point_ops.rs | 0.59448624 | ` async fn count (& self , collection_name : & str , request : CountRequestInternal , read_consistency : Option < ReadConsistency > , shard_selection : ShardSelectorInternal ,) -> Result < CountResult , StorageError > ` | +| operations | types.rs | 0.5493385 | ` # [doc = " Count Request"] # [doc = " Counts the number of points which satisfy the given filter."] # [doc = " If filter is not provided, the count of all points in the collection will be returned."] # [derive (Debug , Deserialize , Serialize , JsonSchema , Validate)] # [serde (rename_all = "snake_case")] pub struct CountRequestInternal { # [doc = " Look only for points which satisfies this conditions"] # [validate] pub filter : Option < Filter > , # [doc = " If true, count exact number of points. If false, count approximate number of points faster."] # [doc = " Approximate count might be unreliable during the indexing process. Default: true"] # [serde (default = "default_exact_count")] pub exact : bool , } ` | +| collection_manager | segments_updater.rs | 0.5121002 | ` fn upsert_points < 'a , T > (segments : & SegmentHolder , op_num : SeqNumberType , points : T ,) -> CollectionResult < usize > where T : IntoIterator < Item = & 'a PointStruct > , ` | +| collection | point_ops.rs | 0.5063539 | ` async fn count (& self , request : CountRequestInternal , read_consistency : Option < ReadConsistency > , shard_selection : & ShardSelectorInternal ,) -> CollectionResult < CountResult > ` | +| map_index | mod.rs | 0.49973983 | ` fn get_points_with_value_count < Q > (& self , value : & Q) -> Option < usize > where Q : ? Sized , N : std :: borrow :: Borrow < Q > , Q : Hash + Eq , ` | +``` + +The results obtained with the code specific model should be different. + +```python +hits = client.search( + "qdrant-sources", + query_vector=("code", code_model.encode(query).tolist()), + limit=5, +) +for hit in hits: + print( + "| ", + hit.payload["context"]["module"], + " | ", + hit.payload["context"]["file_name"], + " | ", + hit.score, + " | `", + hit.payload["signature"], + "` |", + ) +``` + +``` +| field_index | geo_index.rs | 0.73278356 | ` fn count_indexed_points (& self) -> usize ` | +| numeric_index | mod.rs | 0.7254975 | ` fn count_indexed_points (& self) -> usize ` | +| map_index | mod.rs | 0.7124739 | ` fn count_indexed_points (& self) -> usize ` | +| map_index | mod.rs | 0.7124739 | ` fn count_indexed_points (& self) -> usize ` | +| fixtures | payload_context_fixture.rs | 0.7062038 | ` fn total_point_count (& self) -> usize ` | +``` + +In reality, we implemented the system with two different models, as we want to combine the results coming from both of them. We can do it with a batch request, so there is just a single call to Qdrant. + +```python +results = client.search_batch( + "qdrant-sources", + requests=[ + models.SearchRequest( + vector=models.NamedVector( + name="text", vector=nlp_model.encode(query).tolist() + ), + with_payload=True, + limit=5, + ), + models.SearchRequest( + vector=models.NamedVector( + name="code", vector=code_model.encode(query).tolist() + ), + with_payload=True, + limit=5, + ), + ], +) +for hits in results: + for hit in hits: + print( + "| ", + hit.payload["context"]["module"], + " | ", + hit.payload["context"]["file_name"], + " | ", + hit.score, + " | `", + hit.payload["signature"], + "` |", + ) +``` + +``` +| toc | point_ops.rs | 0.59448624 | ` async fn count (& self , collection_name : & str , request : CountRequestInternal , read_consistency : Option < ReadConsistency > , shard_selection : ShardSelectorInternal ,) -> Result < CountResult , StorageError > ` | +| operations | types.rs | 0.5493385 | ` # [doc = " Count Request"] # [doc = " Counts the number of points which satisfy the given filter."] # [doc = " If filter is not provided, the count of all points in the collection will be returned."] # [derive (Debug , Deserialize , Serialize , JsonSchema , Validate)] # [serde (rename_all = "snake_case")] pub struct CountRequestInternal { # [doc = " Look only for points which satisfies this conditions"] # [validate] pub filter : Option < Filter > , # [doc = " If true, count exact number of points. If false, count approximate number of points faster."] # [doc = " Approximate count might be unreliable during the indexing process. Default: true"] # [serde (default = "default_exact_count")] pub exact : bool , } ` | +| collection_manager | segments_updater.rs | 0.5121002 | ` fn upsert_points < 'a , T > (segments : & SegmentHolder , op_num : SeqNumberType , points : T ,) -> CollectionResult < usize > where T : IntoIterator < Item = & 'a PointStruct > , ` | +| collection | point_ops.rs | 0.5063539 | ` async fn count (& self , request : CountRequestInternal , read_consistency : Option < ReadConsistency > , shard_selection : & ShardSelectorInternal ,) -> CollectionResult < CountResult > ` | +| map_index | mod.rs | 0.49973983 | ` fn get_points_with_value_count < Q > (& self , value : & Q) -> Option < usize > where Q : ? Sized , N : std :: borrow :: Borrow < Q > , Q : Hash + Eq , ` | +| field_index | geo_index.rs | 0.73278356 | ` fn count_indexed_points (& self) -> usize ` | +| numeric_index | mod.rs | 0.7254975 | ` fn count_indexed_points (& self) -> usize ` | +| map_index | mod.rs | 0.7124739 | ` fn count_indexed_points (& self) -> usize ` | +| map_index | mod.rs | 0.7124739 | ` fn count_indexed_points (& self) -> usize ` | +| fixtures | payload_context_fixture.rs | 0.7062038 | ` fn total_point_count (& self) -> usize ` | +``` + +Last but not least, if we want to improve the diversity of the results, grouping them by the module might be a good idea. + +```python +results = client.search_groups( + "qdrant-sources", + query_vector=("code", code_model.encode(query).tolist()), + group_by="context.module", + limit=5, + group_size=1, +) +for group in results.groups: + for hit in group.hits: + print( + "| ", + hit.payload["context"]["module"], + " | ", + hit.payload["context"]["file_name"], + " | ", + hit.score, + " | `", + hit.payload["signature"], + "` |", + ) +``` + +``` +| field_index | geo_index.rs | 0.73278356 | ` fn count_indexed_points (& self) -> usize ` | +| numeric_index | mod.rs | 0.7254975 | ` fn count_indexed_points (& self) -> usize ` | +| map_index | mod.rs | 0.7124739 | ` fn count_indexed_points (& self) -> usize ` | +| fixtures | payload_context_fixture.rs | 0.7062038 | ` fn total_point_count (& self) -> usize ` | +| hnsw_index | graph_links.rs | 0.6998417 | ` fn num_points (& self) -> usize ` | +``` + +For a more detailed guide, please check our [code search tutorial](https://qdrant.tech/documentation/tutorials/code-search/) and [code search demo](https://github.com/qdrant/demo-code-search). diff --git a/qdrant-landing/content/documentation/301-advanced/colpali_demo_binary.md b/qdrant-landing/content/documentation/301-advanced/colpali_demo_binary.md new file mode 100644 index 000000000..47525a77d --- /dev/null +++ b/qdrant-landing/content/documentation/301-advanced/colpali_demo_binary.md @@ -0,0 +1,928 @@ +--- +notebook_path: 301-advanced/colpali-and-binary-quantization/colpali_demo_binary.ipynb +reading_time_min: 27 +title: 'ColPali and Qdrant: Document Retrieval with Vision Language Models and Binary Quantization' +--- + +Open In Colab + +# ColPali and Qdrant: Document Retrieval with Vision Language Models and Binary Quantization + +It’s no secret that even the most modern document retrieval systems have a hard time handling visually rich documents like PDFs, containing tables, images, and complex layouts. + +ColPali introduces a multimodal retrieval approach that uses Vision Language Models (VLMs) instead of the traditional OCR and text-based extraction. By processing document images directly, it creates multi-vector embeddings from both the visual and textual content, capturing the document's structure and context more effectively. This method outperforms traditional techniques, as demonstrated by the Visual Document Retrieval Benchmark (ViDoRe) introduced in the paper. + +## Standard Retrieval vs. ColPali + +The standard approach starts by running OCR to extract the text from a document. Once the text is extracted, a layout detection model interprets the structure, which is followed by chunking the text into smaller sections for embedding. This method works adequately for documents where the text content is the primary focus. + +![Standard Retrieval architecture](documentation/301-advanced/colpali_demo_binary/image-278.png) + +*Standard Retrieval architecture. Image from the ColPali paper [1]* + +Rather than relying on OCR, ColPali processes the entire document as an image using a Vision Encoder. It creates multi-vector embeddings that capture both the textual content and the visual structure of the document which are then passed through a Language Model (LLM), which integrates the information into a representation that retains both text and visual features. + +![Colpali architecture](documentation/301-advanced/colpali_demo_binary/image-279.png) + +*Colpali architecture. Image from the ColPali paper [1]* + +The retrieval quality of ColPali is significantly higher, with an NDCG@5 score of 0.81. This comes from a benchmark created by the authors to measure how well systems handle visually rich documents. ColPali's score shows that it does a better job of capturing both text and visual elements compared to traditional methods. + +| Feature | Standard Retrieval | ColPali | +| --------------------------- | -------------------------------- | -------------------------------------------------- | +| **Document Processing** | OCR and text-based extraction | Vision-based processing using a Vision Encoder | +| **Handling Visual Content** | Limited (depends on captioning) | Fully integrated (handles images, tables, layouts) | +| **Embedding Creation** | Single dense embedding from text | Multi-vector embeddings from both text and visuals | +| **Speed (Offline)** | 7.22 seconds per page | 0.39 seconds per page | +| **Speed (Online)** | 22 milliseconds per query | 30 milliseconds per query | +| **Retrieval Quality** | NDCG@5 score of 0.66 | NDCG@5 score of 0.81 | + +## Why ColPali’s Results Are So Good + +One of the standout features of ColPali is its explainability. Because it uses vision transformers, it can 'understand' which parts of a document were most relevant to a specific query. For example, if you’re searching for the page in a report that mentions a specific date, it can highlight the patches of the document where that information is found. This level of transparency is incredibly useful for understanding how the model works and verifying the accuracy of its results. + +Let's take a look at this chart bellow that shows the 2019 Average Hourly Generation by Fuel Type from the [original ColPali paper](https://arxiv.org/abs/2407.01449): + +2019 Average Hourly Generation by Fuel Type + +*Image from the ColPali paper [1]* + +In the figure below, also presented in the ColPali paper, we can see how ColPali identifies the most relevant patches of the document in response to the query "Which hour of the day had the highest overall electricity generation in 2019?" and match the query terms like β€œhour” and β€œhighest generation” to the relevant sections of the document. + +How ColPali identifies the most relevant document image patches + +*Image from the ColPali paper [1]* + +The highlighted zones correspond to the areas of the document that have information relevant to the query. ColPali computes a query-to-page matching score based on these highlighted regions, allowing it to retrieve the most pertinent documents from a large pre-indexed corpus. + +# Getting Started: Setting Up ColPali and Qdrant + +This tutorial takes inspiration from [Daniel van Strien’s guide](https://danielvanstrien.xyz/posts/post-with-code/colpali-qdrant/2024-10-02_using_colpali_with_qdrant.html) [2] on using ColPali and Qdrant for document retrieval, working with a UFO dataset that includes tables, images, and text. + +We’re experimenting with **Binary Quantization** and using oversampling and rescoring to fine-tune the results. + +### Step 1: Install Required Libraries + +Before diving into the code, let’s install and import the libraries we're gonna be using: + +```python +!pip install uv +!uv pip install --system colpali_engine>=0.3.1 datasets huggingface_hub[hf_transfer] qdrant-client transformers>=4.45.0 stamina rich +``` + +``` +Collecting uv + Downloading uv-0.4.26-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB) +Downloading uv-0.4.26-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.7 MB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.7/13.7 MB 105.6 MB/s eta 0:00:00 +[?25hInstalling collected packages: uv +Successfully installed uv-0.4.26 +Using Python 3.10.12 environment at /usr +Resolved 73 packages in 3.57s +Building gputil==1.4.0 +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +httpx ------------------------------ 0 B/74.60 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +h11 ------------------------------ 0 B/56.89 KiB +httpx ------------------------------ 0 B/74.60 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +h2 ------------------------------ 0 B/56.14 KiB +h11 ------------------------------ 0 B/56.89 KiB +httpx ------------------------------ 0 B/74.60 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hpack ------------------------------ 0 B/31.85 KiB +h2 ------------------------------ 0 B/56.14 KiB +h11 ------------------------------ 0 B/56.89 KiB +httpx ------------------------------ 0 B/74.60 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +portalocker ------------------------------ 0 B/17.99 KiB +hpack ------------------------------ 0 B/31.85 KiB +h2 ------------------------------ 0 B/56.14 KiB +h11 ------------------------------ 0 B/56.89 KiB +httpx ------------------------------ 0 B/74.60 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hyperframe ------------------------------ 0 B/12.10 KiB +portalocker ------------------------------ 0 B/17.99 KiB +hpack ------------------------------ 0 B/31.85 KiB +h2 ------------------------------ 0 B/56.14 KiB +h11 ------------------------------ 0 B/56.89 KiB +httpx ------------------------------ 0 B/74.60 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hyperframe ------------------------------ 0 B/12.10 KiB +portalocker ------------------------------ 0 B/17.99 KiB +hpack ------------------------------ 0 B/31.85 KiB +h2 ------------------------------ 0 B/56.14 KiB +h11 ------------------------------ 0 B/56.89 KiB +httpx ------------------------------ 0 B/74.60 KiB +httpcore ------------------------------ 0 B/76.18 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hyperframe ------------------------------ 0 B/12.10 KiB +portalocker ------------------------------ 0 B/17.99 KiB +hpack ------------------------------ 0 B/31.85 KiB +h2 ------------------------------ 0 B/56.14 KiB +h11 ------------------------------ 0 B/56.89 KiB +httpx ------------------------------ 14.92 KiB/74.60 KiB +httpcore ------------------------------ 0 B/76.18 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hyperframe ------------------------------ 0 B/12.10 KiB +portalocker ------------------------------ 0 B/17.99 KiB +hpack ------------------------------ 0 B/31.85 KiB +h2 ------------------------------ 0 B/56.14 KiB +h11 ------------------------------ 0 B/56.89 KiB +httpx ------------------------------ 14.92 KiB/74.60 KiB +httpcore ------------------------------ 14.88 KiB/76.18 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hyperframe ------------------------------ 0 B/12.10 KiB +portalocker ------------------------------ 0 B/17.99 KiB +hpack ------------------------------ 0 B/31.85 KiB +h2 ------------------------------ 0 B/56.14 KiB +h11 ------------------------------ 14.84 KiB/56.89 KiB +httpx ------------------------------ 14.92 KiB/74.60 KiB +httpcore ------------------------------ 14.88 KiB/76.18 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hyperframe ------------------------------ 0 B/12.10 KiB +portalocker ------------------------------ 0 B/17.99 KiB +hpack ------------------------------ 0 B/31.85 KiB +h2 ------------------------------ 14.84 KiB/56.14 KiB +h11 ------------------------------ 14.84 KiB/56.89 KiB +httpx ------------------------------ 14.92 KiB/74.60 KiB +httpcore ------------------------------ 14.88 KiB/76.18 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hyperframe ------------------------------ 0 B/12.10 KiB +portalocker ------------------------------ 0 B/17.99 KiB +hpack ------------------------------ 14.84 KiB/31.85 KiB +h2 ------------------------------ 14.84 KiB/56.14 KiB +h11 ------------------------------ 14.84 KiB/56.89 KiB +httpx ------------------------------ 14.92 KiB/74.60 KiB +httpcore ------------------------------ 14.88 KiB/76.18 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hyperframe ------------------------------ 0 B/12.10 KiB +portalocker ------------------------------ 14.91 KiB/17.99 KiB +hpack ------------------------------ 14.84 KiB/31.85 KiB +h2 ------------------------------ 14.84 KiB/56.14 KiB +h11 ------------------------------ 14.84 KiB/56.89 KiB +httpx ------------------------------ 14.92 KiB/74.60 KiB +httpcore ------------------------------ 14.88 KiB/76.18 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hyperframe ------------------------------ 12.10 KiB/12.10 KiB +portalocker ------------------------------ 14.91 KiB/17.99 KiB +hpack ------------------------------ 14.84 KiB/31.85 KiB +h2 ------------------------------ 14.84 KiB/56.14 KiB +h11 ------------------------------ 14.84 KiB/56.89 KiB +httpx ------------------------------ 14.92 KiB/74.60 KiB +httpcore ------------------------------ 14.88 KiB/76.18 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hyperframe ------------------------------ 12.10 KiB/12.10 KiB +stamina ------------------------------ 0 B/16.06 KiB +portalocker ------------------------------ 14.91 KiB/17.99 KiB +hpack ------------------------------ 14.84 KiB/31.85 KiB +h2 ------------------------------ 14.84 KiB/56.14 KiB +h11 ------------------------------ 14.84 KiB/56.89 KiB +httpx ------------------------------ 14.92 KiB/74.60 KiB +httpcore ------------------------------ 14.88 KiB/76.18 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +stamina ------------------------------ 0 B/16.06 KiB +portalocker ------------------------------ 14.91 KiB/17.99 KiB +hpack ------------------------------ 14.84 KiB/31.85 KiB +h2 ------------------------------ 14.84 KiB/56.14 KiB +h11 ------------------------------ 14.84 KiB/56.89 KiB +httpx ------------------------------ 14.92 KiB/74.60 KiB +httpcore ------------------------------ 14.88 KiB/76.18 KiB +dill ------------------------------ 16.00 KiB/113.53 KiB +multiprocess ------------------------------ 0 B/131.66 KiB +xxhash ------------------------------ 0 B/189.60 KiB +peft ------------------------------ 0 B/245.69 KiB +qdrant-client ------------------------------ 0 B/260.15 KiB +protobuf ------------------------------ 0 B/309.20 KiB +datasets ------------------------------ 0 B/461.60 KiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +stamina ------------------------------ 14.92 KiB/16.06 KiB +hpack ------------------------------ 30.84 KiB/31.85 KiB +colpali-engine ------------------------------ 0 B/36.81 KiB +h2 ------------------------------ 30.84 KiB/56.14 KiB +h11 ------------------------------ 30.84 KiB/56.89 KiB +httpx ------------------------------ 18.89 KiB/74.60 KiB +httpcore ------------------------------ 30.88 KiB/76.18 KiB +dill ------------------------------ 32.00 KiB/113.53 KiB +multiprocess ------------------------------ 14.88 KiB/131.66 KiB +xxhash ------------------------------ 14.88 KiB/189.60 KiB +peft ------------------------------ 14.92 KiB/245.69 KiB +qdrant-client ------------------------------ 14.91 KiB/260.15 KiB +protobuf ------------------------------ 14.88 KiB/309.20 KiB +datasets ------------------------------ 14.92 KiB/461.60 KiB +grpcio-tools ------------------------------ 0 B/2.24 MiB +tokenizers ------------------------------ 14.91 KiB/2.85 MiB +hf-transfer ------------------------------ 12.11 KiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hpack ------------------------------ 30.84 KiB/31.85 KiB +colpali-engine ------------------------------ 0 B/36.81 KiB +h2 ------------------------------ 30.84 KiB/56.14 KiB +h11 ------------------------------ 30.84 KiB/56.89 KiB +httpx ------------------------------ 34.89 KiB/74.60 KiB +httpcore ------------------------------ 30.88 KiB/76.18 KiB +dill ------------------------------ 32.00 KiB/113.53 KiB +multiprocess ------------------------------ 14.88 KiB/131.66 KiB +xxhash ------------------------------ 14.88 KiB/189.60 KiB +peft ------------------------------ 14.92 KiB/245.69 KiB +qdrant-client ------------------------------ 14.91 KiB/260.15 KiB +protobuf ------------------------------ 14.88 KiB/309.20 KiB +datasets ------------------------------ 14.92 KiB/461.60 KiB +grpcio-tools ------------------------------ 0 B/2.24 MiB +tokenizers ------------------------------ 14.91 KiB/2.85 MiB +hf-transfer ------------------------------ 12.11 KiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +hpack ------------------------------ 30.84 KiB/31.85 KiB +colpali-engine ------------------------------ 0 B/36.81 KiB +h2 ------------------------------ 30.84 KiB/56.14 KiB +h11 ------------------------------ 30.84 KiB/56.89 KiB +httpx ------------------------------ 34.89 KiB/74.60 KiB +httpcore ------------------------------ 30.88 KiB/76.18 KiB +dill ------------------------------ 32.00 KiB/113.53 KiB +multiprocess ------------------------------ 30.88 KiB/131.66 KiB +xxhash ------------------------------ 30.88 KiB/189.60 KiB +peft ------------------------------ 14.92 KiB/245.69 KiB +qdrant-client ------------------------------ 30.91 KiB/260.15 KiB +protobuf ------------------------------ 30.88 KiB/309.20 KiB +datasets ------------------------------ 30.92 KiB/461.60 KiB +grpcio-tools ------------------------------ 0 B/2.24 MiB +tokenizers ------------------------------ 14.91 KiB/2.85 MiB +hf-transfer ------------------------------ 12.11 KiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +colpali-engine ------------------------------ 16.00 KiB/36.81 KiB +h2 ------------------------------ 46.84 KiB/56.14 KiB +h11 ------------------------------ 46.84 KiB/56.89 KiB +httpx ------------------------------ 34.89 KiB/74.60 KiB +httpcore ------------------------------ 45.63 KiB/76.18 KiB +dill ------------------------------ 48.00 KiB/113.53 KiB +multiprocess ------------------------------ 30.88 KiB/131.66 KiB +xxhash ------------------------------ 30.88 KiB/189.60 KiB +peft ------------------------------ 30.92 KiB/245.69 KiB +qdrant-client ------------------------------ 30.91 KiB/260.15 KiB +protobuf ------------------------------ 30.88 KiB/309.20 KiB +datasets ------------------------------ 30.92 KiB/461.60 KiB +grpcio-tools ------------------------------ 8.74 KiB/2.24 MiB +tokenizers ------------------------------ 14.91 KiB/2.85 MiB +hf-transfer ------------------------------ 12.11 KiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +colpali-engine ------------------------------ 16.00 KiB/36.81 KiB +h11 ------------------------------ 56.89 KiB/56.89 KiB +httpx ------------------------------ 66.89 KiB/74.60 KiB +httpcore ------------------------------ 76.18 KiB/76.18 KiB +dill ------------------------------ 64.00 KiB/113.53 KiB +multiprocess ------------------------------ 62.77 KiB/131.66 KiB +xxhash ------------------------------ 62.88 KiB/189.60 KiB +peft ------------------------------ 62.92 KiB/245.69 KiB +qdrant-client ------------------------------ 62.91 KiB/260.15 KiB +protobuf ------------------------------ 62.88 KiB/309.20 KiB +datasets ------------------------------ 50.24 KiB/461.60 KiB +grpcio-tools ------------------------------ 24.74 KiB/2.24 MiB +tokenizers ------------------------------ 30.91 KiB/2.85 MiB +hf-transfer ------------------------------ 30.91 KiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +colpali-engine ------------------------------ 16.00 KiB/36.81 KiB +httpx ------------------------------ 66.89 KiB/74.60 KiB +httpcore ------------------------------ 76.18 KiB/76.18 KiB +dill ------------------------------ 64.00 KiB/113.53 KiB +multiprocess ------------------------------ 62.77 KiB/131.66 KiB +xxhash ------------------------------ 62.88 KiB/189.60 KiB +peft ------------------------------ 62.92 KiB/245.69 KiB +qdrant-client ------------------------------ 62.91 KiB/260.15 KiB +protobuf ------------------------------ 62.88 KiB/309.20 KiB +datasets ------------------------------ 50.24 KiB/461.60 KiB +grpcio-tools ------------------------------ 24.74 KiB/2.24 MiB +tokenizers ------------------------------ 30.91 KiB/2.85 MiB +hf-transfer ------------------------------ 30.91 KiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +colpali-engine ------------------------------ 16.00 KiB/36.81 KiB +httpx ------------------------------ 66.89 KiB/74.60 KiB +dill ------------------------------ 64.00 KiB/113.53 KiB +multiprocess ------------------------------ 62.77 KiB/131.66 KiB +xxhash ------------------------------ 62.88 KiB/189.60 KiB +peft ------------------------------ 62.92 KiB/245.69 KiB +qdrant-client ------------------------------ 62.91 KiB/260.15 KiB +protobuf ------------------------------ 62.88 KiB/309.20 KiB +datasets ------------------------------ 50.24 KiB/461.60 KiB +grpcio-tools ------------------------------ 24.74 KiB/2.24 MiB +tokenizers ------------------------------ 46.91 KiB/2.85 MiB +hf-transfer ------------------------------ 30.91 KiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +colpali-engine ------------------------------ 16.00 KiB/36.81 KiB +dill ------------------------------ 80.00 KiB/113.53 KiB +multiprocess ------------------------------ 76.40 KiB/131.66 KiB +xxhash ------------------------------ 140.08 KiB/189.60 KiB +peft ------------------------------ 78.92 KiB/245.69 KiB +qdrant-client ------------------------------ 78.91 KiB/260.15 KiB +protobuf ------------------------------ 155.88 KiB/309.20 KiB +datasets ------------------------------ 92.10 KiB/461.60 KiB +grpcio-tools ------------------------------ 136.74 KiB/2.24 MiB +tokenizers ------------------------------ 157.17 KiB/2.85 MiB +hf-transfer ------------------------------ 142.91 KiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +colpali-engine ------------------------------ 28.54 KiB/36.81 KiB +dill ------------------------------ 96.00 KiB/113.53 KiB +multiprocess ------------------------------ 122.66 KiB/131.66 KiB +xxhash ------------------------------ 188.08 KiB/189.60 KiB +peft ------------------------------ 94.92 KiB/245.69 KiB +qdrant-client ------------------------------ 110.91 KiB/260.15 KiB +protobuf ------------------------------ 187.88 KiB/309.20 KiB +datasets ------------------------------ 140.10 KiB/461.60 KiB +grpcio-tools ------------------------------ 664.74 KiB/2.24 MiB +tokenizers ------------------------------ 557.17 KiB/2.85 MiB +hf-transfer ------------------------------ 688.72 KiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +colpali-engine ------------------------------ 28.54 KiB/36.81 KiB +dill ------------------------------ 96.00 KiB/113.53 KiB +multiprocess ------------------------------ 131.66 KiB/131.66 KiB +peft ------------------------------ 110.92 KiB/245.69 KiB +qdrant-client ------------------------------ 110.91 KiB/260.15 KiB +protobuf ------------------------------ 203.88 KiB/309.20 KiB +datasets ------------------------------ 188.10 KiB/461.60 KiB +grpcio-tools ------------------------------ 1.71 MiB/2.24 MiB +tokenizers ------------------------------ 1.77 MiB/2.85 MiB +hf-transfer ------------------------------ 1.92 MiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +colpali-engine ------------------------------ 28.54 KiB/36.81 KiB +dill ------------------------------ 96.00 KiB/113.53 KiB +multiprocess ------------------------------ 131.66 KiB/131.66 KiB +peft ------------------------------ 110.92 KiB/245.69 KiB +qdrant-client ------------------------------ 110.91 KiB/260.15 KiB +protobuf ------------------------------ 203.88 KiB/309.20 KiB +datasets ------------------------------ 204.10 KiB/461.60 KiB +grpcio-tools ------------------------------ 1.95 MiB/2.24 MiB +tokenizers ------------------------------ 1.97 MiB/2.85 MiB +hf-transfer ------------------------------ 2.44 MiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +colpali-engine ------------------------------ 28.54 KiB/36.81 KiB +dill ------------------------------ 96.00 KiB/113.53 KiB +peft ------------------------------ 110.92 KiB/245.69 KiB +qdrant-client ------------------------------ 110.91 KiB/260.15 KiB +protobuf ------------------------------ 203.88 KiB/309.20 KiB +datasets ------------------------------ 204.10 KiB/461.60 KiB +grpcio-tools ------------------------------ 1.95 MiB/2.24 MiB +tokenizers ------------------------------ 1.97 MiB/2.85 MiB +hf-transfer ------------------------------ 2.55 MiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +colpali-engine ------------------------------ 36.81 KiB/36.81 KiB +dill ------------------------------ 96.00 KiB/113.53 KiB +peft ------------------------------ 110.92 KiB/245.69 KiB +qdrant-client ------------------------------ 126.91 KiB/260.15 KiB +protobuf ------------------------------ 219.88 KiB/309.20 KiB +datasets ------------------------------ 220.10 KiB/461.60 KiB +grpcio-tools ------------------------------ 2.20 MiB/2.24 MiB +hf-transfer ------------------------------ 3.07 MiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +dill ------------------------------ 112.00 KiB/113.53 KiB +peft ------------------------------ 110.92 KiB/245.69 KiB +qdrant-client ------------------------------ 142.91 KiB/260.15 KiB +protobuf ------------------------------ 235.88 KiB/309.20 KiB +datasets ------------------------------ 220.10 KiB/461.60 KiB +grpcio-tools ------------------------------ 2.21 MiB/2.24 MiB +hf-transfer ------------------------------ 3.07 MiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +peft ------------------------------ 166.62 KiB/245.69 KiB +qdrant-client ------------------------------ 174.91 KiB/260.15 KiB +protobuf ------------------------------ 267.88 KiB/309.20 KiB +datasets ------------------------------ 259.36 KiB/461.60 KiB +grpcio-tools ------------------------------ 2.24 MiB/2.24 MiB +hf-transfer ------------------------------ 3.07 MiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +peft ------------------------------ 166.62 KiB/245.69 KiB +qdrant-client ------------------------------ 174.91 KiB/260.15 KiB +protobuf ------------------------------ 287.88 KiB/309.20 KiB +datasets ------------------------------ 268.10 KiB/461.60 KiB +hf-transfer ------------------------------ 3.07 MiB/3.40 MiB +Building gputil==1.4.0 +β ™ Preparing packages... (0/21) +peft ------------------------------ 222.92 KiB/245.69 KiB +qdrant-client ------------------------------ 260.15 KiB/260.15 KiB +protobuf ------------------------------ 309.20 KiB/309.20 KiB +datasets ------------------------------ 316.10 KiB/461.60 KiB +hf-transfer ------------------------------ 3.13 MiB/3.40 MiB +Building gputil==1.4.0 +β Ή Preparing packages... (14/21) +peft ------------------------------ 222.92 KiB/245.69 KiB +protobuf ------------------------------ 309.20 KiB/309.20 KiB +datasets ------------------------------ 316.10 KiB/461.60 KiB +hf-transfer ------------------------------ 3.13 MiB/3.40 MiB +Building gputil==1.4.0 +β Ή Preparing packages... (14/21) +protobuf ------------------------------ 309.20 KiB/309.20 KiB +datasets ------------------------------ 316.10 KiB/461.60 KiB +hf-transfer ------------------------------ 3.13 MiB/3.40 MiB +Building gputil==1.4.0 +β Ή Preparing packages... (14/21) +datasets ------------------------------ 348.10 KiB/461.60 KiB +hf-transfer ------------------------------ 3.13 MiB/3.40 MiB +Building gputil==1.4.0 +β Ή Preparing packages... (14/21) +hf-transfer ------------------------------ 3.13 MiB/3.40 MiB +Building gputil==1.4.0 +β Ή Preparing packages... (14/21) +hf-transfer ------------------------------ 3.19 MiB/3.40 MiB +Building gputil==1.4.0 +β Ή Preparing packages... (14/21) +Building gputil==1.4.0 +β Ή Preparing packages... (14/21) +Building gputil==1.4.0 +β Ή Preparing packages... (14/21) +Building gputil==1.4.0 +β Έ Preparing packages... (19/21) +Building gputil==1.4.0 +β Έ Preparing packages... (19/21) +Building gputil==1.4.0 +β Έ Preparing packages... (19/21) +Building gputil==1.4.0 +β Έ Preparing packages... (19/21) +Building gputil==1.4.0 +Building gputil==1.4.0 + Built gputil==1.4.0 +Prepared 21 packages in 607ms +Uninstalled 3 packages in 202ms +Installed 21 packages in 45ms + + colpali-engine==0.3.2 + + datasets==3.0.2 + + dill==0.3.8 + + gputil==1.4.0 + + grpcio-tools==1.64.1 + + h11==0.14.0 + + h2==4.1.0 + + hf-transfer==0.1.8 + + hpack==4.0.0 + + httpcore==1.0.6 + + httpx==0.27.2 + + hyperframe==6.0.1 + + multiprocess==0.70.16 + + peft==0.11.1 + + portalocker==2.10.1 + - protobuf==3.20.3 + + protobuf==5.28.3 + + qdrant-client==1.12.0 + + stamina==24.3.0 + - tokenizers==0.19.1 + + tokenizers==0.20.1 + - transformers==4.44.2 + + transformers==4.46.0 + + xxhash==3.5.0 +``` + +```python +import os +import torch +import time +from qdrant_client import QdrantClient +from qdrant_client.http import models +from tqdm import tqdm +from datasets import load_dataset +``` + +## Step 2: Downloading the UFO Documents Dataset + +- Item de lista +- Item de lista + +We will retrieve the UFO dataset from the Hugging Face hub. + +```python +os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = ( + "1" # optional setting for faster dataset downloads +) +``` + +
+ +```python +dataset = load_dataset("davanstrien/ufo-ColPali", split="train") +``` + +``` +README.md: 0%| | 0.00/1.20k [00:00) or in your local machine with our [Python Client](https://github.com/qdrant/qdrant-client). + +```python +from google.colab import userdata +``` + +
+ +```python +qdrant_client = QdrantClient( + url="https://56486603-8c49-4917-b932-38fef2e2cca3.europe-west3-0.gcp.staging-cloud.qdrant.io", + api_key=userdata.get("qdrantcloud"), +) +``` + +If you want to start testing without setting up persistent storage, you can initialize an in-memory Qdrant instance. **But keep in mind that the data won't persist after the session ends:** + +```python +# qdrant_client = QdrantClient( +# ":memory:" +# ) +``` + +## Step 4: Setting Up ColPali + +We're going to be using here a ColPali model that is fine-tuned for the UFO dataset. + +```python +from colpali_engine.models import ColPali, ColPaliProcessor + +# Initialize ColPali model and processor +model_name = ( + "davanstrien/finetune_colpali_v1_2-ufo-4bit" # Use the latest version available +) +colpali_model = ColPali.from_pretrained( + model_name, + torch_dtype=torch.bfloat16, + device_map="cuda:0", # Use "cuda:0" for GPU, "cpu" for CPU, or "mps" for Apple Silicon +) +colpali_processor = ColPaliProcessor.from_pretrained( + "vidore/colpaligemma-3b-pt-448-base" +) +``` + +``` +/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: +The secret `HF_TOKEN` does not exist in your Colab secrets. +To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. +You will be able to reuse this secret in all of your notebooks. +Please note that authentication is recommended but still optional to access public models or datasets. + warnings.warn( + + + +adapter_config.json: 0%| | 0.00/750 [00:00 + +```python +qdrant_client.create_collection( + collection_name=collection_name, + on_disk_payload=True, # store the payload on disk + vectors_config=models.VectorParams( + size=128, + distance=models.Distance.COSINE, + on_disk=True, # move original vectors to disk + multivector_config=models.MultiVectorConfig( + comparator=models.MultiVectorComparator.MAX_SIM + ), + quantization_config=models.BinaryQuantization( + binary=models.BinaryQuantizationConfig( + always_ram=True # keep only quantized vectors in RAM + ), + ), + ), +) +``` + +``` +True +``` + +## Step 6: Uploading the vectors to Qdrant + +In this step, we're indexing the vectors into our Qdrant Collection in batches. + +For each batch, the images are processed and encoded using the ColPali model, turning them into multi-vector embeddings. These embeddings are then converted from tensors into lists of vectors, capturing key details from each image and creating a multi-vector representation for each document. This setup works well with Qdrant's multivector capabilities. + +After processing, the vectors and any metadata are uploaded to Qdrant, gradually building up the index. You can lower or increase the `batch_size` depending on your avaliable GPU resources. + +```python +import stamina + + +@stamina.retry( + on=Exception, attempts=3 +) # retry mechanism if an exception occurs during the operation +def upsert_to_qdrant(batch): + try: + qdrant_client.upsert( + collection_name=collection_name, + points=points, + wait=False, + ) + except Exception as e: + print(f"Error during upsert: {e}") + return False + return True +``` + +
+ +```python +batch_size = 4 # Adjust based on your GPU memory constraints + +# Use tqdm to create a progress bar +with tqdm(total=len(dataset), desc="Indexing Progress") as pbar: + for i in range(0, len(dataset), batch_size): + batch = dataset[i : i + batch_size] + + # The images are already PIL Image objects, so we can use them directly + images = batch["image"] + + # Process and encode images + with torch.no_grad(): + batch_images = colpali_processor.process_images(images).to( + colpali_model.device + ) + image_embeddings = colpali_model(**batch_images) + + # Prepare points for Qdrant + points = [] + for j, embedding in enumerate(image_embeddings): + # Convert the embedding to a list of vectors + multivector = embedding.cpu().float().numpy().tolist() + points.append( + models.PointStruct( + id=i + j, # we just use the index as the ID + vector=multivector, # This is now a list of vectors + payload={ + "source": "internet archive" + }, # can also add other metadata/data + ) + ) + + # Upload points to Qdrant + try: + upsert_to_qdrant(points) + except Exception as e: + print(f"Error during upsert: {e}") + continue + + # Update the progress bar + pbar.update(batch_size) + +print("Indexing complete!") +``` + +``` +Indexing Progress: 2244it [2:06:11, 3.37s/it] + +Indexing complete! +``` + +```python +qdrant_client.update_collection( + collection_name=collection_name, + optimizer_config=models.OptimizersConfigDiff(indexing_threshold=10), +) +``` + +``` +True +``` + +## Step 7: Processing the Query + +So let's go ahead and prepare our search query. In this step, the text query "top secret" is processed and transformed into a tensor by the `colpali_processor.process_queries` function. + +```python +query_text = "top secret" +with torch.no_grad(): + batch_query = colpali_processor.process_queries([query_text]).to( + colpali_model.device + ) + query_embedding = colpali_model(**batch_query) +query_embedding +``` + +``` +tensor([[[ 0.1396, -0.0132, 0.0903, ..., -0.0198, -0.0791, -0.0244], + [-0.1035, -0.1079, 0.0520, ..., -0.0151, -0.0874, 0.0854], + [-0.1201, -0.0334, 0.0742, ..., -0.0243, -0.0056, -0.0293], + ..., + [-0.0603, 0.0245, 0.0493, ..., 0.0061, 0.0405, 0.0422], + [-0.0708, 0.0322, 0.0413, ..., 0.0003, 0.0435, 0.0294], + [-0.0542, 0.0510, 0.0708, ..., -0.0197, 0.0366, 0.0114]]], + device='cuda:0', dtype=torch.bfloat16) +``` + +After generating the query embedding tensor, we need to convert it into a multivector that can be used by Qdrant for searching. + +```python +multivector_query = query_embedding[0].cpu().float().numpy().tolist() +``` + +## Step 8: Searching and Retrieving the Documents + +In this step, we perform a search to retrieve the top 10 results closer to our query multivector. + +We apply rescoring to adjust and refine the initial search results by reevaluating the most relevant candidates with a more precise scoring algorithm. Oversampling is used to improve search accuracy by retrieving a larger pool of candidate results than the final number required. Finally, we measure and display how long the search process takes. + +```python +start_time = time.time() +search_result = qdrant_client.query_points( + collection_name=collection_name, + query=multivector_query, + limit=10, + timeout=100, + search_params=models.SearchParams( + quantization=models.QuantizationSearchParams( + ignore=False, + rescore=True, + oversampling=2.0, + ) + ), +) +end_time = time.time() +# Search in Qdrant +search_result.points + +elapsed_time = end_time - start_time +print(f"Search completed in {elapsed_time:.4f} seconds") +``` + +``` +Search completed in 0.7175 seconds +``` + +Search completed in 0.81 seconds, nearly twice as fast as Scalar Quantization, which took 1.56 seconds according to previous tests using the same settings. + +Let's now check the first match for our search query and see if we get a result similar to the original author's example, Daniel van Strien's, who used Scalar Quantization in his tutorial to see how Binary Quantization holds up in terms of accuracy and relevance. + +```python +idx = search_result.points[0].id +dataset[idx]["image"] +``` + +![png](documentation/301-advanced/colpali_demo_binary/output_37_0.png) + +And it's a match! Both Scalar and Binary Quantization had the same top result for the same query. + +However, keep in mind that this is just a quick experiment. Performance may vary, so it's important to test binary quantization on your own datasets to see how it performs for your specific use case. That said, it's promising to see binary quantization maintaining search quality while potentially offering performance improvements with ColPali. + +```python +idx = search_result.points[1].id +dataset[idx]["image"] +``` + +![png](documentation/301-advanced/colpali_demo_binary/output_39_0.png) + +```python +idx = search_result.points[2].id +dataset[idx]["image"] +``` + +![png](documentation/301-advanced/colpali_demo_binary/output_40_0.png) + +```python +idx = search_result.points[3].id +dataset[idx]["image"] +``` + +![png](documentation/301-advanced/colpali_demo_binary/output_41_0.png) + +```python +idx = search_result.points[4].id +dataset[idx]["image"] +``` + +![png](documentation/301-advanced/colpali_demo_binary/output_42_0.png) + +```python +idx = search_result.points[5].id +dataset[idx]["image"] +``` + +![png](documentation/301-advanced/colpali_demo_binary/output_43_0.png) + +This is it! Feel free to experiment with your own data and settings! And remember, always evaluate both performance and quality based on your specific use case before making any final decisions. + +Happy searching! + +### References: + +[1] Faysse, M., Sibille, H., Wu, T., Omrani, B., Viaud, G., Hudelot, C., Colombo, P. (2024). *ColPali: Efficient Document Retrieval with Vision Language Models*. arXiv. + +[2] van Strien, D. (2024). *Using ColPali with Qdrant to index and search a UFO document dataset*. Published October 2, 2024. Blog post: + +[3] Kacper Łukawski (2024). *Any Embedding Model Can Become a Late Interaction Model... If You Give It a Chance!* Qdrant Blog, August 14, 2024. Available at: diff --git a/qdrant-landing/static/documentation/101-foundations/03_qdrant_101_audio/main_pic.png b/qdrant-landing/static/documentation/101-foundations/03_qdrant_101_audio/main_pic.png new file mode 100644 index 000000000..50fb6255c Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/03_qdrant_101_audio/main_pic.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/crabmera.png b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/crabmera.png new file mode 100644 index 000000000..5ad339019 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/crabmera.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_15_0.png b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_15_0.png new file mode 100644 index 000000000..903720c19 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_15_0.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_46_10.png b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_46_10.png new file mode 100644 index 000000000..761be536c Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_46_10.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_46_2.png b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_46_2.png new file mode 100644 index 000000000..198d9bca3 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_46_2.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_46_6.png b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_46_6.png new file mode 100644 index 000000000..90bb64b03 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_46_6.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_48_1.png b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_48_1.png new file mode 100644 index 000000000..6ef783197 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_48_1.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_55_0.png b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_55_0.png new file mode 100644 index 000000000..90bb64b03 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_55_0.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_58_10.png b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_58_10.png new file mode 100644 index 000000000..447a9c988 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_58_10.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_58_2.png b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_58_2.png new file mode 100644 index 000000000..6d6ef7b74 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_58_2.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_58_6.png b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_58_6.png new file mode 100644 index 000000000..cae013d7b Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/04_qdrant_101_cv/output_58_6.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/ecommerce-reverse-image-search/output_21_0.png b/qdrant-landing/static/documentation/101-foundations/ecommerce-reverse-image-search/output_21_0.png new file mode 100644 index 000000000..e944adfc5 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/ecommerce-reverse-image-search/output_21_0.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/from-pinecone-to-qdrant/pinecone.png b/qdrant-landing/static/documentation/101-foundations/from-pinecone-to-qdrant/pinecone.png new file mode 100644 index 000000000..212023883 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/from-pinecone-to-qdrant/pinecone.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/from-pinecone-to-qdrant/qdrant.png b/qdrant-landing/static/documentation/101-foundations/from-pinecone-to-qdrant/qdrant.png new file mode 100644 index 000000000..f184ed492 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/from-pinecone-to-qdrant/qdrant.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/qdrant_and_text_data/crab_nlp.png b/qdrant-landing/static/documentation/101-foundations/qdrant_and_text_data/crab_nlp.png new file mode 100644 index 000000000..d055d34bf Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/qdrant_and_text_data/crab_nlp.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/qdrant_and_text_data/output_26_0.png b/qdrant-landing/static/documentation/101-foundations/qdrant_and_text_data/output_26_0.png new file mode 100644 index 000000000..38785fc96 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/qdrant_and_text_data/output_26_0.png differ diff --git a/qdrant-landing/static/documentation/101-foundations/qdrant_and_text_data/output_29_0.png b/qdrant-landing/static/documentation/101-foundations/qdrant_and_text_data/output_29_0.png new file mode 100644 index 000000000..0b6dd25f2 Binary files /dev/null and b/qdrant-landing/static/documentation/101-foundations/qdrant_and_text_data/output_29_0.png differ diff --git a/qdrant-landing/static/documentation/201-intermediate/Multimodal_Search_with_FastEmbed/output_10_0.png b/qdrant-landing/static/documentation/201-intermediate/Multimodal_Search_with_FastEmbed/output_10_0.png new file mode 100644 index 000000000..7e0ecb0b5 Binary files /dev/null and b/qdrant-landing/static/documentation/201-intermediate/Multimodal_Search_with_FastEmbed/output_10_0.png differ diff --git a/qdrant-landing/static/documentation/201-intermediate/Multimodal_Search_with_FastEmbed/output_12_0.png b/qdrant-landing/static/documentation/201-intermediate/Multimodal_Search_with_FastEmbed/output_12_0.png new file mode 100644 index 000000000..45d94092c Binary files /dev/null and b/qdrant-landing/static/documentation/201-intermediate/Multimodal_Search_with_FastEmbed/output_12_0.png differ diff --git "a/qdrant-landing/static/documentation/201-intermediate/Qdrant and LlamaIndex \342\200\224 A new way to keep your Q&A systems up-to-date/RankFocus.png" "b/qdrant-landing/static/documentation/201-intermediate/Qdrant and LlamaIndex \342\200\224 A new way to keep your Q&A systems up-to-date/RankFocus.png" new file mode 100644 index 000000000..b59a5d650 Binary files /dev/null and "b/qdrant-landing/static/documentation/201-intermediate/Qdrant and LlamaIndex \342\200\224 A new way to keep your Q&A systems up-to-date/RankFocus.png" differ diff --git "a/qdrant-landing/static/documentation/201-intermediate/Qdrant and LlamaIndex \342\200\224 A new way to keep your Q&A systems up-to-date/RerankFocus.png" "b/qdrant-landing/static/documentation/201-intermediate/Qdrant and LlamaIndex \342\200\224 A new way to keep your Q&A systems up-to-date/RerankFocus.png" new file mode 100644 index 000000000..76c9302ec Binary files /dev/null and "b/qdrant-landing/static/documentation/201-intermediate/Qdrant and LlamaIndex \342\200\224 A new way to keep your Q&A systems up-to-date/RerankFocus.png" differ diff --git "a/qdrant-landing/static/documentation/201-intermediate/Qdrant and LlamaIndex \342\200\224 A new way to keep your Q&A systems up-to-date/SetupFocus.png" "b/qdrant-landing/static/documentation/201-intermediate/Qdrant and LlamaIndex \342\200\224 A new way to keep your Q&A systems up-to-date/SetupFocus.png" new file mode 100644 index 000000000..2972d5bf2 Binary files /dev/null and "b/qdrant-landing/static/documentation/201-intermediate/Qdrant and LlamaIndex \342\200\224 A new way to keep your Q&A systems up-to-date/SetupFocus.png" differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/1*aHXJ5wuWuh1faf_BF7i4og.png b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/1*aHXJ5wuWuh1faf_BF7i4og.png new file mode 100644 index 000000000..1d373f724 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/1*aHXJ5wuWuh1faf_BF7i4og.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/1*yIMiJaQexgNqU3BXdR5WKg.png b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/1*yIMiJaQexgNqU3BXdR5WKg.png new file mode 100644 index 000000000..035398648 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/1*yIMiJaQexgNqU3BXdR5WKg.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/2-Figure1-1.png b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/2-Figure1-1.png new file mode 100644 index 000000000..b4f19a58c Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/2-Figure1-1.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/2a5240721c8841188c22224f5a095d8c.png b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/2a5240721c8841188c22224f5a095d8c.png new file mode 100644 index 000000000..5c4bbf910 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/2a5240721c8841188c22224f5a095d8c.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/5accaa76-6fad-437b-8fbb-94b9544c3789_image7.png b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/5accaa76-6fad-437b-8fbb-94b9544c3789_image7.png new file mode 100644 index 000000000..2385200e2 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/5accaa76-6fad-437b-8fbb-94b9544c3789_image7.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/704c6118363e46b79912e7b99e0472d9.png b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/704c6118363e46b79912e7b99e0472d9.png new file mode 100644 index 000000000..f4a16f714 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/704c6118363e46b79912e7b99e0472d9.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/88ab85414277454999d1898aa946dad2.png b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/88ab85414277454999d1898aa946dad2.png new file mode 100644 index 000000000..592f0c136 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/88ab85414277454999d1898aa946dad2.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/900.jpg b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/900.jpg new file mode 100644 index 000000000..8ce1f04c9 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/900.jpg differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/927c14076d9f414c89abbcddb0b846d5.png b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/927c14076d9f414c89abbcddb0b846d5.png new file mode 100644 index 000000000..ff0984878 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/927c14076d9f414c89abbcddb0b846d5.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/How-Do-Embeddings-Work_.jpg b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/How-Do-Embeddings-Work_.jpg new file mode 100644 index 000000000..a953f9471 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/How-Do-Embeddings-Work_.jpg differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/How-Embeddings-Work.jpg b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/How-Embeddings-Work.jpg new file mode 100644 index 000000000..838418d9b Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/How-Embeddings-Work.jpg differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/bf78579e67de42fc885b21ab3f1a38e4.png b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/bf78579e67de42fc885b21ab3f1a38e4.png new file mode 100644 index 000000000..66991b365 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/bf78579e67de42fc885b21ab3f1a38e4.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/c61fb9ee0a64446c8016d82b3709f48c.png b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/c61fb9ee0a64446c8016d82b3709f48c.png new file mode 100644 index 000000000..be21e3dfe Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/c61fb9ee0a64446c8016d82b3709f48c.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/e38eea92d36f4d67958cef06acc29797.png b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/e38eea92d36f4d67958cef06acc29797.png new file mode 100644 index 000000000..517ecd091 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/e38eea92d36f4d67958cef06acc29797.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/qdrant_overview_high_level.png b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/qdrant_overview_high_level.png new file mode 100644 index 000000000..f6f9d5f85 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/qdrant_overview_high_level.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/social_preview.jpg b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/social_preview.jpg new file mode 100644 index 000000000..a35933d7f Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/Qdrant_data_prep/social_preview.jpg differ diff --git a/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/image-278.png b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/image-278.png new file mode 100644 index 000000000..bda55983b Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/image-278.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/image-279.png b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/image-279.png new file mode 100644 index 000000000..fbb2359b2 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/image-279.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_14_0.png b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_14_0.png new file mode 100644 index 000000000..5edc20c70 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_14_0.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_37_0.png b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_37_0.png new file mode 100644 index 000000000..eecbfd9db Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_37_0.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_39_0.png b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_39_0.png new file mode 100644 index 000000000..c9dcc88f0 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_39_0.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_40_0.png b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_40_0.png new file mode 100644 index 000000000..f7963de9d Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_40_0.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_41_0.png b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_41_0.png new file mode 100644 index 000000000..4bac2eb5e Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_41_0.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_42_0.png b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_42_0.png new file mode 100644 index 000000000..67fb9114c Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_42_0.png differ diff --git a/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_43_0.png b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_43_0.png new file mode 100644 index 000000000..6ed472c72 Binary files /dev/null and b/qdrant-landing/static/documentation/301-advanced/colpali_demo_binary/output_43_0.png differ