diff --git a/qdrant-landing/content/blog/colpali-optimization.md b/qdrant-landing/content/blog/colpali-optimization.md index c21d4c338..536fe78ca 100644 --- a/qdrant-landing/content/blog/colpali-optimization.md +++ b/qdrant-landing/content/blog/colpali-optimization.md @@ -40,12 +40,10 @@ Consider this scenario: The total number of comparisons is calculated as: $$ -1,000 \cdot 1,000 \cdot 20,000 \cdot 128 = 2.56 \times 10^{16} \text{ comparisons!} +1,000 \cdot 1,000 \cdot 20,000 \cdot 128 = 2.56 \times 10^{12} \text{ comparisons!} $$ - - -That's trillions of comparisons needed to build the index. Even advanced indexing algorithms like **HNSW** struggle with this scale, as computational costs grow quadratically. +That's trillions of comparisons needed to build the index. Even advanced indexing algorithms like **HNSW** struggle with this scale, as computational costs grow quadratically with amount of multivectors per page. We turned to a hybrid optimization strategy combining **pooling** (to reduce computational overhead) and **reranking** (to preserve accuracy). @@ -59,14 +57,16 @@ For those eager to explore, the [codebase is available here](https://github.com/ ### Pooling -Pooling is well-known in machine learning as a way to compress data while keeping important information intact. For ColPali, we reduced ~1,030 vectors per page to just 38 vectors by pooling rows in the document's 32x32 grid. +Pooling is well-known in machine learning as a way to compress data while keeping important information. For ColPali, we reduced 1,030 vectors per page to just 38 vectors by pooling rows in the document's 32x32 grid. + +![](/blog/colpali-optimization/rows.png) Max and mean pooling are the two most popular types, so we decided to test both approaches on the rows of the grid. Likewise, we could apply pooling on columns, which we plan to explore in the future. - **Mean Pooling:** Averages values across rows. - **Max Pooling:** Selects the maximum value for each feature. -32 vectors represent the pooled rows, while the final 6 vectors encode contextual information derived from ColPali’s special tokens (e.g., for the beginning of the sequence, and task-specific instructions like “Describe the image”). +32 vectors represent the pooled rows, while 6 vectors encode contextual information derived from ColPali’s special tokens (e.g., for the beginning of the sequence, and task-specific instructions like “Describe the image”). For our experiments, we chose to preserve these 6 additional vectors. @@ -82,26 +82,16 @@ Pooling drastically reduces retrieval costs, but there’s a risk of losing fine We created a custom dataset with over 20,000 unique PDF pages by merging: - **ViDoRe Benchmark:** Designed for PDF documents retrieval evaluation. -- **UFO Dataset:** Visually rich documents paired with synthetic queries. +- **UFO Dataset:** Visually rich documents paired with synthetic queries [generated by Daniel van Strien](https://huggingface.co/datasets/davanstrien/ufo-ColPali). - **DocVQA Dataset:** A large set of document-derived Q&A pairs. -Each document was processed into 32x32 grids, generating both full-resolution and pooled embeddings. - -![](/blog/colpali-optimization/rows.png) - -These embeddings were stored in the **Qdrant vector database**: - -- **Full-Resolution Embeddings:** ~1,030 vectors per page. -- **Pooled Embeddings:** Mean and max pooling variants. - -All embeddings were kept in RAM to avoid caching effect in experiments realted to the speed of retrieval. +Each document was processed into 32x32 grids, generating both full-resolution and pooled embeddings. **Full-resolution** embeddings consisted of 1,030 vectors per page, while **pooled embeddings** included mean and max pooling variants. +All embeddings were were stored and kept in RAM to avoid caching effects during retrieval speed experiments. ### Experiment Setup -We evaluated retrieval quality using 1,000 random sampled queries and the retrieval process followed the two-stage approach: -1. **Pooled embeddings** retrieved the top 200 candidates. -2. **Full-resolution embeddings** reranked these candidates to produce the final top 20 results. +We evaluated retrieval quality with 1,000 queries. First, pooled embeddings retrieved the top 200 candidates. Then, full-resolution embeddings reranked them to produce the final top 20 results. To measure performance, we used: @@ -110,10 +100,7 @@ To measure performance, we used: ## Results -The experiments gave us some very promissing results: - -- **Speed:** Retrieval time improved **13x** compared to full-resolution embeddings alone. -- **Accuracy:** Mean pooling preserved retrieval quality nearly identical to the original ColPali. +The experiment showed promising improvements in speed and accuracy. Retrieval time improved **13x** compared to using full-resolution embeddings alone. ### Metrics @@ -123,7 +110,7 @@ The experiments gave us some very promissing results: | **Max** | 0.759 | 0.656 | -Mean pooling offered the ideal balance, combining speed and precision. Max pooling did not perform well enough to be considered viable since it sacrificed significant accuracy without delivering a meaningful speed advantage. +Mean pooling preserved nearly identical quality to the original ColPali, with NDCG@20 = 0.952 and Recall@20 = 0.917. Max pooling did not perform well enough to be considered viable since it sacrificed significant accuracy without delivering a meaningful speed advantage. ## What’s Next? Future experiments could push these results even further: