|
| 1 | +--- |
| 2 | +title: "Qdrant 1.12 - Distance Matrix, Facet Counting & On-Disk Indexing" |
| 3 | +draft: false |
| 4 | +short_description: "On-Disk Text & Geo Index. Distance Matrix API. Facet API for Cardinality." |
| 5 | +description: "Uncover insights with the Distance Matrix API, dynamically filter via Facet API, and offload additional payload to disk." |
| 6 | +preview_image: /blog/qdrant-1.12.x/social_preview.png |
| 7 | +social_preview_image: /blog/qdrant-1.12.x/social_preview.png |
| 8 | +date: 2024-10-08T00:00:00-08:00 |
| 9 | +author: David Myriel |
| 10 | +featured: true |
| 11 | +tags: |
| 12 | + - vector search |
| 13 | + - distance matrix |
| 14 | + - dimensionality reduction |
| 15 | + - data exploration |
| 16 | + - data visualization |
| 17 | + - faceting |
| 18 | + - facet api |
| 19 | +--- |
| 20 | +[**Qdrant 1.12.0 is out!**](https://github.com/qdrant/qdrant/releases/tag/v1.12.0) Let's look at major new features and a few minor additions: |
| 21 | + |
| 22 | +**Distance Matrix API:** Efficiently calculate pairwise distances between vectors.</br> |
| 23 | +**GUI Data Exploration** Visually navigate your dataset and analyze vector relationships.</br> |
| 24 | +**Faceting API:** Dynamically aggregate and count unique values in specific fields.</br> |
| 25 | + |
| 26 | +**Text Index on disk:** Reduce memory usage by storing text indexing data on disk.</br> |
| 27 | +**Geo Index on disk:** Offload indexed geographic data on disk for memory efficiency. |
| 28 | + |
| 29 | +## Distance Matrix API for Data Insights |
| 30 | + |
| 31 | + |
| 32 | +> **Qdrant** is a similarity search engine. Our mission is to give you the tools to **discover and understand connections** between vast amounts of semantically relevant data |
| 33 | +
|
| 34 | +The **Distance Matrix API** is here to lay the groundwork for such tools. |
| 35 | + |
| 36 | +In data exploration, tasks like [**clustering**](https://en.wikipedia.org/wiki/DBSCAN) and [**dimensionality reduction**](https://en.wikipedia.org/wiki/Dimensionality_reduction) rely on calculating distances between data points. |
| 37 | + |
| 38 | +**Use Case:** A retail company with 10,000 customers wants to segment them by purchasing behavior. Each customer is stored as a vector in Qdrant, but without a dedicated API, clustering would need 10,000 separate batch requests, making the process inefficient and costly. |
| 39 | + |
| 40 | +You can use this API to compute a **sparse matrix of distances** that is optimized for large datasets. Then, you can filter through the retrieved data to find the exact vector relationships that matter. |
| 41 | + |
| 42 | +In terms of endpoints, we offer two different formats to show results: |
| 43 | +- **Pairs** are simple, intutitive and ideal for graph representation. |
| 44 | +- **Offsets** are more complex, but also native when defining CSR sparse matrices. |
| 45 | + |
| 46 | +### Output - Pairs |
| 47 | + |
| 48 | +Use the `pairs` endpoint to compare 10 random point pairs from your dataset: |
| 49 | + |
| 50 | +```http |
| 51 | +POST /collections/{collection_name}/points/search/matrix/pairs |
| 52 | +{ |
| 53 | + "sample": 10, |
| 54 | + "limit": 2 |
| 55 | +} |
| 56 | +``` |
| 57 | +Configuring the `sample` will retrieve a random group of 10 points to compare. The `limit` is the number of semantic connections between points to consider. |
| 58 | + |
| 59 | +Qdrant will list a sparse matrix of distances **between the closest pairs**: |
| 60 | + |
| 61 | +```http |
| 62 | +{ |
| 63 | + "result": { |
| 64 | + "pairs": [ |
| 65 | + {"a": 1, "b": 3, "score": 1.4063001}, |
| 66 | + {"a": 1, "b": 4, "score": 1.2531}, |
| 67 | + {"a": 2, "b": 1, "score": 1.1550001}, |
| 68 | + {"a": 2, "b": 8, "score": 1.1359}, |
| 69 | + {"a": 3, "b": 1, "score": 1.4063001}, |
| 70 | + {"a": 3, "b": 4, "score": 1.2218001}, |
| 71 | + {"a": 4, "b": 1, "score": 1.2531}, |
| 72 | + {"a": 4, "b": 3, "score": 1.2218001}, |
| 73 | + {"a": 5, "b": 3, "score": 0.70239997}, |
| 74 | + {"a": 5, "b": 1, "score": 0.6146}, |
| 75 | + {"a": 6, "b": 3, "score": 0.6353}, |
| 76 | + {"a": 6, "b": 4, "score": 0.5093}, |
| 77 | + {"a": 7, "b": 3, "score": 1.0990001}, |
| 78 | + {"a": 7, "b": 1, "score": 1.0349001}, |
| 79 | + {"a": 8, "b": 2, "score": 1.1359}, |
| 80 | + {"a": 8, "b": 3, "score": 1.0553} |
| 81 | + ] |
| 82 | + } |
| 83 | +} |
| 84 | +``` |
| 85 | + |
| 86 | +### Output - Offsets |
| 87 | + |
| 88 | +The `offsets` endpoint offer another format of showing the distance between points: |
| 89 | + |
| 90 | +```http |
| 91 | +POST /collections/{collection_name}/points/search/matrix/offsets |
| 92 | +{ |
| 93 | + "sample": 10, |
| 94 | + "limit": 2 |
| 95 | +} |
| 96 | +``` |
| 97 | + |
| 98 | +Qdrant will return a compact representation of the distances between points in the **form of row and column offsets**. |
| 99 | + |
| 100 | +Two arrays, `offsets_row` and `offsets_col`, represent the positions of non-zero distance values in the matrix. Each entry in these arrays corresponds to a pair of points with a calculated distance. |
| 101 | + |
| 102 | +```http |
| 103 | +{ |
| 104 | + "result": { |
| 105 | + "offsets_row": [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7], |
| 106 | + "offsets_col": [2, 3, 0, 7, 0, 3, 0, 2, 2, 0, 2, 3, 2, 0, 1, 2], |
| 107 | + "scores": [ |
| 108 | + 1.4063001, 1.2531, 1.1550001, 1.1359, 1.4063001, |
| 109 | + 1.2218001, 1.2531, 1.2218001, 0.70239997, 0.6146, 0.6353, |
| 110 | + 0.5093, 1.0990001, 1.0349001, 1.1359, 1.0553 |
| 111 | + ], |
| 112 | + "ids": [1, 2, 3, 4, 5, 6, 7, 8] |
| 113 | + } |
| 114 | +} |
| 115 | +``` |
| 116 | +*To learn more about the distance matrix, read [**The Distance Matrix documentation**](/documentation/concepts/explore/#distance-matrix).* |
| 117 | + |
| 118 | +## Distance Matrix API in the Graph UI |
| 119 | + |
| 120 | +We are adding more visualization options to the [**Graph Exploration Tool**](/blog/qdrant-1.11.x/#web-ui-graph-exploration-tool), introduced in v.1.11. |
| 121 | + |
| 122 | +You can now leverage the **Distance Matrix API** from within this tool for a **clearer picture** of your data and its relationships. |
| 123 | + |
| 124 | +**Example:** You can retrieve 900 `sample` points, with a `limit` of 5 connections per vector and a `tree` visualization: |
| 125 | + |
| 126 | +```json |
| 127 | +{ |
| 128 | + "limit": 5, |
| 129 | + "sample": 900, |
| 130 | + "tree": true |
| 131 | +} |
| 132 | +``` |
| 133 | +The new graphing method is cleaner and reveals **relationships and outliers:** |
| 134 | + |
| 135 | + |
| 136 | + |
| 137 | +*To learn more about the Web UI Dashboard, read the [**Interfaces documentation**](/documentation/interfaces/web-ui/).* |
| 138 | + |
| 139 | +## Facet API for Metadata Cardinality |
| 140 | + |
| 141 | + |
| 142 | + |
| 143 | +In modern applications like e-commerce, users often rely on [**filters**](/articles/vector-search-filtering/), such as **brand** or **color**, to refine search results. The **Facet API** is designed to help users understand the distribution of values in a dataset. |
| 144 | + |
| 145 | +The `facet` endpoint can efficiently count and aggregate values for a specific [**payload field**](/documentation/concepts/payload/) in your dataset. |
| 146 | + |
| 147 | +You can use it to retrieve unique values for a field, along with the number of points that contain each value. This functionality is similar to `GROUP BY` with `COUNT(*)` in SQL databases. |
| 148 | + |
| 149 | +> **Note:** Facet counting can only be applied to fields that support `match` conditions, such as fields with a keyword index. |
| 150 | +
|
| 151 | +### Configuration |
| 152 | + |
| 153 | +Here’s a sample query using the REST API to facet on the `size` field, filtered by products where the `color` is red: |
| 154 | + |
| 155 | +```http |
| 156 | +POST /collections/{collection_name}/facet |
| 157 | +{ |
| 158 | + "key": "size", |
| 159 | + "filter": { |
| 160 | + "must": { |
| 161 | + "key": "color", |
| 162 | + "match": { "value": "red" } |
| 163 | + } |
| 164 | + } |
| 165 | +} |
| 166 | +``` |
| 167 | +This returns counts for each unique value in the `size` field, filtered by `color` = `red`: |
| 168 | + |
| 169 | +```json |
| 170 | +{ |
| 171 | + "response": { |
| 172 | + "hits": [ |
| 173 | + {"value": "L", "count": 19}, |
| 174 | + {"value": "S", "count": 10}, |
| 175 | + {"value": "M", "count": 5}, |
| 176 | + {"value": "XL", "count": 1}, |
| 177 | + {"value": "XXL", "count": 1} |
| 178 | + ] |
| 179 | + }, |
| 180 | + "time": 0.0001 |
| 181 | +} |
| 182 | +``` |
| 183 | +The results are sorted by count in descending order and only values with non-zero counts are returned. |
| 184 | + |
| 185 | +### Configuration - Precise Facet |
| 186 | + |
| 187 | +By default, facet counting runs an approximate filter. If you need a precise count, you can enable the `exact` parameter: |
| 188 | + |
| 189 | +```http |
| 190 | +POST /collections/{collection_name}/facet |
| 191 | +{ |
| 192 | + "key": "size", |
| 193 | + "exact": true |
| 194 | +} |
| 195 | +``` |
| 196 | +This feature provides flexibility between performance and precision, depending on the needs of your application. |
| 197 | + |
| 198 | +*To learn more about faceting, read the [**Facet API documentation**](/documentation/concepts/payload/#facet-counts).* |
| 199 | + |
| 200 | +## Text Index on Disk Support |
| 201 | + |
| 202 | + |
| 203 | +[**Qdrant text indexing**](/documentation/concepts/indexing/#full-text-index) tokenizes text into smaller units (tokens) based on chosen settings (e.g., tokenizer type, token length). These tokens are stored in an inverted index for fast text searches. |
| 204 | + |
| 205 | +> With `on_disk` text indexing, the inverted index is stored on disk, reducing memory usage. |
| 206 | +
|
| 207 | +### Configuration |
| 208 | +Just like with other indexes, simply add `on_disk: true` when creating the index: |
| 209 | + |
| 210 | +```http |
| 211 | +PUT /collections/{collection_name}/index |
| 212 | +{ |
| 213 | + "field_name": "review_text", |
| 214 | + "field_schema": { |
| 215 | + "type": "text", |
| 216 | + "tokenizer": "word", |
| 217 | + "min_token_len": 2, |
| 218 | + "max_token_len": 20, |
| 219 | + "lowercase": true, |
| 220 | + "on_disk": true |
| 221 | + } |
| 222 | +} |
| 223 | +``` |
| 224 | + |
| 225 | +*To learn more about indexes, read the [**Indexing documentation**](/documentation/concepts/indexing/).* |
| 226 | + |
| 227 | +## Geo Index on Disk Support |
| 228 | + |
| 229 | +For [**large-scale geographic datasets**](/documentation/concepts/payload/#geo) where storing all indexes in memory is impractical, **geo indexing** allows efficient filtering of points based on geographic coordinates. |
| 230 | + |
| 231 | +With `on_disk` geo indexing, the index is written to disk instead of residing in memory, making it possible to handle large datasets without exhausting system memory. |
| 232 | + |
| 233 | +> This can be crucial when dealing with millions of geo points that don’t require real-time access. |
| 234 | +
|
| 235 | +### Configuration |
| 236 | + |
| 237 | +To enable this feature, modify the index schema for the geographic field by setting the `on_disk: true` flag. |
| 238 | + |
| 239 | +```http |
| 240 | +PUT /collections/{collection_name}/index |
| 241 | +{ |
| 242 | + "field_name": "location", |
| 243 | + "field_schema": { |
| 244 | + "type": "geo", |
| 245 | + "on_disk": true |
| 246 | + } |
| 247 | +} |
| 248 | +``` |
| 249 | + |
| 250 | +### Performance Considerations |
| 251 | + |
| 252 | +- **Cold Query Latency:** On-disk indexes require I/O to load index segments, introducing slight latency on first access. Subsequent queries will benefit from disk caching. |
| 253 | +- **Hot vs. Cold Indexes:** Fields frequently queried should stay in memory for faster performance, and on-disk indexes are better for large, infrequently queried fields. |
| 254 | +- **Memory vs. Disk Trade-offs:** Users can manage memory by deciding which fields to store on disk. |
| 255 | + |
| 256 | + |
| 257 | + |
| 258 | +> To learn how to get the best performance from Qdrant, read the [**Optimization Guide**](/documentation/guides/optimize/). |
| 259 | +
|
| 260 | +## Just the Beginning |
| 261 | + |
| 262 | +The easiest way to reach that **Hello World** moment is to [**try vector search in a live cluster**](/documentation/quickstart-cloud/). Our **interactive tutorial** will show you how to create a cluster, add data and try some filtering clauses. |
| 263 | + |
| 264 | +**All of the new features from version 1.12 can be tested in the Web UI:** |
| 265 | + |
| 266 | + |
0 commit comments