Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Faiss byte vector support blog #3458

Merged
merged 6 commits into from
Nov 26, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
192 changes: 192 additions & 0 deletions _posts/2024-11-22-faiss-byte-vector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
---
layout: post
title: Introducing byte vector support for Faiss in OpenSearch vector engine
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
authors:
- naveen
- navneev
- vamshin
- dylantong
- kolchfa
date: 2024-11-22
categories:
- technical-posts
has_science_table: true
meta_keywords: Faiss byte vectors in OpenSearch, similarity search, vector search, large-scale applications, memory efficiency, quantization techniques, benchmarking results, signed byte range
meta_description: Learn how byte vectors improve memory efficiency and performance in large-scale similarity search applications. Discover benchmarking results, quantization techniques, and use cases for Faiss byte vectors in OpenSearch.
---

The growing popularity of generative AI and large language models (LLMs) has led to an increased demand for efficient vector search and similarity operations. These models often rely on high-dimensional vector representations of text, images, or other data. Performing similarity searches or nearest neighbor queries on these vectors becomes computationally expensive, especially as vector databases grow in size. OpenSearch's support for Faiss byte vectors offers a promising solution to these challenges.

Using byte vectors instead of float vectors for vector search provides significant improvements in memory efficiency and performance. This is especially beneficial for large-scale vector databases or environments with limited resources. Faiss byte vectors enable you to store quantized embeddings, significantly reducing memory consumption and lowering costs. This approach typically results in only minimal recall loss compared to using full-precision (float) vectors.


## How to use a Faiss byte vector?
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

A byte vector is a compact representation where each dimension is a signed 8-bit integer ranging from -128 to 127. To use byte vectors, you must convert your input vectors, typically in `float` format, into the `byte` type before ingestion. This process requires quantization techniques, which compress float vectors while maintaining essential data characteristics. For more information, see [Quantization techniques](https://opensearch.org/docs/latest/field-types/supported-field-types/knn-vector#quantization-techniques).

Check failure on line 25 in _posts/2024-11-22-faiss-byte-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: informat. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: informat. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-22-faiss-byte-vector.md", "range": {"start": {"line": 25, "column": 377}}}, "severity": "ERROR"}
nateynateynate marked this conversation as resolved.
Show resolved Hide resolved
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

To use a `byte` vector, set the `data_type` parameter to `byte` when creating a k-NN index (the default value of the `data_type` parameter is `float`):


```json
PUT test-index
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"my_vector1": {
"type": "knn_vector",
"dimension": 8,
"data_type": "byte",
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "faiss",
"parameters": {
"ef_construction": 100,
"m": 16
}
}
}
}
}
}
```

During ingestion, make sure each dimension of the vector is within the supported [-128, 127] range:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```json
PUT test-index/_doc/1
{
"my_vector": [-126, 28, 127, 0, 10, -45, 12, -110]
}
```

```json
PUT test-index/_doc/2
{
"my_vector": [100, -25, 4, -67, -2, 127, 99, 0]
}
```

During querying, make sure the query vector is also within the byte range:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```json
GET test-index/_search
{
"size": 2,
"query": {
"knn": {
"my_vector1": {
"vector": [-1, 45, -100, 125, -128, -8, 5, 10],
"k": 2
}
}
}
}
```

**Note**: When using `byte` vectors, expect some loss of precision in the recall compared to using `float` vectors. Byte vectors are useful in large-scale applications and use cases that prioritize reducing memory usage in exchange for a minimal loss in recall.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 92, first sentence: "precision in recall" => "recall precision"?


## Benchmarking results

We ran benchmarking tests on popular datasets using OpenSearch Benchmark to compare recall, indexing, and search performance between float vectors and byte vectors using Faiss HNSW.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

**Note**: Without SIMD optimization (such as AVX2 or NEON) or when AVX2 is disabled (on x86 architectures), the quantization process introduces additional latency. For details on AVX2-compatible processors, see [CPUs with AVX2](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2). In an AWS environment, all community Amazon Machine Images (AMIs) with [HVM](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/virtualization_types.html) support AVX2 optimization.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 99: Should we define SIMD?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to lowercase

These tests were conducted on a single-node cluster, except for the cohere-10m dataset, which used two `r5.2xlarge` instances.

### Configuration

The following table lists the cluster configuration for benchmarking tests.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

|`m` |`ef_construction` |`ef_search` |Replicas |Primary shards |Indexing clients |
|--- |--- |--- |--- |--- |--- |
|16 |100 |100 |0 |8 |16 |

The following table lists the dataset configuration for benchmarking tests.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

|Dataset ID |Dataset |Dimension of vector |Data size |Number of queries |Training data range |Query data range |Space type |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Dimension of vector" => "Vector dimension"?

kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
|--- |--- |--- |--- |--- |--- |--- |--- |
|**Dataset 1** |gist-960-euclidean |960 |1,000,000 |1,000 |[0.0, 1.48] |[0.0, 0.729] |L2 |
|**Dataset 2** |cohere-ip-10m |768 |10,000,000 |10,000 |[-4.142334, 5.5211477] |[-4.109505, 5.4809895] |innerproduct |
|**Dataset 3** |cohere-ip-1m |768 |1,000,000 |10,000 |[-4.1073565, 5.504557] |[-4.109505, 5.4809895] |innerproduct |
|**Dataset 4** |sift-128-euclidean |128 |1,000,000 |10,000 |[0.0, 218.0] |[0.0, 184.0] |L2 |

### Recall, memory, and indexing results

|Dataset ID |Faiss HNSW recall@100 |Faiss HNSW-byte recall@100 |% Reduction in recall |Faiss HNSW memory usage (GB) |Faiss HNSW byte memory usage (GB) |% Reduction in memory |Faiss HNSW mean indexing throughput (docs/sec) |Faiss HNSW byte mean indexing throughput (docs/sec) |% Gain in indexing throughput |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"HNSW-byte" is hyphenated in the third column header but not in the ninth. Is that intentional, or should they match? It's also not hyphenated in the headers in the next table.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
|--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |
|**Dataset 1** |0.91 |0.89 |2.20 |3.72 |1.04 |72.00 |4673 |9686 |107.28 |
|**Dataset 2** |0.91 |0.83 |8.79 |30.03 |8.57 |71.46 |4911 |10207 |107.84 |
|**Dataset 3** |0.94 |0.86 |8.51 |3.00 |0.86 |71.33 |6112 |11673 |90.98 |
|**Dataset 4** |0.99 |0.98 |1.01 |0.62 |0.26 |58.06 |38273 |43267 |13.05 |

### Query results

|Dataset ID |Query clients |Faiss HNSW p90 (ms) |Faiss HNSW byte p90 (ms) |Faiss HNSW p99 (ms) |Faiss HNSW byte p99 (ms) |
|--- |--- |--- |--- |--- |--- |
|**Dataset 1** |**1** |5.35 |5.34 |5.95 |5.59 |
|**Dataset 1** |**8** |6.68 |6.64 |10.23 |9.14 |
|**Dataset 1** |**16** |10.59 |7.38 |12.94 |11.47 |
| | | | | | |
|**Dataset 2** |**1** |7.39 |7.14 |8.35 |7.59 |
|**Dataset 2**|**8** |15.47 |14.83 |21.38 |16.20 |
|**Dataset 2** |**16** |25.01 |25.32 |31.98 |29.42 |
| | | | | | |
|**Dataset 3** |**1** |4.97 |4.72 |5.62 |5.02 |
|**Dataset 3** |**8** |6.75 |5.98 |7.69 |7.7 |
|**Dataset 3** |**16** |10.51 |6.94 |13.87 |12.4 |
| | | | | | |
|**Dataset 4** |**1** |2.91 |3.03 |3.16 |3.15 |
|**Dataset 4**|**8** |3.38 |3.30 |6.30 |4.75 |
|**Dataset 4** |**16** |4.35 |3.80 |8.76 |8.83 |

### Key findings

When comparing the benchmarking results, here are the key findings:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

- **Memory savings**: Byte vectors reduced memory usage by up to **72%**, with higher-dimensional vectors achieving greater reductions.
- **Indexing performance**: The mean indexing throughput for byte vectors was **2x to 107.84%** higher than for float vectors, especially with larger vector dimensions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"with larger vector dimensions" => "when using larger vector dimensions"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok as is

- **Search performance**: Search latencies were similar, with byte vectors occasionally performing better.
- **Recall**: For byte vectors, there was a slight (up to **8.8%**) reduction in recall compared to float vectors, depending on the dataset and the quantization technique used.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

## How does Faiss work with byte vectors internally?

Faiss doesn't directly support the `byte` data type for storing vectors. To achieve this, OpenSearch uses a [`QT_8bit_direct_signed` scalar quantizer](https://faiss.ai/cpp_api/struct/structfaiss_1_1ScalarQuantizer.html). This quantizer accepts float vectors within the signed 8-bit value range and encodes them as unsigned 8-bit integer vectors. During indexing and search, these encoded unsigned 8-bit integer vectors are decoded back into signed 8-bit original vectors for distance computation.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 160, end of last sentence: "are decoded back into the original signed 8-bit vectors for the purpose of distance computation"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's understandable in the original form.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reads as though "the" should precede "signed", though.

This quantization approach reduces the memory footprint by a factor of four. However, encoding and decoding during scalar quantization introduce additional latency. To mitigate this, you can use [SIMD optimization](https://opensearch.org/docs/latest/search-plugins/knn/knn-index#simd-optimization-for-the-faiss-engine) with the `QT_8bit_direct_signed` quantizer to reduce search latencies and improve indexing throughput.

### Example

The following example shows how an input vector is encoded and decoded using `QT_8bit_direct_signed` scalar quantizer:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```c
// Input vector:
[-126, 28, 127, 0, 10, -45, 12, -110]

// Encoded vector generated by adding 128 to each dimension of the input vector to convert signed int8 to unsigned int8:
[2, 156, 255, 128, 138, 83, 140, 18]

// Encoded vector is decoded back into the original signed int8 vector by subtracting 128 from each dimension for distance computation:
[-126, 28, 127, 0, 10, -45, 12, -110]
```

## Conclusion

OpenSearch 2.17 introduced support for Faiss byte vectors, enabling you to store quantized byte vector embeddings efficiently. This reduces memory consumption by up to 75%, lowers costs, and maintains high performance. These advantages make byte vectors an excellent choice for large-scale similarity search applications, especially where memory resources are limited, and applications that handle large volumes of data within the signed byte value range.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

## Future enhancements
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This section should probably precede the conclusion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


In future versions, we plan to enhance this feature by adding an `on_disk` mode with a `4x` compression level in Faiss. This mode will accept `fp32` vectors as input, perform online training, and quantize the data into byte-sized vectors, eliminating the need for performing external quantization.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 185, first sentence: "compression level in Faiss" => "Faiss compression level"?

## References

* [Benchmarking datasets](https://github.com/erikbern/ann-benchmarks?tab=readme-ov-file#data-sets)
* [Cohere/wikipedia-22-12-simple-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings)
* Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazar’e, Maria Lomeli, Lucas Hosseini and Herv’e J’egou. The Faiss library. [https://arxiv.org/abs/2401.08281](https://arxiv.org/abs/2401.08281)

Check failure on line 191 in _posts/2024-11-22-faiss-byte-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Matthijs. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Matthijs. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-22-faiss-byte-vector.md", "range": {"start": {"line": 191, "column": 3}}}, "severity": "ERROR"}

Check failure on line 191 in _posts/2024-11-22-faiss-byte-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Alexandr. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Alexandr. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-22-faiss-byte-vector.md", "range": {"start": {"line": 191, "column": 19}}}, "severity": "ERROR"}

Check failure on line 191 in _posts/2024-11-22-faiss-byte-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Guzhva. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Guzhva. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-22-faiss-byte-vector.md", "range": {"start": {"line": 191, "column": 28}}}, "severity": "ERROR"}

Check failure on line 191 in _posts/2024-11-22-faiss-byte-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Chengqi. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Chengqi. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-22-faiss-byte-vector.md", "range": {"start": {"line": 191, "column": 36}}}, "severity": "ERROR"}

Check failure on line 191 in _posts/2024-11-22-faiss-byte-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Gergely. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Gergely. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-22-faiss-byte-vector.md", "range": {"start": {"line": 191, "column": 64}}}, "severity": "ERROR"}

Check failure on line 191 in _posts/2024-11-22-faiss-byte-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Szilvasy. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Szilvasy. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-22-faiss-byte-vector.md", "range": {"start": {"line": 191, "column": 72}}}, "severity": "ERROR"}

Check failure on line 191 in _posts/2024-11-22-faiss-byte-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Lomeli. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Lomeli. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-22-faiss-byte-vector.md", "range": {"start": {"line": 191, "column": 113}}}, "severity": "ERROR"}

Check failure on line 191 in _posts/2024-11-22-faiss-byte-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Hosseini. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Hosseini. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-22-faiss-byte-vector.md", "range": {"start": {"line": 191, "column": 127}}}, "severity": "ERROR"}

Loading