Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 2023-12-05-improving-document-retrieval-with-spade-semantic-en… #2483

Merged
merged 17 commits into from
Dec 8, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Apply suggestions from code review
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
kolchfa-aws and natebower authored Dec 8, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
commit e50effcc403d04f85f3a637f828ea58f63a84e95
Original file line number Diff line number Diff line change
@@ -15,7 +15,7 @@
has_science_table: true
---

OpenSearch 2.11 introduced neural sparse search---a new, efficient way of semantic retrieval. In this blog post, you'll learn about using sparse encoders for semantic search. You'll find that neural sparse search reduces costs, performs faster, and improves search relevance. We're excited to share benchmarking results that show why neural sparse search is now the top-performing search method. You can even try it out by building your own search engine in just five steps. For a TLDR on benchmarking learnings, see [Key takeaways](#here-are-the-key-takeaways).
OpenSearch 2.11 introduced neural sparse search---a new efficient method of semantic retrieval. In this blog post, you'll learn about using sparse encoders for semantic search. You'll find that neural sparse search reduces costs, performs faster, and improves search relevance. We're excited to share benchmarking results that show why neural sparse search is now the top-performing search method. You can even try it out by building your own search engine in just five steps. For a TLDR on benchmarking learnings, see [Key takeaways](#here-are-the-key-takeaways).

## What are dense and sparse vector embeddings?

@@ -68,25 +68,25 @@

### Here are the key takeaways:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Here are the key takeaways:
### Here are the key takeaways

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just make this "Key takeaways".


* Both bi-encoder and document-only modes provide the highest relevance on the BEIR and Amazon ESCI datasets.
* Both modes provide the highest relevance on the BEIR and Amazon ESCI datasets.
* Without online inference, the search latency of document-only mode is comparable to BM25.
* Sparse encoding results in a much smaller index size than dense encoding. The size of an index a document-only sparse encoder generates is **10.4%** of a dense encoding index size. For a bi-encoder, the index size is **7.2%** of a dense encoding index size.
* Sparse encoding results in a much smaller index size than dense encoding. A document-only sparse encoder generates an index that is **10.4%** of the size of a dense encoding index. For a bi-encoder, the index size is **7.2%** of the size of a dense encoding index.
* Dense encoding uses k-NN retrieval and incurs a 7.9% increase in RAM cost at search time. Neural sparse search uses a native Lucene index, so the RAM cost does not increase at search time.

## Benchmarking results

Check failure on line 76 in _posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'Benchmarking results' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'Benchmarking results' is a heading and should be in sentence case.", "location": {"path": "_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md", "range": {"start": {"line": 76, "column": 4}}}, "severity": "ERROR"}

The benchmarking results are presented in the following tables.

### Table I. Relevance comparison on BEIR<sup>*</sup> benchmark and Amazon ESCI, in terms of NDCG@10 and rank.

Check failure on line 80 in _posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'Table I. Relevance comparison on BEIR * benchmark and Amazon ESCI, in terms of NDCG@10 and rank.' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'Table I. Relevance comparison on BEIR * benchmark and Amazon ESCI, in terms of NDCG@10 and rank.' is a heading and should be in sentence case.", "location": {"path": "_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md", "range": {"start": {"line": 80, "column": 1}}}, "severity": "ERROR"}

Check failure on line 80 in _posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.SpacingWords] There should be once space between words in 'and rank'. Raw Output: {"message": "[OpenSearch.SpacingWords] There should be once space between words in 'and rank'.", "location": {"path": "_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md", "range": {"start": {"line": 80, "column": 102}}}, "severity": "ERROR"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Table I. Relevance comparison on BEIR<sup>*</sup> benchmark and Amazon ESCI, in terms of NDCG@10 and rank.
### Table I. Relevance comparison on BEIR<sup>*</sup> benchmark and Amazon ESCI in terms of NDCG@10 and rank


<table>
<tr style="text\-align:center;">
<td></td>
<td colspan="2">BM25</td>
<td colspan="2">Dense(with TAS-B model)</td>
<td colspan="2">Hybrid(Dense + BM25)</td>
<td colspan="2">Dense (with TAS-B model)</td>
<td colspan="2">Hybrid (Dense + BM25)</td>
<td colspan="2">Neural sparse search bi-encoder</td>
<td colspan="2">Neural sparse search doc-only</td>
<td colspan="2">Neural sparse search document-only</td>
</tr>
<tr>
<td><b>Dataset</b></td>
@@ -102,7 +102,7 @@
<td><b>Rank</b></td>
</tr>
<tr>
<td>Trec Covid</td>

Check failure on line 105 in _posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Trec. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Trec. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md", "range": {"start": {"line": 105, "column": 13}}}, "severity": "ERROR"}

Check failure on line 105 in _posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Covid. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Covid. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md", "range": {"start": {"line": 105, "column": 18}}}, "severity": "ERROR"}
<td>0.688</td>
<td>4</td>
<td>0.481</td>
@@ -206,7 +206,7 @@
<td>2</td>
</tr>
<tr>
<td>SCIDOCS</td>

Check failure on line 209 in _posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [Vale.Terms] Use 'SciDocs' instead of 'SCIDOCS'. Raw Output: {"message": "[Vale.Terms] Use 'SciDocs' instead of 'SCIDOCS'.", "location": {"path": "_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md", "range": {"start": {"line": 209, "column": 13}}}, "severity": "ERROR"}
<td>0.165</td>
<td>2</td>
<td>0.149</td>
@@ -298,11 +298,11 @@
</tr>
</table>

<sup>*</sup> BEIR is short for Benchmarking Information Retrieval. For more information, see [the BEIR GitHub page](https://github.com/beir-cellar/beir).
<sup>*</sup> BEIR stands for Benchmarking Information Retrieval. For more information, see [the BEIR GitHub page](https://github.com/beir-cellar/beir).

### Table II. Speed comparison, in terms of latency and throughput.
### Table II. Speed comparison in terms of latency and throughput

Check failure on line 303 in _posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'Table II. Speed comparison in terms of latency and throughput' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'Table II. Speed comparison in terms of latency and throughput' is a heading and should be in sentence case.", "location": {"path": "_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md", "range": {"start": {"line": 303, "column": 5}}}, "severity": "ERROR"}

| | BM25 | Dense (with TAS-B model) | Neural sparse search bi-encoder | Neural sparse search doc-only |
| | BM25 | Dense (with TAS-B model) | Neural sparse search bi-encoder | Neural sparse search document-only |

Check failure on line 305 in _posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.TableHeadings] 'BM25' is a table heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.TableHeadings] 'BM25' is a table heading and should be in sentence case.", "location": {"path": "_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md", "range": {"start": {"line": 305, "column": 29}}}, "severity": "ERROR"}

Check failure on line 305 in _posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.TableHeadings] 'Dense (with TAS-B model)' is a table heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.TableHeadings] 'Dense (with TAS-B model)' is a table heading and should be in sentence case.", "location": {"path": "_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md", "range": {"start": {"line": 305, "column": 45}}}, "severity": "ERROR"}
|---------------------------|---------------|---------------------------| ------------------------------- | ------------------------------ |
| P50 latency (ms) | 8 ms | 56.6 ms |176.3 ms | 10.2ms |
| P90 latency (ms) | 12.4 ms | 71.12 ms |267.3 ms | 15.2ms |
@@ -311,17 +311,17 @@
| Mean throughput (op/s) | 2214.6 op/s | 298.2 op/s |106.3 op/s | 1790.2 op/s |


<sup>*</sup> We tested latency on a subset of MSMARCO v2, with 1M documents in total. To obtain latency data, we used 20 clients to loop search requests.
<sup>*</sup> We tested latency on a subset of MS MARCO v2 containing 1M documents in total. To obtain latency data, we used 20 clients to loop search requests.

### Table III. Capacity consumption comparison

| |BM25 |Dense (with TAS-B model) |Neural sparse search bi-encoder | Neural sparse search doc-only |
| |BM25 |Dense (with TAS-B model) |Neural sparse search bi-encoder | Neural sparse search document-only |
|-|-|-|-|-|
|Index size |1 GB |65.4 GB |4.7 GB |6.8 GB |
|RAM usage |480.74 GB |675.36 GB |480.64 GB |494.25 GB |
|Runtime RAM delta |+0.01 GB |+53.34 GB |+0.06 GB |+0.03 GB |

<sup>*</sup> We performed this experiment using the full MSMARCO v2 dataset, with 8.8M passages. For all methods, we excluded the `_source` fields and force merged the index before measuring index size. We set the heap size of the OpenSearch JVM to half of the node RAM, so an empty OpenSearch cluster still consumed close to 480 GB of memory.
<sup>*</sup> We performed this experiment using the full MS MARCO v2 dataset, containing 8.8M passages. For all methods, we excluded the `_source` fields and force merged the index before measuring index size. We set the heap size of the OpenSearch JVM to half of the node RAM, so an empty OpenSearch cluster still consumed close to 480 GB of memory.

## Build your search engine in five steps

@@ -371,7 +371,7 @@
GET /_plugins/_ml/tasks/<task_id>
```

Once the task is complete, the task state changes `COMPLETED` and OpenSearch returns the `model_id` for the deployed model:
Once the task is complete, the task state changes to `COMPLETED` and OpenSearch returns the `model_id` for the deployed model:

```json
{
@@ -456,7 +456,7 @@
}
```

### The neural sparse query parameters
### Neural sparse query parameters

The `neural_sparse` query supports two parameters:

@@ -465,9 +465,9 @@

## Selecting a model

OpenSearch provides several pretrained encoder models that you can use out-of-the-box without fine-tuning. For a list of sparse encoding models provided by OpenSearch, see [Sparse encoding models](https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/#sparse-encoding-models).
OpenSearch provides several pretrained encoder models that you can use out of the box without fine-tuning. For a list of sparse encoding models provided by OpenSearch, see [Sparse encoding models](https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/#sparse-encoding-models).

Use the following recommendations to help you select a sparse encoder model:
Use the following recommendations to select a sparse encoder model:

- For **bi-encoder** mode, we recommend using the `opensearch-neural-sparse-encoding-v1` pretrained model. For this model, both online search and offline ingestion share the same model file.

@@ -477,6 +477,6 @@
## Next steps

- For more information about neural sparse search, see [Neural sparse search](https://opensearch.org/docs/latest/search-plugins/neural-sparse-search/).
- For an OpenSearch end-to-end neural search tutorial, see [Neural search tutorial](https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/).
- For an end-to-end neural search tutorial, see [Neural search tutorial](https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/).
- For a list of all search methods OpenSearch supports, see [Search methods](https://opensearch.org/docs/latest/search-plugins/index/#search-methods).
- Give us your feedback on the [OpenSearch Forum](https://forum.opensearch.org/).
- Provide your feedback on the [OpenSearch Forum](https://forum.opensearch.org/).