Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 2023-12-05-improving-document-retrieval-with-spade-semantic-en… #2483

Merged
merged 17 commits into from
Dec 8, 2023

Conversation

dagneyb
Copy link
Contributor

@dagneyb dagneyb commented Dec 7, 2023

…coders.md

Description

[Describe what this change achieves]

Issues Resolved

[List any issues this PR will resolve]

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License.

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Copy link

@model-collapse model-collapse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one suggestion, others are good.

Copy link

@model-collapse model-collapse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serval extra comments here, minor changes.


In our previous [blog post](https://opensearch.org/blog/semantic-science-benchmarks), we mentioned that searching with dense embeddings presents challenges when encoders are facing unfamiliar content. When an encoder trained on one dataset is used on a different dataset, the encoder often produces unpredictable embeddings, resulting in poor search result relevance.

Often, BM25 performs better than dense encoders on BEIR datasets that incorporate strong domain knowledge. In these cases, sparse encoders fall back on keyword-based matching, ensuring that their search results are no worse than BM25 ones. For a comparison of search result relevance benchmarks, see [Table I](#benchmarking-results).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on some BEIR dataset which incorporate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparse encoders are able to fall back on keyword-based ...


You can run a neural sparse search in two modes: **bi-encoder** and **document-only**.

In bi-encoder mode, both documents and search queries are passed through deep encoders. In document-only mode, documents are still passed through deep encoders, but search queries are instead tokenized. In this mode, document encoders are trained to learn more synonym association in order to increase recall. By eliminating the online inference phase, you can **save computational resources** and **significantly reduce latency**. For benchmarks, compare the `Neural sparse doc-only` column with other columns in [Table II](#table-ii-speed-comparison-in-terms-of-latency-and-throughput).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search queries are just simply tokenized instead.

Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Please see my comments and changes and let me know if you have any questions. Thanks!

has_science_table: true
---

OpenSearch 2.11 introduced neural sparse search---a new, efficient way of semantic retrieval. In this blog post, you'll learn about using sparse encoders for semantic search. You'll find that neural sparse search reduces costs, performs faster, and improves search relevance. We're excited to share benchmarking results that show why neural sparse search is now the top-performing search method. You can even try it out by building your own search engine in just five steps. For a TLDR on benchmarking learnings, see [Key takeaways](#here-are-the-key-takeaways).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's replace TLDR as too informal. In the fourth sentence, let's clarify slightly. The top-performing search method as compared to what?


When you use a transformer-based encoder, such as BERT, to generate traditional dense vector embeddings, the encoder translates each word into a vector. Collectively, these vectors make up a semantic vector space. In this space, the closer the vectors are, the more similar the words are in meaning.

In sparse encoding, the encoder takes the text and creates a list of tokens that have similar semantic meaning. The model vocabulary ([WordPiece](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt)) contains most commonly used words along with various tense endings (for example, `-ed` and `-ing`) and suffixes (for example, `-ate` and `-ion`). You can think of the vocabulary as a semantic space where each document is a sparse vector.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First sentence: "uses the text to create"?


## Sparse encoders perform better on unfamiliar datasets

In our previous [blog post](https://opensearch.org/blog/semantic-science-benchmarks), we mentioned that searching with dense embeddings presents challenges when encoders are facing unfamiliar content. When an encoder trained on one dataset is used on a different dataset, the encoder often produces unpredictable embeddings, resulting in poor search result relevance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"encounter" instead of "are facing" (or a better verb)?


In our previous [blog post](https://opensearch.org/blog/semantic-science-benchmarks), we mentioned that searching with dense embeddings presents challenges when encoders are facing unfamiliar content. When an encoder trained on one dataset is used on a different dataset, the encoder often produces unpredictable embeddings, resulting in poor search result relevance.

Often, BM25 performs better than dense encoders on BEIR datasets that incorporate strong domain knowledge. In these cases, sparse encoders fall back on keyword-based matching, ensuring that their search results are no worse than BM25 ones. For a comparison of search result relevance benchmarks, see [Table I](#benchmarking-results).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "than BM25 ones", "are no worse than those produced by BM25"?

The `neural_sparse` query supports two parameters:

- `model_id` (String): The ID of the model that is used to generate tokens and weights from the query text. A sparse encoding model will expand the tokens from query text, while the tokenizer model will only tokenize the query text itself.
- `max_token_score` (Float): An extra parameter required for performance optimization. Just like the OpenSearch `match` query, the `neural_sparse` query is transformed to a Lucene BooleanQuery, combining term-level subqueries using disjunction. The difference is that for the neural sparse query, we use FeatureQuery instead of TermQuery to match the terms. Lucene employs the WAND (Weak AND) algorithm for dynamic pruning, which skips non-competitive tokens based on their score upper bounds. However, FeatureQuery uses `FLOAT.MAX_VALUE` as the score upper bound, which makes the WAND optimization ineffective. The `max_token_score` parameter resets the score upper bound for each token in a query, which is consistent with the original FeatureQuery. Thus, setting the value to 3.5 for the bi-encoder model and 2 for the document-only model can accelerate search without precision loss. After OpenSearch is upgraded to Lucene version 9.8, this parameter will be deprecated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Weak AND (WAND)"?

## Next steps

- For more information about neural sparse search, see [Neural sparse search](https://opensearch.org/docs/latest/search-plugins/neural-sparse-search/).
- For an OpenSearch end-to-end neural search tutorial, see [Neural search tutorial](https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "OpenSearch" follow "end-to-end"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

kolchfa-aws and others added 6 commits December 8, 2023 08:44
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @pajuric This should be ready to republish whenever you're ready. @kolchfa-aws can merge for you.

Signed-off-by: Fanit Kolchina <[email protected]>
@kolchfa-aws
Copy link
Collaborator

@pajuric Let me know if I should adjust the meta/keywords.

Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Just a few tweaks

_authors/xinyual.markdown Outdated Show resolved Hide resolved
_authors/yych.markdown Outdated Show resolved Hide resolved
_authors/zhichaog.markdown Outdated Show resolved Hide resolved
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
kolchfa-aws and others added 4 commits December 8, 2023 10:47
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Copy link
Collaborator

@kolchfa-aws kolchfa-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolchfa-aws kolchfa-aws merged commit e0d439b into opensearch-project:main Dec 8, 2023
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants