New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Update 2023-12-05-improving-document-retrieval-with-spade-semantic-en… #2483

Merged

kolchfa-aws merged 17 commits into opensearch-project:main from dagneyb:patch-1

Dec 8, 2023

Contributor

dagneyb commented Dec 7, 2023

…coders.md

Description

[Describe what this change achieves]

Issues Resolved

[List any issues this PR will resolve]

Check List

Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License.


          Update 2023-12-05-improving-document-retrieval-with-spade-semantic-en…

7d4ab4a

…coders.md

Signed-off-by: Dagney <[email protected]>

dagneyb requested review from elfisher, AMoo-Miki, nknize, krisfreedain, peterzhuamazon, CEHENKLE, dtaivpp, kolchfa-aws and nateynateynate as code owners

December 7, 2023 17:17

kolchfa-aws added 2 commits

December 7, 2023 22:05


          Doc rewrites

f201f17

Signed-off-by: Fanit Kolchina <[email protected]>


          Minor rewording

7e3815b

Signed-off-by: Fanit Kolchina <[email protected]>

model-collapse approved these changes

View reviewed changes

model-collapse left a comment

Only one suggestion, others are good.

model-collapse approved these changes

View reviewed changes

model-collapse left a comment

Serval extra comments here, minor changes.

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated


		In our previous [blog post](https://opensearch.org/blog/semantic-science-benchmarks), we mentioned that searching with dense embeddings presents challenges when encoders are facing unfamiliar content. When an encoder trained on one dataset is used on a different dataset, the encoder often produces unpredictable embeddings, resulting in poor search result relevance.

		Often, BM25 performs better than dense encoders on BEIR datasets that incorporate strong domain knowledge. In these cases, sparse encoders fall back on keyword-based matching, ensuring that their search results are no worse than BM25 ones. For a comparison of search result relevance benchmarks, see [Table I](#benchmarking-results).

model-collapse Dec 8, 2023

on some BEIR dataset which incorporate

model-collapse Dec 8, 2023

sparse encoders are able to fall back on keyword-based ...

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated


		You can run a neural sparse search in two modes: bi-encoder and document-only.

		In bi-encoder mode, both documents and search queries are passed through deep encoders. In document-only mode, documents are still passed through deep encoders, but search queries are instead tokenized. In this mode, document encoders are trained to learn more synonym association in order to increase recall. By eliminating the online inference phase, you can save computational resources and significantly reduce latency. For benchmarks, compare the `Neural sparse doc-only` column with other columns in [Table II](#table-ii-speed-comparison-in-terms-of-latency-and-throughput).

model-collapse Dec 8, 2023

search queries are just simply tokenized instead.

natebower reviewed

View reviewed changes

Collaborator

natebower left a comment

@kolchfa-aws Please see my comments and changes and let me know if you have any questions. Thanks!

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated Show resolved Hide resolved

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated

+              has_science_table: true
+              ---
+              OpenSearch 2.11 introduced neural sparse search---a new, efficient way of semantic retrieval. In this blog post, you'll learn about using sparse encoders for semantic search. You'll find that neural sparse search reduces costs, performs faster, and improves search relevance. We're excited to share benchmarking results that show why neural sparse search is now the top-performing search method. You can even try it out by building your own search engine in just five steps. For a TLDR on benchmarking learnings, see [Key takeaways](#here-are-the-key-takeaways).

Collaborator

natebower Dec 8, 2023

Let's replace TLDR as too informal. In the fourth sentence, let's clarify slightly. The top-performing search method as compared to what?

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated


		When you use a transformer-based encoder, such as BERT, to generate traditional dense vector embeddings, the encoder translates each word into a vector. Collectively, these vectors make up a semantic vector space. In this space, the closer the vectors are, the more similar the words are in meaning.

		In sparse encoding, the encoder takes the text and creates a list of tokens that have similar semantic meaning. The model vocabulary ([WordPiece](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt)) contains most commonly used words along with various tense endings (for example, `-ed` and `-ing`) and suffixes (for example, `-ate` and `-ion`). You can think of the vocabulary as a semantic space where each document is a sparse vector.

Collaborator

natebower Dec 8, 2023

First sentence: "uses the text to create"?

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated


		## Sparse encoders perform better on unfamiliar datasets

		In our previous [blog post](https://opensearch.org/blog/semantic-science-benchmarks), we mentioned that searching with dense embeddings presents challenges when encoders are facing unfamiliar content. When an encoder trained on one dataset is used on a different dataset, the encoder often produces unpredictable embeddings, resulting in poor search result relevance.

Collaborator

natebower Dec 8, 2023

"encounter" instead of "are facing" (or a better verb)?

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated


		In our previous [blog post](https://opensearch.org/blog/semantic-science-benchmarks), we mentioned that searching with dense embeddings presents challenges when encoders are facing unfamiliar content. When an encoder trained on one dataset is used on a different dataset, the encoder often produces unpredictable embeddings, resulting in poor search result relevance.

		Often, BM25 performs better than dense encoders on BEIR datasets that incorporate strong domain knowledge. In these cases, sparse encoders fall back on keyword-based matching, ensuring that their search results are no worse than BM25 ones. For a comparison of search result relevance benchmarks, see [Table I](#benchmarking-results).

Collaborator

natebower Dec 8, 2023

Instead of "than BM25 ones", "are no worse than those produced by BM25"?

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated

+              The `neural_sparse` query supports two parameters:
+              - `model_id` (String): The ID of the model that is used to generate tokens and weights from the query text. A sparse encoding model will expand the tokens from query text, while the tokenizer model will only tokenize the query text itself.
+              - `max_token_score` (Float): An extra parameter required for performance optimization. Just like the OpenSearch `match` query, the `neural_sparse` query is transformed to a Lucene BooleanQuery, combining term-level subqueries using disjunction. The difference is that for the neural sparse query, we use FeatureQuery instead of TermQuery to match the terms. Lucene employs the WAND (Weak AND) algorithm for dynamic pruning, which skips non-competitive tokens based on their score upper bounds. However, FeatureQuery uses `FLOAT.MAX_VALUE` as the score upper bound, which makes the WAND optimization ineffective. The `max_token_score` parameter resets the score upper bound for each token in a query, which is consistent with the original FeatureQuery. Thus, setting the value to 3.5 for the bi-encoder model and 2 for the document-only model can accelerate search without precision loss. After OpenSearch is upgraded to Lucene version 9.8, this parameter will be deprecated.

Collaborator

natebower Dec 8, 2023

"Weak AND (WAND)"?

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated Show resolved Hide resolved

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated Show resolved Hide resolved

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated

+              ## Next steps
+              - For more information about neural sparse search, see [Neural sparse search](https://opensearch.org/docs/latest/search-plugins/neural-sparse-search/).
+              - For an OpenSearch end-to-end neural search tutorial, see [Neural search tutorial](https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/).

Collaborator

natebower Dec 8, 2023

Should "OpenSearch" follow "end-to-end"?

Collaborator

kolchfa-aws Dec 8, 2023

Removed

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated Show resolved Hide resolved

kolchfa-aws reviewed

View reviewed changes

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated Show resolved Hide resolved

kolchfa-aws reviewed

View reviewed changes

_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md Outdated Show resolved Hide resolved

kolchfa-aws and others added 6 commits

December 8, 2023 08:44


          Apply suggestions from code review

e50effc

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>


          Implemented editorial comments

96e0cb2

Signed-off-by: Fanit Kolchina <[email protected]>


          Corrected capitalization of datasets

b3564f0

Signed-off-by: Fanit Kolchina <[email protected]>


          Extra space

3dc0577

Signed-off-by: Fanit Kolchina <[email protected]>


          More editorial feedback

e48f1ba

Signed-off-by: Fanit Kolchina <[email protected]>


          More editorial comments

ff7ea37

Signed-off-by: Fanit Kolchina <[email protected]>

natebower approved these changes

View reviewed changes

Collaborator

natebower left a comment •

edited

Loading

LGTM. @pajuric This should be ready to republish whenever you're ready. @kolchfa-aws can merge for you.


          Add asterisk

e0973f4

Signed-off-by: Fanit Kolchina <[email protected]>

Collaborator

kolchfa-aws commented Dec 8, 2023

@pajuric Let me know if I should adjust the meta/keywords.

kolchfa-aws added 2 commits

December 8, 2023 10:13


          Change capacity to resource

8c3ef2e

Signed-off-by: Fanit Kolchina <[email protected]>


          Make images same height and edit bios

217d246

Signed-off-by: Fanit Kolchina <[email protected]>

natebower reviewed

View reviewed changes

Collaborator

natebower left a comment

@kolchfa-aws Just a few tweaks

_authors/xinyual.markdown Outdated Show resolved Hide resolved

_authors/yych.markdown Outdated Show resolved Hide resolved

_authors/zhichaog.markdown Outdated Show resolved Hide resolved


          Update _authors/xinyual.markdown

8f499f3

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

kolchfa-aws and others added 4 commits

December 8, 2023 10:47


          Update _authors/yych.markdown

e32981f

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>


          Update _authors/zhichaog.markdown

1af2c0e

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>


          Format change

ac4f8b1

Signed-off-by: Fanit Kolchina <[email protected]>


          Merge branch 'patch-1' of https://github.com/dagneyb/project-website …

ee0fd00

…into dagney-branch

kolchfa-aws approved these changes

View reviewed changes

Collaborator

kolchfa-aws left a comment

LGTM

kolchfa-aws merged commit e0d439b into opensearch-project:main

3 of 4 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

natebower natebower approved these changes

kolchfa-aws kolchfa-aws approved these changes

model-collapse model-collapse approved these changes

elfisher Awaiting requested review from elfisher elfisher is a code owner

AMoo-Miki Awaiting requested review from AMoo-Miki AMoo-Miki is a code owner

nknize Awaiting requested review from nknize nknize is a code owner

krisfreedain Awaiting requested review from krisfreedain krisfreedain is a code owner

peterzhuamazon Awaiting requested review from peterzhuamazon peterzhuamazon is a code owner

CEHENKLE Awaiting requested review from CEHENKLE CEHENKLE is a code owner

dtaivpp Awaiting requested review from dtaivpp dtaivpp is a code owner

nateynateynate Awaiting requested review from nateynateynate nateynateynate is a code owner

Labels

None yet