diff --git a/_authors/xinyual.markdown b/_authors/xinyual.markdown
index ae87da0450..93c855a044 100644
--- a/_authors/xinyual.markdown
+++ b/_authors/xinyual.markdown
@@ -5,4 +5,4 @@ photo: '/assets/media/authors/xinyual.jpg'
 github: xinyual
 ---
 
-**Xinyuan Lu** is a machine learning engineer with the OpenSearch project. He is working on large language model(LLM) related applications and search relevance.
\ No newline at end of file
+**Xinyuan Lu** is a machine learning engineer with the OpenSearch Project. He works on large language model (LLM)-related applications and search relevance.
\ No newline at end of file
diff --git a/_authors/yych.markdown b/_authors/yych.markdown
index b13832fefe..88734bdaf7 100644
--- a/_authors/yych.markdown
+++ b/_authors/yych.markdown
@@ -5,4 +5,4 @@ photo: '/assets/media/authors/yych.png'
 github: model-collapse
 ---
 
-**Charlie Yang** is an AWS engineering manager working on the OpenSearch Project. He is focusing on machine learning, search relevance and performance optimization.
\ No newline at end of file
+**Charlie Yang** is an AWS engineering manager with the OpenSearch Project. He focuses on machine learning, search relevance, and performance optimization.
\ No newline at end of file
diff --git a/_authors/zanniu.markdown b/_authors/zanniu.markdown
new file mode 100644
index 0000000000..34ec8087b4
--- /dev/null
+++ b/_authors/zanniu.markdown
@@ -0,0 +1,8 @@
+---
+name: Zan Niu
+short_name: zanniu
+photo: '/assets/media/authors/zanniu.jpeg'
+github: zane-neo
+---
+
+**Zan Niu** is a software engineer at Amazon Web Services. He primarily works on OpenSearch machine learning plugins (ml-commons). He is passionate about high performance architecture, distribute system and machine learning.
diff --git a/_authors/zhichaog.markdown b/_authors/zhichaog.markdown
index d0e1c2e5dc..d113861a4b 100644
--- a/_authors/zhichaog.markdown
+++ b/_authors/zhichaog.markdown
@@ -5,4 +5,4 @@ photo: '/assets/media/authors/zhichaog.png'
 github: zhichao-aws
 ---
 
-**Zhichao Geng** is a machine learning engineer working on OpenSearch. His interest is improving search relevance using machine learning.
\ No newline at end of file
+**Zhichao Geng** is a machine learning engineer with the OpenSearch Project. His interests include improving search relevance using machine learning.
\ No newline at end of file
diff --git a/_posts/2023-12-05-improving-document-retrieval-with-spade-semantic-encoders.md b/_posts/2023-12-05-improving-document-retrieval-with-spade-semantic-encoders.md
deleted file mode 100644
index 13f5829085..0000000000
--- a/_posts/2023-12-05-improving-document-retrieval-with-spade-semantic-encoders.md
+++ /dev/null
@@ -1,446 +0,0 @@
----
-layout: post
-title:  Improving document retrieval with sparse semantic encoders
-authors:
-  - zhichaog
-  - xinyual
-  - dagney
-  - yych
-date: 2023-12-05 01:00:00 -0700
-categories:
-    - technical-posts
-meta_keywords: search relevance, neural sparse search, semantic search, semantic search with sparse encoders
-meta_description: Learn how the neural sparse framework in OpenSearch 2.11 can help you improve search relevance and optimize semantic searches with spare encoders using just a few APIs.
-has_science_table: true
----
-
-In our previous [blog post](https://opensearch.org/blog/semantic-science-benchmarks), one finding shared was that zero-shot semantic search based on dense encoders will have challenges when being applied to scenarios with unfamiliar corpus. This was highlighted with the [BEIR](https://github.com/beir-cellar/beir) benchmark, which consists of diverse retrieval tasks so that the “transferability” of a pretrained embedding model to unseen datasets can be evaluated.
-
-In this blog post, we will present Neural Sparse, our sparse semantic retrieval framework that is now the top-performing search method on the latest BEIR benchmark. You will learn about semantic search with sparse encoders as well as how to implement this method in OpenSearch with just a few API calls.
-
-## Sparse Encoder is now a better choice
-When using transformer-based encoders (e.g. BERT) in traditional dense text embedding, the output of each position in the response layer is translated into a vector, projecting the text into a semantic vector space where distance correlates to similarity in meaning. Neural sparse conducts the process in a novel way that makes the encoder “vote” for the most representative BERT tokens. The vocabulary being adopted (WordPiece) contains most daily used words and also various suffixes, including tense suffixes (for example, ##ed, ##ing,) and common word roots (for example, ##ate, ##ion), where the symbol ## represents continuation. The vocabulary itself spans into a semantic space where all the documents can be regarded as sparse vectors.
-
-<table style="border:none">
-  <tr>
-    <td style="border:none">
-        <img src="/assets/media/blog-images/2023-12-05-improving-document-retrieval-with-spade-semantic-encoders/embedding.png" />
-    </td>
-    <td style="border:none">
-        <img src="/assets/media/blog-images/2023-12-05-improving-document-retrieval-with-spade-semantic-encoders/expand.png" />
-    </td>
-  </tr>
-  <tr>
-    <td colspan="2" style="border:none">
-        Figure 1: <b>Left:</b> words encoded in the dense vector sparse. <b>Right</b>: A typical result of sparse encoding.
-    </td>
-  </tr>
-</table>
-
-Searching with dense embedding will present challenges when facing “unfamiliar” content. In this case, the encoder will produce unpredictable embeddings, leading to bad relevance. That is also why in some BEIR datasets that contain strong domain knowledge, BM25 is the still best performer. In these cases, sparse encoders will try to degenerate themselves into keyword-based matching, protecting the search result to be no worse than BM25. A relevance comparison is provided in **Table I**.
-
-In dense encoding, documents are usually represented as high-dimensional vectors; therefore, k-NN indexes need to be adopted in similarity search. On the contrary, the sparse encoding results are more similar to “term vectors” used by keyword-based matching; therefore, native Lucene indexes can be leveraged. Compared to k-NN indexes, sparse embeddings has the following advantages, leading to reduced costs: 1) Much smaller index size, 2) Reduced runtime RAM cost, and 3) Lower computation cost. The quantized comparison can be found in **Table II**.
-
-### Try extreme efficiency with document-only encoders
-There are two modes supported by Neural Sparse: 1) with bi-encoders and 2) with document-only encoders. Bi-encoder mode is outlined above, while document-only mode, wherein the search queries are tokenized instead of being passed to deep encoders. In this mode, the document encoders are trained to learn more synonym association so as to increase the recall. And by eliminating the online inference phase, a few computational resources can be saved while the latency can also be reduced significantly. We can observe this in **Table II** by comparing “Neural Sparse Doc-only” with other solutions.
-
-## Neural Sparse Search outperforms in Benchmarking
-
-We have conducted some benchmarking using a cluster containing 3 r5.8xlarge data nodes and 1 r5.12xlarge leader&ml node. First, all the evaluated methods are compared in terms of NCDG@10. Then we also compare the runtime speed of each method as well as the resource cost.
-
-Key takeaways:
-
-* Both bi-encoder and document-only mode generate the highest relevance on the BEIR benchmark, along with the Amazon ESCI dataset.
-* Without online inference, the search latency of document-only mode is comparable to BM25.
-* Neural sparse search have much smaller index size than dense encoding. A document-only encoder generates an index with 10.4% of dense encoding’s index size, while the number for a bi-encoder is 7.2%.
-* Dense encoding adopts k-NN retrieval and will have a 7.9% increase in RAM cost when search traffic received. Neural sparse search is based on native Lucene, and the RAM cost will not increase in runtime.
-
-
-The detailed results are presented in the following tables.
-
-<center><b>Table I.</b> Relevance comparison on <b>BEIR</b><sup>*</sup> benchmark and Amazon ESCI, in the term of both NDCG@10 and the rank.</center>
-
-<table>
-    <tr style="text\-align:center;">
-        <td></td>
-        <td colspan="2">BM25</td>
-        <td colspan="2">Dense(with TAS-B model)</td>
-        <td colspan="2">Hybrid(Dense + BM25)</td>
-        <td colspan="2">Neural Sparse Search bi-encoder</td>
-        <td colspan="2">Neural Sparse Search doc-only</td>
-    </tr>
-    <tr>
-        <td><b>Dataset</b></td>
-        <td><b>NDCG</b></td>
-        <td><b>Rank</b></td>
-        <td><b>NDCG</b></td>
-        <td><b>Rank</b></td>
-        <td><b>NDCG</b></td>
-        <td><b>Rank</b></td>
-        <td><b>NDCG</b></td>
-        <td><b>Rank</b></td>
-        <td><b>NDCG</b></td>
-        <td><b>Rank</b></td>
-    </tr>
-    <tr>
-        <td>Trec Covid</td>
-        <td>0.688</td>
-        <td>4</td>
-        <td>0.481</td>
-        <td>5</td>
-        <td>0.698</td>
-        <td>3</td>
-        <td>0.771</td>
-        <td>1</td>
-        <td>0.707</td>
-        <td>2</td>
-    </tr>
-    <tr>
-        <td>NFCorpus</td>
-        <td>0.327</td>
-        <td>4</td>
-        <td>0.319</td>
-        <td>5</td>
-        <td>0.335</td>
-        <td>3</td>
-        <td>0.36</td>
-        <td>1</td>
-        <td>0.352</td>
-        <td>2</td>
-    </tr>
-    <tr>
-        <td>NQ</td>
-        <td>0.326</td>
-        <td>5</td>
-        <td>0.463</td>
-        <td>3</td>
-        <td>0.418</td>
-        <td>4</td>
-        <td>0.553</td>
-        <td>1</td>
-        <td>0.521</td>
-        <td>2</td>
-    </tr>
-    <tr>
-        <td>HotpotQA</td>
-        <td>0.602</td>
-        <td>4</td>
-        <td>0.579</td>
-        <td>5</td>
-        <td>0.636</td>
-        <td>3</td>
-        <td>0.697</td>
-        <td>1</td>
-        <td>0.677</td>
-        <td>2</td>
-    </tr>
-    <tr>
-        <td>FiQA</td>
-        <td>0.254</td>
-        <td>5</td>
-        <td>0.3</td>
-        <td>4</td>
-        <td>0.322</td>
-        <td>3</td>
-        <td>0.376</td>
-        <td>1</td>
-        <td>0.344</td>
-        <td>2</td>
-    </tr>
-    <tr>
-        <td>ArguAna</td>
-        <td>0.472</td>
-        <td>2</td>
-        <td>0.427</td>
-        <td>4</td>
-        <td>0.378</td>
-        <td>5</td>
-        <td>0.508</td>
-        <td>1</td>
-        <td>0.461</td>
-        <td>3</td>
-    </tr>
-    <tr>
-        <td>Touche</td>
-        <td>0.347</td>
-        <td>1</td>
-        <td>0.162</td>
-        <td>5</td>
-        <td>0.313</td>
-        <td>2</td>
-        <td>0.278</td>
-        <td>4</td>
-        <td>0.294</td>
-        <td>3</td>
-    </tr>
-    <tr>
-        <td>DBPedia</td>
-        <td>0.287</td>
-        <td>5</td>
-        <td>0.383</td>
-        <td>4</td>
-        <td>0.387</td>
-        <td>3</td>
-        <td>0.447</td>
-        <td>1</td>
-        <td>0.412</td>
-        <td>2</td>
-    </tr>
-    <tr>
-        <td>SCIDOCS</td>
-        <td>0.165</td>
-        <td>2</td>
-        <td>0.149</td>
-        <td>5</td>
-        <td>0.174</td>
-        <td>1</td>
-        <td>0.164</td>
-        <td>3</td>
-        <td>0.154</td>
-        <td>4</td>
-    </tr>
-    <tr>
-        <td>FEVER</td>
-        <td>0.649</td>
-        <td>5</td>
-        <td>0.697</td>
-        <td>4</td>
-        <td>0.77</td>
-        <td>2</td>
-        <td>0.821</td>
-        <td>1</td>
-        <td>0.743</td>
-        <td>3</td>
-    </tr>
-    <tr>
-        <td>Climate FEVER</td>
-        <td>0.186</td>
-        <td>5</td>
-        <td>0.228</td>
-        <td>3</td>
-        <td>0.251</td>
-        <td>2</td>
-        <td>0.263</td>
-        <td>1</td>
-        <td>0.202</td>
-        <td>4</td>
-    </tr>
-    <tr>
-        <td>SciFact</td>
-        <td>0.69</td>
-        <td>3</td>
-        <td>0.643</td>
-        <td>5</td>
-        <td>0.672</td>
-        <td>4</td>
-        <td>0.723</td>
-        <td>1</td>
-        <td>0.716</td>
-        <td>2</td>
-    </tr>
-    <tr>
-        <td>Quora</td>
-        <td>0.789</td>
-        <td>4</td>
-        <td>0.835</td>
-        <td>3</td>
-        <td>0.864</td>
-        <td>1</td>
-        <td>0.856</td>
-        <td>2</td>
-        <td>0.788</td>
-        <td>5</td>
-    </tr>
-    <tr>
-        <td>Amazon ESCI</td>
-        <td>0.081</td>
-        <td>3</td>
-        <td>0.071</td>
-        <td>5</td>
-        <td>0.086</td>
-        <td>2</td>
-        <td>0.077</td>
-        <td>4</td>
-        <td>0.095</td>
-        <td>1</td>
-    </tr>
-    <tr>
-        <td>Average</td>
-        <td>0.419</td>
-        <td>3.71</td>
-        <td>0.41</td>
-        <td>4.29</td>
-        <td>0.45</td>
-        <td>2.71</td>
-        <td>0.492</td>
-        <td>1.64</td>
-        <td>0.462</td>
-        <td>2.64</td>
-    </tr>
-</table>
-
-***BEIR** is short for Benchmarking Information Retrieval, check our its [Github](https://github.com/beir-cellar/beir) page.
-
-<center><b>Table II.</b>Speed Comparison, in the term of latency and throughput</center>
-
-|	                        | BM25          | Dense (with TAS-B model)  | Neural Sparse Search bi-encoder | Neural Sparse Search doc-only  |
-|---------------------------|---------------|---------------------------| ------------------------------- | ------------------------------ |
-| P50 latency (ms)          | 8ms	        | 56.6ms	                |176.3ms	                      | 10.2ms	|
-| P90 latency (ms)          | 12.4ms	    | 71.12ms	                |267.3ms	                      | 15.2ms	|
-| P99 Latency (ms)          | 18.9ms	    | 86.8ms	                |383.5ms	                      | 22ms	|
-| Max throughput (op/s)	    | 2215.8op/s	| 318.5op/s	                |107.4op/s	                      | 1797.9op/s	|
-| Mean throughput (op/s)	| 2214.6op/s	| 298.2op/s	                |106.3op/s	                      | 1790.2op/s	|
-
-
-*The latencies were tested on a subset of MSMARCO v2, with in total 1M documents. We used 20 clients to loop search requests to get the latency data.
-
-<center><b>Table III.</b>Capacity consumption comparison</center>
-
-|	|BM25	|Dense (with TAS-B model)	|Neural Sparse Search Bi-encoder	| Neural Sparse Search Doc-only	|
-|-|-|-|-|-|
-|Index size	|1 GB	|65.4 GB	|4.7 GB	|6.8 GB	|
-|RAM usage	|480.74 GB	|675.36 GB	|480.64 GB	|494.25 GB	|
-|Runtime RAM delta	|+0.01 GB	|+53.34 GB	|+0.06 GB	|+0.03 GB	|
-
-*We performed this experiment using the full dataset of MSMARCO v2, with 8.8M passages. We excluded all _source fields for all methods and force merged the index before measuring index size. We set the heap size of the OpenSearch JVM to half the node RAM, so an empty OpenSearch cluster also consumes close to 480 GB of memory.
-
-## Build your search engine in five steps
-
-Several pretrained encoder models are published in the OpenSearch model repository. As the state-of-the-art of BEIR benchmark, they are already available for out-of-the-box use, reducing fine-tuning effort. You can follow these three steps to build your search engine:
-
-1. **Prerequisites**: To run the following simple cases in the cluster, change the settings:
-
-    ```
-    PUT /_cluster/settings
-    {
-        "transient" : {
-        "plugins.ml_commons.allow_registering_model_via_url" : true,
-        "plugins.ml_commons.only_run_on_ml_node" : false,
-        "plugins.ml_commons.native_memory_threshold" : 99
-        }
-    }
-    ```
-
-    **allow_registering_model_via_url** is required to be true because you need to register your pretrained model by URL. Set **only_run_on_ml_node** to false if you don’t have a machine learning (ML) node on your cluster.
-2. **Deploy encoders**: The ML Commons plugin supports deploying pretrained models via URL. Taking `opensearch-neural-sparse-encoding` as an example, you can deploy the encoder via this API:
-
-    ```
-    POST /_plugins/_ml/models/_register?deploy=true
-    {
-        "name": "opensearch-neural-sparse-encoding",
-        "version": "1.0.0",
-        "description": "opensearch-neural-sparse-encoding",
-        "model_format": "TORCH_SCRIPT",
-        "function_name": "SPARSE_ENCODING",
-        "model_content_hash_value": "d1ebaa26615090bdb0195a62b180afd2a8524c68c5d406a11ad787267f515ea8",
-        "url": "https://artifacts.opensearch.org/models/ml-models/amazon/neural-sparse/opensearch-neural-sparse-encoding-v1/1.0.1/torch_script/neural-sparse_opensearch-neural-sparse-encoding-v1-1.0.1-torch_script.zip"
-        }
-    ```
-
-    After that, you will get the task_id in your response:
-
-    ```
-    {
-        "task_id": "<task_id>",
-        "status": "CREATED"
-    }
-    ```
-
-    Use task_id to search register model task like:
-
-    ```
-    GET /_plugins/_ml/tasks/<task_id>
-    ```
-
-    You can get register model task information. The state will change. After the state is completed, you can get the model_id like::
-
-    ```
-    {
-        "model_id": "<model_id>",
-        "task_type": "REGISTER_MODEL",
-        "function_name": "SPARSE_TOKENIZE",
-        "state": "COMPLETED",
-        "worker_node": [
-            "wubXZX7xTIC7RW2z8nzhzw"
-        ],
-        "create_time": 1701390988405,
-        "last_update_time": 1701390993724,
-        "is_async": true
-    }
-    ```
-
-3. **Set up the ingestion process**: Each document should be encoded into sparse vectors before being indexed. In OpenSearch, this procedure is implemented by an ingestion processor. You can create the ingestion pipeline using this API:
-
-    ```
-    PUT /_ingest/pipeline/neural-sparse-pipeline
-    {
-        "description": "An example neural sparse encoding pipeline",
-        "processors" : [
-            {
-                "sparse_encoding": {
-                    "model_id": "<model_id>",
-                    "field_map": {
-                    "passage_text": "passage_embedding"
-                    }
-                }
-            }
-        ]
-    }
-    ```
-
-4. **Set up index mapping**: Neural search leverages the `rank_features` field type for indexing, such that the token weights can be stored. The index will use the above ingestion processor to embed text. The index can be created as follows:
-
-    ```
-    PUT /my-neural-sparse-index
-    {
-        "settings": {
-            "default_pipeline": "neural-sparse-pipeline"
-        },
-        "mappings": {
-            "properties": {
-                "passage_embedding": {
-                    "type": "rank_features"
-                },
-                "passage_text": {
-                    "type": "text"
-                }
-            }
-        }
-    }
-    ```
-
-5. **Ingest documents with the ingestion processor**: After setting index, customer can put doc. Customer provide text field while processor will automatically transfer text content into embedding vector and put it into  `rank_features` field according the `field_map` in the processor:
-
-    ```
-    PUT /my-neural-sparse-index/_doc/
-    {
-        "passage_text": "Hello world"
-    }
-    ```
-
-### Model selection
-
-Neural sparse has two working modes: bi-encoder and document-only. For bi-encoder mode, we recommend using the pretrained model named “opensearch-neural-sparse-encoding-v1”, while both online search and offline ingestion share the same model file. For document-only mode, we recommended using the pretrained model “opensearch-neural-sparse-encoding-doc-v1” for the ingestion processor and using the model “opensearch-neural-sparse-tokenizer-v1” to implement online query tokenization. Altough presented as a “ml-commons” model, “opensearch-neural-sparse-tokenizer-v1” only translates the query into tokens without any model inference. All the models are published [here](https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/).
-
-### **Try your engine with a query clause**
-
-Congratulations! Now you have your own semantic search engine based on sparse encoders. To try a sample query, we can invoke the `_search` endpoint using the `neural_sparse` clause in query DSL:
-
-```
- GET /my-neural-sparse-index/_search/
- {
-    "query": {
-        "neural_sparse": {
-            "passage_embedding": {
-                "query_text": "Hello world a b",
-                "model_id": "<model_id>",
-                "max_token_score": 2.0
-            }
-        }
-    }
-}
-```
-
-Here are two parameters:
-- **“model_id” (string)**: The ID of the model that will be used to generate tokens and weights from the query text. The model must be indexed in OpenSearch before it can be used in neural search. A sparse encoding model will expand the tokens from query text, while the tokenizer model will only generate the token inside the query text.
-- **“max_token_score” (float)**: An extra parameter required for performance optimization. Just like the common procedure of OpenSearch match query, the neural_sparse query is transformed to a Lucene BooleanQuery combining disjunction of term-level sub-queries. The difference is we use FeatureQuery instead of TermQuery for term here. Lucene leverages the WAND (Weak AND) algorithm for dynamic pruning, which skips non-competitive tokens based on their score upper bounds. However, FeatureQuery uses FLOAT.MAX_VALUE as the score upper bound, which makes WAND optimization ineffective. The parameter resets the upper bound of each token in this query, and the default value is FLOAT.MAX_VALUE, which is consistent with the origin FeatureQuery. Setting the value to “3.5” for the bi-encoder model and “2” for the document-only model can accelerate search without precision loss. After OpenSearch is upgraded to Lucene version 9.8, this parameter will be deprecated.
diff --git a/_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md b/_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md
new file mode 100644
index 0000000000..419ff26c12
--- /dev/null
+++ b/_posts/2023-12-05-improving-document-retrieval-with-sparse-semantic-encoders.md
@@ -0,0 +1,482 @@
+---
+layout: post
+title:  Improving document retrieval with sparse semantic encoders
+authors:
+  - zhichaog
+  - xinyual
+  - dagney
+  - yych
+  - kolchfa
+date: 2023-12-05 01:00:00 -0700
+categories:
+    - technical-posts
+meta_keywords: search relevance, neural sparse search, semantic search, semantic search with sparse encoders
+meta_description: Learn how the neural sparse framework in OpenSearch 2.11 can help you improve search relevance and optimize semantic searches with sparse encoders using just a few APIs.
+has_science_table: true
+---
+
+OpenSearch 2.11 introduced neural sparse search---a new efficient method of semantic retrieval. In this blog post, you'll learn about using sparse encoders for semantic search. You'll find that neural sparse search reduces costs, performs faster, and improves search relevance. We're excited to share benchmarking results and show how neural sparse search outperforms other search methods. You can even try it out by building your own search engine in just five steps. To skip straight to the results, see [Benchmarking results](#benchmarking-results).
+
+## What are dense and sparse vector embeddings?
+
+When you use a transformer-based encoder, such as BERT, to generate traditional dense vector embeddings, the encoder translates each word into a vector. Collectively, these vectors make up a semantic vector space. In this space, the closer the vectors are, the more similar the words are in meaning.
+
+In sparse encoding, the encoder uses the text to create a list of tokens that have similar semantic meaning. The model vocabulary ([WordPiece](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt)) contains most commonly used words along with various tense endings (for example, `-ed` and `-ing`) and suffixes (for example, `-ate` and `-ion`). You can think of the vocabulary as a semantic space where each document is a sparse vector.
+
+The following images show example results of dense and sparse encoding.
+
+<table style="border:none">
+  <tr>
+    <td style="border:none">
+        <img height="280px" src="/assets/media/blog-images/2023-12-05-improving-document-retrieval-with-spade-semantic-encoders/embedding.png" />
+    </td>
+    <td style="border:none">
+        <img height="280px" src="/assets/media/blog-images/2023-12-05-improving-document-retrieval-with-spade-semantic-encoders/expand.png" />
+    </td>
+  </tr>
+</table>
+
+_**Left**: Dense vector semantic space. **Right**: Sparse vector semantic space._
+
+## Sparse encoders use more efficient data structures
+
+In dense encoding, documents are represented as high-dimensional vectors. To search these documents, you need to use a k-NN index as an underlying data structure. In contrast, sparse search can use a native Lucene index because sparse encodings are similar to term vectors used by keyword-based matching. 
+
+Compared to k-NN indexes, **sparse embeddings have the following cost-reducing advantages**: 
+
+1. Much smaller index size
+1. Reduced runtime RAM cost
+1. Lower computational cost
+
+For a detailed comparison, see [Table II](#table-ii-speed-comparison-in-terms-of-latency-and-throughput).
+
+## Sparse encoders perform better on unfamiliar datasets
+
+In our previous [blog post](https://opensearch.org/blog/semantic-science-benchmarks), we mentioned that searching with dense embeddings presents challenges when encoders encounter unfamiliar content. When an encoder trained on one dataset is used on a different dataset, the encoder often produces unpredictable embeddings, resulting in poor search result relevance. 
+
+Often, BM25 performs better than dense encoders on BEIR datasets that incorporate strong domain knowledge. In these cases, sparse encoders can fall back on keyword-based matching, ensuring that their search results are no worse than those produced by BM25. For a comparison of search result relevance benchmarks, see [Table I](#table-i-relevance-comparison-on-beir-benchmark-and-amazon-esci-in-terms-of-ndcg10-and-rank).
+
+## Among sparse encoders, document-only encoders are the most efficient
+
+You can run a neural sparse search in two modes: **bi-encoder** and **document-only**.
+
+In bi-encoder mode, both documents and search queries are passed through deep encoders. In document-only mode, documents are still passed through deep encoders, but search queries are instead tokenized. In this mode, document encoders are trained to learn more synonym association in order to increase recall. By eliminating the online inference phase, you can **save computational resources** and **significantly reduce latency**. For benchmarks, compare the `Neural sparse document-only` column with the other columns in [Table II](#table-ii-speed-comparison-in-terms-of-latency-and-throughput). 
+
+## Neural sparse search outperforms other search methods in benchmarking tests
+
+For benchmarking, we used a cluster containing 3 `r5.8xlarge` data nodes and 1 `r5.12xlarge` leader/machine learning (ML) node. We measured search relevance for all evaluated search methods in terms of NCDG@10. Additionally, we compared the runtime speed and the resource cost of each method.
+
+**Here are the key takeaways:**
+
+* Both modes provide the highest relevance on the BEIR and Amazon ESCI datasets.
+* Without online inference, the search latency of document-only mode is comparable to BM25.
+* Sparse encoding results in a much smaller index size than dense encoding. A document-only sparse encoder generates an index that is **10.4%** of the size of a dense encoding index. For a bi-encoder, the index size is **7.2%** of the size of a dense encoding index.
+* Dense encoding uses k-NN retrieval and incurs a 7.9% increase in RAM cost at search time. Neural sparse search uses a native Lucene index, so the RAM cost does not increase at search time.
+
+## Benchmarking results
+
+The benchmarking results are presented in the following tables.
+
+### Table I. Relevance comparison on BEIR benchmark and Amazon ESCI in terms of NDCG@10 and rank
+
+<table>
+    <tr style="text\-align:center;">
+        <td></td>
+        <td colspan="2">BM25</td>
+        <td colspan="2">Dense (with TAS-B model)</td>
+        <td colspan="2">Hybrid (Dense + BM25)</td>
+        <td colspan="2">Neural sparse search bi-encoder</td>
+        <td colspan="2">Neural sparse search document-only</td>
+    </tr>
+    <tr>
+        <td><b>Dataset</b></td>
+        <td><b>NDCG</b></td>
+        <td><b>Rank</b></td>
+        <td><b>NDCG</b></td>
+        <td><b>Rank</b></td>
+        <td><b>NDCG</b></td>
+        <td><b>Rank</b></td>
+        <td><b>NDCG</b></td>
+        <td><b>Rank</b></td>
+        <td><b>NDCG</b></td>
+        <td><b>Rank</b></td>
+    </tr>
+    <tr>
+        <td>Trec-Covid</td>
+        <td>0.688</td>
+        <td>4</td>
+        <td>0.481</td>
+        <td>5</td>
+        <td>0.698</td>
+        <td>3</td>
+        <td>0.771</td>
+        <td>1</td>
+        <td>0.707</td>
+        <td>2</td>
+    </tr>
+    <tr>
+        <td>NFCorpus</td>
+        <td>0.327</td>
+        <td>4</td>
+        <td>0.319</td>
+        <td>5</td>
+        <td>0.335</td>
+        <td>3</td>
+        <td>0.36</td>
+        <td>1</td>
+        <td>0.352</td>
+        <td>2</td>
+    </tr>
+    <tr>
+        <td>NQ</td>
+        <td>0.326</td>
+        <td>5</td>
+        <td>0.463</td>
+        <td>3</td>
+        <td>0.418</td>
+        <td>4</td>
+        <td>0.553</td>
+        <td>1</td>
+        <td>0.521</td>
+        <td>2</td>
+    </tr>
+    <tr>
+        <td>HotpotQA</td>
+        <td>0.602</td>
+        <td>4</td>
+        <td>0.579</td>
+        <td>5</td>
+        <td>0.636</td>
+        <td>3</td>
+        <td>0.697</td>
+        <td>1</td>
+        <td>0.677</td>
+        <td>2</td>
+    </tr>
+    <tr>
+        <td>FiQA</td>
+        <td>0.254</td>
+        <td>5</td>
+        <td>0.3</td>
+        <td>4</td>
+        <td>0.322</td>
+        <td>3</td>
+        <td>0.376</td>
+        <td>1</td>
+        <td>0.344</td>
+        <td>2</td>
+    </tr>
+    <tr>
+        <td>ArguAna</td>
+        <td>0.472</td>
+        <td>2</td>
+        <td>0.427</td>
+        <td>4</td>
+        <td>0.378</td>
+        <td>5</td>
+        <td>0.508</td>
+        <td>1</td>
+        <td>0.461</td>
+        <td>3</td>
+    </tr>
+    <tr>
+        <td>Touche</td>
+        <td>0.347</td>
+        <td>1</td>
+        <td>0.162</td>
+        <td>5</td>
+        <td>0.313</td>
+        <td>2</td>
+        <td>0.278</td>
+        <td>4</td>
+        <td>0.294</td>
+        <td>3</td>
+    </tr>
+    <tr>
+        <td>DBPedia</td>
+        <td>0.287</td>
+        <td>5</td>
+        <td>0.383</td>
+        <td>4</td>
+        <td>0.387</td>
+        <td>3</td>
+        <td>0.447</td>
+        <td>1</td>
+        <td>0.412</td>
+        <td>2</td>
+    </tr>
+    <tr>
+        <td>SciDocs</td>
+        <td>0.165</td>
+        <td>2</td>
+        <td>0.149</td>
+        <td>5</td>
+        <td>0.174</td>
+        <td>1</td>
+        <td>0.164</td>
+        <td>3</td>
+        <td>0.154</td>
+        <td>4</td>
+    </tr>
+    <tr>
+        <td>FEVER</td>
+        <td>0.649</td>
+        <td>5</td>
+        <td>0.697</td>
+        <td>4</td>
+        <td>0.77</td>
+        <td>2</td>
+        <td>0.821</td>
+        <td>1</td>
+        <td>0.743</td>
+        <td>3</td>
+    </tr>
+    <tr>
+        <td>Climate FEVER</td>
+        <td>0.186</td>
+        <td>5</td>
+        <td>0.228</td>
+        <td>3</td>
+        <td>0.251</td>
+        <td>2</td>
+        <td>0.263</td>
+        <td>1</td>
+        <td>0.202</td>
+        <td>4</td>
+    </tr>
+    <tr>
+        <td>SciFact</td>
+        <td>0.69</td>
+        <td>3</td>
+        <td>0.643</td>
+        <td>5</td>
+        <td>0.672</td>
+        <td>4</td>
+        <td>0.723</td>
+        <td>1</td>
+        <td>0.716</td>
+        <td>2</td>
+    </tr>
+    <tr>
+        <td>Quora</td>
+        <td>0.789</td>
+        <td>4</td>
+        <td>0.835</td>
+        <td>3</td>
+        <td>0.864</td>
+        <td>1</td>
+        <td>0.856</td>
+        <td>2</td>
+        <td>0.788</td>
+        <td>5</td>
+    </tr>
+    <tr>
+        <td>Amazon ESCI</td>
+        <td>0.081</td>
+        <td>3</td>
+        <td>0.071</td>
+        <td>5</td>
+        <td>0.086</td>
+        <td>2</td>
+        <td>0.077</td>
+        <td>4</td>
+        <td>0.095</td>
+        <td>1</td>
+    </tr>
+    <tr>
+        <td>Average</td>
+        <td>0.419</td>
+        <td>3.71</td>
+        <td>0.41</td>
+        <td>4.29</td>
+        <td>0.45</td>
+        <td>2.71</td>
+        <td>0.492</td>
+        <td>1.64</td>
+        <td>0.462</td>
+        <td>2.64</td>
+    </tr>
+</table>
+
+<sup>*</sup> For more information about Benchmarking Information Retrieval (BEIR), see [the BEIR GitHub page](https://github.com/beir-cellar/beir).
+
+### Table II. Speed comparison in terms of latency and throughput
+
+|	                        | BM25          | Dense (with TAS-B model)  | Neural sparse search bi-encoder | Neural sparse search document-only  |
+|---------------------------|---------------|---------------------------| ------------------------------- | ------------------------------ |
+| P50 latency (ms)          | 8 ms	        | 56.6 ms	                |176.3 ms	                      | 10.2ms	|
+| P90 latency (ms)          | 12.4 ms	    | 71.12 ms	                |267.3 ms	                      | 15.2ms	|
+| P99 Latency (ms)          | 18.9 ms	    | 86.8 ms	                |383.5 ms	                      | 22ms	|
+| Max throughput (op/s)	    | 2215.8 op/s	| 318.5 op/s	                |107.4 op/s	                      | 1797.9 op/s	|
+| Mean throughput (op/s)	| 2214.6 op/s	| 298.2 op/s	                |106.3 op/s	                      | 1790.2 op/s	|
+
+
+<sup>*</sup> We tested latency on a subset of MS MARCO v2 containing 1M documents in total. To obtain latency data, we used 20 clients to loop search requests.
+
+### Table III. Resource consumption comparison
+
+|	|BM25	|Dense (with TAS-B model)	|Neural sparse search bi-encoder	| Neural sparse search document-only	|
+|-|-|-|-|-|
+|Index size	|1 GB	|65.4 GB	|4.7 GB	|6.8 GB	|
+|RAM usage	|480.74 GB	|675.36 GB	|480.64 GB	|494.25 GB	|
+|Runtime RAM delta	|+0.01 GB	|+53.34 GB	|+0.06 GB	|+0.03 GB	|
+
+<sup>*</sup> We performed this experiment using the full MS MARCO v2 dataset, containing 8.8M passages. For all methods, we excluded the `_source` fields and force merged the index before measuring index size. We set the heap size of the OpenSearch JVM to half of the node RAM, so an empty OpenSearch cluster still consumed close to 480 GB of memory.
+
+## Build your search engine in five steps
+
+Follow these steps to build your search engine:
+
+1. **Prerequisites**: For this simple setup, update the following cluster settings:
+
+    ```json
+    PUT /_cluster/settings
+    {
+        "transient": {
+            "plugins.ml_commons.allow_registering_model_via_url": true,
+            "plugins.ml_commons.only_run_on_ml_node": false,
+            "plugins.ml_commons.native_memory_threshold": 99
+        }
+    }
+    ```
+
+    For more information about ML-related cluster settings, see [ML Commons cluster settings](https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/).
+2. **Deploy encoders**: The ML Commons plugin supports deploying pretrained models using a URL. For this example, you'll deploy the `opensearch-neural-sparse-encoding` encoder:
+
+    ```json
+    POST /_plugins/_ml/models/_register?deploy=true
+    {
+        "name": "opensearch-neural-sparse-encoding",
+        "version": "1.0.0",
+        "description": "opensearch-neural-sparse-encoding",
+        "model_format": "TORCH_SCRIPT",
+        "function_name": "SPARSE_ENCODING",
+        "model_content_hash_value": "d1ebaa26615090bdb0195a62b180afd2a8524c68c5d406a11ad787267f515ea8",
+        "url": "https://artifacts.opensearch.org/models/ml-models/amazon/neural-sparse/opensearch-neural-sparse-encoding-v1/1.0.1/torch_script/neural-sparse_opensearch-neural-sparse-encoding-v1-1.0.1-torch_script.zip"
+    }
+    ```
+
+    OpenSearch responds with a `task_id`:
+
+    ```json
+    {
+        "task_id": "<task_id>",
+        "status": "CREATED"
+    }
+    ```
+
+    Use the `task_id` to check the status of the task:
+
+    ```json
+    GET /_plugins/_ml/tasks/<task_id>
+    ```
+
+    Once the task is complete, the task state changes to `COMPLETED` and OpenSearch returns the `model_id` for the deployed model:
+
+    ```json
+    {
+        "model_id": "<model_id>",
+        "task_type": "REGISTER_MODEL",
+        "function_name": "SPARSE_TOKENIZE",
+        "state": "COMPLETED",
+        "worker_node": [
+            "wubXZX7xTIC7RW2z8nzhzw"
+        ],
+        "create_time": 1701390988405,
+        "last_update_time": 1701390993724,
+        "is_async": true
+    }
+    ```
+
+3. **Set up ingestion**: In OpenSearch, a `sparse_encoding` ingest processor encodes documents into sparse vectors before indexing them. Create an ingest pipeline as follows:
+
+    ```json
+    PUT /_ingest/pipeline/neural-sparse-pipeline
+    {
+        "description": "An example neural sparse encoding pipeline",
+        "processors" : [
+            {
+                "sparse_encoding": {
+                    "model_id": "<model_id>",
+                    "field_map": {
+                        "passage_text": "passage_embedding"
+                    }
+                }
+            }
+        ]
+    }
+    ```
+
+4. **Set up index mapping**: Neural search uses the `rank_features` field type to store token weights when documents are indexed. The index will use the ingest pipeline you created to generate text embeddings. Create the index as follows:
+
+    ```json
+    PUT /my-neural-sparse-index
+    {
+        "settings": {
+            "default_pipeline": "neural-sparse-pipeline"
+        },
+        "mappings": {
+            "properties": {
+                "passage_embedding": {
+                    "type": "rank_features"
+                },
+                "passage_text": {
+                    "type": "text"
+                }
+            }
+        }
+    }
+    ```
+
+5. **Ingest documents using the ingest pipeline**: After creating the index, you can ingest documents into it. When you index a text field, the ingest processor converts text into a vector embedding and stores it in the `passage_embedding` field specified in the processor:
+
+    ```json
+    PUT /my-neural-sparse-index/_doc/
+    {
+        "passage_text": "Hello world"
+    }
+    ```
+
+**Try your engine with a query clause**
+
+Congratulations! You've now created your own semantic search engine based on sparse encoders. To try a sample query, invoke the `_search` endpoint using the `neural_sparse` query:
+
+```json
+ GET /my-neural-sparse-index/_search/
+ {
+    "query": {
+        "neural_sparse": {
+            "passage_embedding": {
+                "query_text": "Hello world a b",
+                "model_id": "<model_id>",
+                "max_token_score": 2.0
+            }
+        }
+    }
+}
+```
+
+### Neural sparse query parameters
+
+The `neural_sparse` query supports two parameters:
+
+- `model_id` (String): The ID of the model that is used to generate tokens and weights from the query text. A sparse encoding model will expand the tokens from query text, while the tokenizer model will only tokenize the query text itself.
+- `max_token_score` (Float): An extra parameter required for performance optimization. Just like a `match` query, a `neural_sparse` query is transformed to a Lucene BooleanQuery, combining term-level subqueries using disjunction. The difference is that a `neural_sparse` query uses FeatureQuery instead of TermQuery to match the terms. Lucene employs the Weak AND (WAND) algorithm for dynamic pruning, which skips non-competitive tokens based on their score upper bounds. However, FeatureQuery uses `FLOAT.MAX_VALUE` as the score upper bound, which makes the WAND optimization ineffective. The `max_token_score` parameter resets the score upper bound for each token in a query, which is consistent with the original FeatureQuery. Thus, setting the value to 3.5 for the bi-encoder model and to 2 for the document-only model can accelerate search without precision loss. After OpenSearch is upgraded to Lucene version 9.8, this parameter will be deprecated.
+
+## Selecting a model
+
+OpenSearch provides several pretrained encoder models that you can use out of the box without fine-tuning. For a list of sparse encoding models provided by OpenSearch, see [Sparse encoding models](https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/#sparse-encoding-models).
+
+Use the following recommendations to select a sparse encoding model:
+
+- For **bi-encoder** mode, we recommend using the `opensearch-neural-sparse-encoding-v1` pretrained model. For this model, both online search and offline ingestion share the same model file. 
+
+- For **document-only** mode, we recommended using the `opensearch-neural-sparse-encoding-doc-v1` pretrained model for ingestion and the `opensearch-neural-sparse-tokenizer-v1` model at search time to implement online query tokenization. This model does not employ model inference and only translates the query into tokens. 
+
+
+## Next steps
+
+- For more information about neural sparse search, see [Neural sparse search](https://opensearch.org/docs/latest/search-plugins/neural-sparse-search/). 
+- For an end-to-end neural search tutorial, see [Neural search tutorial](https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/). 
+- For a list of all search methods OpenSearch supports, see [Search methods](https://opensearch.org/docs/latest/search-plugins/index/#search-methods).
+- Provide your feedback on the [OpenSearch Forum](https://forum.opensearch.org/).
\ No newline at end of file
diff --git a/_posts/2023-12-07-semantic-options-benchmarks.md b/_posts/2023-12-07-semantic-options-benchmarks.md
new file mode 100644
index 0000000000..522f15eff0
--- /dev/null
+++ b/_posts/2023-12-07-semantic-options-benchmarks.md
@@ -0,0 +1,156 @@
+---
+layout: post
+title:  "Semantic Search with OpenSearch: Architecture options and Benchmarks"
+authors:
+ - seanzheng
+ - zanniu
+ - ylwu
+date: 2023-12-07
+categories:
+ - technical-post
+meta_keywords: semantic search with OpenSearch, semantic search engine, deep neural network, benchmarking tests
+meta_description: Learn several ways to configure your OpenSearch clusters for semantic search along with how each approach works, and benchmarking tests to results to help you select the right use case.
+has_math: false
+has_science_table: false
+---
+Unlike traditional lexical search algorithms such as BM25, which only take keywords into account, semantic search improves search relevance by understanding the context and semantic meaning of search terms and context. In general, semantic search has two key elements: 1. **Embedding generation**: A machine learning (ML) model, usually a deep neural network model (for example, TAS-B) is used to generate embeddings for both search terms and content; 2. **k-NN**: Searches return results based on embedding proximity using a vector search algorithm like k-nearest neighbors (k-NN).
+
+OpenSearch introduced the k-NN plugin to support vector search in 2019. However, users were left to manage embedding generation outside of OpenSearch. This changed with OpenSearch 2.9, when the new Neural Search plugin was released (available as an experimental feature in 2.4). The Neural Search plugin enables the integration of ML models into your search workloads. During ingestion and search, the plugin uses the ML model to transform text into vectors. Then it performs vector-based search using k-NN and returns semantically similar text-based search results.
+
+The addition of the vector transformation to the search process does come with a cost. It involves making inferences using deep neural network (DNN) language models, such as TAS-B. And the inferences of these DNN models are usually RAM and CPU heavy. If not set up correctly, it can result in resource consumption pressure and impact the health of your cluster. In the rest of this post, we’ll introduce several different ways of configuring OpenSearch clusters for semantic search, explain in detail how each approach works, and present a set of benchmarks to help you choose one to fit your own use case.
+
+## Terms
+
+Before we discuss the options, here are the definitions of some terms we’ll use throughout this post:
+
+* **Data node**: Where OpenSearch data is stored. A data node manages a cluster’s search and indexing tasks and is the primary coordinator of an OpenSearch cluster.
+* **ML node**: OpenSearch introduced ML nodes in 2.3. An ML node is dedicated to ML-related tasks, such as inference for language models. You can follow these instructions to set up a dedicated ML node.
+* **ML connector**: Introduced with ML extensibility in 2.9, an ML connector allows you to connect your preferred inference service (for example, Amazon SageMaker) to OpenSearch. Once created, an ML connector can be used to build an ML model, which is registered just like a local model.
+* **Local/remote inference**: With the newly introduced ML connector, OpenSearch allows ML inference to be hosted either locally on data or ML nodes; or remotely on public inference services.
+
+## Architecture options
+
+OpenSearch provides multiple options for enabling semantic search: 1. Local inference on data nodes, 2. Local inference on ML nodes, 3. Remote inference on data nodes, and 4. Remote inference on ML nodes.
+
+**Option 1: Local inference on data nodes**
+
+With this option, both the Neural Search and ML Commons plugins reside on data nodes, just as any other plugin. Language models are loaded onto local data nodes, and inference is also executed locally. 
+
+**Ingest flow**: As illustrated in Figure 1, the Neural Search plugin receives ingestion requests through the ingestion pipeline. It sends the text blob to ML Commons to generate embeddings. ML Commons runs the inference locally and returns the generated embeddings. Neural Search then ingests the generated embeddings into a k-NN index.
+
+**Query flow**: For query requests, the Neural Search plugin also sends the query to ML Commons, which will inference locally and return an embedding. Upon receiving the embedding, Neural Search will create a vector search request and send it to the k-NN plugin, which will execute the query and return a list of document IDs. These document IDs will then be returned to the user.
+
+![Figure 1: Local inference on data nodes](/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-1.png)
+
+**Option 2: Local inference on ML nodes**
+
+With this option, dedicated ML nodes are set up to perform all ML-related tasks, including inference for language models. Everything else is identical to option 1. In both the ingestion and query flows, the inference request to generate embeddings will now be sent to the ML Commons plugin, which resides on a dedicated ML node instead of on a data node, as shown in following figure:
+![Figure 2: Local inference on ML nodes](/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-2.png)
+
+**Option 3: Remote inference on data nodes**
+
+This option was introduced in OpenSearch 2.9 with the ML extensibility feature. With this option, you use the ML connector to integrate with a remote server (outside of OpenSearch) for model inference (for example, SageMaker). Again, everything else is identical to option 1, except the inference requests are now forwarded by ML Commons from data nodes to the remote SageMaker endpoint through an ML connector, as shown in following figure:
+
+![Figure 3: Remote inference on data nodes](/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-3.png)
+
+
+**Option 4: Remote inference on ML nodes**
+
+This option is a combination of options 2 and 3. It still uses remote inference from SageMaker but also uses a dedicated ML node to host ML Commons, as shown in following figure:
+
+![Figure 4: Remote inference on ML nodes](/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-4.png)
+
+Each of the four options presents some pros and cons:
+
+* Option 1 is the default out-of-the-box option, it requires the least amount of setup configuration, and the inference requests are organically distributed by OpenSearch’s request routing. But running ML models on data nodes could potentially affect other data node tasks, such as querying and ingestion.
+* Option 2 manages all ML tasks with dedicated ML nodes. The benefit of this option is that it decouples the ML tasks from the rest of the cluster, improving the reliability of the cluster. But this also adds an extra network hop to ML nodes, which increases inference latency. 
+* Option 3 leverages an existing inference service, such as SageMaker. The remote connection will introduce extra network latency, but it also provides the benefit of offloading resource-intensive tasks to a dedicated inference server, which improves the reliability of the cluster and offers more model serving flexibility. 
+* Option 4 adds dedicated ML nodes on top of remote inference. Similarly to option 2, the dedicated ML node manages all ML requests, which further separates the ML workload from the rest of the cluster. But this comes with the cost of the ML node. Also, because the heavy lifting of the ML workload happens outside of the cluster, the ML node utilization could be low with this option.
+
+## Benchmarking
+
+To better understand the ingestion/query performance difference between these options, we designed a series of benchmarking tests. The following are our results and observations.
+
+### Experiment setup
+
+1. Dataset: We used MS MARCO as the primary dataset for benchmarking. MS MARCO is a collection of datasets focused on deep learning in search. MS MARCO has 12M documents, with an average length of 1,500 words, and is approximately 100 GB in size. Also note that we have truncation set up in models to only use the first 128 tokens of each document in our experiments. 
+2. Model: We chose sentence-transformers/all-MiniLM-L6-v2 from a list of [pretrained models](https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/#supported-pretrained-models) supported by OpenSearch. 
+    1. All pretrained models support truncation/padding to control the input length; we set both at 128.
+3. Cluster configuration: 
+    1. Node type: M5.xlarge (4 core, 16 GB RAM)
+    2. To ensure an apples-to-apples comparison, we configured all cluster options to use the same type and number of nodes in order to keep the cost similar:
+        1. Option 1: Local inference on data nodes: 2 data nodes, 1 ML node
+        2. Option 2: Local inference on ML nodes: 3 data nodes
+        3. Option 3: Remote inference on data nodes: 2 data nodes, 1 SageMaker node
+        4. Option 4: Remote inference on ML nodes: 1 data node, 1 ML node, 1 SageMaker node
+4. Benchmarking tool: We used [OpenSearch Benchmark](https://github.com/opensearch-project/opensearch-benchmark) to generate traffic and collect results.
+
+### Experiment 1: Ingestion
+
+**Ingestion setup**
+
+|Configuration |Value|
+|--- |--- |
+|Number of clients	|8|
+|Bulk size	|200|
+|Document count	|1M|
+|Local model truncation	|128|
+|SageMaker model truncation	|128|
+|Local model padding	|128|
+|SageMaker model padding	|128|
+|Dataset	|	[MSCARCO](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip)|
+
+**Experiment 1: Results**
+
+|Case	|Mean throughput (doc/s)	|Inference p90 (ms/doc)	|SageMaker inference p90 (ms/req)	|SageMaker overhead p90 (ms/req)	|e2e latency p90 (ms/bulk)|
+|---|---|---|---|---|---|
+|Option 1: Local inference on data nodes (3 data nodes)	|**_213.13_**|72.46	|N/A	|N/A	|8944.53|
+|Option 2: Local inference on ML nodes (2 data nodes + 1 ML node)	|72.76	|67.79	|N/A	|N/A	|25936.7|
+|Option 3: Remote inference on data nodes (2 data nodes + 1 remote ML node)	|**_94.41_**	|101.9	|97	|3.5	|17455.9|
+|Option 4: Remote inference on ML nodes (1 data node + 1 local ML node + 1 remote ML node)	|79.79	|60.37	|54.8	|3.5	|21714.6|
+
+**Experiment 1: Observations**
+
+* Option 1 provides much higher throughput than the other options. This is probably because ML models were deployed to all three data nodes, while the other options have only one dedicated ML node performing inference work. Note that we didn’t perform other tasks during the experiment, so all the nodes are dedicated to ingestion. This might not be the case in a real-world scenario. When the cluster multitasks, the ML inference workload may impact other tasks and cluster health.
+* Comparing options 2 and 3, we can see that even though option 2 has lower inference latency, its throughput is much lower than with option 3, which has a remote ML node. This could be because the SageMaker node is built and optimized solely for inference, while the local ML node still runs the OpenSearch stack and is not optimized for an inference workload.
+* Remote inference added some trivial overhead (3.5 ms, SageMaker overhead). We ran our tests on a public network; testing run on a virtual private cloud (VPC)-based network might yield slightly different results, but they are unlikely to be significant.
+
+### Experiment 2: Query 
+
+**Query setup**
+
+|Configuration |Value|
+|--- |--- |
+|Number of clients	|50|
+|Document count	|500k|
+|Local model truncation	|128|
+|SageMaker model truncation	|128|
+|Local model padding	|128|
+|SageMaker model padding	|128|
+|Dataset	|	[MSCARCO](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip)|
+
+**Experiment 2: Results**
+
+|Case	|Mean throughput (query/s)	|Inference p90 (ms/query)	|SageMaker inference p90 (ms/req)	|SageMaker overhead p90	(ms/req)|e2e latency p90 (ms/query)|
+|---|---|---|---|---|---|
+|Option 1: Local inference on data nodes (3 data nodes)	|128.49	|37.6	|N/A	|N/A	|82.6|
+|Option 2: Local inference on ML nodes (2 data nodes + 1 ML node)	|141.5	|29.5	|N/A	|N/A	|72.9|
+|Option 3: Remote inference on data nodes (2 data nodes + remote ML node)	|**_162.19_**	|26.4	|21.5	|4.9	|72.5|
+|Option 4: Remote inference on ML nodes (1 data node + 1 local ML node + remote ML node)	|136.2	|26.6	|21.6	|5	|76.65|
+
+**Experiment 2: Observations**
+
+* Inference latency is much lower than in the ingestion experiment (~30 ms compared to 60–100 ms). This is primarily because query terms are usually much shorter than documents.
+* Externally hosted models outperformed local models on inference tasks by about 10%, even considering the network overhead. 
+* Unlike with ingestion, inference latency is a considerable part of end-to-end query latency. So the configuration that has the lowest latency achieves higher throughput and lower end-to-end latency.
+* The remote model with a dedicated ML node ranked lowest in throughput, which could be because all remote requests have to pass through the single ML node instead of through multiple data nodes, as in the other configurations.
+
+## Conclusion/Recommendations
+
+In this blog post, we provided multiple options for configuring your OpenSearch cluster for semantic search, including local/remote inference and dedicated ML nodes. You can choose between these options to optimize costs and benefits based on your desired outcome. Based on our benchmarking results and observations, we recommend the following:
+
+* Remotely connected models separate ML workloads from the OpenSearch cluster, with only a small amount of extra latency. This option also provides flexibility in terms of the amount of computation power used for making inferences (for example, leveraging SageMaker GPU instances). This is our **recommended** option for any production-oriented systems. 
+* Local inference works out of the box on existing clusters without any additional resources. You can use this option to quickly set up a development environment or build PoCs. Because the heavy ML workload could potentially affect cluster query and search performance, we don’t recommend this option for production systems. If you do have to use local inference for your production systems, we strongly recommend to use dedicated ML nodes to separate ML workload from the rest of you cluster.
+* Dedicated ML nodes helps improve query latency for local models (by taking over all ML-related tasks from data nodes), but they don’t help much with remote inference because the heavy lifting is performed outside of the OpenSearch cluster. Also, because ML nodes don’t manage any tasks not related to ML, adding an ML node won’t improve query or ingestion throughput.
+
+
diff --git a/assets/media/authors/zanniu.jpeg b/assets/media/authors/zanniu.jpeg
new file mode 100644
index 0000000000..cb93b78e58
Binary files /dev/null and b/assets/media/authors/zanniu.jpeg differ
diff --git a/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-1.png b/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-1.png
new file mode 100644
index 0000000000..bb8d13d35b
Binary files /dev/null and b/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-1.png differ
diff --git a/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-2.png b/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-2.png
new file mode 100644
index 0000000000..b0835e6487
Binary files /dev/null and b/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-2.png differ
diff --git a/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-3.png b/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-3.png
new file mode 100644
index 0000000000..0190dc114c
Binary files /dev/null and b/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-3.png differ
diff --git a/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-4.png b/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-4.png
new file mode 100644
index 0000000000..f51eefb591
Binary files /dev/null and b/assets/media/blog-images/2023-12-07-semantic-options-benchmarks/semantic-options-4.png differ