diff --git a/README.md b/README.md
index ef67e99..9ff1266 100644
--- a/README.md
+++ b/README.md
@@ -21,15 +21,16 @@ An enterprise-grade AI retriever designed to streamline AI integration into your
## 📝 Description
Denser Retriever combines multiple search technologies into a single platform. It utilizes **gradient boosting (
-xgboost)** machine learning technique to combines:
+xgboost)** machine learning technique to combine:
- **Keyword-based searches** that focus on fetching precisely what the query mentions.
- **Vector databases** that are great for finding a wide range of potentially relevant answers.
- **Machine Learning rerankers** that fine-tune the results to ensure the most relevant answers top the list.
-We show that the combinations can significantly improve the retrieval accuracy on MTEB benchmarks when compared to each individual retrievers.
+Our experiments on MTEB datasets show that the combination of keyword search, vector search and a reranker via an xgboost model (denoted as ES+VS+RR_n) can significantly improve the vector search (VS) baseline.
+
+![mteb_ndcg_plot](mteb_ndcg_plot.png)
-![](./www/content/docs/experiments/rag-flow-data.png)
## 🚀 Features
@@ -38,8 +39,6 @@ The initial release of Denser Retriever provides the following features.
- Supporting heterogeneous retrievers such as **keyword search**, **vector search**, and **ML model reranking**
- Leveraging **xgboost** ML technique to effectively combine heterogeneous retrievers
- **State-of-the-art accuracy** on [MTEB](https://github.com/embeddings-benchmark/mteb) Retrieval benchmarking
-- Providing an **out-of-the-box retriever** which significantly outperforms the current best vector search model in
- similar model size
- Demonstrating how to use Denser retriever to power an **end-to-end applications** such as chatbot and semantic search
## 📦 Installation
@@ -48,6 +47,8 @@ We use [Poetry](https://python-poetry.org/docs/) to install and manage Denser Re
Retriever with the following command under repo root directory.
```bash
+git clone https://github.com/denser-org/denser-retriever
+cd denser-retriever
make install
```
diff --git a/docker/milvus/standalone/hello_milvus.py b/docker/milvus/standalone/hello_milvus.py
index c940949..78f9e47 100644
--- a/docker/milvus/standalone/hello_milvus.py
+++ b/docker/milvus/standalone/hello_milvus.py
@@ -33,9 +33,7 @@
print(fmt.format("start connecting to Milvus"))
connections.connect(
"default",
- # host="localhost",
- host="54.68.68.29",
- # host="44.237.177.8",
+ host="localhost",
port="19530",
user="root",
password="Milvus",
diff --git a/docker/milvus/standalone/list_connections.py b/docker/milvus/standalone/list_connections.py
index d1c0483..53d8935 100644
--- a/docker/milvus/standalone/list_connections.py
+++ b/docker/milvus/standalone/list_connections.py
@@ -6,8 +6,7 @@
connections.connect(
"default",
- # host="localhost",
- host="54.68.68.29",
+ host="localhost",
port="19530",
user="root",
password="Milvus",
diff --git a/examples/denser_chat.py b/examples/denser_chat.py
index 85c4a19..e2345d5 100644
--- a/examples/denser_chat.py
+++ b/examples/denser_chat.py
@@ -60,10 +60,6 @@ def denser_chat():
st.session_state.messages.append({"role": "user", "content": prompt})
- enc = tiktoken.encoding_for_model(default_openai_model)
- prompt_length = len(enc.encode(prompt))
- logger.info(f"prompt length:{prompt_length}")
-
with st.chat_message("assistant"):
message_placeholder = st.empty()
full_response = ""
diff --git a/experiments/config_local.yaml b/experiments/config_local.yaml
index 96152cd..6495207 100644
--- a/experiments/config_local.yaml
+++ b/experiments/config_local.yaml
@@ -20,10 +20,6 @@ vector:
milvus_port: 19530
milvus_user: root
milvus_passwd: Milvus
- # https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5
- # sentence-transformers/all-MiniLM-L6-v2 (dim: 384), Alibaba-NLP/gte-large-en-v1.5 (dim 1024),
- # Alibaba-NLP/gte-base-en-v1.5 (dim: 768)
- # Snowflake/snowflake-arctic-embed-m
emb_model: Snowflake/snowflake-arctic-embed-m
emb_dims: 768
one_model: false
@@ -31,9 +27,6 @@ vector:
topk: 100
rerank:
- # https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
- # cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
- # rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
rerank_bs: 100
topk: 100
diff --git a/experiments/config_server.yaml b/experiments/config_server.yaml
index 1c7d9a2..9fc2fb4 100644
--- a/experiments/config_server.yaml
+++ b/experiments/config_server.yaml
@@ -10,20 +10,16 @@ model_features: es+vs+rr_n
keyword:
es_user: elastic
- es_passwd: WzAkbzjZj9AfNXxzmOmp
- es_host: http://54.68.68.29:9200
+ es_passwd: YOUR_ES_PASSWORD
+ es_host: http://00.00.00.00:9200
es_ingest_passage_bs: 5000
topk: 100
vector:
- milvus_host: 54.68.68.29
+ milvus_host: 00.00.00.00
milvus_port: 19530
milvus_user: root
milvus_passwd: Milvus
- # https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5
- # sentence-transformers/all-MiniLM-L6-v2 (dim: 384), Alibaba-NLP/gte-large-en-v1.5 (dim 1024),
- # Alibaba-NLP/gte-base-en-v1.5 (dim: 768)
- # Snowflake/snowflake-arctic-embed-m
emb_model: Snowflake/snowflake-arctic-embed-m
emb_dims: 768
one_model: false
@@ -31,9 +27,6 @@ vector:
topk: 100
rerank:
- # https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
- # cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
- # rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
rerank_bs: 100
topk: 100
diff --git a/experiments/train_and_test.py b/experiments/train_and_test.py
index 38d9532..c492c73 100644
--- a/experiments/train_and_test.py
+++ b/experiments/train_and_test.py
@@ -35,7 +35,6 @@ def __init__(self, config_file, dataset_name):
# (100 passages per query) and reranker passages (maximum 200 passages per query)
def generate_retriever_data(self, train_on, eval_on):
generate_data(self.dataset_name, train_on, self.config_file, ingest=True)
- # generate_data(self.dataset_name, train_on, self.config_file, ingest=False)
if eval_on != train_on:
generate_data(self.dataset_name, eval_on, self.config_file, ingest=False)
@@ -329,7 +328,7 @@ def report(self, eval_on):
if __name__ == "__main__":
- config_file = "experiments/config_server.yaml"
+ # config_file = "experiments/config_server.yaml"
# dataset = ["mteb/arguana", "test", "test"]
# dataset = ["mteb/climate-fever", "test", "test"]
@@ -349,13 +348,14 @@ def report(self, eval_on):
# dataset_name, train_on, eval_on = dataset
# model_dir = "/home/ubuntu/denser_output_retriever/exp_msmarco/models/"
- if len(sys.argv) != 4:
- print("Usage: python train_and_test.py [dataset_name] [train] [test]")
+ if len(sys.argv) != 5:
+ print("Usage: python train_and_test.py [config_file] [dataset_name] [train] [test]")
sys.exit(0)
- dataset_name = sys.argv[1]
- train_on = sys.argv[2]
- eval_on = sys.argv[3]
+ config_file = sys.argv[1]
+ dataset_name = sys.argv[2]
+ train_on = sys.argv[3]
+ eval_on = sys.argv[4]
experiment = Experiment(config_file, dataset_name)
diff --git a/mteb_ndcg_plot.png b/mteb_ndcg_plot.png
new file mode 100644
index 0000000..8fb00ec
Binary files /dev/null and b/mteb_ndcg_plot.png differ
diff --git a/tests/config-cpws.yaml b/tests/config-cpws.yaml
index 66f2ca3..9999fde 100644
--- a/tests/config-cpws.yaml
+++ b/tests/config-cpws.yaml
@@ -25,9 +25,6 @@ vector:
topk: 5
rerank:
- # https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
- # cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
- # rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
rerank_bs: 100
topk: 5
@@ -45,6 +42,5 @@ fields:
output_prefix: denser_output_retriever/
-## temp parameters
max_doc_size: 0
max_query_size: 0
diff --git a/tests/config-denser.yaml b/tests/config-denser.yaml
index 1aa195d..3f69439 100644
--- a/tests/config-denser.yaml
+++ b/tests/config-denser.yaml
@@ -27,15 +27,11 @@ vector:
topk: 5
rerank:
- # https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
- # cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
- # rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
rerank_bs: 100
topk: 5
output_prefix: denser_output_retriever/
-## temp parameters
max_doc_size: 0
max_query_size: 0
diff --git a/tests/config-titanic.yaml b/tests/config-titanic.yaml
index 3d1f6a4..e13909a 100644
--- a/tests/config-titanic.yaml
+++ b/tests/config-titanic.yaml
@@ -25,9 +25,6 @@ vector:
topk: 5
rerank:
- # https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
- # cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
- # rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
rerank_bs: 100
topk: 5
@@ -44,6 +41,5 @@ fields:
output_prefix: denser_output_retriever/
-## temp parameters
max_doc_size: 0
max_query_size: 0
diff --git a/www/content/docs/examples/e2e-chat.mdx b/www/content/docs/examples/e2e-chat.mdx
index aa28010..7ecc044 100644
--- a/www/content/docs/examples/e2e-chat.mdx
+++ b/www/content/docs/examples/e2e-chat.mdx
@@ -19,7 +19,7 @@ Under the [repo](https://github.com/denser-org/denser-retriever/tree/main) direc
poetry run streamlit run examples/denser_chat.py
```
-This command first builds a mini retriever with the following code. It then launches a webpage with interactive UI so suers can input queries.
+This command first builds a mini retriever with the following code.
```python
index_name = "unit_test_denser"
@@ -27,10 +27,8 @@ retriever = RetrieverGeneral(index_name, "tests/config-denser.yaml")
retriever.ingest("tests/test_data/denser_website_passages_top10.jsonl")
```
-Once it is launched, we will see a chatbot interface similar to the following screenshot. We can ask any question to the chatbot and get the response. Below shows an example query of `what use cases does denser support?` The retriever first returns the relevant passages listed under the section of `Sources` at the bottom of the screenshot. These passages are fed to a LLM and the final summarization is displayed on the chat window.
+It then launches a webpage with interactive UI so users can input queries. Once it is launched, we will see a chatbot interface similar to the following screenshot. We can ask any question to the chatbot and get the response. Below shows an example query of `what use cases does denser support?` The retriever first returns the relevant passages listed under the section of `Sources` at the bottom of the screenshot. These passages are fed to a LLM and the final summarization is displayed on the chat window.
import DenserChat from "./denser_chat.png"
-
-With the same code, we can build a chatbot on **Wikipedia** dataset at [here](https://denser.ai). Feel free to try it out!
diff --git a/www/content/docs/examples/e2e-search.mdx b/www/content/docs/examples/e2e-search.mdx
index 9e6c339..ea939ec 100644
--- a/www/content/docs/examples/e2e-search.mdx
+++ b/www/content/docs/examples/e2e-search.mdx
@@ -19,7 +19,7 @@ Under the [repo](https://github.com/denser-org/denser-retriever/tree/main) direc
poetry run streamlit run examples/denser_search.py
```
-This command first builds a mini retriever with the following code. It then launches a webpage with interactive UI so uers can input queries.
+This command first builds a mini retriever with the following code.
```python
index_name = "unit_test_titanic"
@@ -27,7 +27,7 @@ retriever = RetrieverGeneral(index_name, "tests/config-titanic.yaml")
retriever.ingest("tests/test_data/titanic_top10.jsonl")
```
-Once it is launched, we will see a search interface similar to the following screenshot. We can input any queries, along with the filters to search relevant results. Below shows an example query of `cumings` with `Sex` field is filled with `female` The retriever returns the relevant passages which matches the specified filter value.
+It then launches a webpage with interactive UI so users can input queries. Once it is launched, we will see a search interface similar to the following screenshot. We can input any queries, along with the filters to search relevant results. Below shows an example query of `cumings` with `Sex` field is filled with `female`. The retriever returns the relevant passages which matches the specified filter value.
import DenserSearch from "./denser_search.png"
diff --git a/www/content/docs/experiments/index_and_query.mdx b/www/content/docs/experiments/index_and_query.mdx
index ed0fe8d..5e113be 100644
--- a/www/content/docs/experiments/index_and_query.mdx
+++ b/www/content/docs/experiments/index_and_query.mdx
@@ -11,6 +11,23 @@ In index and query use case, users provide a collection of documents such as tex
```bash
poetry run python experiments/index_and_query_from_docs.py
```
+If the run is successful, we would expect to see something similar to the following.
+
+```bash
+2024-05-27 12:00:55 INFO: ES ingesting passages.jsonl record 96
+2024-05-27 12:00:55 INFO: Done building ES index
+2024-05-27 12:00:55 INFO: Remove existing Milvus index state_of_the_union
+2024-05-27 12:00:59 INFO: Milvus vector DB ingesting passages.jsonl record 96
+2024-05-27 12:01:03 INFO: Done building Vector DB index
+[{'source': 'tests/test_data/state_of_the_union.txt',
+'text': 'One of the most serious constitutional responsibilities...',
+'title': '', 'pid': 73,
+'score': -1.6985594034194946}]
+```
+
+## Build and query a retriever from a text file
+
+### Overview
The index and query use case consists of two steps:
@@ -27,7 +44,7 @@ The following diagram illustrates a denser retriever, which consists of three co
- **Vector search** uses neural network models to encode both the query and the documents into dense vector representations in a high-dimensional space. We use [Milvus](https://milvus.io/docs/install_standalone-docker.md) and [snowflake-arctic-embed-m](https://github.com/Snowflake-Labs/arctic-embed?tab=readme-ov-file) model, which achieves state-of-the-art performance on the MTEB/BEIR [leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for each of their size variants.
- **A ML cross-encoder re-ranker** can be utilized to further boost accuracy over these two retriever approaches above. We use [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2), which has a good balance between accuracy and inference latency.
-## Build and query a retriever from a text file
+In the following section, we will explain the underlying processes and mechanisms involved.
@@ -37,7 +54,7 @@ The following diagram illustrates a denser retriever, which consists of three co
We config the above three components in the following yaml file (available at [repo](https://github.com/denser-org/denser-retriever/blob/main/experiments/config_local.yaml)). Most of the parameters are self-explanatory. The sections of `keyword`, `vector`, `rerank` config the Elasticsearch, Milvus, and reranker respectively.
-We uses `combine=model` to combine Elasticsearch, Milvus and reranker via a xgboost model `experiments/models/scifact_xgb_es+vs+rr_n.json`, which was trained using mteb [scifact](https://huggingface.co/datasets/mteb/scifact) dataset (see [training](./training) on how to train such a model). Besides the model combination, we can also use `linear` or `rank` to combine elasticsearch, Milvus and reranker. The experiments on MTEB datasets suggest that the model combination can lead to significantly higher accuracy than the linear or rank methods.
+We uses **combine: model** to combine Elasticsearch, Milvus and reranker via a [xgboost](https://github.com/dmlc/xgboost) model **experiments/models/msmarco_xgb_es+vs+rr_n.json**, which was trained using mteb [msmarco](https://huggingface.co/datasets/mteb/msmarco) dataset (see the [training](https://retriever.denser.ai/docs/experiments/training) recipe on how to train such a model). Besides the model combination, we can also use **linear** or **rank** to combine Elasticsearch, Milvus and reranker. The experiments on MTEB datasets suggest that the model combination can lead to significantly higher accuracy than the linear or rank methods.
Some parameters, for example, `es_ingest_passage_bs`, are only used in training a xgboost model (i.e. not needed in query stage).
@@ -49,7 +66,7 @@ combine: model
keyword_weight: 0.5
vector_weight: 0.5
rerank_weight: 0.5
-model: ./experiments/models/scifact_xgb_es+vs+rr_n.json
+model: ./experiments/models/msmarco_xgb_es+vs+rr_n.json
model_features: es+vs+rr_n
keyword:
@@ -64,10 +81,6 @@ vector:
milvus_port: 19530
milvus_user: root
milvus_passwd: Milvus
- # https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5
- # sentence-transformers/all-MiniLM-L6-v2 (dim: 384), Alibaba-NLP/gte-large-en-v1.5 (dim 1024),
- # Alibaba-NLP/gte-base-en-v1.5 (dim: 768)
- # Snowflake/snowflake-arctic-embed-m
emb_model: Snowflake/snowflake-arctic-embed-m
emb_dims: 768
one_model: false
@@ -75,16 +88,12 @@ vector:
topk: 100
rerank:
- # https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
- # cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
- # rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
rerank_bs: 100
topk: 100
output_prefix: ./denser_output_retriever/
-## temp parameters
max_doc_size: 0
max_query_size: 10000
```
@@ -95,7 +104,7 @@ max_query_size: 10000
### Generate passages
-We now describe how to build a retriever from a given text file: the [state_of_the_union.txt](https://github.com/denser-org/denser-retriever/blob/main/tests/test_data/state_of_the_union.txt). The following code shows how to read the text file, split the file to text chunks and save them to a jsonl file `passage_file`.
+We now describe how to build a retriever from a given text file: the [state_of_the_union.txt](https://github.com/denser-org/denser-retriever/blob/main/tests/test_data/state_of_the_union.txt). The following code shows how to read the text file, split the file to text chunks and save them to a jsonl file `passages.jsonl`.
```python
from langchain_community.document_loaders import TextLoader
@@ -111,10 +120,13 @@ passage_file = "passages.jsonl"
save_HF_docs_as_denser_passages(texts, passage_file, 0)
```
-Each line in `passage_file` is a passage, which contains fields of `source`, `title`, `text` and `pid` (passage id).
+Each line in `passages.jsonl` is a passage, which contains fields of `source`, `title`, `text` and `pid` (passage id).
```json
-{"source": "tests/test_data/state_of_the_union.txt", "title": "", "text": "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny.", "pid": 0}
+{"source": "tests/test_data/state_of_the_union.txt",
+"title": "",
+"text": "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny.",
+"pid": 0}
```
@@ -123,7 +135,7 @@ Each line in `passage_file` is a passage, which contains fields of `source`, `ti
### Build a Denser retriever
-We can build a Denser retriever with the given `passage_file` and `experiments/config.yaml` config file.
+We can build a Denser retriever with the given `passages.jsonl` and `experiments/config_local.yaml` config file.
```python
# Build denser index
@@ -148,7 +160,11 @@ print(passages)
Each returned passage receives a confidence `score` to indicate how relevant it is to the given query. We get something similar to the following.
```python
-[{'source': 'tests/test_data/state_of_the_union.txt', 'text': 'One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', 'title': '', 'pid': 73, 'score': -0.6116511225700378}]
+[{'source': 'tests/test_data/state_of_the_union.txt',
+'text': 'One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.',
+'title': '',
+'pid': 73,
+'score': -0.6116511225700378}]
```
@@ -186,38 +202,23 @@ print(passages)
## Build and query a retriever from a webpage
-The steps to build and query a retriever from a webpage is similar to the above, except for the passage corpus generation. We list the code below and we can also find the code from [repo](https://github.com/denser-org/denser-retriever/blob/main/experiments/index_and_query_from_webpage.py).
+Building a retriever from a webpage is similar to the above, except for the passage corpus generation. The **index_and_query_from_webpage.py** source code can be found at [here](https://github.com/denser-org/denser-retriever/blob/main/experiments/index_and_query_from_webpage.py).
-```python
-import bs4
-from langchain_community.document_loaders import WebBaseLoader
-from langchain_text_splitters import RecursiveCharacterTextSplitter
-from denser_retriever.utils import save_HF_docs_as_denser_passages
-from denser_retriever.retriever_general import RetrieverGeneral
+To run this use case, go to denser-retriever repo and run:
-# Load, chunk and index the contents of the blog to create a retriever.
-loader = WebBaseLoader(
- web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
- bs_kwargs=dict(
- parse_only=bs4.SoupStrainer(
- class_=("post-content", "post-title", "post-header")
- )
- ),
-)
-docs = loader.load()
-
-text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
-texts = text_splitter.split_documents(docs)
-passage_file = "passages.jsonl"
-save_HF_docs_as_denser_passages(texts, passage_file, 0)
+```bash
+poetry run python experiments/index_and_query_from_webpage.py
+```
-# Build denser index
-retriever_denser = RetrieverGeneral("agent_webpage", "experiments/config_local.yaml")
-retriever_denser.ingest(passage_file)
+If successful, we expect to see somthing similar to the following.
-# Query
-query = "What is Task Decomposition?"
-passages, docs = retriever_denser.retrieve(query, {})
-print(passages)
-
-```
+```bash
+2024-05-27 12:10:47 INFO: ES ingesting passages.jsonl record 66
+2024-05-27 12:10:47 INFO: Done building ES index
+2024-05-27 12:10:52 INFO: Milvus vector DB ingesting passages.jsonl record 66
+2024-05-27 12:10:56 INFO: Done building Vector DB index
+[{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
+'text': 'Fig. 1. Overview of a LLM-powered autonomous agent system...',
+'title': '',
+'pid': 2,
+'score': -1.6985594034194946}]
diff --git a/www/content/docs/experiments/mteb_retrieval.mdx b/www/content/docs/experiments/mteb_retrieval.mdx
index c71ff7b..192ccb8 100644
--- a/www/content/docs/experiments/mteb_retrieval.mdx
+++ b/www/content/docs/experiments/mteb_retrieval.mdx
@@ -31,7 +31,9 @@ MTEB [retrieval datasets](https://github.com/embeddings-benchmark/mteb) consists
## Train and test xgboost models
-For each dataset in [MTEB](https://github.com/embeddings-benchmark/mteb), we trained an xgboost models on the training dataset and tested on the test dataset. To speed up the experiments, we used up to 10k queries per dataset in training (`max_query_size: 10000` in `config_server.yaml`). For datasets which do not have training data, we used the development data to train. If neither training nor development data exists, we applied the 3-fold cross-validation. That is, we randomly split the test data into three folds, we used two folds to train a xgboost model and tested on the third fold. We applied this process three times so the whole test dataset can be evaluated. We fixed the xgboost model training with the following settings. Specifically, we used the ndcg metric as model update objective, a moderate learning rate (`eta`) of 0.1, regularization parameter (`gamma`) of 1.0, `min_child_weight` of 0.1, maximum depth of tree up to 6, and evaluation metric of ndcg@10. We used a fixed number (100) of boosting iterations (`num_boost_round`), thus no attempting to optimize the training per dataset.
+For each dataset in [MTEB](https://github.com/embeddings-benchmark/mteb), we trained an xgboost models on the training dataset and tested on the test dataset. To speed up the experiments, we used up to 10k queries per dataset in training (`max_query_size: 10000` in `config_server.yaml`). For datasets which do not have training data, we used the development data to train. If neither training nor development data exists, we applied the 3-fold cross-validation. That is, we randomly split the test data into three folds, we used two folds to train a xgboost model and tested on the third fold. We applied this process three times so the whole test dataset can be evaluated.
+
+We fixed the xgboost model training with the following settings. Specifically, we used the ndcg metric as model update objective, a moderate learning rate (`eta`) of 0.1, regularization parameter (`gamma`) of 1.0, `min_child_weight` of 0.1, maximum depth of tree up to 6, and evaluation metric of ndcg@10. We used a fixed number (100) of boosting iterations (`num_boost_round`), thus no attempting to optimize the training per dataset.
```python
params = {
@@ -44,10 +46,10 @@ params = {
}
```
-The source code for the experiment can be found at [train_and_test.py](https://github.com/denser-org/denser-retriever/blob/main/experiments/train_and_test.py). We ran the following command to train 8 xgboost models (ES+VS, ES+RR, VS+RR, ES+VS+RR, ES+VS_n, ES+RR_n, VS+RR_n, and ES+VS+RR_n) using MSMARCO training data. The definitions of these 8 models can be found at [training](./training). The parameters are dataset_name, train split, and test split respectively.
+The source code for the experiment can be found at [train_and_test.py](https://github.com/denser-org/denser-retriever/blob/main/experiments/train_and_test.py). We ran the following command to train 8 xgboost models (ES+VS, ES+RR, VS+RR, ES+VS+RR, ES+VS_n, ES+RR_n, VS+RR_n, and ES+VS+RR_n) using MSMARCO training data. The definitions of these 8 models can be found at [training](./training). The parameters are dataset_name, config file, train split, and test split respectively. We need to configure hosts, users and passwords for Elasticsearch and Milvus in the config file experiments/[config_server.yaml](https://github.com/denser-org/denser-retriever/blob/main/experiments/config_server.yaml).
```shell
-poetry run python experiments/train_and_test.py mteb/msmarco train test
+poetry run python experiments/train_and_test.py experiments/config_server.yaml mteb/msmarco train test
```
After the training, we can find the models at `/home/ubuntu/denser_output_retriever/exp_msmarco/models/xgb_*`. We note that the prefix `/home/ubuntu/denser_output_retriever/` is defined in the [config_server.yaml](https://github.com/denser-org/denser-retriever/blob/main/experiments/config_server.yaml) file
@@ -60,7 +62,7 @@ In addition to training, this experiment also evaluated the 8 trained models on
## Test xgboost models
-Another use case is to evaluate a trained model on 26 MTEB datasets. We first specify the model in [config_server.yaml](https://github.com/denser-org/denser-retriever/blob/main/experiments/config_server.yaml) file.
+To evaluate a trained model on 26 MTEB datasets, we need to specify the model in [config_server.yaml](https://github.com/denser-org/denser-retriever/blob/main/experiments/config_server.yaml) file.
```yaml
model: PATH_TO_YOUR_TRAINED_MODEL
@@ -69,7 +71,7 @@ model: PATH_TO_YOUR_TRAINED_MODEL
We can then evaluate the MTEB dataset (MSMARCO as an example) by running:
```shell
-poetry run python experiments/test.py mteb/msmarco
+poetry run python experiments/test.py
```
We will get the ndcg@10 score after the evaluation.
@@ -109,11 +111,11 @@ import MtebChart from "@/components/mteb-chart"
}))} />
-Here are the observations from the experiment results.
+The MTEB experiment results are summarized as follows.
-Vector search with [snowflake-arctic-embed-m](https://github.com/Snowflake-Labs/arctic-embed?tab=readme-ov-file) model can significantly boost the elasticsearch NDCG@10 baseline from 38.32 to 54.24. The combination of elasticsearch, vector search and a reranker via xgboost models can further improve the vector search baseline. For instance, the ES+VS+RR_n model achieves the highest NDCG@10 score of 56.47, surpassing the vector search baseline by an absolute increase of 2.23 and a relative improvement of 4.11%.
+Vector search by [snowflake-arctic-embed-m](https://github.com/Snowflake-Labs/arctic-embed?tab=readme-ov-file) model can significantly boost the Elasticsearch NDCG@10 baseline from 38.32 to 54.24. The combination of Elasticsearch, vector search and a reranker via xgboost models can further improve the vector search baseline. For instance, the ES+VS+RR_n model achieves the highest NDCG@10 score of 56.47, surpassing the vector search baseline (NDCG@10 of 54.24) by an absolute increase of 2.23 and a relative improvement of 4.11%.
-For datasets which have training data (FEVER, FiQA2018, HotpotQA, NFCorpus, and SciFact), the combinations of elasticsearch, vector search and reranker via xgboost models are more beneficial, which can be witnessed by the following table.
+For datasets which have training data (FEVER, FiQA2018, HotpotQA, NFCorpus, and SciFact), the combinations of Elasticsearch, vector search and reranker via xgboost models are more beneficial, which can be witnessed by the following table.
| Name | VS | ES+VS+RR_n | Delta | Delta% |
| -------- | ----- | ---------- | ----- | ------ |
@@ -125,4 +127,4 @@ For datasets which have training data (FEVER, FiQA2018, HotpotQA, NFCorpus, and
| SciFact | 73.16 | 75.33 | 2.17 | 2.96 |
| Average | 59.41 | 62.05 | 2.63 | 4.68 |
-The ES+VS+RR_n model improves the vector search NDCG@10 baseline by 2.63 absolute and 4.68% relative gains on these five datasets. It is worth noting that, on the widely used benchmark dataset MSMARCO, the ES+VS+RR_n leads significant relative NDCG@10 gian of 13.07% when compared to vector search baseline.
+The ES+VS+RR_n model (NDCG@10 of 62.05) improves the vector search NDCG@10 baseline (NDCG@10 of 59.41) by 2.63 absolute and 4.68% relative gains on these five datasets. It is worth noting that, on the widely used benchmark dataset MSMARCO, the ES+VS+RR_n leads significant relative NDCG@10 gian of 13.07% when compared to vector search baseline.
diff --git a/www/content/docs/experiments/training.mdx b/www/content/docs/experiments/training.mdx
index 86a9c91..3a31670 100644
--- a/www/content/docs/experiments/training.mdx
+++ b/www/content/docs/experiments/training.mdx
@@ -6,50 +6,52 @@ title: Training
We run this experiment on a **server**, which requires ES and Milvus installations specified [here](/docs/install/install-server).
-In training use case, users provide a training dataset to train a [xgboost](https://github.com/dmlc/xgboost) model which governs how to combine elasticsearch, vector search and reranking. A training set consists of three components:
+In training use case, users provide a training dataset to train a [xgboost](https://github.com/dmlc/xgboost) model which governs on how to combine keyword search, vector search and reranking. A training set consists of three components:
1. A query set
2. A passage corpus
3. A qrels file which annotates the relevance of passages for queries
-We use the [mteb/scifact](https://huggingface.co/datasets/mteb/scifact) dataset to illustrate the use case. The source code for this use case can be found at [train_and_test.py](https://github.com/denser-org/denser-retriever/blob/main/experiments/train_and_test.py). We use the following command to run this use case
+The source code for this use case can be found at [train_and_test.py](https://github.com/denser-org/denser-retriever/blob/main/experiments/train_and_test.py). The training command usage is:
+
+ ```bash
+ python train_and_test.py [config_file] [dataset_name] [train] [test]
+ ```
+We use the [mteb/scifact](https://huggingface.co/datasets/mteb/scifact) dataset to illustrate the use case:
```bash
-poetry run python experiments/train_and_test.py experiments/config_server.yaml mteb/scifact
+poetry run python experiments/train_and_test.py experiments/config_server.yaml mteb/scifact train test
```
-
-The training use case consists of the following steps:
-
-```python
- dataset_name = "mteb/scifact"
- train_on = "train"
- eval_on = "test"
-
- experiment = Experiment(dataset_name)
-
- # Generate retriever data, this takes time
- experiment.generate_retriever_data(train_on, eval_on)
- experiment.compute_baselines(eval_on)
- experiment.train(train_on, eval_on)
- experiment.test(eval_on)
+That is, we use experiments/[config_server.yaml](https://github.com/denser-org/denser-retriever/blob/main/experiments/config_server.yaml) as the config file (we need to configure hosts, users and passwords for Elasticsearch and Milvus), use `mteb/scifact` dataset, use the `train` split to train a xgboost model and test the trained model on `test` split data. If successful, we would get similar result as the following.
+```bash
+INFO:__main__:train: train, eval: test, cross-validation: False
+metric_keyword.json: "NDCG@10": 0.58425,
+metric_vector.json: "NDCG@10": 0.73167,
+metric_es+vs.json: "NDCG@10": 0.73288,
+metric_es+rr.json: "NDCG@10": 0.69086,
+metric_vs+rr.json: "NDCG@10": 0.72731,
+metric_es+vs+rr.json: "NDCG@10": 0.73081,
+metric_es+vs_n.json: "NDCG@10": 0.75084,
+metric_es+rr_n.json: "NDCG@10": 0.69692,
+metric_vs+rr_n.json: "NDCG@10": 0.73625,
+metric_es+vs+rr_n.json: "NDCG@10": 0.75335
```
-
-We specify the training dataset `mteb/scifact`. Specifically we use the `train` split to train a xgboost model to combine elasticsearch, vector search and reranking. We test the trained model on `test` split data and report the ndcg@10 score.
+We explain the experiments in the following steps.
## Generate retriever data
-`Generate_retriever_data` is used to generate the featurized query passage data to train xgboost models. The following table shows the mteb/scifact statistics. The dataset has 5,183 passages, 920 and 300 training and test queries. For each query, each passage in the corpus receives a relevance label, with 0 and 1 indicate irrelevant and relevant respectively.
+We need to generate featurized query passage data to train xgboost models. The following table shows the mteb/scifact statistics. The dataset has 5,183 passages, 809 and 300 training and test queries. For each query, each passage in the corpus receives a relevance label, with 0 and 1 being irrelevant and relevant respectively.
| #Corpus | #Train Query | #Test Query|
|---|---|---|
-|5,183 | 920| 300|
+|5,183 | 809| 300|
-We first build an elasticsearch index, a vector index and a reranker using scifact passages. The elasticsearch, vector search and reranker settings are configured in [config_server.yaml](https://github.com/denser-org/denser-retriever/blob/main/experiments/config_server.yaml).
+We first build an Elasticsearch index, a vector index and a reranker using scifact passages. The Elasticsearch, vector search and reranker settings are configured in [config_server.yaml](https://github.com/denser-org/denser-retriever/blob/main/experiments/config_server.yaml).
- For vector search, we use [snowflake-arctic-embed-m](https://github.com/Snowflake-Labs/arctic-embed?tab=readme-ov-file) model, which achieves state-of-the-art performance on the MTEB/BEIR [leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for each of their size variants.
- For ML reranker, we use [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2), which has a good balance between accuracy and inference latency.
-The Denser retriever is illustrated in the following diagram, with the top and bottom boxes describing the training and inference respectively. For each query in the training data, we query elasticsearch and vector database to retrieve two sets of topk (100) passages respectively. We note that these two sets may overlap. We then apply a ML reranker to rerank the passages returned from elasticsearch and vector search.
+The Denser retriever is illustrated in the following diagram, with the top and bottom boxes describing the training and inference respectively. For each query in the training data, we query Elasticsearch and vector database to retrieve two sets of topk (100) passages respectively. We note that these two sets may overlap. We then apply a ML reranker to rerank the passages returned from Elasticsearch and vector search.
import RagFlowData from './rag-flow-data.png'
@@ -68,42 +70,40 @@ Let's consider a query and two passages in the following. The first passage is a
{"source": "4414547", "title": "Mosaic PPM1D mutations are associated with predisposition to breast and ovarian cancer", "text": "Improved sequencing technologies offer unprecedented opportunities for investigating the role of rare genetic variation in common disease. However, there are considerable challenges with respect to study design, data analysis and replication. Using pooled next-generation sequencing of 507 genes implicated in the repair of DNA in 1,150 samples, an analytical strategy focused on protein-truncating variants (PTVs) and a large-scale sequencing case–control replication experiment in 13,642 individuals, here we show that rare PTVs in the p53-inducible protein phosphatase PPM1D are associated with predisposition to breast cancer and ovarian cancer. PPM1D PTV mutations were present in 25 out of 7,781 cases versus 1 out of 5,861 controls (P = 1.12 × 10−5), including 18 mutations in 6,912 individuals with breast cancer (P = 2.42 × 10−4) and 12 mutations in 1,121 individuals with ovarian cancer (P = 3.10 × 10−9). Notably, all of the identified PPM1D PTVs were mosaic in lymphocyte DNA and clustered within a 370-base-pair region in the final exon of the gene, carboxy-terminal to the phosphatase catalytic domain. Functional studies demonstrate that the mutations result in enhanced suppression of p53 in response to ionizing radiation exposure, suggesting that the mutant alleles encode hyperactive PPM1D isoforms. Thus, although the mutations cause premature protein truncation, they do not result in the simple loss-of-function effect typically associated with this class of variant, but instead probably have a gain-of-function effect. Our results have implications for the detection and management of breast and ovarian cancer risk. More generally, these data provide new insights into the role of rare and of mosaic genetic variants in common conditions, and the use of sequencing in their identification.", "pid": -1}
```
- For elasticsearch (ES), vector search (VS) and Reranker (RR), we generate three features: `rank`, `score` and `missing` on a query passage pair. We list the featurized query and passage pairs in the following table.
+ For Elasticsearch (ES), vector search (VS) and Reranker (RR), we generate three features: `rank`, `score` and `missing` on a query passage pair. We list the featurized query and passage pairs in the following table.
|QID |PID | Label | ES Rank | ES Score | ES Missing | VS Rank | VS Score | VS Missing| RR Rank | RR Score | RR Missing
| ---- |--- | ------ | ----------- | ---- | ---- | ----------- | ---- | ---- | ----------- | ---- |---- |
| 3 |14717500 |1 | 3 | 74.42| 0| 5| -1.29| 0| 1| 2.98| 0 |
| 3 |4414547 |0 | 29 | 32.08| 0| 4| -1.28| 0| 2| 1.47| 0 |
- The first data point represents the query and passage `14717500`. The passage is annotated with label `1` (relevant) with respect to the query. The passage receives rank position of `3` and relevance score of `74.42` in elasticsearch retriever. It is ranked in the top 100 passages from elasticsearch and thus is not missing (ES Missing value of `0`). Similarly the passage receives rank position of `5` and score `-1.29` from vector search. We note both elasticsearch and vector search top 100 passages are reranked by a reranker, so the reranker missing feature is always `0`.
+ The first data point represents the query and passage `14717500`. The passage is annotated with label `1` (relevant) with respect to the query. The passage receives rank position of `3` and relevance score of `74.42` in Elasticsearch retriever. It is ranked in the top 100 passages from Elasticsearch and thus is not missing (ES Missing value of `0`). Similarly the passage receives rank position of `5` and score `-1.29` from vector search. We note both Elasticsearch and vector search top 100 passages are reranked by a reranker, so the reranker missing feature is always `0`.
-We now have a featurized query passage training (138,322) and test (51,601) data from scifact dataset.
+We now have featurized query passage training (138,322) and test (51,601) data from scifact dataset.
## Compute baselines
-`Compute_baselines` is used to compute the ndcg@10 score for elasticsearch and vector search baselines. For elasticsearch, the topk passages per query are sorted in the descending order of elasticsearch scores. The sorted passages can be used to compute the elasticsearch ndcg@10 score. Vector search ndcg@10 can be computed similarly with the vector scores of passages. We list the baseline ndcg@10 scores in the following table.
+For Elasticsearch, the topk passages per query are sorted in the descending order of Elasticsearch scores to compute its ndcg@10 score. Vector search ndcg@10 can be computed similarly with the vector scores. We list the baseline ndcg@10 scores in the following table.
| | Elasticsearch | Vector Search|
|---|---|---|
|ndcg@10 | 58.42| 73.16 |
-We note that Vector search leads to higher accuracy than elasticsearch (73.16 vs 58.42), which suggests that vector search can capture semantic similarity better compared to keyword search.
+We note that Vector search leads to higher accuracy than Elasticsearch (73.16 vs 58.42), which suggests that vector search can capture semantic similarity better compared to keyword search.
## Train xgboost models
-`Train` method is used to train xgboost models on scifact training data.
+There are six ways of combining Elasticsearch (ES), vector search (VS) and reranking (RR) to build a Denser retriever: ES, VS, ES+VS, ES+RR, VS+RR, or ES+VS+RR. Out of these six combinations, four (ES+VS, ES+RR, VS+RR, and ES+VS+RR) require xgboost models to combine different retrieval scores.
-There are six ways of combining elasticsearch (ES), vector search (VS) and reranking (RR) to build a Denser retriever: ES, VS, ES+VS, ES+RR, VS+RR, or ES+VS+RR. Either ES or VS can merely use their relevance score to generate descending sorted passages, and therefore they do not require a xgboost model.
-
-We train one xgboost model for the remaining of four configurations: ES+VS, ES+RR, VS+RR, and ES+VS+RR. In addition, we introduce feature `normalization` to the raw elasticsearch, vector search and reranker scores. We therefore add four more configurations: ES+VS_n, ES+RR_n, VS+RR_n, and ES+VS+RR_n. We introduce two feature normalizations:
+We train one xgboost model for each of these four configurations: ES+VS, ES+RR, VS+RR, and ES+VS+RR. In addition, we introduce feature `normalization` to the raw Elasticsearch, vector search and reranker scores. We therefore add four more configurations: ES+VS_n, ES+RR_n, VS+RR_n, and ES+VS+RR_n. We introduce two feature normalizations:
- **Norm1: Standardization** normalizes the features values to have a zero mean and unit variance.
- **Norm2: Min-max** normalizes the features based on the min and max ranges.
-We end up with training 8 xgboost models for the scifact dataset. The xgboost model training code can be found at [train_and_test.py](https://github.com/denser-org/denser-retriever/tree/main/experiments/train_and_test.py), which is based on the xgboost [rank](https://github.com/dmlc/xgboost/tree/master/demo/rank) code.
+We end up with training 8 xgboost models for the scifact dataset. These 8 models along with ES and VS baselines are illustrated in the following table.
-|ID | elasticsearch | vector search | reranker | normalization |
+|ID | Elasticsearch | Vector search | Reranker | Normalization |
|---|---|---|---|---|
|ES | ✅ | ❌ | ❌ | ❌ |
|VS | ❌ |✅ | ❌ | ❌ |
@@ -116,15 +116,17 @@ We end up with training 8 xgboost models for the scifact dataset. The xgboost mo
|VS+RR_n | ❌ |✅ | ✅ | ✅ |
|ES+VS+RR_n | ✅ |✅ | ✅ | ✅ |
+The xgboost model training code can be found at [train_and_test.py](https://github.com/denser-org/denser-retriever/tree/main/experiments/train_and_test.py), which is adapted from the xgboost [rank](https://github.com/dmlc/xgboost/tree/master/demo/rank) code.
+
## Test xgboost models
-`Test` method is used to test xgboost models on scifact test data and report ndcg@10 scores. We list all 8 models accuracy in the following table. Ref is the reference ndcg@10 of `snowflake-arctic-embed-m` from Huggingface [leaderboard](https://huggingface.co/spaces/mteb/leaderboard), which is consistent with our reported VS accuracy.
+Once the xgboost models are trained, we can test these xgboost models on scifact test data and report ndcg@10 scores. We list all 8 models accuracy in the following table. Ref is the reference ndcg@10 of `snowflake-arctic-embed-m` from Huggingface [leaderboard](https://huggingface.co/spaces/mteb/leaderboard), which is consistent with our reported VS accuracy.
| | ES | VS | ES+VS/ES+VS_n | ES+RR/ES+RR_n | VS+RR/VS+RR_n | ES+VS+RR/ES+VS+RR_n |ref |
|---|---|---|---|---|---|---|---|
|ndcg@10| 58.42 | 73.16 | 73.28/75.08 | 69.08/69.69 | 72.73/73.62 | 73.08/75.33 | 73.55 |
-The experiments show that the combination of ES, VS and RR lead to higher accuracy. For example, the ES+VS+RR_n leads to the ndcg@10 score of 75.33, resulting 1.78 ndcg@10 increase compared to vector search baseline.
+The experiments show that the combination of ES, VS and RR lead to higher accuracy. For example, the ES+VS+RR_n leads to the ndcg@10 score of 75.33, resulting 2.17 ndcg@10 increase compared to vector search baseline (ndcg@10 of 73.16).
We also support the linear combination of ES, VS, and RR, for example, with the following setting.
@@ -136,7 +138,7 @@ vector_weight: 0.5
rerank_weight: 0.5
```
-However, we find that the linear combination performs worse than XGBoost models, achieving an NDCG@10 score of only 62.73 with the given weights. The reasons for the low accuracy are:
+However, we find that the linear combination performs worse than XGBoost models, achieving an NDCG@10 score of only 62.73 with equal weights of 0.5, 0.5, and 0.5. The reasons for the low accuracy are:
- The scores from ES, VS, and RR are neither bounded nor calibrated, making it difficult for the linear weights to accurately model their relative importance.
- Some query-passage pairs may have missing score features.
@@ -145,6 +147,6 @@ On the contrary, the xgboost models can effectively estimate the feature importa
Xgboost model has a nice feature to estimate the feature importance. We plot the feature importance in the following picture. It shows that the normalized vector search score (VS Norm2) is the most important feature to predict if a passage is relevant or not. The normalized reranker feature (RR Norm1) is the second most important feature.
-![](./feature_importance.png)
+![Feature importance](./feature_importance.png)
diff --git a/www/content/docs/index.mdx b/www/content/docs/index.mdx
index 9a1daeb..576420a 100644
--- a/www/content/docs/index.mdx
+++ b/www/content/docs/index.mdx
@@ -14,7 +14,7 @@ In the world of AI, a "retriever" is a tool used to sift through vast amounts of
## How Does Denser Retriever Work?
-Denser Retriever integrates multiple search technologies into a single platform. It utilizes **gradient boosting (xgboost)** machine learning technique to combines:
+Denser Retriever integrates multiple search technologies into a single platform. It utilizes **gradient boosting (xgboost)** machine learning technique to combine:
- **Keyword-based searches** that focus on fetching precisely what the query mentions.
- **Vector databases** that are great for finding a wide range of potentially relevant answers.
@@ -65,8 +65,7 @@ The initial release of Denser Retriever provides the following features.
- Supporting heterogeneous retrievers such as `keyword search`, `vector search`, and `ML model reranking`
- Leveraging **xgboost** ML technique to effectively combine heterogeneous retrievers
- **State-of-the-art accuracy** on [MTEB](https://github.com/embeddings-benchmark/mteb) Retrieval benchmarking
-- Providing an **out-of-the-box retriever** which significantly outperforms the current best vector search model in similar model size
-- Demonstrating how to use Denser retriever to power an `end-to-end applications` such as chatbot and semantic search
+- Demonstrating how to use Denser retriever to power an `end-to-end applications` such as chatbots and semantic search
## Why Denser Retriever?
diff --git a/www/content/docs/install/install-local.mdx b/www/content/docs/install/install-local.mdx
index fbd65fa..79b654b 100644
--- a/www/content/docs/install/install-local.mdx
+++ b/www/content/docs/install/install-local.mdx
@@ -7,7 +7,7 @@ description: Setup the Denser Retriever Requirements on a local host.
This setup is not suitable for production use as the data is not persistent and environment variables are not kept secret.
-Elasticsearch and Milvus are required to run the Denser Retriever. They support the keyword search and vector search respectively. We follow the following instructions to install elasticsearch and Milvus on a local host (for example, your laptop).
+Elasticsearch and Milvus are required to run the Denser Retriever. They support the keyword search and vector search respectively. We follow the following instructions to install Elasticsearch and Milvus on a local host (for example, your laptop).
## Prerequisites
diff --git a/www/content/docs/install/install-server.mdx b/www/content/docs/install/install-server.mdx
index 2036ed6..298af16 100644
--- a/www/content/docs/install/install-server.mdx
+++ b/www/content/docs/install/install-server.mdx
@@ -3,11 +3,11 @@ title: Self-host
description: Setup the Denser Retriever Requirements on a self-hosted server.
---
-If we plan to host Elasticsearch and Milvus services for production usage, we need a server (for example, an AWS instance) to provide reliable and scalable services. In this section, we list the following instructions to install elasticsearch and Milvus on a server.
+If we plan to host Elasticsearch and Milvus services for production usage, we need a server (for example, an AWS instance) to provide reliable and scalable services. In this section, we list the following instructions to install Elasticsearch and Milvus on a server.
## Install Keyword Search
-We use Elasticsearch under the hood as keyword search implementation due to its high performance and robustness. We follow the [Elasticsearch install guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/targz.html) to install Elasticsearch version 8.13 on an AWS EC2 instance (`t2.medium` with 2 vCPU and 4 GiB Memory). Users may refer to the official elasticsearch doc for greater details. To be self-contained, we list the installation commands required in this doc.
+We use Elasticsearch under the hood as keyword search implementation due to its high performance and robustness. We follow the [Elasticsearch install guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/targz.html) to install Elasticsearch version 8.13 on an AWS EC2 instance (for example, a `t2.medium` with 2 vCPU and 4 GiB Memory). Users may refer to the official Elasticsearch doc for greater details. To be self-contained, we list the installation commands required in this doc.
### Download and install archive for Linux
@@ -28,12 +28,12 @@ We run the following command to start Elasticsearch:
When starting Elasticsearch for the first time, security features are enabled and configured by default. Specifically,
-- Authentication and authorization are enabled, and a password is generated for the elastic built-in superuser.
+- Authentication and authorization are enabled, and a password is generated for the `elastic` built-in superuser.
- Certificates and keys for TLS are generated for the transport and HTTP layer, and TLS is enabled and configured with these keys and certificates.
-The password for the `elastic` user is output to your terminal. Take a note of the this password `es_passwd` which will be used to connect elasticsearch server.
+The password for the `elastic` user is output to your terminal. Take a note of the this password `es_passwd` which will be used to connect Elasticsearch server.
-The default setting requires both elastic password and TLS to access elasticsearch service. To make it simple, you can disable TLS by changing the `enabled: true` to `enabled: false` at `config/elasticsearch.yml`. After the change, the ssl config block looks like the following:
+The default setting requires both elastic password and TLS to access Elasticsearch service. To make it simple, you can disable TLS by changing the `enabled: true` to `enabled: false` at `config/elasticsearch.yml`. After the change, the ssl config block looks like the following:
```bash
xpack.security.http.ssl:
@@ -86,12 +86,12 @@ pkill -F pid
## Install Vector Database
-We use Milvus under the hood as vector database implementation due to its high performance and robustness. We follow the Milvus instructions at [here](https://milvus.io/docs/install_standalone-docker-compose.md) to install Milvus **Standalone** Service on a `t2.medium` instance (the same one as we installed the keyword search service).
+We use Milvus under the hood as vector database implementation due to its high performance and robustness. We follow the Milvus instructions at [here](https://milvus.io/docs/install_standalone-docker-compose.md) to install Milvus **Standalone** Service on a `t2.medium` instance (the same one as we installed Elasticsearch service).
To be self-contained, we list the installation commands required in this doc.
### Configure Milvus
-We include `milvus` directory in `denser-retriever` repo to support Milvus installation. Open the `milvus/standalone/docker-compose.yml` file and find the following block:
+We include `milvus` [directory](https://github.com/denser-org/denser-retriever/tree/main/docker/milvus) in `denser-retriever` repo to support Milvus installation. Open the `milvus/standalone/docker-compose.yml` file and find the following block:
```shell
volumes:
- /home/ubuntu/denser-retriever/docker/milvus/standalone/milvus.yaml:/milvus/configs/milvus.yaml # Map the local path to the container path
diff --git a/www/content/docs/misc/filters.mdx b/www/content/docs/misc/filters.mdx
index 485cfab..b055d34 100644
--- a/www/content/docs/misc/filters.mdx
+++ b/www/content/docs/misc/filters.mdx
@@ -35,33 +35,31 @@ field_name:field_name_internal:type
```yaml
version: "0.1"
+# linear or rank
+combine: linear
keyword_weight: 0.5
vector_weight: 0.5
rerank_weight: 0.5
-# linear or rank
-combine: linear
keyword:
es_user: elastic
- es_passwd: WzAkbzjZj9AfNXxzmOmp
- es_host: http://54.68.68.29:9200
+ es_passwd: YOUR_ES_PASSWORD
+ es_host: http://localhost:9200
es_ingest_passage_bs: 5000
topk: 5
vector:
- milvus_host: 54.68.68.29
+ milvus_host: localhost
milvus_port: 19530
milvus_user: root
milvus_passwd: Milvus
emb_model: sentence-transformers/all-MiniLM-L6-v2
emb_dims: 384
+ one_model: true
vector_ingest_passage_bs: 1000
topk: 5
rerank:
- # https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
- # cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
- # rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
rerank_bs: 100
topk: 5
@@ -78,7 +76,6 @@ fields:
output_prefix: denser_output_retriever/
-## temp parameters
max_doc_size: 0
max_query_size: 0
```
@@ -89,7 +86,7 @@ max_query_size: 0
### Build and query a Denser retriever
-Once we have the config file, the code to ingest and query a retriever is similar to [previous example](../experiments/index_and_query). We run the following python code to build a retriever index and then query.
+Once we have the config file, we run the following python code to build a retriever index and then query.
```python
from denser_retriever.retriever_general import RetrieverGeneral
diff --git a/www/content/docs/quick-start.mdx b/www/content/docs/quick-start.mdx
index 698bd97..f9a5025 100644
--- a/www/content/docs/quick-start.mdx
+++ b/www/content/docs/quick-start.mdx
@@ -4,13 +4,13 @@ title: Quick Start
## Overview
-The following diagram illustrates a denser retriever, which consists of three components:
+The following diagram illustrates a Denser Retriever, which consists of three components:
import DenserRetriever from './experiments/denser-retriever.png'
-- **Keyword search** relies on traditional search techniques that use exact keyword matching. We use [elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/targz.html) in denser retriever.
+- **Keyword search** relies on traditional search techniques that use exact keyword matching. We use [elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/targz.html) in Denser Retriever.
- **Vector search** uses neural network models to encode both the query and the documents into dense vector representations in a high-dimensional space. We use [Milvus](https://milvus.io/docs/install_standalone-docker.md) and [snowflake-arctic-embed-m](https://github.com/Snowflake-Labs/arctic-embed?tab=readme-ov-file) model, which achieves state-of-the-art performance on the MTEB/BEIR [leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for each of their size variants.
- **A ML cross-encoder re-ranker** can be utilized to further boost accuracy over these two retriever approaches above. We use [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2), which has a good balance between accuracy and inference latency.
@@ -18,16 +18,13 @@ import DenserRetriever from './experiments/denser-retriever.png'
To get started, we need
-- Install `denser-retriever` python package, see [here](./install/install-package)
-- Install `Elasticsearch` and `Milvus`
- - Either on a local machine (for example, my laptop), see [here](./install/install-local)
- - Or on a server (for example, an AWS instance), see [here](./install/install-server)
+- Install `denser-retriever` python package, see [here](./install/installation)
+- Install `Elasticsearch` and `Milvus`: Either on a local machine (for example, my laptop), see [here](./install/install-local), or on a server (for example, an AWS instance), see [here](./install/install-server)
## Experiments
-- Try out Denser Retriever which was built on Wikipedia dataset at [here](https://denser.ai)
- [**Build an index and query:**](./experiments/index_and_query) users provide a collection of documents such as text files or webpages to build a retriever. Users can then ask questions to obtain relevant results from the provided documents.
-- [**Training:**](./experiments/training) Users provide a training dataset to train a xgboost model which governs on how to combine elasticsearch, vector search and reranking. Users can then use such a model to effectively combine elasticsearch, vector search and reranker to get optimal results.
-- [**MTEB Experiments**](./experiments/mteb_retrieval]) User want to replicate the MTEB retrieval experiments.
+- [**Training:**](./experiments/training) Users provide a training dataset to train an xgboost model which governs on how to combine keyword search, vector search and reranking. Users can then use such a model to effectively combine keyword search, vector search and a reranker to get optimal results.
+- [**MTEB Experiments:**](./experiments/mteb_retrieval) User want to replicate the MTEB retrieval experiments.
## Examples