Clean up before release

denser-org · May 28, 2024 · fa11456 · fa11456
1 parent 6786869
commit fa11456
Show file tree

Hide file tree

Showing 21 changed files with 145 additions and 181 deletions.
diff --git a/README.md b/README.md
@@ -21,15 +21,16 @@ An enterprise-grade AI retriever designed to streamline AI integration into your
 ## 📝 Description
 
 Denser Retriever combines multiple search technologies into a single platform. It utilizes **gradient boosting (
-xgboost)** machine learning technique to combines:
+xgboost)** machine learning technique to combine:
 
 - **Keyword-based searches** that focus on fetching precisely what the query mentions.
 - **Vector databases** that are great for finding a wide range of potentially relevant answers.
 - **Machine Learning rerankers** that fine-tune the results to ensure the most relevant answers top the list.
 
-We show that the combinations can significantly improve the retrieval accuracy on MTEB benchmarks when compared to each individual retrievers.
+Our experiments on MTEB datasets show that the combination of keyword search, vector search and a reranker via an xgboost model (denoted as ES+VS+RR_n) can significantly improve the vector search (VS) baseline.
+
+![mteb_ndcg_plot](mteb_ndcg_plot.png)
 
-![](./www/content/docs/experiments/rag-flow-data.png)
 
 ## 🚀 Features
 
@@ -38,8 +39,6 @@ The initial release of Denser Retriever provides the following features.
 - Supporting heterogeneous retrievers such as **keyword search**, **vector search**, and **ML model reranking**
 - Leveraging **xgboost** ML technique to effectively combine heterogeneous retrievers
 - **State-of-the-art accuracy** on [MTEB](https://github.com/embeddings-benchmark/mteb) Retrieval benchmarking
-- Providing an **out-of-the-box retriever** which significantly outperforms the current best vector search model in
-  similar model size
 - Demonstrating how to use Denser retriever to power an **end-to-end applications** such as chatbot and semantic search
 
 ## 📦 Installation
@@ -48,6 +47,8 @@ We use [Poetry](https://python-poetry.org/docs/) to install and manage Denser Re
 Retriever with the following command under repo root directory.
 
 ```bash
+git clone https://github.com/denser-org/denser-retriever
+cd denser-retriever
 make install
 ```
 

diff --git a/docker/milvus/standalone/hello_milvus.py b/docker/milvus/standalone/hello_milvus.py
@@ -33,9 +33,7 @@
 print(fmt.format("start connecting to Milvus"))
 connections.connect(
     "default",
-    # host="localhost",
-    host="54.68.68.29",
-    # host="44.237.177.8",
+    host="localhost",
     port="19530",
     user="root",
     password="Milvus",

diff --git a/docker/milvus/standalone/list_connections.py b/docker/milvus/standalone/list_connections.py
@@ -6,8 +6,7 @@
 
 connections.connect(
     "default",
-    # host="localhost",
-    host="54.68.68.29",
+    host="localhost",
     port="19530",
     user="root",
     password="Milvus",

diff --git a/examples/denser_chat.py b/examples/denser_chat.py
@@ -60,10 +60,6 @@ def denser_chat():
 
         st.session_state.messages.append({"role": "user", "content": prompt})
 
-        enc = tiktoken.encoding_for_model(default_openai_model)
-        prompt_length = len(enc.encode(prompt))
-        logger.info(f"prompt length:{prompt_length}")
-
         with st.chat_message("assistant"):
             message_placeholder = st.empty()
             full_response = ""

diff --git a/experiments/config_local.yaml b/experiments/config_local.yaml
@@ -20,20 +20,13 @@ vector:
   milvus_port: 19530
   milvus_user: root
   milvus_passwd: Milvus
-  # https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5
-  # sentence-transformers/all-MiniLM-L6-v2 (dim: 384), Alibaba-NLP/gte-large-en-v1.5 (dim 1024),
-  # Alibaba-NLP/gte-base-en-v1.5 (dim: 768)
-  # Snowflake/snowflake-arctic-embed-m
   emb_model: Snowflake/snowflake-arctic-embed-m
   emb_dims: 768
   one_model: false
   vector_ingest_passage_bs: 2000 # 1000
   topk: 100
 
 rerank:
-  # https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
-  # cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
-  # rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
   rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
   rerank_bs: 100
   topk: 100

diff --git a/experiments/config_server.yaml b/experiments/config_server.yaml
@@ -10,30 +10,23 @@ model_features: es+vs+rr_n
 
 keyword:
   es_user: elastic
-  es_passwd: WzAkbzjZj9AfNXxzmOmp
-  es_host: http://54.68.68.29:9200
+  es_passwd: YOUR_ES_PASSWORD
+  es_host: http://00.00.00.00:9200
   es_ingest_passage_bs: 5000
   topk: 100
 
 vector:
-  milvus_host: 54.68.68.29
+  milvus_host: 00.00.00.00
   milvus_port: 19530
   milvus_user: root
   milvus_passwd: Milvus
-  # https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5
-  # sentence-transformers/all-MiniLM-L6-v2 (dim: 384), Alibaba-NLP/gte-large-en-v1.5 (dim 1024),
-  # Alibaba-NLP/gte-base-en-v1.5 (dim: 768)
-  # Snowflake/snowflake-arctic-embed-m
   emb_model: Snowflake/snowflake-arctic-embed-m
   emb_dims: 768
   one_model: false
   vector_ingest_passage_bs: 2000 # 1000
   topk: 100
 
 rerank:
-  # https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
-  # cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
-  # rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
   rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
   rerank_bs: 100
   topk: 100

diff --git a/experiments/train_and_test.py b/experiments/train_and_test.py
@@ -35,7 +35,6 @@ def __init__(self, config_file, dataset_name):
     # (100 passages per query) and reranker passages (maximum 200 passages per query)
     def generate_retriever_data(self, train_on, eval_on):
         generate_data(self.dataset_name, train_on, self.config_file, ingest=True)
-        # generate_data(self.dataset_name, train_on, self.config_file, ingest=False)
         if eval_on != train_on:
             generate_data(self.dataset_name, eval_on, self.config_file, ingest=False)
 
@@ -329,7 +328,7 @@ def report(self, eval_on):
 
 
 if __name__ == "__main__":
-    config_file = "experiments/config_server.yaml"
+    # config_file = "experiments/config_server.yaml"
 
     # dataset = ["mteb/arguana", "test", "test"]
     # dataset = ["mteb/climate-fever", "test", "test"]
@@ -349,13 +348,14 @@ def report(self, eval_on):
     # dataset_name, train_on, eval_on = dataset
     # model_dir = "/home/ubuntu/denser_output_retriever/exp_msmarco/models/"
 
-    if len(sys.argv) != 4:
-        print("Usage: python train_and_test.py [dataset_name] [train] [test]")
+    if len(sys.argv) != 5:
+        print("Usage: python train_and_test.py [config_file] [dataset_name] [train] [test]")
         sys.exit(0)
 
-    dataset_name = sys.argv[1]
-    train_on = sys.argv[2]
-    eval_on = sys.argv[3]
+    config_file = sys.argv[1]
+    dataset_name = sys.argv[2]
+    train_on = sys.argv[3]
+    eval_on = sys.argv[4]
 
     experiment = Experiment(config_file, dataset_name)
 

diff --git a/mteb_ndcg_plot.png b/mteb_ndcg_plot.png
diff --git a/tests/config-cpws.yaml b/tests/config-cpws.yaml
@@ -25,9 +25,6 @@ vector:
   topk: 5
 
 rerank:
-  # https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
-  # cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
-  # rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
   rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
   rerank_bs: 100
   topk: 5
@@ -45,6 +42,5 @@ fields:
 
 output_prefix: denser_output_retriever/
 
-## temp parameters
 max_doc_size: 0
 max_query_size: 0
diff --git a/tests/config-denser.yaml b/tests/config-denser.yaml
@@ -27,15 +27,11 @@ vector:
   topk: 5
 
 rerank:
-  # https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
-  # cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
-  # rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
   rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
   rerank_bs: 100
   topk: 5
 
 output_prefix: denser_output_retriever/
 
-## temp parameters
 max_doc_size: 0
 max_query_size: 0
diff --git a/tests/config-titanic.yaml b/tests/config-titanic.yaml
@@ -25,9 +25,6 @@ vector:
   topk: 5
 
 rerank:
-  # https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
-  # cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
-  # rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
   rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
   rerank_bs: 100
   topk: 5
@@ -44,6 +41,5 @@ fields:
 
 output_prefix: denser_output_retriever/
 
-## temp parameters
 max_doc_size: 0
 max_query_size: 0
diff --git a/www/content/docs/examples/e2e-chat.mdx b/www/content/docs/examples/e2e-chat.mdx
@@ -19,18 +19,16 @@ Under the [repo](https://github.com/denser-org/denser-retriever/tree/main) direc
 poetry run streamlit run examples/denser_chat.py
 ```
 
-This command first builds a mini retriever with the following code. It then launches a webpage with interactive UI so suers can input queries.
+This command first builds a mini retriever with the following code.
 
 ```python
 index_name = "unit_test_denser"
 retriever = RetrieverGeneral(index_name, "tests/config-denser.yaml")
 retriever.ingest("tests/test_data/denser_website_passages_top10.jsonl")
 ```
 
-Once it is launched, we will see a chatbot interface similar to the following screenshot. We can ask any question to the chatbot and get the response. Below shows an example query of `what use cases does denser support?` The retriever first returns the relevant passages listed under the section of `Sources` at the bottom of the screenshot. These passages are fed to a LLM and the final summarization is displayed on the chat window.
+It then launches a webpage with interactive UI so users can input queries. Once it is launched, we will see a chatbot interface similar to the following screenshot. We can ask any question to the chatbot and get the response. Below shows an example query of `what use cases does denser support?` The retriever first returns the relevant passages listed under the section of `Sources` at the bottom of the screenshot. These passages are fed to a LLM and the final summarization is displayed on the chat window.
 
 import DenserChat from "./denser_chat.png"
 
 <Screenshot src={DenserChat} alt="Denser Chat" className="shadow-xs" />
-
-With the same code, we can build a chatbot on **Wikipedia** dataset at [here](https://denser.ai). Feel free to try it out!
diff --git a/www/content/docs/examples/e2e-search.mdx b/www/content/docs/examples/e2e-search.mdx
@@ -19,15 +19,15 @@ Under the [repo](https://github.com/denser-org/denser-retriever/tree/main) direc
 poetry run streamlit run examples/denser_search.py
 ```
 
-This command first builds a mini retriever with the following code. It then launches a webpage with interactive UI so uers can input queries.
+This command first builds a mini retriever with the following code.
 
 ```python
 index_name = "unit_test_titanic"
 retriever = RetrieverGeneral(index_name, "tests/config-titanic.yaml")
 retriever.ingest("tests/test_data/titanic_top10.jsonl")
 ```
 
-Once it is launched, we will see a search interface similar to the following screenshot. We can input any queries, along with the filters to search relevant results. Below shows an example query of `cumings` with `Sex` field is filled with `female` The retriever returns the relevant passages which matches the specified filter value.
+It then launches a webpage with interactive UI so users can input queries. Once it is launched, we will see a search interface similar to the following screenshot. We can input any queries, along with the filters to search relevant results. Below shows an example query of `cumings` with `Sex` field is filled with `female`. The retriever returns the relevant passages which matches the specified filter value.
 
 import DenserSearch from "./denser_search.png"