Skip to content

Commit

Permalink
Clean up before release
Browse files Browse the repository at this point in the history
  • Loading branch information
zhiheng-huang committed May 28, 2024
1 parent 6786869 commit fa11456
Show file tree
Hide file tree
Showing 21 changed files with 145 additions and 181 deletions.
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,16 @@ An enterprise-grade AI retriever designed to streamline AI integration into your
## 📝 Description

Denser Retriever combines multiple search technologies into a single platform. It utilizes **gradient boosting (
xgboost)** machine learning technique to combines:
xgboost)** machine learning technique to combine:

- **Keyword-based searches** that focus on fetching precisely what the query mentions.
- **Vector databases** that are great for finding a wide range of potentially relevant answers.
- **Machine Learning rerankers** that fine-tune the results to ensure the most relevant answers top the list.

We show that the combinations can significantly improve the retrieval accuracy on MTEB benchmarks when compared to each individual retrievers.
Our experiments on MTEB datasets show that the combination of keyword search, vector search and a reranker via an xgboost model (denoted as ES+VS+RR_n) can significantly improve the vector search (VS) baseline.

![mteb_ndcg_plot](mteb_ndcg_plot.png)

![](./www/content/docs/experiments/rag-flow-data.png)

## 🚀 Features

Expand All @@ -38,8 +39,6 @@ The initial release of Denser Retriever provides the following features.
- Supporting heterogeneous retrievers such as **keyword search**, **vector search**, and **ML model reranking**
- Leveraging **xgboost** ML technique to effectively combine heterogeneous retrievers
- **State-of-the-art accuracy** on [MTEB](https://github.com/embeddings-benchmark/mteb) Retrieval benchmarking
- Providing an **out-of-the-box retriever** which significantly outperforms the current best vector search model in
similar model size
- Demonstrating how to use Denser retriever to power an **end-to-end applications** such as chatbot and semantic search

## 📦 Installation
Expand All @@ -48,6 +47,8 @@ We use [Poetry](https://python-poetry.org/docs/) to install and manage Denser Re
Retriever with the following command under repo root directory.

```bash
git clone https://github.com/denser-org/denser-retriever
cd denser-retriever
make install
```

Expand Down
4 changes: 1 addition & 3 deletions docker/milvus/standalone/hello_milvus.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,7 @@
print(fmt.format("start connecting to Milvus"))
connections.connect(
"default",
# host="localhost",
host="54.68.68.29",
# host="44.237.177.8",
host="localhost",
port="19530",
user="root",
password="Milvus",
Expand Down
3 changes: 1 addition & 2 deletions docker/milvus/standalone/list_connections.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@

connections.connect(
"default",
# host="localhost",
host="54.68.68.29",
host="localhost",
port="19530",
user="root",
password="Milvus",
Expand Down
4 changes: 0 additions & 4 deletions examples/denser_chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,10 +60,6 @@ def denser_chat():

st.session_state.messages.append({"role": "user", "content": prompt})

enc = tiktoken.encoding_for_model(default_openai_model)
prompt_length = len(enc.encode(prompt))
logger.info(f"prompt length:{prompt_length}")

with st.chat_message("assistant"):
message_placeholder = st.empty()
full_response = ""
Expand Down
7 changes: 0 additions & 7 deletions experiments/config_local.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,20 +20,13 @@ vector:
milvus_port: 19530
milvus_user: root
milvus_passwd: Milvus
# https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5
# sentence-transformers/all-MiniLM-L6-v2 (dim: 384), Alibaba-NLP/gte-large-en-v1.5 (dim 1024),
# Alibaba-NLP/gte-base-en-v1.5 (dim: 768)
# Snowflake/snowflake-arctic-embed-m
emb_model: Snowflake/snowflake-arctic-embed-m
emb_dims: 768
one_model: false
vector_ingest_passage_bs: 2000 # 1000
topk: 100

rerank:
# https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
# cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
# rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
rerank_bs: 100
topk: 100
Expand Down
13 changes: 3 additions & 10 deletions experiments/config_server.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,30 +10,23 @@ model_features: es+vs+rr_n

keyword:
es_user: elastic
es_passwd: WzAkbzjZj9AfNXxzmOmp
es_host: http://54.68.68.29:9200
es_passwd: YOUR_ES_PASSWORD
es_host: http://00.00.00.00:9200
es_ingest_passage_bs: 5000
topk: 100

vector:
milvus_host: 54.68.68.29
milvus_host: 00.00.00.00
milvus_port: 19530
milvus_user: root
milvus_passwd: Milvus
# https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5
# sentence-transformers/all-MiniLM-L6-v2 (dim: 384), Alibaba-NLP/gte-large-en-v1.5 (dim 1024),
# Alibaba-NLP/gte-base-en-v1.5 (dim: 768)
# Snowflake/snowflake-arctic-embed-m
emb_model: Snowflake/snowflake-arctic-embed-m
emb_dims: 768
one_model: false
vector_ingest_passage_bs: 2000 # 1000
topk: 100

rerank:
# https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
# cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
# rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
rerank_bs: 100
topk: 100
Expand Down
14 changes: 7 additions & 7 deletions experiments/train_and_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ def __init__(self, config_file, dataset_name):
# (100 passages per query) and reranker passages (maximum 200 passages per query)
def generate_retriever_data(self, train_on, eval_on):
generate_data(self.dataset_name, train_on, self.config_file, ingest=True)
# generate_data(self.dataset_name, train_on, self.config_file, ingest=False)
if eval_on != train_on:
generate_data(self.dataset_name, eval_on, self.config_file, ingest=False)

Expand Down Expand Up @@ -329,7 +328,7 @@ def report(self, eval_on):


if __name__ == "__main__":
config_file = "experiments/config_server.yaml"
# config_file = "experiments/config_server.yaml"

# dataset = ["mteb/arguana", "test", "test"]
# dataset = ["mteb/climate-fever", "test", "test"]
Expand All @@ -349,13 +348,14 @@ def report(self, eval_on):
# dataset_name, train_on, eval_on = dataset
# model_dir = "/home/ubuntu/denser_output_retriever/exp_msmarco/models/"

if len(sys.argv) != 4:
print("Usage: python train_and_test.py [dataset_name] [train] [test]")
if len(sys.argv) != 5:
print("Usage: python train_and_test.py [config_file] [dataset_name] [train] [test]")
sys.exit(0)

dataset_name = sys.argv[1]
train_on = sys.argv[2]
eval_on = sys.argv[3]
config_file = sys.argv[1]
dataset_name = sys.argv[2]
train_on = sys.argv[3]
eval_on = sys.argv[4]

experiment = Experiment(config_file, dataset_name)

Expand Down
Binary file added mteb_ndcg_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 0 additions & 4 deletions tests/config-cpws.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,6 @@ vector:
topk: 5

rerank:
# https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
# cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
# rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
rerank_bs: 100
topk: 5
Expand All @@ -45,6 +42,5 @@ fields:

output_prefix: denser_output_retriever/

## temp parameters
max_doc_size: 0
max_query_size: 0
4 changes: 0 additions & 4 deletions tests/config-denser.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,11 @@ vector:
topk: 5

rerank:
# https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
# cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
# rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
rerank_bs: 100
topk: 5

output_prefix: denser_output_retriever/

## temp parameters
max_doc_size: 0
max_query_size: 0
4 changes: 0 additions & 4 deletions tests/config-titanic.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,6 @@ vector:
topk: 5

rerank:
# https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
# cross-encoder/ms-marco-MiniLM-L-2-v2, cross-encoder/ms-marco-MiniLM-L-4-v2, or cross-encoder/ms-marco-MiniLM-L-6-v2
# rerank_model: cross-encoder/ms-marco-TinyBERT-L-2-v2
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
rerank_bs: 100
topk: 5
Expand All @@ -44,6 +41,5 @@ fields:

output_prefix: denser_output_retriever/

## temp parameters
max_doc_size: 0
max_query_size: 0
6 changes: 2 additions & 4 deletions www/content/docs/examples/e2e-chat.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -19,18 +19,16 @@ Under the [repo](https://github.com/denser-org/denser-retriever/tree/main) direc
poetry run streamlit run examples/denser_chat.py
```

This command first builds a mini retriever with the following code. It then launches a webpage with interactive UI so suers can input queries.
This command first builds a mini retriever with the following code.

```python
index_name = "unit_test_denser"
retriever = RetrieverGeneral(index_name, "tests/config-denser.yaml")
retriever.ingest("tests/test_data/denser_website_passages_top10.jsonl")
```

Once it is launched, we will see a chatbot interface similar to the following screenshot. We can ask any question to the chatbot and get the response. Below shows an example query of `what use cases does denser support?` The retriever first returns the relevant passages listed under the section of `Sources` at the bottom of the screenshot. These passages are fed to a LLM and the final summarization is displayed on the chat window.
It then launches a webpage with interactive UI so users can input queries. Once it is launched, we will see a chatbot interface similar to the following screenshot. We can ask any question to the chatbot and get the response. Below shows an example query of `what use cases does denser support?` The retriever first returns the relevant passages listed under the section of `Sources` at the bottom of the screenshot. These passages are fed to a LLM and the final summarization is displayed on the chat window.

import DenserChat from "./denser_chat.png"

<Screenshot src={DenserChat} alt="Denser Chat" className="shadow-xs" />

With the same code, we can build a chatbot on **Wikipedia** dataset at [here](https://denser.ai). Feel free to try it out!
4 changes: 2 additions & 2 deletions www/content/docs/examples/e2e-search.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,15 @@ Under the [repo](https://github.com/denser-org/denser-retriever/tree/main) direc
poetry run streamlit run examples/denser_search.py
```

This command first builds a mini retriever with the following code. It then launches a webpage with interactive UI so uers can input queries.
This command first builds a mini retriever with the following code.

```python
index_name = "unit_test_titanic"
retriever = RetrieverGeneral(index_name, "tests/config-titanic.yaml")
retriever.ingest("tests/test_data/titanic_top10.jsonl")
```

Once it is launched, we will see a search interface similar to the following screenshot. We can input any queries, along with the filters to search relevant results. Below shows an example query of `cumings` with `Sex` field is filled with `female` The retriever returns the relevant passages which matches the specified filter value.
It then launches a webpage with interactive UI so users can input queries. Once it is launched, we will see a search interface similar to the following screenshot. We can input any queries, along with the filters to search relevant results. Below shows an example query of `cumings` with `Sex` field is filled with `female`. The retriever returns the relevant passages which matches the specified filter value.

import DenserSearch from "./denser_search.png"

Expand Down
Loading

0 comments on commit fa11456

Please sign in to comment.