Skip to content

Commit

Permalink
Quick start and LeCaRDv2 exp
Browse files Browse the repository at this point in the history
  • Loading branch information
zhiheng-huang committed Jan 26, 2025
1 parent afdf79f commit f6567bb
Show file tree
Hide file tree
Showing 15 changed files with 328 additions and 55 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -107,3 +107,5 @@ denser_output_retriever/
docker-volume/
*.pkl
*.failed

exps
103 changes: 94 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ An enterprise-grade AI retriever designed to streamline AI integration into your

- Optimizes retrieval by combining keyword search, vector search, and reranking
- Simple example for quick start
- Built-in support for [MTEB](https://github.com/embeddings-benchmark/mteb) scifact and MsMarco datasets
- Built-in experiments for [MTEB](https://github.com/embeddings-benchmark/mteb) scifact, MsMarco and LeCaRDv2 datasets
- Full MTEB Retrieval benchmark experiments
- Ready-to-use components for chatbots and semantic search applications

Expand All @@ -43,11 +43,50 @@ We need to start the elasticsearch and Milvus services before running the experi
docker compose up -d
```

After starting the services, we run the following command to use `"tests/test_data/state_of_the_union.txt"` file to build a retriever and run a
query `"What did the president say about Ketanji Brown Jackson"` to retrieve the top 10 passages.
After starting the services, here is the code to build a retriever and run a query. The retriever building and query are governed by the configuration file `denser_retriever/configs/fusion_msmarco.json`, which specifies the embedding model, reranker model and top-k arguments in retrieval.

```python
from langchain_core.documents import Document
from denser_retriever.core.retriever import DenserRetriever
from denser_retriever.config import load_retriever_config

# Create sample documents
texts = [
Document(page_content="Python is a high-level programming language known for its simplicity and readability."),
Document(page_content="Machine learning is a subset of AI that enables systems to learn from data."),
Document(page_content="Natural Language Processing (NLP) helps computers understand human language."),
Document(page_content="Deep learning models use neural networks with multiple layers.")
]

# Initialize retriever
index_name = "tech_docs"
config = load_retriever_config("denser_retriever/configs/fusion_msmarco.json")
retriever_config = config.get_retriever_config(index_name, True)
retriever = DenserRetriever(**retriever_config)

# Ingest documents
retriever.ingest(texts)

# Test queries
queries = [
"What is Python?",
"Explain machine learning",
]

# Retrieve results
for query in queries:
result = retriever.retrieve(query=query, k=5, usage=True)
print(f"\nQuery: {query}")
print(result.to_json())

# Cleanup
retriever.delete_all(delete_index=True)
```

You can also find the code at `experiments/quick_start.py`. Run the following command to execute the code.

```commandline
python -m denser_retriever.experiments.toy_example
```bash
python -m denser_retriever.experiments.quick_start
```

## Ingetstion
Expand Down Expand Up @@ -141,9 +180,15 @@ Metadata: {
--------------------------------------------------------------------------------
```

## Scifact Dataset Experiment

### Training
## Mteb Experiments

<details>
<summary>
Scifact Dataset Experiment
</summary>

### Training

Training refers to train a logistic regression model to fuse keyword search, vector search and rerank. The trained
logistic regression model is used in the `fusion` combine method. Without training, we can still use `vector`, `hybrid`,
Expand Down Expand Up @@ -226,7 +271,12 @@ the `weights_es+vs+rr.json` model from the training, obtained identical ndcg@10
| reranker | 0.6759 | 0.83 |
| fusion | 0.7434 | 2.10 |

## MsMarco Dataset Experiment
</details>

<details>
<summary>
MsMarco Dataset Experiment
</summary>

### Training

Expand Down Expand Up @@ -288,6 +338,41 @@ NDCG@10 scores. The fusion method outperforms the other methods with a higher co
| reranker | 0.4013 | 10.02 |
| fusion | 0.4707 | 16.65 |

</details>

<details>
<summary>
LeCaRDv2 Dataset Experiment
</summary>

LeCaRDv2 dataset involves identifying and retrieving the case document that best matches or is most relevant to the scenario described in each of the provided queries. The query set contains 159 queries, each outlining a distinct situation. The corpus set includes 3795 candidate case documents. The original data link is at https://github.com/THUIR/LeCaRDv2


### Evaluation

LeCaRDv2 dataset only has 159 test queries and does not have a training dataset. While we cannot train a Logistic Regression model on this data, we use the model trained on MsMarco dataset to evaluate. The embedding model of `BAAI/bge-m3` and reranker model `BAAI/bge-reranker-v2-m3` are used in the evaluation. All models are specified in the `fusion_lecardv2.json`.

```bash
python -m denser_retriever.experiments.evaluate \
lecardv2 \
mteb/lecardv2 \
--split test \
--config denser_retriever/configs/fusion_lecardv2.json \
--output-dir exps/exp_lecardv2/pred \
--top-k 100
```

The evaluation accuracy is listed below. The hybrid method achieves the highest NDCG@10 score (0.7510), which is strong when compared to the [Huggingface Mteb leaderboard](https://huggingface.co/spaces/mteb/leaderboard). The fusion method does not perform well mainly due to 1) the reranker model does not perform well on this dataset, and 2) the fusion logistic regression model is trained on MsMarco dataset.

| Method | NDCG@10 |
|----------|---------|
| vector | 0.7034 |
| hybrid | 0.7510 |
| reranker | 0.6022 |
| fusion | 0.7054 |

</details>

## Unit Tests

Run the following command to run all unit tests.
Expand All @@ -302,7 +387,7 @@ If you want to run a specific test, for example the `test_retrieve` method in `t
pytest tests/test_retriever.py::TestRetriever::test_retrieve
```

## 📃 Documentation (slightly outdated)
## 📃 Documentation (outdated)

The official documentation is hosted on [retriever.denser.ai](https://retriever.denser.ai). The complete MTEB retrieval experiment is available at [retriever-docs.denser.ai](https://retriever-docs.denser.ai/docs/core/experiments/mteb_retrieval).

Expand Down
3 changes: 2 additions & 1 deletion denser_retriever/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
VoyageAPIEmbeddings,
)
from denser_retriever.core.keyword import DenserKeywordSearch, ElasticKeywordSearch
from denser_retriever.core.reranker import DenserReranker, HFReranker, CohereReranker
from denser_retriever.core.reranker import DenserReranker, HFReranker, CohereReranker, BGEReranker
from denser_retriever.core.retriever import DenserRetriever
from denser_retriever.core.vectordb.base import DenserVectorDB
from denser_retriever.core.vectordb import MilvusDenserVectorDB
Expand Down Expand Up @@ -38,6 +38,7 @@ def get_version() -> str:
"DenserReranker",
"HFReranker",
"CohereReranker",
"BGEReranker",
"DenserRetriever",
"DenserVectorDB",
"MilvusDenserVectorDB",
Expand Down
12 changes: 10 additions & 2 deletions denser_retriever/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,13 @@
ElasticKeywordSearch,
create_elasticsearch_client,
)
from denser_retriever.core.reranker import HFReranker
from denser_retriever.core.reranker import HFReranker, BGEReranker
from denser_retriever.core.vectordb.milvus import MilvusDenserVectorDB
from denser_retriever.core.embeddings import (
VoyageAPIEmbeddings,
SentenceTransformerEmbeddings,
BGEEmbeddings,
BGEM3Embeddings,
)

# Define the combine method type
Expand Down Expand Up @@ -85,7 +86,10 @@ def get_retriever_config(

# Configure reranker
if self.reranker_model:
reranker = HFReranker(model_name=self.reranker_model)
if self.reranker_model.startswith("BAAI"):
reranker = BGEReranker(model_name=self.reranker_model)
else:
reranker = HFReranker(model_name=self.reranker_model)
else:
reranker = None

Expand All @@ -107,6 +111,10 @@ def get_retriever_config(
embeddings = BGEEmbeddings(
model_name=self.embedding.model, embedding_size=self.embedding.size
)
elif self.embedding.type == "bgem3":
embeddings = BGEM3Embeddings(
model_name=self.embedding.model, embedding_size=self.embedding.size
)
else:
raise ValueError(f"Unknown embedding type: {self.embedding.type}")
else:
Expand Down
31 changes: 31 additions & 0 deletions denser_retriever/configs/fusion_lecardv2.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"max_query_len": 2000,
"es": {
"url": "http://localhost:9200",
"username": "elastic",
"password": "YOUR_PASSWORD",
"analysis": "default"
},
"milvus": {
"uri": "http://localhost:19530",
"user": "root",
"password": "YOUR_PASSWORD"
},
"reranker_model": "BAAI/bge-reranker-v2-m3",
"embedding": {
"type": "bgem3",
"model": "BAAI/bge-m3",
"size": 1024
},
"combine_config": {
"method": "fusion",
"keyword_top_k": 100,
"vector_top_k": 100,
"reranker_top_k": 100,
"lr_config": {
"lr_features": "es+vs+rr",
"lr_model": "denser_retriever/models/weights_es+vs+rr_msmarco.json"
}
},
"aggregation": false
}
6 changes: 5 additions & 1 deletion denser_retriever/configs/hybrid.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,14 @@
"max_query_len": 2000,
"es": {
"url": "http://localhost:9200",
"username": "elastic",
"password": "YOUR_PASSWORD",
"analysis": "default"
},
"milvus": {
"uri": "http://localhost:19530"
"uri": "http://localhost:19530",
"user": "root",
"password": "YOUR_PASSWORD"
},
"embedding": {
"type": "sentence_transformer",
Expand Down
2 changes: 2 additions & 0 deletions denser_retriever/configs/reranker.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
"max_query_len": 2000,
"es": {
"url": "http://localhost:9200",
"username": "elastic",
"password": "YOUR_PASSWORD",
"analysis": "default"
},
"reranker_model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
Expand Down
16 changes: 10 additions & 6 deletions denser_retriever/configs/train_lecardv2.json
Original file line number Diff line number Diff line change
@@ -1,23 +1,27 @@
{
"ingest_bs": 20,
"max_query_size": 10,
"max_query_size": 0,
"max_query_len": 2000,
"max_doc_size": 100,
"max_doc_size": 0,
"max_doc_len": 8000,
"es": {
"url": "http://localhost:9200",
"username": "elastic",
"password": "YOUR_PASSWORD",
"analysis": "default"
},
"milvus": {
"uri": "http://localhost:19530"
"uri": "http://localhost:19530",
"user": "root",
"password": "YOUR_PASSWORD"
},
"reranker_model": "BAAI/bge-reranker-base",
"embedding": {
"type": "voyage",
"model": "voyage-law-2",
"type": "bgem3",
"model": "BAAI/bge-m3",
"size": 1024
},
"voyage_api_key": "pa-b76ti3S2pWuSl0go1S7f8-x150YAXUoh6UANO2LpHbI",
"voyage_api_key": "",
"combine_config": {
"method": "fusion",
"keyword_top_k":100,
Expand Down
32 changes: 32 additions & 0 deletions denser_retriever/configs/train_msmarco.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"ingest_bs": 10000,
"max_query_size": 10000,
"max_query_len": 2000,
"max_doc_size": 0,
"max_doc_len": 8000,
"es": {
"url": "http://localhost:9200",
"username": "elastic",
"password": "YOUR_PASSWORD",
"analysis": "default"
},
"milvus": {
"uri": "http://localhost:19530",
"user": "root",
"password": "YOUR_PASSWORD"
},
"reranker_model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
"embedding": {
"type": "sentence_transformer",
"model": "Snowflake/snowflake-arctic-embed-m",
"size": 768,
"one_model": false
},
"combine_config": {
"method": "fusion",
"keyword_top_k":100,
"vector_top_k":100,
"reranker_top_k":100
},
"aggregation": false
}
4 changes: 3 additions & 1 deletion denser_retriever/configs/vector.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
{
"max_query_len": 2000,
"milvus": {
"uri": "http://localhost:19530"
"uri": "http://localhost:19530",
"user": "root",
"password": "YOUR_PASSWORD"
},
"embedding": {
"type": "sentence_transformer",
Expand Down
37 changes: 37 additions & 0 deletions denser_retriever/core/embeddings.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,43 @@ def embed_query(self, text):
return self.client.encode_queries(text)


class BGEM3Embeddings(DenserEmbeddings):
_instances: Dict[str, "BGEM3Embeddings"] = {}

def __new__(cls, model_name: str, embedding_size: int):
key = f"{model_name}_{embedding_size}"
if key in cls._instances:
return cls._instances[key]

instance = super(BGEM3Embeddings, cls).__new__(cls)
cls._instances[key] = instance
instance.__initialized = False
return instance

def __init__(self, model_name: str, embedding_size: int):
if hasattr(self, "__initialized") and self.__initialized:
return

try:
from FlagEmbedding import BGEM3FlagModel
except ImportError as exc:
raise ImportError("Could not import FlagEmbedding python package.") from exc

self.client = BGEM3FlagModel(
model_name,
use_fp16=True,
) # Setting use_fp16 to True speeds up computation with a slight performance degradation
self.embedding_size = embedding_size
self.__initialized = True

def embed_documents(self, texts):
res = self.client.encode(texts)['dense_vecs'].tolist()
return res

def embed_query(self, text):
res = self.client.encode([text])['dense_vecs'].tolist()
return res

class VoyageAPIEmbeddings(DenserEmbeddings):
_instances: Dict[str, "VoyageAPIEmbeddings"] = {}

Expand Down
Loading

0 comments on commit f6567bb

Please sign in to comment.