diff --git a/README.md b/README.md
index b46cd33..ef67e99 100644
--- a/README.md
+++ b/README.md
@@ -29,7 +29,7 @@ xgboost)** machine learning technique to combines:
We show that the combinations can significantly improve the retrieval accuracy on MTEB benchmarks when compared to each individual retrievers.
-![](./www/pages/docs/experiments/RAG-flow-data.png)
+![](./www/content/docs/experiments/rag-flow-data.png)
## 🚀 Features
diff --git a/bun.lockb b/bun.lockb
index 9a2b25d..5099b16 100755
Binary files a/bun.lockb and b/bun.lockb differ
diff --git a/www/app/global.css b/www/app/global.css
index 65f25f3..c9e641f 100644
--- a/www/app/global.css
+++ b/www/app/global.css
@@ -542,3 +542,7 @@ details > summary::before {
details[open] > summary::before {
transform: rotate(90deg);
}
+
+.prose table th {
+ @apply text-left;
+}
diff --git a/www/components/mteb-chart.tsx b/www/components/mteb-chart.tsx
new file mode 100644
index 0000000..f57d39f
--- /dev/null
+++ b/www/components/mteb-chart.tsx
@@ -0,0 +1,77 @@
+"use client"
+
+import {
+ Bar,
+ BarChart,
+ CartesianGrid,
+ Legend,
+ Rectangle,
+ ResponsiveContainer,
+ Tooltip,
+ XAxis,
+ YAxis,
+} from "recharts"
+
+export interface MtebChartProps {
+ data: any
+}
+
+export default function MtebChart({ data }: MtebChartProps) {
+ return (
+
+
+
+
+
+
+
+
+
+ )
+}
diff --git a/www/content/docs/experiments/RAG-flow-data.png b/www/content/docs/experiments/RAG-flow-data.png
index e6a56e4..a44af43 100644
Binary files a/www/content/docs/experiments/RAG-flow-data.png and b/www/content/docs/experiments/RAG-flow-data.png differ
diff --git a/www/content/docs/experiments/mteb_ndcg_plot.png b/www/content/docs/experiments/mteb_ndcg_plot.png
deleted file mode 100644
index 8fb00ec..0000000
Binary files a/www/content/docs/experiments/mteb_ndcg_plot.png and /dev/null differ
diff --git a/www/content/docs/experiments/mteb_retrieval.mdx b/www/content/docs/experiments/mteb_retrieval.mdx
index a1f9463..c71ff7b 100644
--- a/www/content/docs/experiments/mteb_retrieval.mdx
+++ b/www/content/docs/experiments/mteb_retrieval.mdx
@@ -3,35 +3,36 @@ title: MTEB Retrieval Experiments
---
- We run this experiment on a **server**, which requires ES and Milvus installations specified [here](/docs/install/install-server).
+ We run this experiment on a **server**, which requires ES and Milvus
+ installations specified [here](/docs/install/install-server).
-
-
## MTEB datasets
+
MTEB [retrieval datasets](https://github.com/embeddings-benchmark/mteb) consists of 15 datasets. The datasets stats including name, corpus size, train, dev and test query sizes are listed in the following table.
-|Name |#Corpus | #Train Query | #Dev Query | #Test Query|
-|---|---|---|---|---|
-|ArguAna |8,674 |0 |0 |1,406|
-|ClimateFEVER |5,416,593 | 0 |0 |1,535|
-|CQADupstack |457,199 |0 |0 |9,963|
-|DBPedia |4,635,922|0 |67 |400 |
-|FEVER |5,416,568 |109,810| 6,666 |6,666 |
-|FiQA2018 |57,638| 5,500 |500 |648|
-|HotpotQA| 5,233,329 |85,000 | 5,447 |7,405 |
-|MSMARCO |8,841,823| 502,939 |6,980 |43|
-|NFCorpus |3,633| 2,590 |324 |323 |
-|NQ| 2,681,468| 0 | 0 |3,452 |
-|QuoraRetrieval |522,931| 0 |5,000 |10,000|
-|SCIDOCS |25,657| 0 |0 |1,000 |
-|SciFact|5,183| 809 |0 |300 |
-|Touche2020 |382,545| 0 |0 |49 |
-|TRECCOVID | 171,332| 0 |0 |50 |
+| Name | #Corpus | #Train Query | #Dev Query | #Test Query |
+| -------------- | --------- | ------------ | ---------- | ----------- |
+| ArguAna | 8,674 | 0 | 0 | 1,406 |
+| ClimateFEVER | 5,416,593 | 0 | 0 | 1,535 |
+| CQADupstack | 457,199 | 0 | 0 | 9,963 |
+| DBPedia | 4,635,922 | 0 | 67 | 400 |
+| FEVER | 5,416,568 | 109,810 | 6,666 | 6,666 |
+| FiQA2018 | 57,638 | 5,500 | 500 | 648 |
+| HotpotQA | 5,233,329 | 85,000 | 5,447 | 7,405 |
+| MSMARCO | 8,841,823 | 502,939 | 6,980 | 43 |
+| NFCorpus | 3,633 | 2,590 | 324 | 323 |
+| NQ | 2,681,468 | 0 | 0 | 3,452 |
+| QuoraRetrieval | 522,931 | 0 | 5,000 | 10,000 |
+| SCIDOCS | 25,657 | 0 | 0 | 1,000 |
+| SciFact | 5,183 | 809 | 0 | 300 |
+| Touche2020 | 382,545 | 0 | 0 | 49 |
+| TRECCOVID | 171,332 | 0 | 0 | 50 |
## Train and test xgboost models
For each dataset in [MTEB](https://github.com/embeddings-benchmark/mteb), we trained an xgboost models on the training dataset and tested on the test dataset. To speed up the experiments, we used up to 10k queries per dataset in training (`max_query_size: 10000` in `config_server.yaml`). For datasets which do not have training data, we used the development data to train. If neither training nor development data exists, we applied the 3-fold cross-validation. That is, we randomly split the test data into three folds, we used two folds to train a xgboost model and tested on the third fold. We applied this process three times so the whole test dataset can be evaluated. We fixed the xgboost model training with the following settings. Specifically, we used the ndcg metric as model update objective, a moderate learning rate (`eta`) of 0.1, regularization parameter (`gamma`) of 1.0, `min_child_weight` of 0.1, maximum depth of tree up to 6, and evaluation metric of ndcg@10. We used a fixed number (100) of boosting iterations (`num_boost_round`), thus no attempting to optimize the training per dataset.
+
```python
params = {
"objective": "rank:ndcg",
@@ -42,6 +43,7 @@ params = {
"eval_metric": "ndcg@10",
}
```
+
The source code for the experiment can be found at [train_and_test.py](https://github.com/denser-org/denser-retriever/blob/main/experiments/train_and_test.py). We ran the following command to train 8 xgboost models (ES+VS, ES+RR, VS+RR, ES+VS+RR, ES+VS_n, ES+RR_n, VS+RR_n, and ES+VS+RR_n) using MSMARCO training data. The definitions of these 8 models can be found at [training](./training). The parameters are dataset_name, train split, and test split respectively.
```shell
@@ -69,32 +71,43 @@ We can then evaluate the MTEB dataset (MSMARCO as an example) by running:
```shell
poetry run python experiments/test.py mteb/msmarco
```
+
We will get the ndcg@10 score after the evaluation.
## Experiment results
We list the ndcg@10 scores of different models in the following table. Ref is the reference ndcg@10 of `snowflake-arctic-embed-m` from Huggingface [leaderboard](https://huggingface.co/spaces/mteb/leaderboard), which is consistent with our reported VS accuracy. The bold numbers are the highest accuracy per dataset in our experiments. We use VS instead of Ref as the vector search baseline. Delta and % are the ndcg@10 absolute and relative gains of ES+VS+RR_n model compared to VS baseline.
-|Name | ES | VS | ES+VS/ES+VS_n | ES+RR/ES+RR_n | VS+RR/VS+RR_n | ES+VS+RR/ES+VS+RR_n |Ref | Delta/%|
-|---|---|---|---|---|---|---|---|---|
-|ArguAna |42.93 |56.49 | 56.68/57.27 |47.45/48.21 | 56.32/56.44 |56.81/**57.28** |56.44 | 0.79/1.39% |
-|ClimateFEVER |18.10 |39.12 | 39.21/39.01 |28.20/28.34 | 39.06/38.71 |39.11/**39.25** |39.37 |0.13/0.33%|
-|CQADupstack |25.13 | 42.23 | 42.40/42.51 |37.68/37.54 | 43.92/44.25 |43.85/**44.32** |43.81 | 2.09/4.94%|
-|DBPedia |27.42 |44.66 | 45.26/44.26 |47.94/48.26 | 48.62/49.08 |48.79/**49.13** |44.73 | 4.47/10.00% |
-|FEVER |72.80 |88.90 | 89.29/90.05 |84.38/84.94 | 89.84/90.30 |90.21/**91.00** |89.02 | 2.10/2.36%|
-|FiQA2018 |23.89 |42.29 | 42.57/42.79 |36.62/36.31 | 43.04/43.09 |43.19/**43.22** |42.4 | 0.93/2.19%|
-|HotpotQA|54.94 |73.65 | 74.74/75.01 |74.93/75.39 | 77.64/78.07 |77.95/**78.37** |73.65 |4.72/6.40% |
-|MSMARCO |21.84 |41.77 | 41.65/41.72 |46.93/47.15 | 47.11/**47.24** |47.09/47.23 | 41.77 | 5.46/13.07% |
-|NFCorpus |31.40 |36.74 | 37.37/37.63 |34.51/35.36 | 37.32/37.31 |**37.70**/37.15 |36.77 | 0.41/1.11% |
-|NQ|27.21 |61.33 | 60.51/61.20 |55.60/55.47 | 61.50/62.24 |62.27/**62.35** | 62.43 | 1.02/1.66% |
-|QuoraRetrieval|74.23 |80.73 | 86.64/86.91 |84.14/84.40 | 87.76/88.10 |88.39/**88.54** | 87.42 | 7.81/9.67% |
-|SCIDOCS |14.68 |21.03 | 20.49/20.06 |16.48/16.48 | **20.51**/20.19 |20.34/20.03 |21.10 | -1.00/-4.75% |
-|SciFact|58.42 |73.16 | 73.28/75.08 | 69.08/69.69 | 72.73/73.62 |73.08/**75.33** |73.55 | 2.17/2.96% |
-|Touche2020 |29.92 |**32.65** | 31.86/34.26 |29.76/29.93 | 30.47/29.30 |31.51/30.98 |31.47 | -1.67/-5.11% |
-|TRECCOVID |52.02 |78.92 | 77.78/79.12 |75.59/76.95 | 80.34/81.19 |81.97/**83.01** |79.65 | 4.09/5.18% |
-|Average |38.32 |54.24 |54.64/55.12 |51.28/51.62| 55.74/55.94 |56.15/**56.47** | 54.91 |2.23/4.11%|
-
-![](./mteb_ndcg_plot.png)
+| Name | ES | VS | ES+VS/ES+VS_n | ES+RR/ES+RR_n | VS+RR/VS+RR_n | ES+VS+RR/ES+VS+RR_n | Ref | Delta/% |
+| -------------- | ----- | --------- | ------------- | ------------- | --------------- | ------------------- | ----- | ------------ |
+| ArguAna | 42.93 | 56.49 | 56.68/57.27 | 47.45/48.21 | 56.32/56.44 | 56.81/**57.28** | 56.44 | 0.79/1.39% |
+| ClimateFEVER | 18.10 | 39.12 | 39.21/39.01 | 28.20/28.34 | 39.06/38.71 | 39.11/**39.25** | 39.37 | 0.13/0.33% |
+| CQADupstack | 25.13 | 42.23 | 42.40/42.51 | 37.68/37.54 | 43.92/44.25 | 43.85/**44.32** | 43.81 | 2.09/4.94% |
+| DBPedia | 27.42 | 44.66 | 45.26/44.26 | 47.94/48.26 | 48.62/49.08 | 48.79/**49.13** | 44.73 | 4.47/10.00% |
+| FEVER | 72.80 | 88.90 | 89.29/90.05 | 84.38/84.94 | 89.84/90.30 | 90.21/**91.00** | 89.02 | 2.10/2.36% |
+| FiQA2018 | 23.89 | 42.29 | 42.57/42.79 | 36.62/36.31 | 43.04/43.09 | 43.19/**43.22** | 42.4 | 0.93/2.19% |
+| HotpotQA | 54.94 | 73.65 | 74.74/75.01 | 74.93/75.39 | 77.64/78.07 | 77.95/**78.37** | 73.65 | 4.72/6.40% |
+| MSMARCO | 21.84 | 41.77 | 41.65/41.72 | 46.93/47.15 | 47.11/**47.24** | 47.09/47.23 | 41.77 | 5.46/13.07% |
+| NFCorpus | 31.40 | 36.74 | 37.37/37.63 | 34.51/35.36 | 37.32/37.31 | **37.70**/37.15 | 36.77 | 0.41/1.11% |
+| NQ | 27.21 | 61.33 | 60.51/61.20 | 55.60/55.47 | 61.50/62.24 | 62.27/**62.35** | 62.43 | 1.02/1.66% |
+| QuoraRetrieval | 74.23 | 80.73 | 86.64/86.91 | 84.14/84.40 | 87.76/88.10 | 88.39/**88.54** | 87.42 | 7.81/9.67% |
+| SCIDOCS | 14.68 | 21.03 | 20.49/20.06 | 16.48/16.48 | **20.51**/20.19 | 20.34/20.03 | 21.10 | -1.00/-4.75% |
+| SciFact | 58.42 | 73.16 | 73.28/75.08 | 69.08/69.69 | 72.73/73.62 | 73.08/**75.33** | 73.55 | 2.17/2.96% |
+| Touche2020 | 29.92 | **32.65** | 31.86/34.26 | 29.76/29.93 | 30.47/29.30 | 31.51/30.98 | 31.47 | -1.67/-5.11% |
+| TRECCOVID | 52.02 | 78.92 | 77.78/79.12 | 75.59/76.95 | 80.34/81.19 | 81.97/**83.01** | 79.65 | 4.09/5.18% |
+| Average | 38.32 | 54.24 | 54.64/55.12 | 51.28/51.62 | 55.74/55.94 | 56.15/**56.47** | 54.91 | 2.23/4.11% |
+
+import MtedResult from "./mted_result.json"
+import MtebChart from "@/components/mteb-chart"
+
+
+ ({
+ name: item.Name,
+ es: item.ES.replaceAll("*", ""),
+ vs: item.VS.replaceAll("*", ""),
+ "ES+VS+RR_n": item["ES+VS+RR/ES+VS+RR_n"].split("/")[1].replaceAll("*", ""),
+}))} />
+
Here are the observations from the experiment results.
@@ -102,16 +115,14 @@ Vector search with [snowflake-arctic-embed-m](https://github.com/Snowflake-Labs/
For datasets which have training data (FEVER, FiQA2018, HotpotQA, NFCorpus, and SciFact), the combinations of elasticsearch, vector search and reranker via xgboost models are more beneficial, which can be witnessed by the following table.
-|Name | VS | ES+VS+RR_n | Delta | Delta%|
-|---|---|---|---|---|
-|FEVER |88.9| 91.00|2.10| 2.36|
-|FiQA2018|42.29|43.22|0.93 |2.19 |
-|HotpotQA |73.65 | 78.37 | 4.72 |6.4|
-|MSMARCO | 41.77 | 47.23 |5.46 |13.07|
-|NFCorpus |36.74|37.15|0.41 |1.11|
-|SciFact| 73.16| 75.33| 2.17 |2.96|
-|Average | 59.41 | 62.05 | 2.63 | 4.68 |
+| Name | VS | ES+VS+RR_n | Delta | Delta% |
+| -------- | ----- | ---------- | ----- | ------ |
+| FEVER | 88.9 | 91.00 | 2.10 | 2.36 |
+| FiQA2018 | 42.29 | 43.22 | 0.93 | 2.19 |
+| HotpotQA | 73.65 | 78.37 | 4.72 | 6.4 |
+| MSMARCO | 41.77 | 47.23 | 5.46 | 13.07 |
+| NFCorpus | 36.74 | 37.15 | 0.41 | 1.11 |
+| SciFact | 73.16 | 75.33 | 2.17 | 2.96 |
+| Average | 59.41 | 62.05 | 2.63 | 4.68 |
The ES+VS+RR_n model improves the vector search NDCG@10 baseline by 2.63 absolute and 4.68% relative gains on these five datasets. It is worth noting that, on the widely used benchmark dataset MSMARCO, the ES+VS+RR_n leads significant relative NDCG@10 gian of 13.07% when compared to vector search baseline.
-
-
diff --git a/www/content/docs/experiments/mted_result.json b/www/content/docs/experiments/mted_result.json
new file mode 100644
index 0000000..0cedc4e
--- /dev/null
+++ b/www/content/docs/experiments/mted_result.json
@@ -0,0 +1 @@
+[{"Name":"ArguAna","ES":"42.93","VS":"56.49","ES+VS/ES+VS_n":"56.68/57.27","ES+RR/ES+RR_n":"47.45/48.21","VS+RR/VS+RR_n":"56.32/56.44","ES+VS+RR/ES+VS+RR_n":"56.81/**57.28**","Ref":"56.44","Delta/%":"0.79/1.39%"},{"Name":"ClimateFEVER","ES":"18.10","VS":"39.12","ES+VS/ES+VS_n":"39.21/39.01","ES+RR/ES+RR_n":"28.20/28.34","VS+RR/VS+RR_n":"39.06/38.71","ES+VS+RR/ES+VS+RR_n":"39.11/**39.25**","Ref":"39.37","Delta/%":"0.13/0.33%"},{"Name":"CQADupstack","ES":"25.13","VS":"42.23","ES+VS/ES+VS_n":"42.40/42.51","ES+RR/ES+RR_n":"37.68/37.54","VS+RR/VS+RR_n":"43.92/44.25","ES+VS+RR/ES+VS+RR_n":"43.85/**44.32**","Ref":"43.81","Delta/%":"2.09/4.94%"},{"Name":"DBPedia","ES":"27.42","VS":"44.66","ES+VS/ES+VS_n":"45.26/44.26","ES+RR/ES+RR_n":"47.94/48.26","VS+RR/VS+RR_n":"48.62/49.08","ES+VS+RR/ES+VS+RR_n":"48.79/**49.13**","Ref":"44.73","Delta/%":"4.47/10.00%"},{"Name":"FEVER","ES":"72.80","VS":"88.90","ES+VS/ES+VS_n":"89.29/90.05","ES+RR/ES+RR_n":"84.38/84.94","VS+RR/VS+RR_n":"89.84/90.30","ES+VS+RR/ES+VS+RR_n":"90.21/**91.00**","Ref":"89.02","Delta/%":"2.10/2.36%"},{"Name":"FiQA2018","ES":"23.89","VS":"42.29","ES+VS/ES+VS_n":"42.57/42.79","ES+RR/ES+RR_n":"36.62/36.31","VS+RR/VS+RR_n":"43.04/43.09","ES+VS+RR/ES+VS+RR_n":"43.19/**43.22**","Ref":"42.4","Delta/%":"0.93/2.19%"},{"Name":"HotpotQA","ES":"54.94","VS":"73.65","ES+VS/ES+VS_n":"74.74/75.01","ES+RR/ES+RR_n":"74.93/75.39","VS+RR/VS+RR_n":"77.64/78.07","ES+VS+RR/ES+VS+RR_n":"77.95/**78.37**","Ref":"73.65","Delta/%":"4.72/6.40%"},{"Name":"MSMARCO","ES":"21.84","VS":"41.77","ES+VS/ES+VS_n":"41.65/41.72","ES+RR/ES+RR_n":"46.93/47.15","VS+RR/VS+RR_n":"47.11/**47.24**","ES+VS+RR/ES+VS+RR_n":"47.09/47.23","Ref":"41.77","Delta/%":"5.46/13.07%"},{"Name":"NFCorpus","ES":"31.40","VS":"36.74","ES+VS/ES+VS_n":"37.37/37.63","ES+RR/ES+RR_n":"34.51/35.36","VS+RR/VS+RR_n":"37.32/37.31","ES+VS+RR/ES+VS+RR_n":"**37.70**/37.15","Ref":"36.77","Delta/%":"0.41/1.11%"},{"Name":"NQ","ES":"27.21","VS":"61.33","ES+VS/ES+VS_n":"60.51/61.20","ES+RR/ES+RR_n":"55.60/55.47","VS+RR/VS+RR_n":"61.50/62.24","ES+VS+RR/ES+VS+RR_n":"62.27/**62.35**","Ref":"62.43","Delta/%":"1.02/1.66%"},{"Name":"QuoraRetrieval","ES":"74.23","VS":"80.73","ES+VS/ES+VS_n":"86.64/86.91","ES+RR/ES+RR_n":"84.14/84.40","VS+RR/VS+RR_n":"87.76/88.10","ES+VS+RR/ES+VS+RR_n":"88.39/**88.54**","Ref":"87.42","Delta/%":"7.81/9.67%"},{"Name":"SCIDOCS","ES":"14.68","VS":"21.03","ES+VS/ES+VS_n":"20.49/20.06","ES+RR/ES+RR_n":"16.48/16.48","VS+RR/VS+RR_n":"**20.51**/20.19","ES+VS+RR/ES+VS+RR_n":"20.34/20.03","Ref":"21.10","Delta/%":"-1.00/-4.75%"},{"Name":"SciFact","ES":"58.42","VS":"73.16","ES+VS/ES+VS_n":"73.28/75.08","ES+RR/ES+RR_n":"69.08/69.69","VS+RR/VS+RR_n":"72.73/73.62","ES+VS+RR/ES+VS+RR_n":"73.08/**75.33**","Ref":"73.55","Delta/%":"2.17/2.96%"},{"Name":"Touche2020","ES":"29.92","VS":"**32.65**","ES+VS/ES+VS_n":"31.86/34.26","ES+RR/ES+RR_n":"29.76/29.93","VS+RR/VS+RR_n":"30.47/29.30","ES+VS+RR/ES+VS+RR_n":"31.51/30.98","Ref":"31.47","Delta/%":"-1.67/-5.11%"},{"Name":"TRECCOVID","ES":"52.02","VS":"78.92","ES+VS/ES+VS_n":"77.78/79.12","ES+RR/ES+RR_n":"75.59/76.95","VS+RR/VS+RR_n":"80.34/81.19","ES+VS+RR/ES+VS+RR_n":"81.97/**83.01**","Ref":"79.65","Delta/%":"4.09/5.18%"},{"Name":"Average","ES":"38.32","VS":"54.24","ES+VS/ES+VS_n":"54.64/55.12","ES+RR/ES+RR_n":"51.28/51.62","VS+RR/VS+RR_n":"55.74/55.94","ES+VS+RR/ES+VS+RR_n":"56.15/**56.47**","Ref":"54.91","Delta/%":"2.23/4.11%"}]
diff --git a/www/content/docs/experiments/training.mdx b/www/content/docs/experiments/training.mdx
index 5e30bf7..86a9c91 100644
--- a/www/content/docs/experiments/training.mdx
+++ b/www/content/docs/experiments/training.mdx
@@ -51,7 +51,9 @@ We first build an elasticsearch index, a vector index and a reranker using scifa
The Denser retriever is illustrated in the following diagram, with the top and bottom boxes describing the training and inference respectively. For each query in the training data, we query elasticsearch and vector database to retrieve two sets of topk (100) passages respectively. We note that these two sets may overlap. We then apply a ML reranker to rerank the passages returned from elasticsearch and vector search.
- ![](./RAG-flow-data.png)
+import RagFlowData from './rag-flow-data.png'
+
+
Let's consider a query and two passages in the following. The first passage is annotated with label 1 (relevant) while the second is 0 (irrelevant).
diff --git a/www/package.json b/www/package.json
index 37c5c70..4aa399a 100644
--- a/www/package.json
+++ b/www/package.json
@@ -24,6 +24,7 @@
"react": "18.2.0",
"react-dom": "18.2.0",
"react-use": "^17.5.0",
+ "recharts": "^2.12.7",
"sharp": "^0.33.4",
"shiki": "1.2.1",
"sst": "^3.0.13",