diff --git a/README.md b/README.md index b46cd33..ef67e99 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ xgboost)** machine learning technique to combines: We show that the combinations can significantly improve the retrieval accuracy on MTEB benchmarks when compared to each individual retrievers. -![](./www/pages/docs/experiments/RAG-flow-data.png) +![](./www/content/docs/experiments/rag-flow-data.png) ## 🚀 Features diff --git a/bun.lockb b/bun.lockb index 9a2b25d..5099b16 100755 Binary files a/bun.lockb and b/bun.lockb differ diff --git a/www/app/global.css b/www/app/global.css index 65f25f3..c9e641f 100644 --- a/www/app/global.css +++ b/www/app/global.css @@ -542,3 +542,7 @@ details > summary::before { details[open] > summary::before { transform: rotate(90deg); } + +.prose table th { + @apply text-left; +} diff --git a/www/components/mteb-chart.tsx b/www/components/mteb-chart.tsx new file mode 100644 index 0000000..f57d39f --- /dev/null +++ b/www/components/mteb-chart.tsx @@ -0,0 +1,77 @@ +"use client" + +import { + Bar, + BarChart, + CartesianGrid, + Legend, + Rectangle, + ResponsiveContainer, + Tooltip, + XAxis, + YAxis, +} from "recharts" + +export interface MtebChartProps { + data: any +} + +export default function MtebChart({ data }: MtebChartProps) { + return ( + + + + + + + + { + if (value === "evrn") return "ES+VS+RR_n" + return value.toUpperCase() + }} + wrapperStyle={{ + paddingTop: "1rem", + fontSize: "12px", + }} + /> + + + + + + ) +} diff --git a/www/content/docs/experiments/RAG-flow-data.png b/www/content/docs/experiments/RAG-flow-data.png index e6a56e4..a44af43 100644 Binary files a/www/content/docs/experiments/RAG-flow-data.png and b/www/content/docs/experiments/RAG-flow-data.png differ diff --git a/www/content/docs/experiments/mteb_ndcg_plot.png b/www/content/docs/experiments/mteb_ndcg_plot.png deleted file mode 100644 index 8fb00ec..0000000 Binary files a/www/content/docs/experiments/mteb_ndcg_plot.png and /dev/null differ diff --git a/www/content/docs/experiments/mteb_retrieval.mdx b/www/content/docs/experiments/mteb_retrieval.mdx index a1f9463..c71ff7b 100644 --- a/www/content/docs/experiments/mteb_retrieval.mdx +++ b/www/content/docs/experiments/mteb_retrieval.mdx @@ -3,35 +3,36 @@ title: MTEB Retrieval Experiments --- - We run this experiment on a **server**, which requires ES and Milvus installations specified [here](/docs/install/install-server). + We run this experiment on a **server**, which requires ES and Milvus + installations specified [here](/docs/install/install-server). - - ## MTEB datasets + MTEB [retrieval datasets](https://github.com/embeddings-benchmark/mteb) consists of 15 datasets. The datasets stats including name, corpus size, train, dev and test query sizes are listed in the following table. -|Name |#Corpus | #Train Query | #Dev Query | #Test Query| -|---|---|---|---|---| -|ArguAna |8,674 |0 |0 |1,406| -|ClimateFEVER |5,416,593 | 0 |0 |1,535| -|CQADupstack |457,199 |0 |0 |9,963| -|DBPedia |4,635,922|0 |67 |400 | -|FEVER |5,416,568 |109,810| 6,666 |6,666 | -|FiQA2018 |57,638| 5,500 |500 |648| -|HotpotQA| 5,233,329 |85,000 | 5,447 |7,405 | -|MSMARCO |8,841,823| 502,939 |6,980 |43| -|NFCorpus |3,633| 2,590 |324 |323 | -|NQ| 2,681,468| 0 | 0 |3,452 | -|QuoraRetrieval |522,931| 0 |5,000 |10,000| -|SCIDOCS |25,657| 0 |0 |1,000 | -|SciFact|5,183| 809 |0 |300 | -|Touche2020 |382,545| 0 |0 |49 | -|TRECCOVID | 171,332| 0 |0 |50 | +| Name | #Corpus | #Train Query | #Dev Query | #Test Query | +| -------------- | --------- | ------------ | ---------- | ----------- | +| ArguAna | 8,674 | 0 | 0 | 1,406 | +| ClimateFEVER | 5,416,593 | 0 | 0 | 1,535 | +| CQADupstack | 457,199 | 0 | 0 | 9,963 | +| DBPedia | 4,635,922 | 0 | 67 | 400 | +| FEVER | 5,416,568 | 109,810 | 6,666 | 6,666 | +| FiQA2018 | 57,638 | 5,500 | 500 | 648 | +| HotpotQA | 5,233,329 | 85,000 | 5,447 | 7,405 | +| MSMARCO | 8,841,823 | 502,939 | 6,980 | 43 | +| NFCorpus | 3,633 | 2,590 | 324 | 323 | +| NQ | 2,681,468 | 0 | 0 | 3,452 | +| QuoraRetrieval | 522,931 | 0 | 5,000 | 10,000 | +| SCIDOCS | 25,657 | 0 | 0 | 1,000 | +| SciFact | 5,183 | 809 | 0 | 300 | +| Touche2020 | 382,545 | 0 | 0 | 49 | +| TRECCOVID | 171,332 | 0 | 0 | 50 | ## Train and test xgboost models For each dataset in [MTEB](https://github.com/embeddings-benchmark/mteb), we trained an xgboost models on the training dataset and tested on the test dataset. To speed up the experiments, we used up to 10k queries per dataset in training (`max_query_size: 10000` in `config_server.yaml`). For datasets which do not have training data, we used the development data to train. If neither training nor development data exists, we applied the 3-fold cross-validation. That is, we randomly split the test data into three folds, we used two folds to train a xgboost model and tested on the third fold. We applied this process three times so the whole test dataset can be evaluated. We fixed the xgboost model training with the following settings. Specifically, we used the ndcg metric as model update objective, a moderate learning rate (`eta`) of 0.1, regularization parameter (`gamma`) of 1.0, `min_child_weight` of 0.1, maximum depth of tree up to 6, and evaluation metric of ndcg@10. We used a fixed number (100) of boosting iterations (`num_boost_round`), thus no attempting to optimize the training per dataset. + ```python params = { "objective": "rank:ndcg", @@ -42,6 +43,7 @@ params = { "eval_metric": "ndcg@10", } ``` + The source code for the experiment can be found at [train_and_test.py](https://github.com/denser-org/denser-retriever/blob/main/experiments/train_and_test.py). We ran the following command to train 8 xgboost models (ES+VS, ES+RR, VS+RR, ES+VS+RR, ES+VS_n, ES+RR_n, VS+RR_n, and ES+VS+RR_n) using MSMARCO training data. The definitions of these 8 models can be found at [training](./training). The parameters are dataset_name, train split, and test split respectively. ```shell @@ -69,32 +71,43 @@ We can then evaluate the MTEB dataset (MSMARCO as an example) by running: ```shell poetry run python experiments/test.py mteb/msmarco ``` + We will get the ndcg@10 score after the evaluation. ## Experiment results We list the ndcg@10 scores of different models in the following table. Ref is the reference ndcg@10 of `snowflake-arctic-embed-m` from Huggingface [leaderboard](https://huggingface.co/spaces/mteb/leaderboard), which is consistent with our reported VS accuracy. The bold numbers are the highest accuracy per dataset in our experiments. We use VS instead of Ref as the vector search baseline. Delta and % are the ndcg@10 absolute and relative gains of ES+VS+RR_n model compared to VS baseline. -|Name | ES | VS | ES+VS/ES+VS_n | ES+RR/ES+RR_n | VS+RR/VS+RR_n | ES+VS+RR/ES+VS+RR_n |Ref | Delta/%| -|---|---|---|---|---|---|---|---|---| -|ArguAna |42.93 |56.49 | 56.68/57.27 |47.45/48.21 | 56.32/56.44 |56.81/**57.28** |56.44 | 0.79/1.39% | -|ClimateFEVER |18.10 |39.12 | 39.21/39.01 |28.20/28.34 | 39.06/38.71 |39.11/**39.25** |39.37 |0.13/0.33%| -|CQADupstack |25.13 | 42.23 | 42.40/42.51 |37.68/37.54 | 43.92/44.25 |43.85/**44.32** |43.81 | 2.09/4.94%| -|DBPedia |27.42 |44.66 | 45.26/44.26 |47.94/48.26 | 48.62/49.08 |48.79/**49.13** |44.73 | 4.47/10.00% | -|FEVER |72.80 |88.90 | 89.29/90.05 |84.38/84.94 | 89.84/90.30 |90.21/**91.00** |89.02 | 2.10/2.36%| -|FiQA2018 |23.89 |42.29 | 42.57/42.79 |36.62/36.31 | 43.04/43.09 |43.19/**43.22** |42.4 | 0.93/2.19%| -|HotpotQA|54.94 |73.65 | 74.74/75.01 |74.93/75.39 | 77.64/78.07 |77.95/**78.37** |73.65 |4.72/6.40% | -|MSMARCO |21.84 |41.77 | 41.65/41.72 |46.93/47.15 | 47.11/**47.24** |47.09/47.23 | 41.77 | 5.46/13.07% | -|NFCorpus |31.40 |36.74 | 37.37/37.63 |34.51/35.36 | 37.32/37.31 |**37.70**/37.15 |36.77 | 0.41/1.11% | -|NQ|27.21 |61.33 | 60.51/61.20 |55.60/55.47 | 61.50/62.24 |62.27/**62.35** | 62.43 | 1.02/1.66% | -|QuoraRetrieval|74.23 |80.73 | 86.64/86.91 |84.14/84.40 | 87.76/88.10 |88.39/**88.54** | 87.42 | 7.81/9.67% | -|SCIDOCS |14.68 |21.03 | 20.49/20.06 |16.48/16.48 | **20.51**/20.19 |20.34/20.03 |21.10 | -1.00/-4.75% | -|SciFact|58.42 |73.16 | 73.28/75.08 | 69.08/69.69 | 72.73/73.62 |73.08/**75.33** |73.55 | 2.17/2.96% | -|Touche2020 |29.92 |**32.65** | 31.86/34.26 |29.76/29.93 | 30.47/29.30 |31.51/30.98 |31.47 | -1.67/-5.11% | -|TRECCOVID |52.02 |78.92 | 77.78/79.12 |75.59/76.95 | 80.34/81.19 |81.97/**83.01** |79.65 | 4.09/5.18% | -|Average |38.32 |54.24 |54.64/55.12 |51.28/51.62| 55.74/55.94 |56.15/**56.47** | 54.91 |2.23/4.11%| - -![](./mteb_ndcg_plot.png) +| Name | ES | VS | ES+VS/ES+VS_n | ES+RR/ES+RR_n | VS+RR/VS+RR_n | ES+VS+RR/ES+VS+RR_n | Ref | Delta/% | +| -------------- | ----- | --------- | ------------- | ------------- | --------------- | ------------------- | ----- | ------------ | +| ArguAna | 42.93 | 56.49 | 56.68/57.27 | 47.45/48.21 | 56.32/56.44 | 56.81/**57.28** | 56.44 | 0.79/1.39% | +| ClimateFEVER | 18.10 | 39.12 | 39.21/39.01 | 28.20/28.34 | 39.06/38.71 | 39.11/**39.25** | 39.37 | 0.13/0.33% | +| CQADupstack | 25.13 | 42.23 | 42.40/42.51 | 37.68/37.54 | 43.92/44.25 | 43.85/**44.32** | 43.81 | 2.09/4.94% | +| DBPedia | 27.42 | 44.66 | 45.26/44.26 | 47.94/48.26 | 48.62/49.08 | 48.79/**49.13** | 44.73 | 4.47/10.00% | +| FEVER | 72.80 | 88.90 | 89.29/90.05 | 84.38/84.94 | 89.84/90.30 | 90.21/**91.00** | 89.02 | 2.10/2.36% | +| FiQA2018 | 23.89 | 42.29 | 42.57/42.79 | 36.62/36.31 | 43.04/43.09 | 43.19/**43.22** | 42.4 | 0.93/2.19% | +| HotpotQA | 54.94 | 73.65 | 74.74/75.01 | 74.93/75.39 | 77.64/78.07 | 77.95/**78.37** | 73.65 | 4.72/6.40% | +| MSMARCO | 21.84 | 41.77 | 41.65/41.72 | 46.93/47.15 | 47.11/**47.24** | 47.09/47.23 | 41.77 | 5.46/13.07% | +| NFCorpus | 31.40 | 36.74 | 37.37/37.63 | 34.51/35.36 | 37.32/37.31 | **37.70**/37.15 | 36.77 | 0.41/1.11% | +| NQ | 27.21 | 61.33 | 60.51/61.20 | 55.60/55.47 | 61.50/62.24 | 62.27/**62.35** | 62.43 | 1.02/1.66% | +| QuoraRetrieval | 74.23 | 80.73 | 86.64/86.91 | 84.14/84.40 | 87.76/88.10 | 88.39/**88.54** | 87.42 | 7.81/9.67% | +| SCIDOCS | 14.68 | 21.03 | 20.49/20.06 | 16.48/16.48 | **20.51**/20.19 | 20.34/20.03 | 21.10 | -1.00/-4.75% | +| SciFact | 58.42 | 73.16 | 73.28/75.08 | 69.08/69.69 | 72.73/73.62 | 73.08/**75.33** | 73.55 | 2.17/2.96% | +| Touche2020 | 29.92 | **32.65** | 31.86/34.26 | 29.76/29.93 | 30.47/29.30 | 31.51/30.98 | 31.47 | -1.67/-5.11% | +| TRECCOVID | 52.02 | 78.92 | 77.78/79.12 | 75.59/76.95 | 80.34/81.19 | 81.97/**83.01** | 79.65 | 4.09/5.18% | +| Average | 38.32 | 54.24 | 54.64/55.12 | 51.28/51.62 | 55.74/55.94 | 56.15/**56.47** | 54.91 | 2.23/4.11% | + +import MtedResult from "./mted_result.json" +import MtebChart from "@/components/mteb-chart" + +
+ ({ + name: item.Name, + es: item.ES.replaceAll("*", ""), + vs: item.VS.replaceAll("*", ""), + "ES+VS+RR_n": item["ES+VS+RR/ES+VS+RR_n"].split("/")[1].replaceAll("*", ""), +}))} /> +
Here are the observations from the experiment results. @@ -102,16 +115,14 @@ Vector search with [snowflake-arctic-embed-m](https://github.com/Snowflake-Labs/ For datasets which have training data (FEVER, FiQA2018, HotpotQA, NFCorpus, and SciFact), the combinations of elasticsearch, vector search and reranker via xgboost models are more beneficial, which can be witnessed by the following table. -|Name | VS | ES+VS+RR_n | Delta | Delta%| -|---|---|---|---|---| -|FEVER |88.9| 91.00|2.10| 2.36| -|FiQA2018|42.29|43.22|0.93 |2.19 | -|HotpotQA |73.65 | 78.37 | 4.72 |6.4| -|MSMARCO | 41.77 | 47.23 |5.46 |13.07| -|NFCorpus |36.74|37.15|0.41 |1.11| -|SciFact| 73.16| 75.33| 2.17 |2.96| -|Average | 59.41 | 62.05 | 2.63 | 4.68 | +| Name | VS | ES+VS+RR_n | Delta | Delta% | +| -------- | ----- | ---------- | ----- | ------ | +| FEVER | 88.9 | 91.00 | 2.10 | 2.36 | +| FiQA2018 | 42.29 | 43.22 | 0.93 | 2.19 | +| HotpotQA | 73.65 | 78.37 | 4.72 | 6.4 | +| MSMARCO | 41.77 | 47.23 | 5.46 | 13.07 | +| NFCorpus | 36.74 | 37.15 | 0.41 | 1.11 | +| SciFact | 73.16 | 75.33 | 2.17 | 2.96 | +| Average | 59.41 | 62.05 | 2.63 | 4.68 | The ES+VS+RR_n model improves the vector search NDCG@10 baseline by 2.63 absolute and 4.68% relative gains on these five datasets. It is worth noting that, on the widely used benchmark dataset MSMARCO, the ES+VS+RR_n leads significant relative NDCG@10 gian of 13.07% when compared to vector search baseline. - - diff --git a/www/content/docs/experiments/mted_result.json b/www/content/docs/experiments/mted_result.json new file mode 100644 index 0000000..0cedc4e --- /dev/null +++ b/www/content/docs/experiments/mted_result.json @@ -0,0 +1 @@ +[{"Name":"ArguAna","ES":"42.93","VS":"56.49","ES+VS/ES+VS_n":"56.68/57.27","ES+RR/ES+RR_n":"47.45/48.21","VS+RR/VS+RR_n":"56.32/56.44","ES+VS+RR/ES+VS+RR_n":"56.81/**57.28**","Ref":"56.44","Delta/%":"0.79/1.39%"},{"Name":"ClimateFEVER","ES":"18.10","VS":"39.12","ES+VS/ES+VS_n":"39.21/39.01","ES+RR/ES+RR_n":"28.20/28.34","VS+RR/VS+RR_n":"39.06/38.71","ES+VS+RR/ES+VS+RR_n":"39.11/**39.25**","Ref":"39.37","Delta/%":"0.13/0.33%"},{"Name":"CQADupstack","ES":"25.13","VS":"42.23","ES+VS/ES+VS_n":"42.40/42.51","ES+RR/ES+RR_n":"37.68/37.54","VS+RR/VS+RR_n":"43.92/44.25","ES+VS+RR/ES+VS+RR_n":"43.85/**44.32**","Ref":"43.81","Delta/%":"2.09/4.94%"},{"Name":"DBPedia","ES":"27.42","VS":"44.66","ES+VS/ES+VS_n":"45.26/44.26","ES+RR/ES+RR_n":"47.94/48.26","VS+RR/VS+RR_n":"48.62/49.08","ES+VS+RR/ES+VS+RR_n":"48.79/**49.13**","Ref":"44.73","Delta/%":"4.47/10.00%"},{"Name":"FEVER","ES":"72.80","VS":"88.90","ES+VS/ES+VS_n":"89.29/90.05","ES+RR/ES+RR_n":"84.38/84.94","VS+RR/VS+RR_n":"89.84/90.30","ES+VS+RR/ES+VS+RR_n":"90.21/**91.00**","Ref":"89.02","Delta/%":"2.10/2.36%"},{"Name":"FiQA2018","ES":"23.89","VS":"42.29","ES+VS/ES+VS_n":"42.57/42.79","ES+RR/ES+RR_n":"36.62/36.31","VS+RR/VS+RR_n":"43.04/43.09","ES+VS+RR/ES+VS+RR_n":"43.19/**43.22**","Ref":"42.4","Delta/%":"0.93/2.19%"},{"Name":"HotpotQA","ES":"54.94","VS":"73.65","ES+VS/ES+VS_n":"74.74/75.01","ES+RR/ES+RR_n":"74.93/75.39","VS+RR/VS+RR_n":"77.64/78.07","ES+VS+RR/ES+VS+RR_n":"77.95/**78.37**","Ref":"73.65","Delta/%":"4.72/6.40%"},{"Name":"MSMARCO","ES":"21.84","VS":"41.77","ES+VS/ES+VS_n":"41.65/41.72","ES+RR/ES+RR_n":"46.93/47.15","VS+RR/VS+RR_n":"47.11/**47.24**","ES+VS+RR/ES+VS+RR_n":"47.09/47.23","Ref":"41.77","Delta/%":"5.46/13.07%"},{"Name":"NFCorpus","ES":"31.40","VS":"36.74","ES+VS/ES+VS_n":"37.37/37.63","ES+RR/ES+RR_n":"34.51/35.36","VS+RR/VS+RR_n":"37.32/37.31","ES+VS+RR/ES+VS+RR_n":"**37.70**/37.15","Ref":"36.77","Delta/%":"0.41/1.11%"},{"Name":"NQ","ES":"27.21","VS":"61.33","ES+VS/ES+VS_n":"60.51/61.20","ES+RR/ES+RR_n":"55.60/55.47","VS+RR/VS+RR_n":"61.50/62.24","ES+VS+RR/ES+VS+RR_n":"62.27/**62.35**","Ref":"62.43","Delta/%":"1.02/1.66%"},{"Name":"QuoraRetrieval","ES":"74.23","VS":"80.73","ES+VS/ES+VS_n":"86.64/86.91","ES+RR/ES+RR_n":"84.14/84.40","VS+RR/VS+RR_n":"87.76/88.10","ES+VS+RR/ES+VS+RR_n":"88.39/**88.54**","Ref":"87.42","Delta/%":"7.81/9.67%"},{"Name":"SCIDOCS","ES":"14.68","VS":"21.03","ES+VS/ES+VS_n":"20.49/20.06","ES+RR/ES+RR_n":"16.48/16.48","VS+RR/VS+RR_n":"**20.51**/20.19","ES+VS+RR/ES+VS+RR_n":"20.34/20.03","Ref":"21.10","Delta/%":"-1.00/-4.75%"},{"Name":"SciFact","ES":"58.42","VS":"73.16","ES+VS/ES+VS_n":"73.28/75.08","ES+RR/ES+RR_n":"69.08/69.69","VS+RR/VS+RR_n":"72.73/73.62","ES+VS+RR/ES+VS+RR_n":"73.08/**75.33**","Ref":"73.55","Delta/%":"2.17/2.96%"},{"Name":"Touche2020","ES":"29.92","VS":"**32.65**","ES+VS/ES+VS_n":"31.86/34.26","ES+RR/ES+RR_n":"29.76/29.93","VS+RR/VS+RR_n":"30.47/29.30","ES+VS+RR/ES+VS+RR_n":"31.51/30.98","Ref":"31.47","Delta/%":"-1.67/-5.11%"},{"Name":"TRECCOVID","ES":"52.02","VS":"78.92","ES+VS/ES+VS_n":"77.78/79.12","ES+RR/ES+RR_n":"75.59/76.95","VS+RR/VS+RR_n":"80.34/81.19","ES+VS+RR/ES+VS+RR_n":"81.97/**83.01**","Ref":"79.65","Delta/%":"4.09/5.18%"},{"Name":"Average","ES":"38.32","VS":"54.24","ES+VS/ES+VS_n":"54.64/55.12","ES+RR/ES+RR_n":"51.28/51.62","VS+RR/VS+RR_n":"55.74/55.94","ES+VS+RR/ES+VS+RR_n":"56.15/**56.47**","Ref":"54.91","Delta/%":"2.23/4.11%"}] diff --git a/www/content/docs/experiments/training.mdx b/www/content/docs/experiments/training.mdx index 5e30bf7..86a9c91 100644 --- a/www/content/docs/experiments/training.mdx +++ b/www/content/docs/experiments/training.mdx @@ -51,7 +51,9 @@ We first build an elasticsearch index, a vector index and a reranker using scifa The Denser retriever is illustrated in the following diagram, with the top and bottom boxes describing the training and inference respectively. For each query in the training data, we query elasticsearch and vector database to retrieve two sets of topk (100) passages respectively. We note that these two sets may overlap. We then apply a ML reranker to rerank the passages returned from elasticsearch and vector search. - ![](./RAG-flow-data.png) +import RagFlowData from './rag-flow-data.png' + + Let's consider a query and two passages in the following. The first passage is annotated with label 1 (relevant) while the second is 0 (irrelevant). diff --git a/www/package.json b/www/package.json index 37c5c70..4aa399a 100644 --- a/www/package.json +++ b/www/package.json @@ -24,6 +24,7 @@ "react": "18.2.0", "react-dom": "18.2.0", "react-use": "^17.5.0", + "recharts": "^2.12.7", "sharp": "^0.33.4", "shiki": "1.2.1", "sst": "^3.0.13",