Name	Name	Last commit message	Last commit date
parent directory ..
data_process	data_process
metric	metric
model	model
README.md	README.md
pair-pipeline.png	pair-pipeline.png
run_marco.sh	run_marco.sh
run_nq.sh	run_nq.sh
wget_data.sh	wget_data.sh
wget_trained_model.sh	wget_trained_model.sh

PAIR

This is a repository of the paper: PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval, ACL 2021.

*News: We have released a new repository for the RocketQA series: https://github.com/PaddlePaddle/RocketQA. The repo provides an easy-to-use toolkit for running and fine-tuning the state-of-the-art dense retrievers. We will continue to maintain that repo in the future.

Introduction

PAIR is a novel approach to improving dense passage retrieval. The three major technical contributions include introducing formal formulations of the two kinds of similarity relations, generating high-quality pseudo labeled data via knowledge distillation, and designing an effective two-stage training procedure that incorporates passage-centric similarity relation constraint. The experiment results show that PAIR significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions (NQ).

The pipeline of PAIR training approach is shown as follows:

Preparation

Environment

Python 3.7
PaddlePaddle 1.8 (Please refer to the Installation Guide)
cuda >= 9.0
cudnn >= 7.0
faiss

Download data

To download the raw corpus of MSMARCO & Natural Questions, as well as the preprocessed training data, run

sh wget_data.sh

The downloaded data will be saved into corpus (including the training and development/test sets of MSMARCO & NQ and all the passages in MSMARCO and Wikipedia to be indexed), data_train (including the preprocessed training data for pre-training and fine-tuning of PAIR).

├── corpus/
│   ├── marco                   # The original dataset of MSMARCO 
│   │   ├── train.query.txt
│   │   ├── train.query.txt.format
│   │   ├── qrels.train.tsv
│   │   ├── dev.query.txt
│   │   ├── dev.query.txt.format
│   │   ├── qrels.dev.tsv
│   │   ├── para.txt
│   │   ├── para.title.txt
│   │   ├── para_8part          # The paragraphs were divided into 8 parts to facilitate the inference
│   │   ├── 
│   ├── nq                      # The original dataset of NQ 
│   │   ├── ...                 # (has the same directory structure as MSMARCO)

├── data_train/
│   ├── marco_pretrain.tsv                   # Training examples for pre-training stage, positives and negatives of hybrid-domain queries are pseudo labels sampled by knowledge distillation from MSMARCO corpus
│   ├── marco_finetune.tsv                   # Training examples for fine-tuning stage, positives and negatives of in-domain queries are ground truth labels and pseudo labels sampled by knowledge distillation from MSMARCO corpus
│   ├── nq_pretrain.tsv                      # Training examples for pre-training stage, positives and negatives of hybrid-domain queries are pseudo labels sampled by knowledge distillation from Wikipedia corpus
│   ├── nq_finetune.tsv                      # Training examples for fine-tuning stage, positives and negatives of in-domain queries are ground truth labels and pseudo labels sampled by knowledge distillation from Wikipedia corpus

Download the trained models

To download our trained models and the initial pre-trained language model (ERNIE 2.0), run

sh wget_trained_model.sh

The downloaded model parameters will be saved into checkpoint, including

├── checkpoint/   
│   ├── ernie_base_twin_init                    # (ERNIE 2.0 base) initial parameters for dual-encoder
│   ├── marco_finetuned_encoder                 # Final dual-encoder model with shared parameters on MSMARCO
│   ├── nq_finetuned_encoder                    # Final dual-encoder model with shared parameters on NQ

Training

The Training Procedure

To reproduce the results of the paper, you can follow the commands in run_marco.sh / run_nq.sh. These scripts contain the entire process of PAIR. Each step depends on the result of the previous step.

Running command for each stage

Dual-encoder pre-training

To pre-train a dual-encoder model, run

cd model
sh script/run_dual_encoder_train.sh $TRAIN_SET $MODEL_PATH $nodes $use_cross_batch $use_lamb true

Dual-encoder fine-tuning

To fine-tune a dual-encoder model, run

cd model
sh script/run_dual_encoder_train.sh $TRAIN_SET $MODEL_PATH $nodes $use_cross_batch $use_lamb false

Dual-encoder inference

To do the inference of dual-encoder and get top K retrieval results (retrieved by FAISS), run

sh script/run_retrieval.sh $TEST_SET $MODEL_PATH $DATA_PATH $TOP_K

Here, we separate whole candidate passages into 8 parts, and predict their embeddings with 8 GPU cards simultaneously. After getting top K results on each part, we merge them to get the final file. (ie. $recall_topk_file in Data Processing)

Tips: remember to specify GPU cards before training by

export CUDA_VISIBLE_DEVICES=0,1,xxx

Evaluation

To evaluate the models on MSMARCO development set, run

python metric/msmarco_eval.py corpus/marco/qrels.dev.tsv $recall_topk_file

To evaluate the models on NQ test set, run

python metric/nq_eval.py $recall_topk_file

The table below shows the results of our experiments on two datasets.

Model	MSMARCO Dev			NQ Test
Model	MRR@10	R@50	R@1000	R@5	R@20	R@100
PAIR	37.9	86.4	98.2	74.9	83.5	89.1

Citation

If you find our paper and code useful, please cite the following paper:

@inproceedings{ren2021pair,
  title={PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval},
  author={Ren, Ruiyang and Lv, Shangwen and Qu, Yingqi and Liu, Jing and Zhao, Wayne Xin and She, Qiaoqiao and Wu, Hua and Wang, Haifeng},
  booktitle={Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021},
  pages = {2173--2183},
  year={2021},
  publisher = {Association for Computational Linguistics},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACL2021-PAIR

ACL2021-PAIR

README.md

PAIR

Introduction

Preparation

Environment

Download data

Download the trained models

Training

The Training Procedure

Running command for each stage

Dual-encoder pre-training

Dual-encoder fine-tuning

Dual-encoder inference

Evaluation

Citation

Files

ACL2021-PAIR

Directory actions

More options

Directory actions

More options

Latest commit

History

ACL2021-PAIR

Folders and files

parent directory

README.md

PAIR

Introduction

Preparation

Environment

Download data

Download the trained models

Training

The Training Procedure

Running command for each stage

Dual-encoder pre-training

Dual-encoder fine-tuning

Dual-encoder inference

Evaluation

Citation