This directory contains scripts download and process the Wikipedia DPR data, created by Facebook Research. Although, the code is fully functional detailed instructions are not available yet. In a nutshell, one needs to
- Download data: passages and queries.
We suggest placing them in a collection sub-directory such as
download
- Each DPR dataset comes with the training set, which we split into three subsets:
bitext
(regular training data),dev
(development), andtrain_fusion
(a set to learn a fusion model). Splitting and processing the queries can be done using the following script. - Finally passages, need to be converted using convert_pass.py.