Releases: cl-tohoku/quiz-datasets
v1.0.1
v1.0.0
Data source
Questions
abc_01-12
- 17,735 questions used in the first (2003) through 12th (2012) abc/EQIDEN quiz competitions.
aio_01_dev
- 1,992 questions used in the development set for the first AI王 competition (2020).
aio_01_test
- 2,000 questions used in the test set for the first AI王 competition.
aio_01_unused
- 608 questions prepared but unused for the first AI王 competition.
aio_02_train
- 22,335 questions distributed as the training set for the second AI王 competition (2021).
- The questions are the same as concatenation of
abc_01-12
,aio_01_dev
,aio_01_test
, andaio_01_unused
.
aio_02_dev
- 1,000 questions used in the development set for the second AI王 competition.
Passages
In the following sets of passages, each passage consists of consecutive sentences no longer than 400 characters from Japanese Wikipedia as of 2022-04-04.
The following sets of passages differ in how many Wikipedia pages are used to extract sentences from.
jawiki-20220404-c400-small
- 394,124 passages from 28,246 pages which have at least 500 incoming links within Wikipedia.
jawiki-20220404-c400-medium
- 1,678,986 passages from 233,981 pages which have at least 100 incoming links within Wikipedia.
jawiki-20220404-c400-large
- 4,288,198 passages from 903,024 pages which have at least 10 incoming links within Wikipedia.
Datasets
The format of the datasets are described in README.
File format: gzipped JSON Lines (.jsonl.gz
)
Questions \ Passages | jawiki-20220404-c400-small |
jawiki-20220404-c400-medium |
jawiki-20220404-c400-large |
---|---|---|---|
abc_01-12 |
Download (581 MB) | Download (521 MB) | Download (480 MB) |
aio_01_dev |
Download (65.1 MB) | Download (58.6 MB) | Download (53.6 MB) |
aio_01_test |
Download (65.6 MB) | Download (59 MB) | Download (54.1 MB) |
aio_01_unused |
Download (19.8 MB) | Download (17.6 MB) | Download (16 MB) |
aio_02_train |
Download (735 MB) | Download (660 MB) | Download (607 MB) |
aio_02_dev |
Download (32.9 MB) | Download (29.6 MB) | Download (27.3 MB) |
DPR-formatted datasets
Retriever input files
The format is same as DPR's datasets for training retrievers (e.g., data.retriever.nq-train
.)
Questions without any positive passages are excluded from these datasets.
File format: gzipped JSON (.json.gz
)
Questions \ Passages | jawiki-20220404-c400-small |
jawiki-20220404-c400-medium |
jawiki-20220404-c400-large |
---|---|---|---|
abc_01-12 |
Download (405 MB) | Download (425 MB) | Download (414 MB) |
aio_01_dev |
Download (49.6 MB) | Download (52.5 MB) | Download (50.9 MB) |
aio_01_test |
Download (48.6 MB) | Download (52 MB) | Download (51.3 MB) |
aio_01_unused |
Download (14.1 MB) | Download (15 MB) | Download (14.5 MB) |
aio_02_train |
Download (517 MB) | Download (544 MB) | Download (530 MB) |
aio_02_dev |
Download (23 MB) | Download (24.2 MB) | Download (23.5 MB) |
Questions TSV files
The format is same as DPR's datasets for validating retrievers (e.g., data.retriever.qas.nq-train
.)
File format: TSV (.tsv
)
Questions | |
---|---|
abc_01-12 |
Download (2.69 MB) |
aio_01_dev |
Download (326 KB) |
aio_01_test |
Download (334 KB) |
aio_01_unused |
Download (104 KB) |
aio_02_train |
Download (3.43 MB) |
aio_02_dev |
Download (153 KB) |
Passages TSV files
The format is same as DPR's passages file (e.g., data.wikipedia_split.psgs_w100
.)
File format: gzipped TSV (.tsv.gz
)
Passages | |
---|---|
jawiki-20220404-c400-small |
Download (113 MB) |
jawiki-20220404-c400-medium |
Download (433 MB) |
jawiki-20220404-c400-large |
Download (1020 MB) |