Skip to content

Releases: cl-tohoku/quiz-datasets

v1.0.1

30 May 09:32
Compare
Choose a tag to compare

Added the filtered passages files in JSON Lines format.

Passages
jawiki-20220404-c400-small Download (116 MB)
jawiki-20220404-c400-medium Download (448 MB)
jawiki-20220404-c400-large Download (1.03 GB)

v1.0.0

06 Aug 11:41
Compare
Choose a tag to compare

Data source

Questions

  • abc_01-12
    • 17,735 questions used in the first (2003) through 12th (2012) abc/EQIDEN quiz competitions.
  • aio_01_dev
  • aio_01_test
    • 2,000 questions used in the test set for the first AI王 competition.
  • aio_01_unused
    • 608 questions prepared but unused for the first AI王 competition.
  • aio_02_train
    • 22,335 questions distributed as the training set for the second AI王 competition (2021).
    • The questions are the same as concatenation of abc_01-12, aio_01_dev, aio_01_test, and aio_01_unused.
  • aio_02_dev
    • 1,000 questions used in the development set for the second AI王 competition.

Passages

In the following sets of passages, each passage consists of consecutive sentences no longer than 400 characters from Japanese Wikipedia as of 2022-04-04.
The following sets of passages differ in how many Wikipedia pages are used to extract sentences from.

  • jawiki-20220404-c400-small
    • 394,124 passages from 28,246 pages which have at least 500 incoming links within Wikipedia.
  • jawiki-20220404-c400-medium
    • 1,678,986 passages from 233,981 pages which have at least 100 incoming links within Wikipedia.
  • jawiki-20220404-c400-large
    • 4,288,198 passages from 903,024 pages which have at least 10 incoming links within Wikipedia.

Datasets

The format of the datasets are described in README.

File format: gzipped JSON Lines (.jsonl.gz)

Questions \ Passages jawiki-20220404-c400-small jawiki-20220404-c400-medium jawiki-20220404-c400-large
abc_01-12 Download (581 MB) Download (521 MB) Download (480 MB)
aio_01_dev Download (65.1 MB) Download (58.6 MB) Download (53.6 MB)
aio_01_test Download (65.6 MB) Download (59 MB) Download (54.1 MB)
aio_01_unused Download (19.8 MB) Download (17.6 MB) Download (16 MB)
aio_02_train Download (735 MB) Download (660 MB) Download (607 MB)
aio_02_dev Download (32.9 MB) Download (29.6 MB) Download (27.3 MB)

DPR-formatted datasets

Retriever input files

The format is same as DPR's datasets for training retrievers (e.g., data.retriever.nq-train.)
Questions without any positive passages are excluded from these datasets.

File format: gzipped JSON (.json.gz)

Questions \ Passages jawiki-20220404-c400-small jawiki-20220404-c400-medium jawiki-20220404-c400-large
abc_01-12 Download (405 MB) Download (425 MB) Download (414 MB)
aio_01_dev Download (49.6 MB) Download (52.5 MB) Download (50.9 MB)
aio_01_test Download (48.6 MB) Download (52 MB) Download (51.3 MB)
aio_01_unused Download (14.1 MB) Download (15 MB) Download (14.5 MB)
aio_02_train Download (517 MB) Download (544 MB) Download (530 MB)
aio_02_dev Download (23 MB) Download (24.2 MB) Download (23.5 MB)

Questions TSV files

The format is same as DPR's datasets for validating retrievers (e.g., data.retriever.qas.nq-train.)

File format: TSV (.tsv)

Questions
abc_01-12 Download (2.69 MB)
aio_01_dev Download (326 KB)
aio_01_test Download (334 KB)
aio_01_unused Download (104 KB)
aio_02_train Download (3.43 MB)
aio_02_dev Download (153 KB)

Passages TSV files

The format is same as DPR's passages file (e.g., data.wikipedia_split.psgs_w100.)

File format: gzipped TSV (.tsv.gz)

Passages
jawiki-20220404-c400-small Download (113 MB)
jawiki-20220404-c400-medium Download (433 MB)
jawiki-20220404-c400-large Download (1020 MB)