Baselines for NTCIR Data Search
$ curl -sSL | python
$ source ~/.bash_profile
$ poetry install
Since this package uses Anserini, Java 11 and Maven 3.3+ are also required.
Please visit and download the test collection, which includes
These files are expected to be in data
Please run the command below for compiling Java codes of Anserini:
$ poetry run invoke build
If the build successes, you will see a message that looks like
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 37.788 s
[INFO] Finished at: 2020-03-06T10:58:14+09:00
[INFO] ------------------------------------------------------------------------
Compiled anserini successfully.
Let's index the collection of Japanese statistical data.
task reads contents from data/data_search_j_collection.jsonl
and produces multiple files in JSONL format.
$ poetry run invoke preprocess ja data/data_search_j_collection.jsonl collections/ja
You can see the generated files by
$ ls collections/ja
collection.00000000.jsonl collection.00000003.jsonl collection.00000006.jsonl collection.00000009.jsonl collection.00000012.jsonl
collection.00000001.jsonl collection.00000004.jsonl collection.00000007.jsonl collection.00000010.jsonl collection.00000013.jsonl
collection.00000002.jsonl collection.00000005.jsonl collection.00000008.jsonl collection.00000011.jsonl
When you process the English statistical data data_search_e_collection.jsonl
, ja
should be replaced with en
Then, index
task indexes the collection as follows:
$ poetry run invoke index ja collections/ja indices/ja
2020-03-06 13:22:53,100 INFO [main] index.IndexCollection ( - Total 1,338,402 documents indexed in 00:02:27
Created the index successfully.
You can find the index files at indices/ja
After the index has been built,
task can produce ranked lists by several search models such as BM25 and LMIR.
$ poetry run invoke search ja indices/ja data/data_search_j_train_topics.tsv results
This will read queries from data/data_search_j_train_topics.tsv
retrieve results from indices/ja
, and output them into results
$ ls results/
ja-bm25.accurate.txt ja-bm25.txt ja-bm25prf+bm25.txt ja-qld.txt ja-rm3+bm25.txt ja-rm3+qld.txt ja-sdm+bm25.txt ja-sdm+qld.txt
The output file name is <language>-<search_model>.txt
- bm25 : BM25 scoring model
- bm25.accurate : BM25 scoring model
- bm25prf : bm25PRF query expansion model
- qld : query likelihood Dirichlet scoring model
- rm3 : RM3 query expansion model
- sdm : Sequential Dependence Model
Please refer to Anserini for details on search models.
Note that these results are in the TREC format, not the NTCIR format required in the NTCIR-15 Data Search task.
You can transform a TREC file into an NTCIR file by ntcirify
e.g. poetry run invoke ntcirify trec_file.txt ntcir_file.txt