pyNTCIREVAL is a python version of NTCIREVAL http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html developed by Dr. Tetsuya Sakai http://www.f.waseda.jp/tetsuya/sakai.html . Only a part of NTCIREVAL functionalities has been implemented in the current version of pyNTCIREVAL: retrieval effectiveness metrics for ranked retrieval (e.g. DCG and ERR). As shown below, pyNTCIREVAL can be used in Python codes as well.
For Japanese users, there is a very nice textbook written in Japanese that discusses various evaluation metrics and how to use NTCIREVAL: see http://www.f.waseda.jp/tetsuya/book.html .
These evaluation metrics are available in the current version:
- Hit@k: 1 if top k contains a relevant doc, and 0 otherwise.
- P@k (precision at k): number of relevant docs in top k divided by k.
- AP (Average Precision)6, 7.
- ERR (Expected Reciprocal Rank), nERR@k2, 8.
- RBP (Rank-biased Precision)4.
- nDCG (original nDCG)3.
- MSnDCG (Microsoft version of nDCG)1.
- Q-measure8.
- RR (Reciprocal Rank).
- O-measure5
- P-measure and P-plus5.
- NCU (Normalised Cumulative Utility)7.
pip install pyNTCIREVAL
from pyNTCIREVAL import Labeler
from pyNTCIREVAL.metrics import Precision
# dict of { document ID: relevance level }
qrels = {0: 1, 1: 0, 2: 0, 3: 0, 4: 1, 5: 0, 6: 0, 7: 1, 8: 0, 9: 0}
ranked_list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # a list of document IDs
# labeling: [doc_id] -> [(doc_id, rel_level)]
labeler = Labeler(qrels)
labeled_ranked_list = labeler.label(ranked_list)
assert labeled_ranked_list == [
(0, 1), (1, 0), (2, 0), (3, 0), (4, 1),
(5, 0), (6, 0), (7, 1), (8, 0), (9, 0)
]
# let's compute Precision@5
metric = Precision(cutoff=5)
result = metric.compute(labeled_ranked_list)
assert result == 0.4
Many evaluation metric classes need xrelnum
and grades
as input for initialization.
xrelnum
is a list containing the number of documents of i-th relevance level,
while grades
is a list containing a grade for each i-th relevance level (except for level 0).
For example, there are three levels of relevance: irrelevant, partially relevant, and highly relevant.
Suppose a document collection includes 5 irrelevant, 3 partially relevant, and 2 highly relevant for a certain topic.
In this case, xrelnum = [5, 3, 2]
.
If we want to assign 0, 1, and 2 grades for each level, then grades = [1, 2]
.
from pyNTCIREVAL import Labeler
from pyNTCIREVAL.metrics import MSnDCG
# dict of { document ID: relevance level }
qrels = {0: 2, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 0, 7: 2, 8: 0, 9: 0}
grades = [1, 2] # a grade for relevance levels 1 and 2 (Note that level 0 is excluded)
ranked_list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # a list of document IDs
# labeling: [doc_id] -> [(doc_id, rel_level)]
labeler = Labeler(qrels)
labeled_ranked_list = labeler.label(ranked_list)
assert labeled_ranked_list == [
(0, 2), (1, 0), (2, 1), (3, 0), (4, 1),
(5, 0), (6, 0), (7, 2), (8, 0), (9, 0)
]
# compute the number of documents for each relevance level
rel_level_num = 3
xrelnum = labeler.compute_per_level_doc_num(rel_level_num)
assert xrelnum == [6, 2, 2]
# Let's compute nDCG@5
metric = MSnDCG(xrelnum, grades, cutoff=5)
result = metric.compute(labeled_ranked_list)
assert result == 0.6885695823073614
[1] Burges, C. et al.: Learning to rank using gradient descent, ICML 2005.
[2] Chapelle, O. et al.: Expected Reciprocal Rank for Graded Relevance, CIKM 2009.
[3] Jarvelin, K. and Kelalainen, J.: Cumulated Gain-based Evaluation of IR Techniques, ACM TOIS 20(4), 2002.
[4] Moffat, A. and Zobel, J.: Rank-biased Precision for Measurement of Retrieval Effectiveness, ACM TOIS 27(1), 2008.
[5] Sakai, T.: On the Properties of Evaluation Metrics for Finding One Highly Relevant Document, IPSJ TOD, Vol.48, No.SIG9 (TOD35), 2007.
[6] Sakai, T.: Alternatives to Bpref, SIGIR 2007.
[7] Sakai. T. and Robertson, S.: Modelling A User Population for Designing Information Retrieval Metrics, EVIA 2008.
[8] Sakai, T. and Song, R.: Evaluating Diversified Search Results Using Per-intent Graded Relevance, SIGIR 2011.