Interleaving

A python library for conducting interleaving, which comparing two or multiple rankers based on observed user clicks by interleaving their results.

Introduction

A/B testing is a well-known technique for comparing two or more systems based on user behaviors in a production environment, and has been used for improving the quality of systems in many services. Interleaving, which can be an alternative to A/B testing for comparing rankings, has shown x100 efficiency compared to A/B testing^{1, 2}. Since the efficiency matters a lot in particular for many alternatives in comparison, interleaving is a promising technique for user-based ranking evaluation. This library aims to provide most of the algorithms that have been proposed in the literature.

Interleaving algorithms

Interleaving for two rankers

Balanced interleaving³
Team draft interleaving⁴
Probabilistic interleaving⁵
Optimized interleaving⁶

Interleaving for multiple rankers

Team draft multileaving⁷
Probabilistic multileaving⁸
Optimized multileaving⁷
Roughly optimized multileaving⁹
Pairwise preference multileaving¹⁰

Note that probabilistic interleaving and probabilistic multileaving use different strategies to select a ranker from which a document is selected. In the original papers, probabilistic interleaving samples a ranker with replacement, i.e. one of the two rankers is sampled at every document selection. Probabilistic multileaving samples a ranker without replacement. Let D be a set of all the rankers. A ranker is sampled from D without replacement. When D is empty, all the rankers are put into D again. Probabilistic has an keyword argument replace by which either of these strategies can be used.

Prerequisites

Numpy
Scipy
Pulp

Installation

interleaving and its prerequisites can be installed by

$ pip install git+https://github.com/mpkato/interleaving.git

An alternative can be

$ git clone git+https://github.com/mpkato/interleaving.git
$ cd interleaving
$ python setup.py install

Usage

>>> import interleaving
>>>
>>> a = [1, 2, 3, 4, 5] # Ranking 1
>>> b = [4, 3, 5, 1, 2] # Ranking 2
>>> method = interleaving.TeamDraft([a, b]) # initialize an interleaving method
>>>
>>> ranking = method.interleave() # interleaving
>>> ranking
[1, 4, 2, 3, 5]
>>>
>>> clicks = [0, 2] # observed clicks, i.e. documents 1 and 2 are clicked
>>> result = interleaving.TeamDraft.evaluate(ranking, clicks)
>>> result # (0, 1) indicates Ranking 1 won Ranking 2.
[(0, 1)]
>>>
>>> clicks = [1, 3] # observed clicks, i.e. documents 4 and 3 are clicked
>>> result = interleaving.TeamDraft.evaluate(ranking, clicks)
>>> result # (1, 0) indicates Ranking 2 won Ranking 1.
[(1, 0)]
>>>
>>> clicks = [0, 1] # observed clicks, i.e. documents 1 and 4 are clicked
>>> result = interleaving.TeamDraft.evaluate(ranking, clicks)
>>> result # if (0, 1) or (1, 0) does not appear in the result,
>>>        # it indicates a tie between Rankings 1 and 2.
[]

Note

The ranking sampling algorithm of optimized multileaving⁷ and roughly optimized multileaving⁹ may take a long time or even runs into an inifinite loop. To work around this problem, this implementation supports secure_sampling flag to limit the number of sampling attempts to sample_num.

>>> import interleaving
>>> interleaving.Optimized([[1, 2], [2, 3]], sample_num=4, secure_sampling=True)

References

Chapelle et al. "Large-scale Validation and Analysis of Interleaved Search Evaluation." ACM TOIS 30.1 (2012): 6.
Schuth, Hofmann, Radlinski. "Predicting Search Satisfaction Metrics with Interleaved Comparisons." SIGIR 2015.
Joachims. "Evaluating retrieval performance using clickthrough data". Text Mining 2003.
Radlinski, Kurup, and Joachims. "How does clickthrough data reflect retrieval quality?" CIKM 2008.
Hofmann, Whiteson, and de Rijke. "A probabilistic method for inferring preferences from clicks." CIKM 2011.
Radlinski and Craswell. "Optimized Interleaving for Online Retrieval Evaluation." WSDM 2013.
Schuth et al. "Multileaved Comparisons for Fast Online Evaluation." CIKM 2014.
Schuth et al. "Probabilistic Multileave for Online Retrieval Evaluation." SIGIR 2015.
Manabe et al. "A Comparative Live Evaluation of Multileaving Methods on a Commercial cQA Search", SIGIR 2017.
Oosterhuis and de Rijke. "Sensitive and Scalable Online Evaluation with Theoretical Guarantees", CIKM 2017.

License

MIT License (see LICENSE file).

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
.circleci		.circleci
docs		docs
interleaving		interleaving
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
circle.yml		circle.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interleaving

Introduction

Interleaving algorithms

Interleaving for two rankers

Interleaving for multiple rankers

Prerequisites

Installation

Usage

Note

References

License

About

Releases

Packages

Contributors 2

Languages

License

mpkato/interleaving

Folders and files

Latest commit

History

Repository files navigation

Interleaving

Introduction

Interleaving algorithms

Interleaving for two rankers

Interleaving for multiple rankers

Prerequisites

Installation

Usage

Note

References

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages