This directory contains all the code and required resources and enables reproducing the experiments from the paper and running the algorithms on new corpora. To simplify the documentation of our steps, all steps are persisted in a Makefile.
Outline
- Transferred Relevance Judgments: The resulting new relevance judgments on ClueWeb12, ClueWeb12+, and Common Crawl 2015.
- ClueWeb12+: Steps to produce our simulated ClueWeb12+. ClueWeb12+ is the ClueWeb12 plus snapshots of judged web pages not included in the ClueWeb12 crawl, but available in the ClueWeb12 crawling period in the Wayback Machine.
- Evaluation: All evaluations reported in the paper.
- (Re)Producing Transferred Relevance Judgments: The steps used to identify near-duplicates between judged documents in the old corpora and documents in the new corpora.
We have produced new relevance judgments by finding (near-)duplicates of judged ClueWeb09/ClueWeb12 documents in newer crawls. All transferred relevance judgments are located in src/main/resources/artificial-qrels/.
Here is an overview of the judgments:
- Original judgments (we removed near-duplicates judged for the same topic):
- Transferred judgments to ClueWeb12:
- Transferred judgments to Common Crawl 2015:
The associated topics can be found in Anserini.
The creation of ClueWeb12+ was performed in two steps (please install java:8
, then run make install
to install all dependencies):
- We use the Wayback CDX-Api to find the timestamps when judged URLs are saved in the Wayback Machine: de.webis.sigir2021.App
- We crawl the snapshots identified by the CDX-Api from the Wayback Machine: de.webis.sigir2021.CrawlWaybackSnapshots
Both steps create WARC files. The resulting WARC files are available online, combined in a single tar here:
The results of both steps are also available as the intermediate topic-level WARCs, which can be found here:
We have indexed the ClueWeb12+ in Elasticsearch with make ranking-create-index-clueweb09-in-wayback12
.
Please install docker
, python:3.6
, and pipenv
.
There are make targets for all our experiment steps.
- Install
docker
,python:3.6
, andpipenv
- Install dependencies by running:
make simulation-install
andmake ranking-install
- Verify your installation by running:
make simulation-test
andmake ranking-test
(you can skip ranking-test when you want to reuse our run-files)
All jupyter notebooks with some additional experiments can be found in src/main/jupyter, and we can start jupyter with make jupyter
.
- Download the run-files from TREC.
- Run
make simulation-evaluate-original-runs
to produce original-runs-evaluation.jsonl
We set up the elasticsearch indices with make ranking-create-index-clueweb09-in-wayback12
.
Use the associated make targets to create the run-files (the run files are also available online at https://files.webis.de/sigir2021-relevance-label-transfer-resources/case-study-run-files/):
make ranking-create-original-web-<TRACK_YEAR>
to create the original run-files for<TRACK_YEAR>
.make ranking-create-transferred-web-<TRACK_YEAR>
to create the transferred run-files for<TRACK_YEAR>
.make ranking-create-transferred-cw12-and-wb12
to create all transferred run files for ClueWeb12+.make ranking-create-transferred-cc15
to create all transferred run files for Common Crawl 2015.- When all run-files are available, run
make simulation-reproducibility-analysis
to produce reproducibility-evaluation-per-query-zero-scores-removed.jsonl (this jsonl file is used to create the evaluations in section "5.2 Experiments with Best-Case Topic Selections")
We create the transferred relevance judgments in two steps:
- First, we create a jsonl file for each track that contains near-duplicates for the judgments with the class de.webis.sigir2021.trec.LabelTransfer
- Second, we use those jsonl files to create qrel files for each track with the class de.webis.sigir2021.trec.CreateQrels
All resources for creating the transferred relevance judgments are available online at https://files.webis.de/sigir2021-relevance-label-transfer-resources/:
- https://files.webis.de/sigir2021-relevance-label-transfer-resources/url-transfer-from-cw09-to-cw12-with-similarity.jsonl
- https://files.webis.de/sigir2021-relevance-label-transfer-resources/relevance-transfer-only-near-duplicates.jsonl
- https://files.webis.de/sigir2021-relevance-label-transfer-resources/relevance-transfer-exact-duplicates.jsonl
- https://files.webis.de/sigir2021-relevance-label-transfer-resources/url-transfer-from-cw09-or-cw12-to-cc15-with-similarity.jsonl
This resources are created with (please install java:8
, hadoop:2.7.1
, spark:2.2.1
):
- near-duplicate detection with SimHash
- Transfer to Near-Duplicates with identical/canonical URLs with the class de.webis.sigir2021.Evaluation and de.webis.sigir2021.EvaluationCC15