Skip to content

Latest commit

 

History

History
112 lines (78 loc) · 9.77 KB

File metadata and controls

112 lines (78 loc) · 9.77 KB

Showcase for the CopyCat Library: "Transferring Relevance Judgments"

This directory contains all the code and required resources and enables reproducing the experiments from the paper and running the algorithms on new corpora. To simplify the documentation of our steps, all steps are persisted in a Makefile.

Outline

  • Transferred Relevance Judgments: The resulting new relevance judgments on ClueWeb12, ClueWeb12+, and Common Crawl 2015.
  • ClueWeb12+: Steps to produce our simulated ClueWeb12+. ClueWeb12+ is the ClueWeb12 plus snapshots of judged web pages not included in the ClueWeb12 crawl, but available in the ClueWeb12 crawling period in the Wayback Machine.
  • Evaluation: All evaluations reported in the paper.
  • (Re)Producing Transferred Relevance Judgments: The steps used to identify near-duplicates between judged documents in the old corpora and documents in the new corpora.

Transferred Relevance Judgments

We have produced new relevance judgments by finding (near-)duplicates of judged ClueWeb09/ClueWeb12 documents in newer crawls. All transferred relevance judgments are located in src/main/resources/artificial-qrels/.

Here is an overview of the judgments:

The associated topics can be found in Anserini.

ClueWeb12+

The creation of ClueWeb12+ was performed in two steps (please install java:8, then run make install to install all dependencies):

  1. We use the Wayback CDX-Api to find the timestamps when judged URLs are saved in the Wayback Machine: de.webis.sigir2021.App
  2. We crawl the snapshots identified by the CDX-Api from the Wayback Machine: de.webis.sigir2021.CrawlWaybackSnapshots

Both steps create WARC files. The resulting WARC files are available online, combined in a single tar here:

The results of both steps are also available as the intermediate topic-level WARCs, which can be found here:

We have indexed the ClueWeb12+ in Elasticsearch with make ranking-create-index-clueweb09-in-wayback12.

Evaluation

Please install docker, python:3.6, and pipenv. There are make targets for all our experiment steps.

Preparations

  • Install docker, python:3.6, and pipenv
  • Install dependencies by running: make simulation-install and make ranking-install
  • Verify your installation by running: make simulation-test and make ranking-test (you can skip ranking-test when you want to reuse our run-files)

Produce Evaluations Reported in the Paper

All jupyter notebooks with some additional experiments can be found in src/main/jupyter, and we can start jupyter with make jupyter.

Produce Evaluations on Original Run-Files

Produce Rankings from the Case-Study

We set up the elasticsearch indices with make ranking-create-index-clueweb09-in-wayback12. Use the associated make targets to create the run-files (the run files are also available online at https://files.webis.de/sigir2021-relevance-label-transfer-resources/case-study-run-files/):

  • make ranking-create-original-web-<TRACK_YEAR> to create the original run-files for <TRACK_YEAR>.
  • make ranking-create-transferred-web-<TRACK_YEAR> to create the transferred run-files for <TRACK_YEAR>.
  • make ranking-create-transferred-cw12-and-wb12 to create all transferred run files for ClueWeb12+.
  • make ranking-create-transferred-cc15 to create all transferred run files for Common Crawl 2015.
  • When all run-files are available, run make simulation-reproducibility-analysis to produce reproducibility-evaluation-per-query-zero-scores-removed.jsonl (this jsonl file is used to create the evaluations in section "5.2 Experiments with Best-Case Topic Selections")

(Re)Producing Transferred Relevance Judgments

We create the transferred relevance judgments in two steps:

All resources for creating the transferred relevance judgments are available online at https://files.webis.de/sigir2021-relevance-label-transfer-resources/:

This resources are created with (please install java:8, hadoop:2.7.1, spark:2.2.1):