LongEval Retrieval (used at CLEF 2023) #234

mam10eks · 2023-05-13T08:04:14Z

Dataset Information:

The goal would be to integrate the data of LongEval for the task 1 on retrieval.

The information from the official task description:

The goal of Task 1 is to propose an information retrieval system which can handle changes over the time. The proposed retrieval system should follow the temporal timewise evolution of Web documents. The Longeval Websearch collection relies on a large set of data (corpus of pages, queries, user interaction) provided by a commercial search engine (Qwant). It is designed to reflect the changes of the Web across time, by providing evolving document and query sets. The queries in the collection were collected from Qwant's users over several months and can thus be expected to reflect the changes in the search preferences of the users. The documents in the collection were then selected to be able to well evaluate retrieval on these queries at the time they were collected, and thus also change over a time.

Links to Resources:

https://clef-longeval.github.io/

Dataset ID(s) & supported entities:

longeval/en/train: docs, queries, qrels
longeval/en/heldout: docs, queries
longeval/en/a-short-july: docs, queries
longeval/en/b-long-september: docs, queries
longeval/fr/train: docs, queries, qrels
longeval/fr/heldout: docs, queries
longeval/fr/a-short-july: docs, queries
longeval/fr/b-long-september: docs, queries

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

Dataset definition (in ir_datasets/datasets/[topid].py)
Tests (in tests/integration/[topid].py)
Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
Documentation (in ir_datasets/etc/[topid].yaml)
- Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
Downloadable content (in ir_datasets/etc/downloads.json)
- Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
- Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

The text was updated successfully, but these errors were encountered:

mam10eks · 2023-05-13T08:05:08Z

I have started to work on this and have a first prototype locally that uses TrecDocs and TsvQueries, so it should be not much code that is needed here.

seanmacavaney · 2023-05-15T18:58:19Z

Awesome! Given LongEval's focus on the temporal, I think it should be encoded at a higher level in the dataset ids, e.g.:

longeval (plaeholder)
- /[2023-07|2023-09|...] (placeholder)
  - /[en|fr|...] (docs)
    - /[train|heldout|eval|...] (docs, queries, qrels)`

Though maybe I'm missing something about how the task is structured?

mam10eks · 2023-05-16T11:28:27Z

Yes, makes perfect sense, I can implement this ticket? (I already have a prototype, it is not much code as LongEval comes in formats already supported in ir_datasets)

seanmacavaney · 2023-05-16T16:17:11Z

That would be awesome! I love when folks release data in standard formats :-)

romaindeveaud · 2023-06-20T12:08:20Z

If I may add something, the LongEval collection is subject to a custom license from Qwant (https://lindat.mff.cuni.cz/repository/xmlui/page/Qwant_LongEval_BY-NC-SA_License, this is basically an extension of the CC-BY-NC License) that requires an explicit agreement as well as providing contact information.
Is it something that is feasible within ir-datasets?

mam10eks · 2023-07-19T23:06:37Z

Dear Romain,

Thanks for reaching out.
Yes, this is feasible.

The ir-datasets integration would expect that the user manually downloads the data (I already have a prototype implementation that assumes this).
I.e., ir-datasets would not download the dataset, but only show a message to the user to obtain the data (thereby filling out the explicit agreement and contact information) and than store it in some predefined directory.

Best regards,

Maik

mam10eks added the add-dataset label May 13, 2023

mam10eks changed the title ~~LongEval (used at CLEF 2023)~~ LongEval Retrieval (used at CLEF 2023) May 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongEval Retrieval (used at CLEF 2023) #234

LongEval Retrieval (used at CLEF 2023) #234

mam10eks commented May 13, 2023

mam10eks commented May 13, 2023

seanmacavaney commented May 15, 2023

mam10eks commented May 16, 2023

seanmacavaney commented May 16, 2023

romaindeveaud commented Jun 20, 2023

mam10eks commented Jul 19, 2023

LongEval Retrieval (used at CLEF 2023) #234

LongEval Retrieval (used at CLEF 2023) #234

Comments

mam10eks commented May 13, 2023

mam10eks commented May 13, 2023

seanmacavaney commented May 15, 2023

mam10eks commented May 16, 2023

seanmacavaney commented May 16, 2023

romaindeveaud commented Jun 20, 2023

mam10eks commented Jul 19, 2023