Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LongEval Retrieval (used at CLEF 2023) #234

Open
8 tasks
mam10eks opened this issue May 13, 2023 · 6 comments
Open
8 tasks

LongEval Retrieval (used at CLEF 2023) #234

mam10eks opened this issue May 13, 2023 · 6 comments

Comments

@mam10eks
Copy link
Contributor

Dataset Information:

The goal would be to integrate the data of LongEval for the task 1 on retrieval.

The information from the official task description:

The goal of Task 1 is to propose an information retrieval system which can handle changes over the time. The proposed retrieval system should follow the temporal timewise evolution of Web documents. The Longeval Websearch collection relies on a large set of data (corpus of pages, queries, user interaction) provided by a commercial search engine (Qwant). It is designed to reflect the changes of the Web across time, by providing evolving document and query sets. The queries in the collection were collected from Qwant's users over several months and can thus be expected to reflect the changes in the search preferences of the users. The documents in the collection were then selected to be able to well evaluate retrieval on these queries at the time they were collected, and thus also change over a time.

Links to Resources:

https://clef-longeval.github.io/

Dataset ID(s) & supported entities:

  • longeval/en/train: docs, queries, qrels
  • longeval/en/heldout: docs, queries
  • longeval/en/a-short-july: docs, queries
  • longeval/en/b-long-september: docs, queries
  • longeval/fr/train: docs, queries, qrels
  • longeval/fr/heldout: docs, queries
  • longeval/fr/a-short-july: docs, queries
  • longeval/fr/b-long-september: docs, queries

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

  • Dataset definition (in ir_datasets/datasets/[topid].py)
  • Tests (in tests/integration/[topid].py)
  • Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
  • Documentation (in ir_datasets/etc/[topid].yaml)
  • Downloadable content (in ir_datasets/etc/downloads.json)
    • Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
    • Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

@mam10eks
Copy link
Contributor Author

I have started to work on this and have a first prototype locally that uses TrecDocs and TsvQueries, so it should be not much code that is needed here.

@mam10eks mam10eks changed the title LongEval (used at CLEF 2023) LongEval Retrieval (used at CLEF 2023) May 13, 2023
@seanmacavaney
Copy link
Collaborator

Awesome! Given LongEval's focus on the temporal, I think it should be encoded at a higher level in the dataset ids, e.g.:

  • longeval (plaeholder)
    • /[2023-07|2023-09|...] (placeholder)
      • /[en|fr|...] (docs)
        • /[train|heldout|eval|...] (docs, queries, qrels)`

Though maybe I'm missing something about how the task is structured?

@mam10eks
Copy link
Contributor Author

Yes, makes perfect sense, I can implement this ticket? (I already have a prototype, it is not much code as LongEval comes in formats already supported in ir_datasets)

@seanmacavaney
Copy link
Collaborator

That would be awesome! I love when folks release data in standard formats :-)

@romaindeveaud
Copy link

If I may add something, the LongEval collection is subject to a custom license from Qwant (https://lindat.mff.cuni.cz/repository/xmlui/page/Qwant_LongEval_BY-NC-SA_License, this is basically an extension of the CC-BY-NC License) that requires an explicit agreement as well as providing contact information.
Is it something that is feasible within ir-datasets?

@mam10eks
Copy link
Contributor Author

Dear Romain,

Thanks for reaching out.
Yes, this is feasible.

The ir-datasets integration would expect that the user manually downloads the data (I already have a prototype implementation that assumes this).
I.e., ir-datasets would not download the dataset, but only show a message to the user to obtain the data (thereby filling out the explicit agreement and contact information) and than store it in some predefined directory.

Best regards,

Maik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants