Skip to content

AmenRa/a-multi-domain-benchmark-for-personalized-search-evaluation

Repository files navigation

A Multi-domain Benchmark for Personalized Search Evaluation

DOI

We provide large-scale multi-domain benchmark datasets for Personalized Search.

The datasets can be found here.
Models' source code can be found here.
Pre-computed baseline runs are available on ranxhub.

Citation

Please cite the following paper if you use the data or code in this repo.

@inproceedings{bassani2022multi,
  title={A Multi-Domain Benchmark for Personalized Search Evaluation},
  author={Bassani, Elias and Kasela, Pranav and Raganato, Alessandro and Pasi, Gabriella},
  booktitle={Proceedings of the 31st ACM International Conference on Information \& Knowledge Management},
  pages={3822--3827},
  year={2022}
}

Folder structure of each dataset

- train:
  - queries.jsonl
  - query_ids.txt
- val:
  - bm25_run.json
  - qrels.json
  - queries.jsonl
  - query_ids.txt
- test:
  - bm25_run.json
  - qrels.json
  - queries.jsonl
  - query_ids.txt
- collection.jsonl
- fos_hierarachies.jsonl
- in_refs.jsonl
- out_refs.jsonl
- has_authors.jsonl
- authors.jsonl
- affiliations.jsonl
- conference_instances.jsonl
- conference_series.jsonl
- journals.jsonl
- bm25_config.json

File descriptions

queries.jsonl

Each JSON line is as follows:

{
  "id": ...
  "text": ...
  "rel_doc_ids": ...      # IDs of the relevant documents
  "user_id": ...          # Same as `author_id` in other files
  "user_doc_ids": ...     # IDs of the associated user documents
  "bm25_doc_ids": ...     # IDs of the documents retrieved by BM25
  "bm25_doc_scores": ...  # Scores assigned by BM25 to the retrieved documents
  "timestamp": ...
}

collection.jsonl

Each JSON line is as follows:

{
  "id": ...
  "title": ...
  "text": ...
  "keywords": ...
  "fields_of_study": ...
  "publication_date": ...
  "timestamp": ...
  "conference_instance_id": ...
  "conference_series_id": ...
  "journal_id": ...
  "issue_id": ...
  "volume": ...
  "publisher": ...
  "doi": ...
}

authors.jsonl

Each JSON line is as follows:

{
  "id": ...
  "name": ...
  "affiliation_id": ...
  "docs": [{"doc_id": "...", "timestamp": ...}, ...]
}

has_authors.jsonl

Each JSON line is as follows:

{
  "doc_id": ...
  "timestamp": ...
  "author_ids": ["123678452", ...]
}

in_refs.jsonl (incoming reference)

Each JSON line is as follows:

{
  "doc_id": ...
  "in_refs": [{"doc_id": "...", "timestamp": ...}, ...]
}

out_refs.jsonl (outgoing reference)

Each JSON line is as follows:

{
  "doc_id": ...
  "timestamp": ...
  "out_refs": ["2048600620", ...]
}

affiliations.jsonl

Each JSON line is as follows:

{
  "id": ...
  "name": ...   # Name of the institution
}

conference_instances.jsonl

Each JSON line is as follows:

{
  "id": ...
  "name": ...
  "conference_series_id": ...
}

conference_series.jsonl

Each JSON line is as follows:

{
  "id": ...
  "name": ...
}

journals.jsonl

Each JSON line is as follows:

{
  "id": ...
  "name": ...
}

fields_of_study_hierarchies.jsonl

Fields of studies associated with the documents have a hierarchical tree structure.
Each JSON line is as follows:

{
  "id": ...
  "hierarchy": ...
}