Skip to content

Latest commit

 

History

History
36 lines (23 loc) · 1.62 KB

README.md

File metadata and controls

36 lines (23 loc) · 1.62 KB

🪐MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset

This is the official code and data repository for the paper: 🪐MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset.

Overview

1. Download Dataset/Model Checkpoints

The 🪐MARS benchmark and our best model checkpoints on three tasks in 🪐MARS can be downloaded at this link.

2. Benchmark Curation

Code for instructing ChatGPT to curate the 🪐MARS benchmark can be found in the benchmark_curation folder.

3. Evaluation

Code for evaluating language models on the 🪐MARS benchmark can be found in the evaluation folder.

4. Citing this work

Please use the bibtex below for citing our paper:

@inproceedings{Wang2024MARSBT,
  title={MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset},
  author={Weiqi Wang and Yangqiu Song},
  year={2024},
  url={https://doi.org/10.48550/arXiv.2406.02106},
  doi={10.48550/arXiv.2406.02106}
}

5. Acknowledgement

The authors of this paper were supported by the NSFC Fund (U20B2053) from the NSFC of China, the RIF (R6020-19 and R6021-20), and the GRF (16211520 and 16205322) from RGC of Hong Kong. We also thank the support from the UGC Research Matching Grants (RMGS20EG01-D, RMGS20CR11, RMGS20CR12, RMGS20EG19, RMGS20EG21, RMGS23CR05, RMGS23EG08).