CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
基准 | 汉语 | 常识推理 | 中国特有知识 | 中国和世界知识域 | 推理和记忆的关系 |
---|---|---|---|---|---|
davis2023benchmarks 中提到的基准 | ✘ | ✔ | ✘ | ✘ | ✘ |
XNLI, XCOPA,XStoryCloze | ✔ | ✔ | ✘ | ✘ | ✘ |
LogiQA,CLUE, CMMLU | ✔ | ✘ | ✔ | ✘ | ✘ |
CORECODE | ✔ | ✔ | ✘ | ✘ | ✘ |
CHARM (ours) | ✔ | ✔ | ✔ | ✔ | ✔ |
- [2024.7.26] Opencompass支持CHARM的所有推理和评测任务.🔥🔥🔥
- [2024.6.06] 更新排行榜,评测了LLaMA-3、GPT-4o、Gemini-1.5、Yi1.5、Qwen1.5等模型.
- [2024.5.24] 开源CHARM数据 !!! 🔥🔥🔥
- [2024.5.15] CHARM已被计算语言学协会第62届年会(ACL 2024)主会议接受!!! 🔥🔥🔥
- [2024.3.21] 论文发布在 ArXiv.
以下是快速下载 CHARM 并在 OpenCompass 上进行评估的步骤。
请参考 OpenCompass 的安装步骤。
git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}
cd ${path_to_opencompass}
mkdir data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM
cd ${path_to_opencompass}
# 修改配置文件`configs/eval_charm_rea.py`: 将现有的模型取消注释,或者添加你想评测的模型
python run.py configs/eval_charm_rea.py -r --dump-eval-details
# 修改配置文件`configs/eval_charm_mem.py`: 将现有的模型取消注释,或者添加你想评测的模型
python run.py configs/eval_charm_mem.py -r --dump-eval-details
推理和评测的结果位于路径${path_to_opencompass}/outputs
, 如下所示:
outputs
├── CHARM_mem
│ └── chat
│ └── 20240605_151442
│ ├── predictions
│ │ ├── internlm2-chat-1.8b-turbomind
│ │ ├── llama-3-8b-instruct-lmdeploy
│ │ └── qwen1.5-1.8b-chat-hf
│ ├── results
│ │ ├── internlm2-chat-1.8b-turbomind_judged-by--GPT-3.5-turbo-0125
│ │ ├── llama-3-8b-instruct-lmdeploy_judged-by--GPT-3.5-turbo-0125
│ │ └── qwen1.5-1.8b-chat-hf_judged-by--GPT-3.5-turbo-0125
│ └── summary
│ └── 20240605_205020 # MEMORY_SUMMARY_DIR
│ ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Anachronisms_Judgment
│ ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Movie_and_Music_Recommendation
│ ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Sport_Understanding
│ ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Time_Understanding
│ └── judged-by--GPT-3.5-turbo-0125.csv # MEMORY_SUMMARY_CSV
└── CHARM_rea
└── chat
└── 20240605_152359
├── predictions
│ ├── internlm2-chat-1.8b-turbomind
│ ├── llama-3-8b-instruct-lmdeploy
│ └── qwen1.5-1.8b-chat-hf
├── results # REASON_RESULTS_DIR
│ ├── internlm2-chat-1.8b-turbomind
│ ├── llama-3-8b-instruct-lmdeploy
│ └── qwen1.5-1.8b-chat-hf
└── summary
├── summary_20240605_205328.csv # REASON_SUMMARY_CSV
└── summary_20240605_205328.txt
cd ${path_to_CHARM_repo}
# 生成论文中的Table5, Table6, Table9 and Table10,详见https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_reasoning.py ${REASON_SUMMARY_CSV}
# 生成论文中的Figure3 and Figure9,详见https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_mem_rea.py ${REASON_SUMMARY_CSV} ${MEMORY_SUMMARY_CSV}
# 生成论文中的Table7, Table12, Table13 and Figure11,详见https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/analyze_mem_indep_rea.py data/CHARM ${REASON_RESULTS_DIR} ${MEMORY_SUMMARY_DIR} ${MEMORY_SUMMARY_CSV}
@misc{sun2024benchmarking,
title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
year={2024},
eprint={2403.14112},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
此项目是在Apache 2.0许可下发布的 license.