📚 See our Paper Here.
📃 Dataset and Benchmark Process Here.
💡 Currently Call for Contributions
- Introduction: ENG | 中文
- Share Your Experience Here: English Version | 中文版
- User-Centric 🏄🏻♀️🏄🏼🏄🏽♂️
- Dataset
- Real-world usage scenarios
- The dataset is collected through a User Survey with 712 participants in 23 countries
- Evaluation
- LLMs' efficacy as cooperative services in satisfying user needs
- Dataset
- Intent-Divided 🙇🧑💻🧑🎨🪂
- System abilities and performances in different scenarios might be different,
- Users’ expectations across different intents are different,
- Evaluation criteria for different situations should be different,
- Therefore we design this benchmark categorized by User Intents.
- According to related literature, our intent taxonomy is
-
Objective
- Factual QA, Solve Professional Problem, Text Assistant, Use through APIs
-
Subjective
- Seek Creativity, Ask for Advice, Leisure
-
- Multi-Cultural
- The dataset is contributed by users from 23 countries in Asia, Europe, North America, Oceania, South America, and Africa.
- Their reported scenarios cover multiple cultural backgrounds
Solve Problem | Factual QA | Text Assistant | Ask for Advice | Seek Creativity | Leisure | API | All | |
---|---|---|---|---|---|---|---|---|
Cases | 379 | 259 | 82 | 116 | 86 | 83 | 26 | 1031 |
GPT-4-0125-preview | *8.28 | *8.68 | 7.91 | *7.69 | *7.47 | *7.57 | *8.38 | *8.16 |
Claude-3-opus | 7.61 | 7.71 | 7.68 | 7.01 | 7.10 | 7.16 | 7.77 | 7.50 |
Qwen-max | 7.53 | 7.64 | *8.20 | 7.28 | 7.09 | 6.63 | 7.65 | 7.48 |
GLM-4 | 7.52 | 7.29 | 7.65 | 7.20 | 7.10 | 6.37 | 8.04 | 7.32 |
ERNIE-Bot-4 | 7.51 | 7.17 | 7.23 | 7.09 | 7.20 | 7.02 | 8.00 | 7.30 |
Moonshot-v1-8k | 7.25 | 7.53 | 7.62 | 6.92 | 7.05 | 7.01 | 7.92 | 7.29 |
Spark-3.5 | 6.97 | 6.70 | 7.45 | 7.05 | 6.44 | 6.33 | 7.08 | 6.86 |
Baichuan2-Turbo | 6.55 | 6.83 | 6.91 | 6.35 | 6.17 | 6.02 | 7.19 | 6.57 |
GPT-3.5-turbo | 6.55 | 6.73 | 7.01 | 6.35 | 6.17 | 5.69 | 6.73 | 6.51 |
Deepseek-chat | 6.74 | 6.24 | 6.83 | 6.09 | 5.52 | 4.93 | 6.58 | 6.29 |
For each intent and the overall scenarios, we mark the three best-performing LLM services, with the first marked ’*’, the second bolded, and the third underlined.
The dataset comes from a user survey with 712 participants in 23 countries.
Chinese Cases
Intent | Description | Cases | Evaluation Criteria |
---|---|---|---|
Solve Problem | Seek answers or explanations in the field of programming, natural sciences, humanities, social sciences, etc. Address and learn about the profession |
大模型现在为什么都是decoder-only架构 纯流体的粘度测试怎么做 烟草花叶病毒属外壳蛋白进入叶绿体的已知机制介绍 如何证明费马大定理? |
1 事实正确性(Factuality), 2 满足用户需求(User Satisfaction), 3 清晰度(Clarity), 4 逻辑连贯性(Logical Coherence), 5 完备性(Completeness) |
Factual QA | Fast and direct access to factual information | 大雪农历初几 一加仑是多少升 西瓜书的目录是什么 |
1 事实正确性(Factuality), 2 满足用户需求(User Satisfaction), 3 清晰度 (Clarity), 4 完备性 (Completeness), 5 逻辑连贯性(Logical Coherence) |
Text Assistant | Summarizing, translating, editing, or creating content | 请你帮我撰写一段给领导2024龙年的拜年微信 | 1 清晰度(Clarity) 2 满足用户需求(User Satisfaction) 3 逻辑连贯性(Logical Coherence) 4 事实正确性(Factuality) 5 创造性(Creativity) |
Use through APIs | Use through Application Programming Interface instead of user interfaces Explore and test the capabilities of LLM, such as evaluating it on various tasks, simulating agents, environments, or datasets, etc. |
大模型CEval评测 MBTI测试 评价模型生成内容的helpfulness |
1 事实正确性(Factuality), 2 满足用户需求(User Satisfaction), 3 清晰度(Clarity), 4 逻辑连贯性(Logical Coherence), 5 完备性(Completeness) |
Ask for Advice | Career development, personal counseling, gift recommendation, etc., or creating personal schedules, travel plans, shopping lists, etc. | 如何快速提高英语听力能力? 哪些有效方式可以缓解失眠症状? 适合中老年人的健康监测智能设备推荐 |
1 满足用户需求(User Satisfaction), 2 事实正确性(Factuality), 3 公平与可负责程度(Fairness and Responsibility), 4 创造性(Creativity), 5 丰富度(Richness) |
Seek Creativity | Brainstorming for inspiration, innovative ideas, etc. | 设计三个生鲜超市slogan 我在构思经济学的课题,关于后疫情时代消费者行为变化,给我几个具体的idea 如何发财 |
1 满足用户需求(User Satisfaction), 2 逻辑连贯性(Logical Coherence), 3 创造性(Creativity), 4 丰富度(Richness), 5 事实正确性(Factuality) |
Leisure | Movie and music recommendations, games, and other entertaining activities | 下饭剧推荐 分享一个关于程序员的幽默笑话 推荐几款好玩的音乐节奏游戏 |
1 满足用户需求(User Satisfaction), 2 趣味性 (Engagement), 3 适宜性 (Appropriateness), 4 创造性 (Creativity), 5 事实正确性 (Factuality) |
- The evaluation criteria for each intent are shown in the table above.
- Details are shown in the paper.
- Please cite our Paper if you find our work valuable, thank you!
@inproceedings{wang2024user,
title={A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models},
author={Wang, Jiayin and Mo, Fengran and Ma, Weizhi and Sun, Peijie and Zhang, Min and Nie, Jian-Yun},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
pages={3588--3612},
year={2024}
}