User-Centric Evaluation of LLMs

📚 Our Paper (EMNLP 24 Resource Award) A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models.

📃 Dataset HuggingFace. Benchmark Process Here.

💡 Currently Call for Contributions

Introduction: ENG | 中文
Share Your Experience Here: English Version | 中文版

Our Highlights

User-Centric 🏄🏻‍♀️🏄🏼🏄🏽‍♂️
- Dataset
  - Real-world usage scenarios
  - The dataset is collected through a User Survey with 712 participants in 23 countries
- Evaluation
  - LLMs' efficacy as cooperative services in satisfying user needs
Intent-Divided 🙇🧑‍💻🧑‍🎨🪂
- System abilities and performances in different scenarios might be different,
- Users’ expectations across different intents are different,
- Evaluation criteria for different situations should be different,
- Therefore we design this benchmark categorized by User Intents.
- According to related literature, our intent taxonomy is
  - Objective
    - Factual QA, Solve Professional Problem, Text Assistant, Use through APIs
  - Subjective
    - Seek Creativity, Ask for Advice, Leisure
Multi-Cultural
- The dataset is contributed by users from 23 countries in Asia, Europe, North America, Oceania, South America, and Africa.
- Their reported scenarios cover multiple cultural backgrounds

Benchmark Results

	Solve Problem	Factual QA	Text Assistant	Ask for Advice	Seek Creativity	Leisure	API	All
Cases	379	259	82	116	86	83	26	1031
GPT-4-0125-preview	*8.28	*8.68	7.91	*7.69	*7.47	*7.57	*8.38	*8.16
Claude-3-opus	7.61	7.71	7.68	7.01	7.10	7.16	7.77	7.50
Qwen-max	7.53	7.64	*8.20	7.28	7.09	6.63	7.65	7.48
GLM-4	7.52	7.29	7.65	7.20	7.10	6.37	8.04	7.32
ERNIE-Bot-4	7.51	7.17	7.23	7.09	7.20	7.02	8.00	7.30
Moonshot-v1-8k	7.25	7.53	7.62	6.92	7.05	7.01	7.92	7.29
Spark-3.5	6.97	6.70	7.45	7.05	6.44	6.33	7.08	6.86
Baichuan2-Turbo	6.55	6.83	6.91	6.35	6.17	6.02	7.19	6.57
GPT-3.5-turbo	6.55	6.73	7.01	6.35	6.17	5.69	6.73	6.51
Deepseek-chat	6.74	6.24	6.83	6.09	5.52	4.93	6.58	6.29

For each intent and the overall scenarios, we mark the three best-performing LLM services, with the first marked ’*’, the second bolded, and the third underlined.

Dataset

The dataset comes from a user survey with 712 participants in 23 countries.

Example Cases

Chinese Cases

Intent	Description	Cases	Evaluation Criteria
Solve Problem	Seek answers or explanations in the field of programming, natural sciences, humanities, social sciences, etc. Address and learn about the profession	大模型现在为什么都是decoder-only架构纯流体的粘度测试怎么做烟草花叶病毒属外壳蛋白进入叶绿体的已知机制介绍如何证明费马大定理？	1 事实正确性(Factuality), 2 满足用户需求(User Satisfaction), 3 清晰度(Clarity), 4 逻辑连贯性(Logical Coherence), 5 完备性(Completeness)
Factual QA	Fast and direct access to factual information	大雪农历初几一加仑是多少升西瓜书的目录是什么	1 事实正确性(Factuality), 2 满足用户需求(User Satisfaction), 3 清晰度 (Clarity), 4 完备性 (Completeness), 5 逻辑连贯性(Logical Coherence)
Text Assistant	Summarizing, translating, editing, or creating content	请你帮我撰写一段给领导2024龙年的拜年微信	1 清晰度(Clarity) 2 满足用户需求(User Satisfaction) 3 逻辑连贯性(Logical Coherence) 4 事实正确性(Factuality) 5 创造性(Creativity)
Use through APIs	Use through Application Programming Interface instead of user interfaces Explore and test the capabilities of LLM, such as evaluating it on various tasks, simulating agents, environments, or datasets, etc.	大模型CEval评测 MBTI测试评价模型生成内容的helpfulness	1 事实正确性(Factuality), 2 满足用户需求(User Satisfaction), 3 清晰度(Clarity), 4 逻辑连贯性(Logical Coherence), 5 完备性(Completeness)
Ask for Advice	Career development, personal counseling, gift recommendation, etc., or creating personal schedules, travel plans, shopping lists, etc.	如何快速提高英语听力能力？哪些有效方式可以缓解失眠症状？适合中老年人的健康监测智能设备推荐	1 满足用户需求(User Satisfaction), 2 事实正确性(Factuality), 3 公平与可负责程度(Fairness and Responsibility), 4 创造性(Creativity), 5 丰富度(Richness)
Seek Creativity	Brainstorming for inspiration, innovative ideas, etc.	设计三个生鲜超市slogan 我在构思经济学的课题，关于后疫情时代消费者行为变化，给我几个具体的idea 如何发财	1 满足用户需求(User Satisfaction), 2 逻辑连贯性(Logical Coherence), 3 创造性(Creativity), 4 丰富度(Richness), 5 事实正确性(Factuality)
Leisure	Movie and music recommendations, games, and other entertaining activities	下饭剧推荐分享一个关于程序员的幽默笑话推荐几款好玩的音乐节奏游戏	1 满足用户需求(User Satisfaction), 2 趣味性 (Engagement), 3 适宜性 (Appropriateness), 4 创造性 (Creativity), 5 事实正确性 (Factuality)

Evaluation

The evaluation criteria for each intent are shown in the table above.
Details are shown in the paper.

Citation

Please cite our Paper if you find our work valuable, thank you!

@inproceedings{wang2024user,
  title={A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models},
  author={Wang, Jiayin and Mo, Fengran and Ma, Weizhi and Sun, Peijie and Zhang, Min and Nie, Jian-Yun},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
  pages={3588--3612},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
fig		fig
src		src
.gitignore		.gitignore
Contribution-CN.md		Contribution-CN.md
Contribution.md		Contribution.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

User-Centric Evaluation of LLMs

Our Highlights

Benchmark Results

Dataset

Example Cases

Evaluation

Citation

About

Releases

Packages

Languages

Alice1998/URS

Folders and files

Latest commit

History

Repository files navigation

User-Centric Evaluation of LLMs

Our Highlights

Benchmark Results

Dataset

Example Cases

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages