As Large Language Models (LLMs) demonstrate strong performance in their ability to reason, plan, and make decisions, many researchers are studying how to leverage these models to perform various tasks involving these complex abilities in diverse environments, hence the term Language Agents. In a burgeoning field, language agent evaluation is being rapidly developed and challenges arise when it comes to measuring an agent's performance on diverse tasks ranging from QA, math, code, to planning, decision-making, and web navigation. Thus, it's imperative to understand the existing evaluation approaches and benchmarks!
📚 This awesome (and ongoing) repository aims to provide a comprehensive, organized list of LLM/language agent benchmarks! As a bonus, we also provide a comprehensive list of other awesome language-agent-adjacent lists. Pull requests and issues are welcome! 😎
- 🧠 General
- 💻 Computer Use
- 🔨 Software Engineering
- 🧰 Tool-Use
- 🔍 RAG
- 🤖🤖 Multi-Agent
- 🗄️ Other Awesome Lists
Depending on the domain, some agents are tested on LLM benchmarks. So we briefly include a couple popular ones. You can find comprehensive lists of LLM benchmarks in the 🗄️ Other Awesome Lists section!
- Yang, Zhilin, et al. "HotpotQA: A dataset for diverse, explainable multi-hop question answering." arXiv preprint arXiv:1809.09600 (2018). [paper] [project]
- Thorne, James, et al. "FEVER: a large-scale dataset for fact extraction and VERification." arXiv preprint arXiv:1803.05355 (2018). [paper] [project]
- Joshi, Mandar, et al. "Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension." arXiv preprint arXiv:1705.03551 (2017). [paper] [project]
- Min, Sewon, et al. "AmbigQA: Answering ambiguous open-domain questions." arXiv preprint arXiv:2004.10645 (2020). [paper] [project]
- Hendrycks, Dan, et al. "Measuring massive multitask language understanding." arXiv preprint arXiv:2009.03300 (2020). [paper] [project]
- Cobbe, Karl, et al. "Training verifiers to solve math word problems." arXiv preprint arXiv:2110.14168 (2021). [paper] [project]
- Patel, Arkil, Satwik Bhattamishra, and Navin Goyal. "Are NLP models really able to solve simple math word problems?." arXiv preprint arXiv:2103.07191 (2021). [paper] [project]
- Lu, Pan, et al. "Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning." arXiv preprint arXiv:2209.14610 (2022). [paper] [project]
- Hendrycks, Dan, et al. "Measuring mathematical problem solving with the math dataset." arXiv preprint arXiv:2103.03874 (2021). [paper] [project]
- Chen, Mark, et al. "Evaluating large language models trained on code." arXiv preprint arXiv:2107.03374 (2021). [paper] [project]
- Austin, Jacob, et al. "Program synthesis with large language models." arXiv preprint arXiv:2108.07732 (2021). [paper] [project]
- Shridhar, Mohit, et al. "Alfworld: Aligning text and embodied environments for interactive learning." arXiv preprint arXiv:2010.03768 (2020). [paper] [project]
- Kannan, Shyam Sundar, Vishnunandan LN Venkatesh, and Byung-Cheol Min. "Smart-llm: Smart multi-agent robot task planning using large language models." arXiv preprint arXiv:2309.10062 (2023). [paper] [project]
- Chang, Matthew, et al. "PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks." arXiv preprint arXiv:2411.00081 (2024). [paper] [project]
- Gao, Chen, et al. "EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment." arXiv preprint arXiv:2410.09604 (2024). [paper] [project]
- Choi, Jae-Woo, et al. "Lota-bench: Benchmarking language-oriented task planners for embodied agents." arXiv preprint arXiv:2402.08178 (2024). [paper] [project]
- Xu, Tianqi, et al. "CRAB: Cross-platfrom agent benchmark for multi-modal embodied language model agents." NeurIPS 2024 Workshop on Open-World Agents. 2024. [paper] [project]
- Zhang, Lingfeng, et al. "ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models." arXiv preprint arXiv:2410.14682 (2024). [paper]
- Framework, LLM Multi-Agent. "VillagerBench: Benchmarking Multi-Agent Collaboration in Minecraft." [paper]
- Wu, Yue, et al. "Smartplay: A benchmark for llms as intelligent agents." arXiv preprint arXiv:2310.01557 (2023). [paper]
- Gioacchini, Luca, et al. "AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents." arXiv preprint arXiv:2404.06411 (2024). [paper] [project]
- Liu, Xiao, et al. "Agentbench: Evaluating llms as agents." arXiv preprint arXiv:2308.03688 (2023). [paper] [project]
- Liu, Xiao, et al. "Visualagentbench: Towards large multimodal models as visual foundation agents." arXiv preprint arXiv:2408.06327 (2024). [paper] [project]
- Ma, Chang, et al. "AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents." arXiv preprint arXiv:2401.13178 (2024). [paper] [project]
- Yin, Guoli, et al. "MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains." arXiv preprint arXiv:2407.18961 (2024). [paper] [project]
- Liu, Zhiwei, et al. "Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents." arXiv preprint arXiv:2308.05960 (2023). [paper] [project]
- Mialon, Grégoire, et al. "Gaia: a benchmark for general ai assistants." arXiv preprint arXiv:2311.12983 (2023). [paper] [project]
- Wang, Wei, et al. "BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems." arXiv preprint arXiv:2408.15971 (2024). [paper] [project]
- Li, Youquan, et al. "FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback." arXiv preprint arXiv:2410.09412 (2024). [paper]
- Shen, Yongliang, et al. "Taskbench: Benchmarking large language models for task automation." arXiv preprint arXiv:2311.18760 (2023). [paper] [project]
- Valmeekam, Karthik, et al. "Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change." Advances in Neural Information Processing Systems 36 (2024). [paper] [project]
- Xiao, Ruixuan, et al. "Flowbench: Revisiting and benchmarking workflow-guided planning for llm-based agents." arXiv preprint arXiv:2406.14884 (2024). [paper]
- Xie, Jian, et al. "Travelplanner: A benchmark for real-world planning with language agents." arXiv preprint arXiv:2402.01622 (2024). [paper] [project]
- Cheng, Ching-An, et al. "Llf-bench: Benchmark for interactive learning from language feedback." arXiv preprint arXiv:2312.06853 (2023). [paper] [project]
- Wu, Cheng-Kuang, et al. "StreamBench: Towards Benchmarking Continuous Improvement of Language Agents." arXiv preprint arXiv:2406.08747 (2024). [paper] [project]
- Tu, Quan, et al. "Charactereval: A chinese benchmark for role-playing conversational agent evaluation." arXiv preprint arXiv:2401.01275 (2024). [paper] [project]
- Wang, Zekun Moore, et al. "Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models." arXiv preprint arXiv:2310.00746 (2023). [paper] [project]
- Chen, Hongzhan, et al. "RoleInteract: Evaluating the Social Interaction of Role-Playing Agents." arXiv preprint arXiv:2403.13679 (2024). [paper] [project]
- Liu, Jiaheng, et al. "RoleAgent: Building, Interacting, and Benchmarking High-quality Role-Playing Agents from Scripts." The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. [paper]
- Yan, Ming, et al. "Larp: Language-agent role play for open-world games." arXiv preprint arXiv:2312.17653 (2023). [paper]
- Debenedetti, Edoardo, et al. "AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents." arXiv preprint arXiv:2406.13352 (2024). [paper] [project]
- Zhang, Hanrong, et al. "Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents." arXiv preprint arXiv:2410.02644 (2024). [paper] [project]
- Andriushchenko, Maksym, et al. "Agentharm: A benchmark for measuring harmfulness of llm agents." arXiv preprint arXiv:2410.09024 (2024). [paper] [project]
- Yin, Sheng, et al. "SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents." arXiv preprint arXiv:2412.13178 (2024). [paper] [project]
- Yuan, Tongxin, et al. "R-judge: Benchmarking safety risk awareness for llm agents." arXiv preprint arXiv:2401.10019 (2024). [paper] [project]
- BENCHMARK, JUDGE. "JAILJUDGE: AComprehensive JAILBREAK JUDGE BENCHMARK WITH MULTI-AGENT ENHANCED EXPLANATION EVALUATION FRAMEWORK." [paper] [project]
- Kapoor, Raghav, et al. "OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web." European Conference on Computer Vision. Springer, Cham, 2025. [paper]
- Xu, Tianqi, et al. "Crab: Cross-environment agent benchmark for multimodal language model agents." arXiv preprint arXiv:2407.01511 (2024). [paper] [project]
- Xie, Tianbao, et al. "Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments." arXiv preprint arXiv:2404.07972 (2024). [paper] [project]
- Bonatti, Rogerio, et al. "Windows agent arena: Evaluating multi-modal os agents at scale." arXiv preprint arXiv:2409.08264 (2024). [paper] [project]
- Wang, Zilong, et al. "Officebench: Benchmarking language agents across multiple applications for office automation." arXiv preprint arXiv:2407.19056 (2024). [paper]
- Styles, Olly, et al. "WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting." arXiv preprint arXiv:2405.00823 (2024). [paper]
- Drouin, Alexandre, et al. "WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?." arXiv preprint arXiv:2403.07718 (2024). [paper] [project]
- Humphreys, Peter C., et al. "A data-driven approach for learning to control computers." International Conference on Machine Learning. PMLR, 2022. [paper]
- Gao, Difei, et al. "Assistgui: Task-oriented desktop graphical user interface automation." arXiv preprint arXiv:2312.13108 (2023). [paper] [project]
- Zhang, Ziniu, et al. "MMInA: Benchmarking Multihop Multimodal Internet Agents." arXiv preprint arXiv:2404.09992 (2024). [paper] [project]
- Zhou, Shuyan, et al. "Webarena: A realistic web environment for building autonomous agents." arXiv preprint arXiv:2307.13854 (2023). [paper] [project]
- Koh, Jing Yu, et al. "Visualwebarena: Evaluating multimodal agents on realistic visual web tasks." arXiv preprint arXiv:2401.13649 (2024). [paper] [project]
- Yoran, Ori, et al. "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?." arXiv preprint arXiv:2407.15711 (2024). [paper] [project]
- Jang, Lawrence, et al. "Videowebarena: Evaluating long context multimodal agents with video understanding web tasks." arXiv preprint arXiv:2410.19100 (2024). [paper]
- Xu, Frank F., et al. “Theagentcompany: Benchmarking LLM Agents on Consequential Real World Tasks.” arXiv.Org, 18 Dec. 2024, arxiv.org/abs/2412.14161. [paper] [project]
- Xu, Kevin, et al. "Tur [k] ingbench: A challenge benchmark for web agents." arXiv preprint arXiv:2403.11905 (2024). [paper] [project]
- Yao, Shunyu, et al. "Webshop: Towards scalable real-world web interaction with grounded language agents." Advances in Neural Information Processing Systems 35 (2022): 20744-20757. [paper] [project]
- Lin, Kevin Qinghong, et al. "VideoGUI: A Benchmark for GUI Automation from Instructional Videos." arXiv preprint arXiv:2406.10227 (2024). [paper] [project]
- Deng, Xiang, et al. "Mind2web: Towards a generalist agent for the web." Advances in Neural Information Processing Systems 36 (2024). [paper] [project]
- Shi, Tianlin, et al. "World of bits: An open-domain platform for web-based agents." International Conference on Machine Learning. PMLR, 2017. [paper]
- Deng, Shihan, et al. "Mobile-bench: An evaluation benchmark for llm-based mobile agents." arXiv preprint arXiv:2407.00993 (2024). [paper]
- Rawles, Christopher, et al. "AndroidWorld: A dynamic benchmarking environment for autonomous agents." arXiv preprint arXiv:2405.14573 (2024). [paper] [project]
- Zhang, Danyang, et al. "Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction." arXiv preprint arXiv:2305.08144 (2023). [paper] [project]
- Lee, Juyong, et al. "Benchmarking Mobile Device Control Agents across Diverse Configurations." arXiv preprint arXiv:2404.16660 (2024). [paper] [project]
- Wang, Luyuan, et al. "MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents." arXiv preprint arXiv:2406.08184 (2024). [paper] [project]
- Trivedi, Harsh, et al. "Appworld: A controllable world of apps and people for benchmarking interactive coding agents." arXiv preprint arXiv:2407.18901 (2024). [paper] [project]
- Sun, Liangtai, et al. "Meta-gui: Towards multi-modal conversational agents on mobile gui." arXiv preprint arXiv:2205.11029 (2022). [paper] [project]
- Wang, Junyang, et al. "Mobile-agent: Autonomous multi-modal mobile device agent with visual perception." arXiv preprint arXiv:2401.16158 (2024). [paper]
- Zhang, Chi, et al. "Appagent: Multimodal agents as smartphone users." arXiv preprint arXiv:2312.13771 (2023). [paper] [project]
- Zhang, Danyang, Lu Chen, and Kai Yu. "Zhang, Danyang, et al. "Mobile-env: an evaluation platform and benchmark for LLM-GUI interaction." arXiv preprint arXiv:2305.08144 (2023)." arXiv preprint arXiv:2305.08144 (2023). [paper]
- Rodriguez, Juan, et al. "BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks." arXiv preprint arXiv:2412.04626 (2024). [paper] [project]
- Zou, Anni, et al. "DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems." arXiv preprint arXiv:2407.10701 (2024). [paper] [project]
- Carlos E. Jimenez, John Yang, et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv preprint arXiv:2310.06770 (2023). [paper] [project]
- Aleithan, Reem, et al. "SWE-Bench+: Enhanced Coding Benchmark for LLMs." arXiv preprint arXiv:2410.06992 (2024). [paper] [project]
- Zan, Daoguang, et al. "Swe-bench-java: A github issue resolving benchmark for java." arXiv preprint arXiv:2408.14354 (2024). [paper] [project]
- Yang, John, et al. "SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?." arXiv preprint arXiv:2410.03859 (2024). [paper] [project]
- Gu, Alex, et al. "Cruxeval: A benchmark for code reasoning, understanding and execution." arXiv preprint arXiv:2401.03065 (2024). [paper] [project]
- Siegel, Zachary S., et al. "CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark." arXiv preprint arXiv:2409.11363 (2024). [paper] [project]
- Guo, Chengquan, et al. "RedCode: Risky Code Execution and Generation Benchmark for Code Agents." arXiv preprint arXiv:2411.07781 (2024). [paper] [project]
- Zhuo, Terry Yue, et al. "Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions." arXiv preprint arXiv:2406.15877 (2024). [paper] [project]
- Zhang, Yaolun, et al. "PyBench: Evaluating LLM Agent on various real-world coding tasks." arXiv preprint arXiv:2407.16732 (2024). [paper]
- Yang, John, et al. "Intercode: Standardizing and benchmarking interactive coding with execution feedback." Advances in Neural Information Processing Systems 36 (2024). [paper] [project]
- Li, Bowen, et al. "Devbench: A comprehensive benchmark for software development." arXiv preprint arXiv:2403.08604 (2024). [paper] [project]
- Si, Chenglei, et al. "Design2Code: How Far Are We From Automating Front-End Engineering?." arXiv preprint arXiv:2403.03163 (2024). [paper] [project]
- Zhang, Yuge, et al. "Benchmarking Data Science Agents." arXiv preprint arXiv:2402.17168 (2024). [paper] [project]
- Gu, Ken, et al. "Blade: Benchmarking language model agents for data-driven science." arXiv preprint arXiv:2408.09667 (2024). [paper] [project]
- Zhang, Dan, et al. "DataSciBench: An LLM Agent Benchmark for Data Science." [paper]
- Huang, Yiming, et al. "DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models." arXiv preprint arXiv:2410.07331 (2024). [paper] [project]
- Hu, Xueyu, et al. "Infiagent-dabench: Evaluating agents on data analysis tasks." arXiv preprint arXiv:2401.05507 (2024). [paper] [project]
- Chan, Jun Shern, et al. "Mle-bench: Evaluating machine learning agents on machine learning engineering." arXiv preprint arXiv:2410.07095 (2024). [paper] [project]
- Chen, Ziru, et al. "Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery." arXiv preprint arXiv:2410.05080 (2024). [paper] [project]
- Jansen, Peter, et al. "DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents." arXiv preprint arXiv:2406.06769 (2024). [paper] [project]
- Huang, Qian, et al. "Benchmarking large language models as ai research agents." NeurIPS 2023 Foundation Models for Decision Making Workshop. 2023. [paper] [project]
- Wang, Jize, et al. "GTA: A Benchmark for General Tool Agents." arXiv preprint arXiv:2407.08713 (2024). [paper] [project]
- Yao, Shunyu, et al. "
$\tau $ -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." arXiv preprint arXiv:2406.12045 (2024). [paper] [project] - Lu, Jiarui, et al. "Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities." arXiv preprint arXiv:2408.04682 (2024). [paper] [project]
- Chen, Zehui, et al. "T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step." arXiv preprint arXiv:2312.14033 (2023). [paper] [project]
- Wu, Mengsong, et al. "Seal-tools: Self-instruct tool learning dataset for agent tuning and detailed benchmark." CCF International Conference on Natural Language Processing and Chinese Computing. Singapore: Springer Nature Singapore, 2024. [paper] [project]
- Guo, Zishan, Yufei Huang, and Deyi Xiong. "Ctooleval: A chinese benchmark for llm-powered agent evaluation in real-world API interactions." Findings of the Association for Computational Linguistics ACL 2024. 2024. [paper] [project]
- Li, Minghao, et al. "Api-bank: A comprehensive benchmark for tool-augmented llms." arXiv preprint arXiv:2304.08244 (2023). [paper] [project]
- Yang, Xiao, et al. "CRAG--Comprehensive RAG Benchmark." arXiv preprint arXiv:2406.04744 (2024). [paper] [project]
- Wang, Shuting, et al. "OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain." arXiv preprint arXiv:2412.13018 (2024). [paper] [project]
- Lyu, Yuanjie, et al. "Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models." ACM Transactions on Information Systems (2024). [paper] [project]
- Pipitone, Nicholas, and Ghita Houir Alami. "LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain." arXiv preprint arXiv:2408.10343 (2024). [paper] [project]
- RAG-ConfusionQA [paper] [project]
- Kuzman, Taja, et al. "PandaChat-RAG: Towards the Benchmark for Slovenian RAG Applications." (2024).[paper] [project]
- Ossowski, Timothy, et al. "COMMA: A Communicative Multimodal Multi-Agent Benchmark." arXiv preprint arXiv:2410.07553 (2024). [paper]
- Xu, Lin, et al. "MAgIC: Benchmarking Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration." arXiv preprint arXiv:2311.08562 (2023). [paper] [project]
- https://github.com/Paitesanshi/LLM-Agent-Survey
- https://github.com/git-disl/awesome-LLM-game-agent-papers
- https://github.com/woooodyy/llm-agent-paper-list
- https://github.com/ysymyth/awesome-language-agents
- https://github.com/AGI-Edgerunners/LLM-Agents-Papers
- https://github.com/zjunlp/LLMAgentPapers
- https://github.com/hyp1231/awesome-llm-powered-agent
- https://github.com/AGI-Edgerunners/LLM-Planning-Papers
- https://github.com/taichengguo/LLM_MultiAgents_Survey_Papers
- https://github.com/ysymyth/awesome-language-agents
- https://github.com/kaushikb11/awesome-llm-agents
- https://github.com/e2b-dev/awesome-ai-agents
- https://github.com/kyrolabs/awesome-agents
- https://github.com/jxzhangjhu/Awesome-LLM-RAG
- https://github.com/frutik/Awesome-RAG
- https://github.com/hymie122/RAG-Survey
- https://github.com/Hannibal046/Awesome-LLM
- https://github.com/Shubhamsaboo/awesome-llm-apps
- https://github.com/shure-dev/Awesome-LLM-Papers-Comprehensive-Topics
- https://github.com/HqWu-HITCS/Awesome-Chinese-LLM
- https://github.com/tensorchord/Awesome-LLMOps
- https://github.com/hijkzzz/Awesome-LLM-Strawberry
- https://github.com/WangRongsheng/awesome-LLM-resourses
- https://github.com/GT-RIPL/Awesome-LLM-Robotics
- https://github.com/DefTruth/Awesome-LLM-Inference
- https://github.com/luban-agi/Awesome-Domain-LLM
- https://github.com/WLiK/LLM4Rec-Awesome-Papers
- https://github.com/RManLuo/Awesome-LLM-KG
- https://github.com/jackaduma/awesome_LLMs_interview_notes
- https://github.com/XiaoxinHe/Awesome-Graph-LLM
- https://github.com/fr0gger/Awesome-GPT-Agents
- https://github.com/atfortes/Awesome-LLM-Reasoning
- https://github.com/codefuse-ai/Awesome-Code-LLM
- https://github.com/JShollaj/awesome-llm-interpretability
- https://github.com/lmmlzn/Awesome-LLMs-Datasets
- https://github.com/horseee/Awesome-Efficient-LLM
- https://github.com/corca-ai/awesome-llm-security
- https://github.com/HuangOwen/Awesome-LLM-Compression