diff --git a/_posts/2023-05-03-arena.md b/_posts/2023-05-03-arena.md new file mode 100644 index 0000000..2018f47 --- /dev/null +++ b/_posts/2023-05-03-arena.md @@ -0,0 +1,223 @@ +--- +layout: distill +title: Chatbot Arena +description: Benchmarking LLMs in the Wild with Elo Ratings +giscus_comments: true +date: 2023-05-03 +featured: true +thumbnail: assets/img/blog/arena/cover.png +authors: + - name: Lianmin Zheng* + url: https://lmzheng.net/ + affiliations: + name: UC Berkeley, LMSys + - name: Ying Sheng* + url: https://sites.google.com/view/yingsheng/home + - name: Wei-Lin Chiang + url: https://infwinston.github.io/ + - name: Hao Zhang + url: https://cseweb.ucsd.edu/~haozhang/ + - name: Joseph E. Gonzalez + url: https://people.eecs.berkeley.edu/~jegonzal/ + - name: Ion Stoica + url: https://people.eecs.berkeley.edu/~istoica/ +--- + +We present Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. In this blog post, we are releasing our initial results and a leaderboard based on the Elo rating system, which is a widely-used rating system in chess and other competitive games. We invite the entire community to join this effort by contributing new models and evaluating them by asking questions and voting for your favorite answer. + + + +
Table 1. LLM Leaderboard (April 24 - May 1, 2023). The latest and detailed version here.
+Rank | Model | Elo Rating | Description | +
---|---|---|---|
1 | 🥇 vicuna-13b | 1169 | a chat assistant fine-tuned from LLaMA on user-shared conversations by LMSYS | +
2 | 🥈 koala-13b | 1082 | a dialogue model for academic research by BAIR | +
3 | 🥉 oasst-pythia-12b | 1065 | an Open Assistant for everyone by LAION | +
4 | alpaca-13b | 1008 | a model fine-tuned from LLaMA on instruction-following demonstrations by Stanford | +
5 | chatglm-6b | 985 | an open bilingual dialogue language model by Tsinghua University | +
6 | fastchat-t5-3b | 951 | a chat assistant fine-tuned from FLAN-T5 by LMSYS | +
7 | dolly-v2-12b | 944 | an instruction-tuned open large language model by Databricks | +
8 | llama-13b | 932 | open and efficient foundation language models by Meta | +
9 | stablelm-tuned-alpha-7b | 858 | Stability AI language models | +
Figure 1. The side-by-side chatting and voting interface.
+ +Please note that we periodically release blog posts to update the leaderboard. Feel free to check the following updates: + +- [May 10 Updates](https://lmsys.org/blog/2023-05-10-leaderboard/) +- [May 25 Updates](https://lmsys.org/blog/2023-05-25-leaderboard/) +- [June 22 Updates](https://lmsys.org/blog/2023-06-22-leaderboard/) +- [Dataset Release (July 20)](https://lmsys.org/blog/2023-07-20-dataset/) +- [Dec. 7 Updates](https://lmsys.org/blog/2023-12-07-leaderboard/) +- [Policy Updates (March 1, 2024)](https://lmsys.org/blog/2024-03-01-policy/) + +## Introduction + +Following the great success of ChatGPT, there has been a proliferation of open-source large language models that are finetuned to follow instructions. These models are capable of providing valuable assistance in response to users’ questions/prompts. Notable examples include Alpaca and Vicuna, based on LLaMA, and OpenAssistant and Dolly, based on Pythia. + +Despite the constant release of new models every week, the community faces a challenge in benchmarking these models effectively. Benchmarking LLM assistants is extremely challenging because the problems can be open-ended, and it is very difficult to write a program to automatically evaluate the response quality. +In this case, we typically have to resort to human evaluation based on pairwise comparison. + +There are some desired properties for a good benchmark system based on pairwise comparison. + +- **Scalability**. The system should scale to a large number of models when it is not feasible to collect sufficient data for all possible model pairs. +- **Incrementality**. The system should be able to evaluate a new model using a relatively small number of trials. +- **Unique order**. The system should provide a unique order for all models. Given any two models, we should be able to tell which ranks higher or whether they are tied. + +Existing LLM benchmark systems rarely satisfy all of these properties. Classical LLM benchmark frameworks, such as [HELM](https://crfm.stanford.edu/helm/latest/) and [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), provide multi-metric measurements for tasks commonly used in academic research. However, they are not based on pairwise comparison and are not effective at evaluating open-ended questions. OpenAI also launched the [evals](https://github.com/openai/evals) project to collect better questions, but this project does not provide ranking mechanisms for all participating models. When we launched our [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) model, we utilized a GPT-4-based evaluation pipeline, but it does not provide a solution for scalable and incremental ratings. + +In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the [Elo rating system](https://en.wikipedia.org/wiki/Elo_rating_system), which is a widely-used rating system in chess and other competitive games. The Elo rating system is promising to provide the desired property mentioned above. We noticed that the [Anthropic LLM paper](https://arxiv.org/pdf/2204.05862.pdf) also adopted the Elo rating system. + +To collect data, we launched the arena with several popular open-source LLMs one week ago. In the arena, a user can chat with two anonymous models side-by-side and vote for which one is better. This crowdsourcing way of data collection represents some use cases of LLMs in the wild. A comparison between several evaluation methods is shown in Table 2. + +Table 2: Comparison between different evaluation methods.
+HELM / lm-evaluation-harness | OpenAI/eval | Alpaca Evaluation | Vicuna Evaluation | Chatbot Arena | +|
---|---|---|---|---|---|
Question Source | Academic datasets | Mixed | Self-instruct evaluation set | GPT-4 generated | User prompts | +
Evaluator | Program | Program/Model | Human | GPT-4 | User | +
Metrics | Basic metrics | Basic metrics | Win rate | Win rate | Elo ratings | +
Figure 2: Battle count of each combination of models
+ +Figure 2 shows the battles count of each combination of models. When we initially launched the tournament, we had prior information on the likely ranking based on our benchmarks and chose to pair models according to this ranking. We gave preference to what we believed would be strong pairings based on this ranking. However, we later switched to uniform sampling to get better overall coverage of the rankings. Towards the end of the tournament, we also introduced a new model `fastchat-t5-3b`. All of these result in non-uniform model frequency. + + +Figure 3: Battle counts for the top-15 languages.
+ +Figure 3 plots the language distribution and shows most user prompts are in English. + +## Elo Rating System + +The [Elo rating system](https://en.wikipedia.org/wiki/Elo_rating_system) is a method for calculating the relative skill levels of players, which has been widely adopted in competitive games and sports. The difference in the ratings between two players serves as a predictor of the outcome of a match. The Elo rating system works well for our case because we have multiple models and we run pairwise battles between them. + +If player A has a rating of `Ra` and player B a rating of `Rb`, the exact formula (using the logistic curve with base 10) for the probability of player A winning is + + + +The ratings of players can be linearly updated after each battle. Suppose player A (with Rating `Ra`) was expected to score `Ea` points but actucally scored `Sa` points. The formula for updating that player's rating is + + + +Using the collected data, we compute the Elo ratings of the models in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing) and put the main results in Table 1. You are welcome to try the notebook and play with the voting data by yourself. The data only contains voting results without conversation histories because releasing the conversation history will raise concerns such as privacy and toxicity. + +## Pairwise Win Rates + +As a basis for calibration, we also present here the pairwise win rates for each model in the tournament (Figure 4) as well as the predicted pairwise win rate estimated using Elo ratings (Figure 5). +By comparing the figures, we find the elo ratings can predict win rates relatively well. + + +Figure 4: Fraction of Model A wins for all non-tied A vs. B battles.
+ + +Figure 5: Predicted win rate using Elo ratings for Model A in an A vs. B battle
+ +## Future Plans + +We plan to work on the following items: + +- Add more closed-source models (ChatGPT-3.5, ChatGPT-4, and Claude-v1 are avaiable now in the anonymous Arena) +- Add more open-source models +- Release periodically updated leaderboards (e.g., monthly) +- Implement better sampling algorithms, tournament mechanisms, and serving systems to support a much larger number of models +- Provide fine-grained rankings on different task types. + +We appreciate any feedback from you to make the arena better. + +## Join Us + +We invite the entire community to join this benchmarking effort by contributing your models and votes for the anonymous models you think provide better answers. You can visit [https://lmarena.ai](https://lmarena.ai) to vote for better models. If you want to see a specific model in the arena, you can follow this [guide](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model) to help us add it. + +## Acknowledgment + +We thank other members of the Vicuna team for valuable feedback and MBZUAI for donating compute resources. Additionally, we extend our thanks to Tianjun Zhang and Eric Wallace for their insightful discussions. + +## Links + +- Demo: [https://lmarena.ai](https://lmarena.ai) +- Leaderboard: [https://lmarena.ai/?leaderboard](https://lmarena.ai/?leaderboard) +- GitHub: [https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat) +- Colab notebook: [https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing) + +## Citation + +Please cite the following [papers](https://arxiv.org/abs/2403.04132) if you find our work useful. + +``` +@misc{chiang2024chatbot, + title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference}, + author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica}, + year={2024}, + eprint={2403.04132}, + archivePrefix={arXiv}, + primaryClass={cs.AI} +} + +@inproceedings{zheng2023judging, + title={Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena}, + author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica}, + booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, + year={2023}, + url={https://openreview.net/forum?id=uccHPGDlao} +} + +@inproceedings{zheng2024lmsyschatm, + title={LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset}, + author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Tianle Li and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zhuohan Li and Zi Lin and Eric Xing and Joseph E. Gonzalez and Ion Stoica and Hao Zhang}, + booktitle={The Twelfth International Conference on Learning Representations}, + year={2024}, + url={https://openreview.net/forum?id=BOfDKxfwt0} +} +``` diff --git a/_posts/2023-05-10-leaderboard.md b/_posts/2023-05-10-leaderboard.md new file mode 100644 index 0000000..c989a0b --- /dev/null +++ b/_posts/2023-05-10-leaderboard.md @@ -0,0 +1,149 @@ +--- +layout: distill +title: Chatbot Arena Leaderboard Updates (Week 2) +giscus_comments: true +date: 2023-05-10 +featured: false +thumbnail: assets/img/blog/leaderboard_week2/leaderboard_cover.png +authors: + - name: Chatbot Arena Team + affiliations: + name: LMSYS Org +--- + +We release an updated leaderboard with more models and new data we collected last week, after the announcement of the anonymous [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/). We are actively iterating on the design of the arena and leaderboard scores. + +In this update, we have added 4 new yet strong players into the Arena, including three **proprietary models** and one open-source model. They are: + +- OpenAI GPT-4 +- OpenAI GPT-3.5-turbo +- Anthropic Claude-v1 +- RWKV-4-Raven-14B + +Table 1 displays the Elo ratings of all 13 models, which are based on the 13K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing). You can also try the voting [demo](https://lmarena.ai). + + + +Table 1. LLM Leaderboard (April 24 - May 8, 2023). The latest and detailed version here.
+Rank | Model | Elo Rating | Description | License |
---|---|---|---|---|
1 | 🥇 GPT-4 | 1274 | ChatGPT-4 by OpenAI | Proprietary |
2 | 🥈 Claude-v1 | 1224 | Claude by Anthropic | Proprietary |
3 | 🥉 GPT-3.5-turbo | 1155 | ChatGPT-3.5 by OpenAI | Proprietary |
4 | Vicuna-13B | 1083 | a chat assistant fine-tuned from LLaMA on user-shared conversations by LMSYS | Weights available; Non-commercial |
5 | Koala-13B | 1022 | a dialogue model for academic research by BAIR | Weights available; Non-commercial |
6 | RWKV-4-Raven-14B | 989 | an RNN with transformer-level LLM performance | Apache 2.0 |
7 | Oasst-Pythia-12B | 928 | an Open Assistant for everyone by LAION | Apache 2.0 |
8 | ChatGLM-6B | 918 | an open bilingual dialogue language model by Tsinghua University | Weights available; Non-commercial |
9 | StableLM-Tuned-Alpha-7B | 906 | Stability AI language models | CC-BY-NC-SA-4.0 |
10 | Alpaca-13B | 904 | a model fine-tuned from LLaMA on instruction-following demonstrations by Stanford | Weights available; Non-commercial |
11 | FastChat-T5-3B | 902 | a chat assistant fine-tuned from FLAN-T5 by LMSYS | Apache 2.0 |
12 | Dolly-V2-12B | 863 | an instruction-tuned open large language model by Databricks | MIT |
13 | LLaMA-13B | 826 | open and efficient foundation language models by Meta | Weights available; Non-commercial |
Figure 1: One example where Claude is preferred over GPT-4.
+ +In Figure 1, the user posed a tricky question that demanded careful reasoning and planning. Although both Claude and GPT-4 provided similar answers, Claude's response was marginally better as the needle was positioned on top. +However, we observed that the outcome of this example cannot always be replicated due to the randomness of sampling. +Sometimes GPT-4 can also give the same order as Claude, but it fails at this generation trial. +Additionally, we noted that the behavior of GPT-4 differed slightly when using the OpenAI API versus the ChatGPT interface, which could be attributed to different prompts, sampling parameters, or other unknown factors. + + +Figure 2: One example where a user thinks both Claude and GPT-4 are wrong.
+ +In Figure 2, both Claude and GPT-4 are still struggling with this kind of tricky reasoning questions despite their amazing capabilities. + +Besides these tricky cases, there are also a lot of easy questions that do not require complex reasoning or knowledge. In this case, open source models like Vicuna can perform on par with GPT-4, so we might be able to use a slightly weaker (but smaller or cheaper) LLM in place of the more powerful one like GPT-4. + +**Win Fraction Matrix** +We present the win fraction of all model pairs in Figure 3. + + +Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles.
+ +**Language-specific leaderboards** +Lastly, we present two language-specific leaderboards, by isolating the conversation data into two subsets based on the language: (1) English-only and (2) non-English. From Figure 4, we can tell that Koala is worse at non-English languages and ChatGLM-6B is better at non-English languages. This is because of the different compositions of their training data. + + +Figure 4: The English-only and non-English leaderboards.
+ +More figures, analyses, and calculations can be found in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing). + +## Next Steps + +**Help us add more models** +Since the launch of Chatbot Arena, we have seen growing interest from the community. Many model developers are eager to put their chatbots into the Arena and see how they perform against others. +Please help us add more models by following [this guide](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model). + +**Bring your own self-hosted chatbot (BYOC)** +We also plan to open some APIs to allow competitors to register their self-hosted chatbots and participate in the Arena. + +**Area-specific Arena** +Similar to the language-specific Arena, we will extend a single, monolithic leaderboard to more areas, and publish more functionality-specific leaderboards, +such as writing, coding, and reasoning. In which specific area or ability do you want to see the LLMs evaluated? +Please give us feedback on [Discord](https://discord.gg/HSWAKCrnFx) or [Twitter](https://twitter.com/lmsysorg). + +## Acknowledgement + +This blog post is primarily contributed by Lianmin Zheng, Ying Sheng, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. +We thank other members of LMSYS team (Wei-Lin Chiang, Siyuan Zhuang, and more) for valuable feedback and MBZUAI for donating compute resources. +Additionally, we extend our thanks to community contributors for their votes and model support. diff --git a/_posts/2023-05-25-leaderboard.md b/_posts/2023-05-25-leaderboard.md new file mode 100644 index 0000000..00458a5 --- /dev/null +++ b/_posts/2023-05-25-leaderboard.md @@ -0,0 +1,171 @@ +--- +layout: distill +title: Chatbot Arena Leaderboard Updates (Week 4) +giscus_comments: true +date: 2023-05-25 +featured: false +thumbnail: assets/img/blog/leaderboard_week4/leaderboard_cover.png +authors: + - name: Chatbot Arena Team + affiliations: + name: LMSYS Org +--- + +In this update, we are excited to welcome the following models joining the [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/): + +1. Google PaLM 2, chat-tuned with the code name [chat-bison@001](https://cloud.google.com/vertex-ai/docs/release-notes#May_10_2023) on Google Cloud Vertex AI +2. Anthropic Claude-instant-v1 +3. MosaicML MPT-7B-chat +4. Vicuna-7B + +A new Elo rating leaderboard based on the 27K anonymous voting data collected **in the wild** between April 24 and May 22, 2023 is released in Table 1 below. + +We provide a [Google Colab notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing) to analyze the voting data, including the computation of the Elo ratings. +You can also try the voting [demo](https://lmarena.ai). + + + +Table 1. LLM Leaderboard (April 24 - May 22, 2023). The latest and detailed version here.
+Rank | Model | Elo Rating | Description | License |
---|---|---|---|---|
1 | 🥇 GPT-4 | 1225 | ChatGPT-4 by OpenAI | Proprietary |
2 | 🥈 Claude-v1 | 1195 | Claude by Anthropic | Proprietary |
3 | 🥉 Claude-instant-v1 | 1153 | Lighter, less expensive, and much faster version of Claude | Proprietary |
4 | GPT-3.5-turbo | 1143 | ChatGPT-3.5 by OpenAI | Proprietary |
5 | Vicuna-13B | 1054 | a chat assistant fine-tuned from LLaMA on user-shared conversations by LMSYS | Weights available; Non-commercial |
6 | PaLM 2 | 1042 | PaLM 2 tuned for chat (chat-bison@001 on Google Vertex AI). The PaLM 2 model family is powering Bard. | Proprietary |
7 | Vicuna-7B | 1007 | a chat assistant fine-tuned from LLaMA on user-shared conversations by LMSYS | Weights available; Non-commercial |
8 | Koala-13B | 980 | a dialogue model for academic research by BAIR | Weights available; Non-commercial |
9 | mpt-7b-chat | 952 | a chatbot fine-tuned from MPT-7B by MosaicML | CC-By-NC-SA-4.0 |
10 | FastChat-T5-3B | 941 | a chat assistant fine-tuned from FLAN-T5 by LMSYS | Apache 2.0 |
11 | Alpaca-13B | 937 | a model fine-tuned from LLaMA on instruction-following demonstrations by Stanford | Weights available; Non-commercial |
12 | RWKV-4-Raven-14B | 928 | an RNN with transformer-level LLM performance | Apache 2.0 |
13 | Oasst-Pythia-12B | 921 | an Open Assistant for everyone by LAION | Apache 2.0 |
14 | ChatGLM-6B | 921 | an open bilingual dialogue language model by Tsinghua University | Weights available; Non-commercial |
15 | StableLM-Tuned-Alpha-7B | 882 | Stability AI language models | CC-BY-NC-SA-4.0 |
16 | Dolly-V2-12B | 866 | an instruction-tuned open large language model by Databricks | MIT |
17 | LLaMA-13B | 854 | open and efficient foundation language models by Meta | Weights available; Non-commercial |
Figure 1: Fraction of Model A Wins for All Non-tied A vs. B Battles.
+ +If you want to see more models, please help us [add them](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model) or contact us by giving us API access. + +## Overview + +### Google PaLM 2 + +Google's PaLM 2 is one of the most significant models announced since our last leaderboard update. We added the PaLM 2 Chat to the Chatbot Arena via the [Google Cloud Vertex AI API](https://cloud.google.com/vertex-ai/docs/release-notes#May_10_2023). The model is chat-tuned under the code name _chat-bison@001_. + +In the past two weeks, PaLM 2 has competed for around 1.8k anonymous battles with the other 16 chatbots, currently ranked 6th on the leaderboard. It ranks above all other open-source chatbots, except for Vicuna-13B, whose Elo is 12 scores higher than PaLM 2 (Vicuna 1054 vs. PaLM 2 1042) which in terms of ELO rating is nearly a virtual tie. We noted the following interesting results from PaLM 2's Arena data. + +PaLM 2 is better when playing against the top 4 players, i.e., GPT-4, Claude-v1, ChatGPT, Claude-instant-v1, and it also wins 53% of the plays with Vicuna, but worse when playing against weaker players. This can be seen in Figure 1 which shows the win fraction matrix. Among all battles PaLM 2 has participated in, 21.6% were lost to a chatbot that is not one of GPT-4, Claude-v1, GPT-3.5-turbo, Claude-instant-v1. For reference, another proprietary model GPT-3.5-turbo only loses 12.8% of battles to those chatbots. + +In short, we find that the current PaLM 2 version available at Google Cloud Vertex API has the following deficiencies when compared to other models we have evaluated: + +1. PaLM 2 seems more strongly regulated than other models which impacts its ability to answer some questions. +2. The currently offered PaLM 2 has limited multilingual abilities. +3. The currently offered PaLM 2 has unsatisfied reasoning capabilities. + +**PaLM 2 is more strongly regulated** + +PaLM 2 seems to be more strongly regulated than other models. In many user conversations, when the users ask questions that PaLM 2 is uncertain or uncomfortable giving an answer to, PaLM 2 is more likely to abstain from responding than other models. + +Based on a rough estimate, among all pairwise battles, PaLM 2 has lost 20.9% of the battles due to refusing to answer, and it has lost 30.8% of the battles to chatbots not belonging to one of the top four (GPT-4, Claude-v1, ChatGPT, Claude-instant-v1) due to refusing to answer. + +This partially explains why PaLM 2 frequently loses plays to weaker chatbots on the leaderboard. This also highlights a flaw in the chatbot arena methodology, as casual users are more likely to penalize abstention over subtly inaccurate responses. Below we provide several failure cases illustrating how PaLM loses plays to weaker chatbots because it refuses to answer the question. + +We also noticed that, sometimes, it is hard to clearly specify the boundary for LLM regulation. In the offered PaLM 2 versions, we see several undesired tendencies: + +- PaLM 2 refuses many roleplay questions, even if the users asked it to emulate a Linux terminal or a programming language interpreter. +- Sometimes PaLM 2 refuses to answer easy and non-controversial factual questions. + +Several examples are shown below: + + + +Figure 2: Example questions that PaLM 2 refuses to answer.
+ +**Limited multilingual abilities** + +We do not see strong multilingual abilities from PaLM 2 with the currently offered public API chat-bison@001 at Google Vertex API. PaLM 2 tends to not answer non-English questions, including questions written in popular languages such as Chinese, Spanish, and Hebrew. We were unable to reproduce several multilingual examples demonstrated in the PaLM 2 technical report using the current PaLM 2 versions. We are waiting for Google to gradually release the latest version of PaLM 2. + +We also calculate the Elo ratings of all models when only considering English and only considering non-English conversations, respectively, illustrated in Figure 3. The results confirm the observations – on the non-English leaderboard, PaLM 2 ranks 16th. + + +Figure 3: The English-only and non-English leaderboards.
+ +**PaLM 2's reasoning ability is unsatisfied** + +We also observe the offered PaLM 2 version do not demonstrate strong reasoning capabilities. On one hand, it seems to detect if the question is in plain text, and tends to refuse many questions not in plain text, such as those in programming languages, debugging, and code interpretation. On the other hand, we see PaLM 2 didn’t perform well on some entry-level reasoning tasks when compared against other chatbots. See several examples in Figure 4. + + + +Figure 4: Examples where PaLM 2 fails on simple reasoning tasks.
+ +**Elo ratings after removing non-English and refusal conversations** + +We remove all non-English conversations and all conversations for which PaLM 2 didn’t provide an answer and calculate the Elo ratings of each model with the filtered data. This rating represents a hypothetical upper bound of PaLM 2's Elo in the Arena. See Figure 5 below. + + +Figure 5: The leaderboard after removing PaLM 2's non-English and refusal conversations.
+ +### Smaller Models Are Competitive + +We observe several smaller models, including vicuna-7B and mpt-7b-chat, have achieved high ratings on the leaderboard. These smaller models perform favorably when compared against larger models with doubled parameters. + +We speculate that high-quality pre-training and fine-tuning datasets are more critical than model size. However, it is possible that larger models would still perform better with more complex reasoning tasks or answering more subtle questions (e.g., Trivia). +Hence, curating high-quality datasets in both pretraining and finetuning stages seems to be a key approach to reducing model sizes while keeping model quality high. + +### Claude-v1 and Claude-instant-v1 + +Claude-instant-v1 is a low-cost, faster alternative to Claude-v1 offered by Anthropic. If benchmarked in the wild in the arena, we observe that Claude-instant is close to GPT-3.5-turbo (1153 vs. 1143). The rating gap between Claude and Claude-instant seems smaller than that between GPT-4 and GPT-3.5-turbo. Claude-instant has a context length of 9K, is charged at a price of 0.00163/1K prompt token and 0.00551/1K completion token, compared to its OpenAI opponent product – GPT-3.5-turbo – with a context length of 4K and a uniform price of 0.002/1K token (regardless of prompt or completion). + +### Limitations of the “In-the-wild” Evaluation + +However, we want to point out a few facts about the current chatbot Arena and leaderboard. The current Arena is designed to benchmark LLM-based chatbots **"in the wild"**. That means, the voting data provided by our Arena users and the prompts-answers generated during the voting process reflect how the chatbots perform in normal human-chatbot interactions. This might not align with many benchmarking results in the LLM research literature, which tends to characterize long-tail abilities like zero-shot, complex reasoning, etc. Hence, the current chatbot arena has limitations in clearly reflecting the long-tail capability difference between chatbots. See the later section for more details and our plan. + +## Next Steps + +**Evaluating long-tail capability of LLMs** + +As pointed out by the community in [thread 1](https://twitter.com/tinkerteller/status/1656914923316998144?s=20) and [thread 2](https://twitter.com/LechMazur/status/1659915936919347202?s=20), the current Arena and leaderboard design has one major limitation: Performing user studies on a small scale often cannot generate many hard or medium prompts that are necessary to tell the long-tail capability difference between LLMs. Moreover, for difficult questions, it is also very hard for regular Arena users to judge which LLM has generated a better answer -- some domain-specific questions are considered very difficult, even for 99% of non-expert humans. + +However, long-tail capability, such as complex reasoning, can be crucial for LLMs to complete real-world tasks. Building long-tail capability into LLMs is the holy-grail problem and is the most actively studied and invested area in LLM development. + +We listen carefully to the community feedback and are thinking about how to improve the leaderboard to overcome these limitations and capture the long-tail capability different in LLMs. On top of the Chatbot Arena, we are actively designing a new tournament mechanism to examine the chatbots using presets of expert-designed questions and expert judges. We will have more updates soon. + +**More models** + +Since the launch of Arena, we have received many requests from the community to add more models. Due to the limited compute resources and bandwidth we have, we may not be able to serve all of them. We are working on improving the scalability of our serving systems. +In the meanwhile, you can still contribute support for [new models](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model) or contact us if you can help us scale the system. diff --git a/_posts/2023-06-22-leaderboard.md b/_posts/2023-06-22-leaderboard.md new file mode 100644 index 0000000..5540a75 --- /dev/null +++ b/_posts/2023-06-22-leaderboard.md @@ -0,0 +1,353 @@ +--- +layout: distill +title: Chatbot Arena Leaderboard Updates (Week 8) +description: Introducing MT-Bench and Vicuna-33B +giscus_comments: true +date: 2023-06-22 +featured: false +thumbnail: assets/img/blog/leaderboard_week8/ability_breakdown.png +authors: + - name: Lianmin Zheng + url: https://lmzheng.net/ + affiliations: + name: UC Berkeley, LMSys + - name: Wei-Lin Chiang + url: https://infwinston.github.io/ + - name: Ying Sheng + url: https://sites.google.com/view/yingsheng/home + - name: Hao Zhang + url: https://cseweb.ucsd.edu/~haozhang/ +--- + +In this blog post, we share the latest update on Chatbot Arena leaderboard, which now includes more open models and three metrics: + +1. **Chatbot Arena Elo**, based on 42K anonymous votes from [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/) using the Elo rating system. +2. **MT-Bench score**, based on a challenging multi-turn benchmark and GPT-4 grading, proposed and validated in our [Judging LLM-as-a-judge paper](https://arxiv.org/abs/2306.05685). +3. **MMLU**, a widely adopted [benchmark](https://arxiv.org/abs/2009.03300). + +Furthermore, we’re excited to introduce our **new series of Vicuna-v1.3 models**, ranging from 7B to 33B parameters, trained on an extended set of user-shared conversations. +Their weights are now [available](https://github.com/lm-sys/FastChat/tree/main#vicuna-weights). + +## Updated Leaderboard and New Models + + + + + +Table 1. LLM Leaderboard (April 24 - June 19, 2023). The latest and detailed version here.
+Model | MT-bench (score) | Arena Elo Rating | MMLU | License |
---|---|---|---|---|
GPT-4 | 8.99 | 1227 | 86.4 | Proprietary |
GPT-3.5-turbo | 7.94 | 1130 | 70.0 | Proprietary |
Claude-v1 | 7.90 | 1178 | 75.6 | Proprietary |
Claude-instant-v1 | 7.85 | 1156 | 61.3 | Proprietary |
Vicuna-33B | 7.12 | - | 59.2 | Non-commercial |
WizardLM-30B | 7.01 | - | 58.7 | Non-commercial |
Guanaco-33B | 6.53 | 1065 | 57.6 | Non-commercial |
Tulu-30B | 6.43 | - | 58.1 | Non-commercial |
Guanaco-65B | 6.41 | - | 62.1 | Non-commercial |
OpenAssistant-LLaMA-30B | 6.41 | - | 56.0 | Non-commercial |
PaLM-Chat-Bison-001 | 6.40 | 1038 | - | Proprietary |
Vicuna-13B | 6.39 | 1061 | 52.1 | Non-commercial |
MPT-30B-chat | 6.39 | - | 50.4 | CC-BY-NC-SA-4.0 |
WizardLM-13B | 6.35 | 1048 | 52.3 | Non-commercial |
Vicuna-7B | 6.00 | 1008 | 47.1 | Non-commercial |
Baize-v2-13B | 5.75 | - | 48.9 | Non-commercial |
Nous-Hermes-13B | 5.51 | - | 49.3 | Non-commercial |
MPT-7B-Chat | 5.42 | 956 | 32.0 | CC-BY-NC-SA-4.0 |
GPT4All-13B-Snoozy | 5.41 | 986 | 43.0 | Non-commercial |
Koala-13B | 5.35 | 992 | 44.7 | Non-commercial |
MPT-30B-Instruct | 5.22 | - | 47.8 | CC-BY-SA 3.0 |
Falcon-40B-Instruct | 5.17 | - | 54.7 | Apache 2.0 |
H2O-Oasst-OpenLLaMA-13B | 4.63 | - | 42.8 | Apache 2.0 |
Alpaca-13B | 4.53 | 930 | 48.1 | Non-commercial |
ChatGLM-6B | 4.50 | 905 | 36.1 | Non-commercial |
OpenAssistant-Pythia-12B | 4.32 | 924 | 27.0 | Apache 2.0 |
RWKV-4-Raven-14B | 3.98 | 950 | 25.6 | Apache 2.0 |
Dolly-V2-12B | 3.28 | 850 | 25.7 | MIT |
FastChat-T5-3B | 3.04 | 897 | 47.7 | Apache 2.0 |
StableLM-Tuned-Alpha-7B | 2.75 | 871 | 24.4 | CC-BY-NC-SA-4.0 |
LLaMA-13B | 2.61 | 826 | 47.0 | Non-commercial |
Figure 1: Sample questions from the MT-Bench.
+ +### But Still, How to Grade Chatbots' Answers? + +Though we believe human preference is the gold standard, it is notoriously slow and expensive to collect. +In our first [Vicuna blogpost](https://lmsys.org/blog/2023-03-30-vicuna/), we explored an automated evaluation pipeline based on GPT-4. +This approach has since got popular and adopted in several [concurrent and follow-up works](#related-work). + +In our latest paper, ["Judging LLM-as-a-judge"](https://arxiv.org/abs/2306.05685), we conducted a systematic study to answer how reliable those LLM judges are. +We provide a brief overview of conclusions here but recommend reading the paper for more details. + +We begin by acknowledging potential limitations of LLM-as-a-judge: + +- **Position bias** where LLM judges may favor the first answer in a pairwise comparison. +- **Verbosity bias** where LLM judges may favor lengthier answers, regardless of their quality. +- **Self-enhancement bias** where LLM judges may favor their own responses. +- **Limited reasoning ability** referring to LLM judges' possible shortcomings in grading math and reasoning questions. + +Our study then explores how few-shot judge, chain-of-thought judge, reference-based judge, and fine-tuned judge can help to mitigate these limitations. + +Upon implementing some of these solutions, we discovered that despite limitations, strong LLM judges like GPT-4 can align impressively well with both controlled and crowdsourced human preferences, achieving over 80% agreement. +This level of agreement is comparable to the agreement between two different human judges. +Therefore, if used carefully, LLM-as-a-judge can act as a _scalable_ and _explainable_ approximation of human preferences. + +We also found that single-answer grading based on GPT-4, without pairwise comparison, can also rank models effectively and match human preferences well. +In Table 1, we present the MT-Bench as a column on the leaderboard based on single-answer grading with GPT-4. + +## Results and Analysis + +### MT-Bench Effectively Distinguishes Among Chatbots + +Table 1 provides a detailed rundown of the MT-bench-enhanced leaderboard, where we conduct an exhaustive evaluation of 28 popular instruction-tuned models. +We observe a clear distinction among chatbots of varying abilities, with scores showing a high correlation with the Chatbot Arena Elo rating. +In particular, MT-Bench reveals noticeable performance gaps between GPT-4 and GPT-3.5/Claude, and between open and proprietary models. + +To delve deeper into the distinguishing factors among chatbots, we select a few representative chatbots and break down their performance per category in Figure 2. +GPT-4 shows superior performance in Coding and Reasoning compared to GPT-3.5/Claude, while Vicuna-13B lags significantly behind in several specific categories: Extraction, Coding, and Math. +This suggests there is still ample room for improvement for open-source models. + + +Figure 2: The comparison of 6 representative LLMs regarding their abilities in 8 categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities.
+ +### Multi-turn Conversation Capabilities + +We next analyze the multi-turn scores of selected models, presented in Table 2. + +Table 2. The breakdown of LLMs' MT-bench scores in the 1st and 2nd turn of a dialogue. Full score is 10.
+Model | Average 1st Turn Score | Average 2nd Turn Score | Score Difference |
---|---|---|---|
GPT-4 | 8.96 | 9.03 | 0.07 |
Claude-v1 | 8.15 | 7.65 | -0.50 |
GPT-3.5-turbo | 8.08 | 7.81 | -0.26 |
Vicuna-33B | 7.46 | 6.79 | -0.67 |
WizardLM-30B | 7.13 | 6.89 | -0.24 |
WizardLM-13B | 7.12 | 5.59 | -1.53 |
Guanaco-33B | 6.88 | 6.18 | -0.71 |
Vicuna-13B | 6.81 | 5.96 | -0.85 |
PaLM2-Chat-Bison | 6.71 | 6.09 | -0.63 |
Vicuna-7B | 6.69 | 5.30 | -1.39 |
Koala-13B | 6.08 | 4.63 | -1.45 |
MPT-7B-Chat | 5.85 | 4.99 | -0.86 |
Falcon-40B-instruct | 5.81 | 4.53 | -1.29 |
H2OGPT-Oasst-Open-LLaMA-13B | 5.51 | 3.74 | -1.78 |
Figure 3: MT-bench provides more explainability in evaluating LLMs' human preferences.
+ +In conclusion, we have shown that MT-Bench effectively differentiates between chatbots of varying capabilities. +It's scalable, offers valuable insights with category breakdowns, and provides explainability for human judges to verify. +However, LLM judges should be used carefully. It can still make errors, especially when grading math/reasoning questions. + +## How to Evaluate New Models on MT-Bench? + +Evaluating models on MT-bench is simple and fast. Our script supports all huggingface models, and we’ve provided [detailed instructions](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge#mt-bench), +in which you can generate model’s answers to the MT-bench questions and their GPT-4 judgments. You can also examine the answers and reviews on our gradio browsing demo. + +## Next steps + +**Release of Conversations Data** + +We're in the process of releasing Chatbot Arena conversations data to the broader research community. Stay tuned for updates! + +**MT-bench-1K** + +MT-Bench currently consists of a concise set of 80 carefully curated questions, ensuring the highest quality. +We're actively expanding the question set to MT-Bench-1K by integrating high-quality prompts from the Chatbot Arena and generating new ones automatically using LLMs. +If you have any good ideas, we'd be delighted to hear from you. + +**Invitation for collaborations** + +We're engaging with various organizations to explore possibilities for standardizing the evaluation of human preferences for LLMs at scale. +If this interests you, please feel free to reach out to us. + +## Related work + +There has been a great amount of interesting work studying how to evaluate human preferences and how to use strong LLM as judges for evaluation. +You are welcome to check them out and see more opinions on this topic: + +- [Judging LLM-as-a-judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685) +- [Can foundation models label data like humans?](https://huggingface.co/blog/llm-leaderboard) +- [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources](https://arxiv.org/abs/2306.04751) +- [The False Promise of Imitating Proprietary LLMs](https://arxiv.org/abs/2305.15717) +- [AlpacaEval and AlpacaFarm](https://github.com/tatsu-lab/alpaca_eval) +- [Large Language Models are not Fair Evaluators](https://arxiv.org/abs/2305.17926) + +## Links + +Below are readily available tools and code to run MT-bench and other metrics used in this blogpost: + +- The MT-bench uses [fastchat.llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge), +- The [Arena Elo calculator](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing). +- The MMLU is based on [InstructEval](https://github.com/declare-lab/instruct-eval/blob/main/mmlu.py) and [Chain-of-Thought Hub](https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU). + +If you wish to see more models on leaderboard, we invite you to [contribute to FastChat](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model) or [contact us](mailto:lmsysorg@gmail.com) to provide us with API access. diff --git a/_posts/2023-07-20-dataset.md b/_posts/2023-07-20-dataset.md new file mode 100644 index 0000000..b404505 --- /dev/null +++ b/_posts/2023-07-20-dataset.md @@ -0,0 +1,117 @@ +--- +layout: distill +title: Chatbot Arena Conversation Dataset Release +giscus_comments: true +date: 2023-07-20 +featured: false +thumbnail: assets/img/blog/arena/cover.png +authors: + - name: Chatbot Arena Team + affiliations: + name: LMSYS Org +--- + +Since its launch three months ago, [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/) has become a widely cited LLM evaluation platform that emphasizes large-scale, community-based, and interactive human evaluation. In that short time span, we collected around 53K votes from 19K unique IP addresses for 22 models. + +In this blog post, we are releasing an updated leaderboard with more models and two datasets for human preference related study: + +- **33K crowd-sourced conversations** with human preference annotations from Chatbot Arena. ([link](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)) +- **3K expert-level human annotations** from MT-bench. ([link](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)) + +As estimated by this Llama2 analysis blog [post](https://www.interconnects.ai/p/llama-2-from-meta?sd=pf), Meta spent about 8 million on human preference data for LLama 2 and that dataset is not avaialble now. +Therefore, we think our datasets are highly valuable due to the expensive nature of obtaining human preferences and the limited availability of open, high-quality datasets. + +## Updated Leaderboard + +We are hosting the latest leaderboard at [lmsys/chatbot-arena-leaderboard](https://lmarena.ai/?leaderboard). Below is a screenshot. Since the last update, we added two 30B models: Vicuna-33B-v1.3 and MPT-30B-chat, both of which perform very well in the arena. +Two days ago, we also introduced Llama 2 and Claude 2 to the arena. The leaderboard will soon include them after we get enough votes. +Please help us by casting your votes at our voting [website](https://lmarena.ai). + +Besides the slowly updated Arena Elo ratings, we also use MT-bench, a fast GPT-4 based automatic evaluation pipeline to evaluate all new models, including LLama 2 (chat), Claude 2, WizardLM-13B-v1.1, XGen-7B-8K-Inst, and ChatGLM2-6B. +You are welcome to check out the interactive [lmsys/chatbot-arena-leaderboard](https://lmarena.ai/?leaderboard) to sort the models according to different metrics. +Some early evaluation results of LLama 2 can be found in our [tweets](https://twitter.com/lmsysorg/status/1681744327192752128). + + +Figure 1. Chatbot Arena Leaderboard (see more)
+ +## Dataset 1: 33K Chatbot Arena Conversation Data + +Link: [lmsys/chatbot_arena_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) + +This dataset contains 33K cleaned conversations with pairwise human preferences collected on Chatbot Arena from April to June 2023. +Each sample includes two model names, their full conversation text, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp. + +To ensure the safe release of data, we have attempted to remove all conversations that contain personally identifiable information (PII). In addition, we have included the OpenAI moderation API output to flag inappropriate conversations. However, we have chosen not to remove all of these conversations so that researchers can study safety-related questions associated with LLM usage in the wild as well as the OpenAI moderation process. As an example, we included additional toxic tags that are generated by our own toxic tagger, which are trained by fine-tuning T5 and RoBERTa on manually labeled data. + +### Uniqueness and Potential Usage + +Compared to existing human preference datasets like [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1). This dataset + +- Contains the outputs of 20 LLMs including stronger LLMs such as GPT-4 and Claude-v1. It also contains many failure cases of these state-of-the-art models. +- Contains unrestricted conversations from over 13K users in the wild. + +We believe this data will help the AI research community answer important questions around topics like: + +- Characteristics of real-world user prompts +- Train better models with RLHF +- Improve and evaluate LLM evaluation methods +- Build model selection and request dispatching algorithms +- Study the design and application of inappropriate content filtering mechanisms + +### Disclaimers and Terms + +- This dataset includes offensive conversations. It is not intended for training dialogue agents without applying appropriate filtering measures. We are not responsible for any outputs of the models trained on this dataset. +- Statements or opinions made in this dataset do not reflect the views of researchers or institutions involved in the data collection effort. +- Users of this data are responsible for ensuring its appropriate use, which includes abiding by any applicable laws and regulations. +- Users of this data should adhere to the terms of use for a specific model when using its direct outputs. +- Please contact us if you find any issues with the dataset. + +### Visualization and Elo Rating Calculation + +This Colab [notebook](https://colab.research.google.com/drive/1J2Wf7sxc9SVmGnSX_lImhT246pxNVZip?usp=sharing) provides some visualizations and shows how to compute Elo ratings with the dataset. We pasted some figures here. + + +Figure 2. Fraction of Model A Wins for All Non-tied A vs. B Battles.
+ +Figure 3. Battle Counts of Each Models Pair.
+ +## Dataset 2: 3K MT-bench Human Annotations + +Link: [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments) + +In addition to the crowd-sourced evaluation with Chatbot Arena, we also conducted a controlled human evaluation with MT-bench. + +This dataset contains 3.3K expert-level pairwise human preferences for model responses generated by 6 models in response to 80 MT-bench questions. +The 6 models are GPT-4, GPT-3.5, Claud-v1, Vicuna-13B, Alpaca-13B, and LLaMA-13B. The annotators are mostly graduate students with expertise in the topic areas of each of the questions. The details of data collection can be found in our [paper](https://arxiv.org/abs/2306.05685). + +### Agreement Calculation + +This Colab [notebook](https://colab.research.google.com/drive/1ctgygDRJhVGUJTQy8-bRZCl1WNcT8De6?usp=sharing) shows how to compute the agreement between humans and GPT-4 judge with the dataset. Our results show that humans and GPT-4 judge achieve over 80\% agreement, the same level of agreement between humans. + +## Acknowlement + +We thank the whole community for contributing to the arena dataset. +We also plan to gradually release more conversations in the future after doing thorough review. + +## Citation + +``` +@misc{chiang2024chatbot, + title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference}, + author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica}, + year={2024}, + eprint={2403.04132}, + archivePrefix={arXiv}, + primaryClass={cs.AI} +} +@inproceedings{zheng2023judging, + title={Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena}, + author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica}, + booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, + year={2023}, + url={https://openreview.net/forum?id=uccHPGDlao} +} +``` diff --git a/_posts/2024-03-01-policy.md b/_posts/2024-03-01-policy.md new file mode 100644 index 0000000..d7c9574 --- /dev/null +++ b/_posts/2024-03-01-policy.md @@ -0,0 +1,92 @@ +--- +layout: distill +title: Chatbot Arena Policy Update +description: Live and Community-Driven LLM Evaluation +giscus_comments: true +date: 2024-03-01 +featured: false +thumbnail: assets/img/blog/arena_policy/arena_logo_v0_4x3.png +authors: + - name: Chatbot Arena Team + affiliations: + name: LMSYS Org +--- + +## Our Mission + +Chatbot Arena ([lmarena.ai](https://lmarena.ai)) is an open-source project developed by members from [LMSYS](https://lmarena.ai/?about) and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We maintain the open evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish [leaderboard](https://lmarena.ai/?leaderboard) periodically. + + + +## Our Progress + +Chatbot Arena was first launched in [May 2023](https://lmsys.org/blog/2023-05-03-arena/) and has emerged as a critical platform for live, community-driven LLM evaluation, attracting millions of participants and collecting over 800,000 votes. This extensive engagement has enabled the evaluation of more than 90 LLMs, including both commercial GPT-4, Gemini/Bard and open-weight Llama and Mistral models, significantly enhancing our understanding of their capabilities and limitations. + +Our periodic [leaderboard](https://lmarena.ai/?leaderboard) and blog post updates have become a valuable resource for the community, offering critical insights into model performance that guide the ongoing development of LLMs. Our commitment to open science is further demonstrated through the sharing of [user preference data](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) and [one million user prompts](https://huggingface.co/datasets/lmsys/lmsys-chat-1m), supporting research and model improvement. + +We also collaborate with open-source and commercial model providers to bring their latest models to community for preview testing. We believe this initiative helps advancing the field and encourages user engagement to collect crucial votes for evaluating all the models in the Arena. Moreover, it provides an opportunity for the community to test and provide anonymized feedback before the models are officially released. + +The platform's infrastructure ([FastChat](https://github.com/lm-sys/FastChat)) and evaluation tools, available on GitHub, emphasize our dedication to transparency and community engagement in the evaluation process. This approach not only enhances the reliability of our findings but also fosters a collaborative environment for advancing LLMs. + +In our ongoing efforts, we feel obligated to establish policies that guarantee evaluation transparency and trustworthiness. Moreover, we actively involve the community in shaping any modifications to the evaluation process, reinforcing our commitment to openness and collaborative progress. + +## Our Policy + +