diff --git a/_posts/2023-05-03-arena.md b/_posts/2023-05-03-arena.md new file mode 100644 index 0000000..2018f47 --- /dev/null +++ b/_posts/2023-05-03-arena.md @@ -0,0 +1,223 @@ +--- +layout: distill +title: Chatbot Arena +description: Benchmarking LLMs in the Wild with Elo Ratings +giscus_comments: true +date: 2023-05-03 +featured: true +thumbnail: assets/img/blog/arena/cover.png +authors: + - name: Lianmin Zheng* + url: https://lmzheng.net/ + affiliations: + name: UC Berkeley, LMSys + - name: Ying Sheng* + url: https://sites.google.com/view/yingsheng/home + - name: Wei-Lin Chiang + url: https://infwinston.github.io/ + - name: Hao Zhang + url: https://cseweb.ucsd.edu/~haozhang/ + - name: Joseph E. Gonzalez + url: https://people.eecs.berkeley.edu/~jegonzal/ + - name: Ion Stoica + url: https://people.eecs.berkeley.edu/~istoica/ +--- + +We present Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. In this blog post, we are releasing our initial results and a leaderboard based on the Elo rating system, which is a widely-used rating system in chess and other competitive games. We invite the entire community to join this effort by contributing new models and evaluating them by asking questions and voting for your favorite answer. + + + +

Table 1. LLM Leaderboard (April 24 - May 1, 2023). The latest and detailed version here.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rank Model Elo Rating Description
1 🥇 vicuna-13b 1169 a chat assistant fine-tuned from LLaMA on user-shared conversations by LMSYS
2 🥈 koala-13b 1082 a dialogue model for academic research by BAIR
3 🥉 oasst-pythia-12b 1065 an Open Assistant for everyone by LAION
4 alpaca-13b 1008 a model fine-tuned from LLaMA on instruction-following demonstrations by Stanford
5 chatglm-6b 985 an open bilingual dialogue language model by Tsinghua University
6 fastchat-t5-3b 951 a chat assistant fine-tuned from FLAN-T5 by LMSYS
7 dolly-v2-12b 944 an instruction-tuned open large language model by Databricks
8 llama-13b 932 open and efficient foundation language models by Meta
9 stablelm-tuned-alpha-7b 858 Stability AI language models
+ +­ + +Table 1 displays the Elo ratings of nine popular models, which are based on the 4.7K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing). You can also try the voting [demo](https://lmarena.ai). + + +

Figure 1. The side-by-side chatting and voting interface.

+ +Please note that we periodically release blog posts to update the leaderboard. Feel free to check the following updates: + +- [May 10 Updates](https://lmsys.org/blog/2023-05-10-leaderboard/) +- [May 25 Updates](https://lmsys.org/blog/2023-05-25-leaderboard/) +- [June 22 Updates](https://lmsys.org/blog/2023-06-22-leaderboard/) +- [Dataset Release (July 20)](https://lmsys.org/blog/2023-07-20-dataset/) +- [Dec. 7 Updates](https://lmsys.org/blog/2023-12-07-leaderboard/) +- [Policy Updates (March 1, 2024)](https://lmsys.org/blog/2024-03-01-policy/) + +## Introduction + +Following the great success of ChatGPT, there has been a proliferation of open-source large language models that are finetuned to follow instructions. These models are capable of providing valuable assistance in response to users’ questions/prompts. Notable examples include Alpaca and Vicuna, based on LLaMA, and OpenAssistant and Dolly, based on Pythia. + +Despite the constant release of new models every week, the community faces a challenge in benchmarking these models effectively. Benchmarking LLM assistants is extremely challenging because the problems can be open-ended, and it is very difficult to write a program to automatically evaluate the response quality. +In this case, we typically have to resort to human evaluation based on pairwise comparison. + +There are some desired properties for a good benchmark system based on pairwise comparison. + +- **Scalability**. The system should scale to a large number of models when it is not feasible to collect sufficient data for all possible model pairs. +- **Incrementality**. The system should be able to evaluate a new model using a relatively small number of trials. +- **Unique order**. The system should provide a unique order for all models. Given any two models, we should be able to tell which ranks higher or whether they are tied. + +Existing LLM benchmark systems rarely satisfy all of these properties. Classical LLM benchmark frameworks, such as [HELM](https://crfm.stanford.edu/helm/latest/) and [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), provide multi-metric measurements for tasks commonly used in academic research. However, they are not based on pairwise comparison and are not effective at evaluating open-ended questions. OpenAI also launched the [evals](https://github.com/openai/evals) project to collect better questions, but this project does not provide ranking mechanisms for all participating models. When we launched our [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) model, we utilized a GPT-4-based evaluation pipeline, but it does not provide a solution for scalable and incremental ratings. + +In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the [Elo rating system](https://en.wikipedia.org/wiki/Elo_rating_system), which is a widely-used rating system in chess and other competitive games. The Elo rating system is promising to provide the desired property mentioned above. We noticed that the [Anthropic LLM paper](https://arxiv.org/pdf/2204.05862.pdf) also adopted the Elo rating system. + +To collect data, we launched the arena with several popular open-source LLMs one week ago. In the arena, a user can chat with two anonymous models side-by-side and vote for which one is better. This crowdsourcing way of data collection represents some use cases of LLMs in the wild. A comparison between several evaluation methods is shown in Table 2. + +
+

Table 2: Comparison between different evaluation methods.

+
+ + + + + + + + + + + + + + + +
HELM / lm-evaluation-harness OpenAI/eval Alpaca Evaluation Vicuna Evaluation Chatbot Arena
Question Source Academic datasets Mixed Self-instruct evaluation set GPT-4 generated User prompts
Evaluator Program Program/Model Human GPT-4 User
Metrics Basic metrics Basic metrics Win rate Win rate Elo ratings
+
+ +## Data Collection + +We hosted the arena at [https://lmarena.ai](https://lmarena.ai) with our multi-model serving system, [FastChat](https://github.com/lm-sys/FastChat). When a user enters the arena, they can chat with two anonymous models side-by-side, as shown in Figure 1. +After getting responses from the two models, users can continue chatting or vote for the model they think is better. Once a vote is submitted, the model names will be revealed. Users can continue chatting or restart a new battle with two new randomly chosen anonymous models. The platform logs all user interactions. In our analysis, we only use the votes when the model names are hidden. + +The arena was launched about one week ago and we have collected 4.7k valid anonymous votes since then. We share some exploratory analysis in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing) and present a short summary here. + + +

Figure 2: Battle count of each combination of models

+ +Figure 2 shows the battles count of each combination of models. When we initially launched the tournament, we had prior information on the likely ranking based on our benchmarks and chose to pair models according to this ranking. We gave preference to what we believed would be strong pairings based on this ranking. However, we later switched to uniform sampling to get better overall coverage of the rankings. Towards the end of the tournament, we also introduced a new model `fastchat-t5-3b`. All of these result in non-uniform model frequency. + + +

Figure 3: Battle counts for the top-15 languages.

+ +Figure 3 plots the language distribution and shows most user prompts are in English. + +## Elo Rating System + +The [Elo rating system](https://en.wikipedia.org/wiki/Elo_rating_system) is a method for calculating the relative skill levels of players, which has been widely adopted in competitive games and sports. The difference in the ratings between two players serves as a predictor of the outcome of a match. The Elo rating system works well for our case because we have multiple models and we run pairwise battles between them. + +If player A has a rating of `Ra` and player B a rating of `Rb`, the exact formula (using the logistic curve with base 10) for the probability of player A winning is + + + +The ratings of players can be linearly updated after each battle. Suppose player A (with Rating `Ra`) was expected to score `Ea` points but actucally scored `Sa` points. The formula for updating that player's rating is + + + +Using the collected data, we compute the Elo ratings of the models in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing) and put the main results in Table 1. You are welcome to try the notebook and play with the voting data by yourself. The data only contains voting results without conversation histories because releasing the conversation history will raise concerns such as privacy and toxicity. + +## Pairwise Win Rates + +As a basis for calibration, we also present here the pairwise win rates for each model in the tournament (Figure 4) as well as the predicted pairwise win rate estimated using Elo ratings (Figure 5). +By comparing the figures, we find the elo ratings can predict win rates relatively well. + + +

Figure 4: Fraction of Model A wins for all non-tied A vs. B battles.

+ + +

Figure 5: Predicted win rate using Elo ratings for Model A in an A vs. B battle

+ +## Future Plans + +We plan to work on the following items: + +- Add more closed-source models (ChatGPT-3.5, ChatGPT-4, and Claude-v1 are avaiable now in the anonymous Arena) +- Add more open-source models +- Release periodically updated leaderboards (e.g., monthly) +- Implement better sampling algorithms, tournament mechanisms, and serving systems to support a much larger number of models +- Provide fine-grained rankings on different task types. + +We appreciate any feedback from you to make the arena better. + +## Join Us + +We invite the entire community to join this benchmarking effort by contributing your models and votes for the anonymous models you think provide better answers. You can visit [https://lmarena.ai](https://lmarena.ai) to vote for better models. If you want to see a specific model in the arena, you can follow this [guide](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model) to help us add it. + +## Acknowledgment + +We thank other members of the Vicuna team for valuable feedback and MBZUAI for donating compute resources. Additionally, we extend our thanks to Tianjun Zhang and Eric Wallace for their insightful discussions. + +## Links + +- Demo: [https://lmarena.ai](https://lmarena.ai) +- Leaderboard: [https://lmarena.ai/?leaderboard](https://lmarena.ai/?leaderboard) +- GitHub: [https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat) +- Colab notebook: [https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing) + +## Citation + +Please cite the following [papers](https://arxiv.org/abs/2403.04132) if you find our work useful. + +``` +@misc{chiang2024chatbot, + title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference}, + author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica}, + year={2024}, + eprint={2403.04132}, + archivePrefix={arXiv}, + primaryClass={cs.AI} +} + +@inproceedings{zheng2023judging, + title={Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena}, + author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica}, + booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, + year={2023}, + url={https://openreview.net/forum?id=uccHPGDlao} +} + +@inproceedings{zheng2024lmsyschatm, + title={LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset}, + author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Tianle Li and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zhuohan Li and Zi Lin and Eric Xing and Joseph E. Gonzalez and Ion Stoica and Hao Zhang}, + booktitle={The Twelfth International Conference on Learning Representations}, + year={2024}, + url={https://openreview.net/forum?id=BOfDKxfwt0} +} +``` diff --git a/_posts/2023-05-10-leaderboard.md b/_posts/2023-05-10-leaderboard.md new file mode 100644 index 0000000..c989a0b --- /dev/null +++ b/_posts/2023-05-10-leaderboard.md @@ -0,0 +1,149 @@ +--- +layout: distill +title: Chatbot Arena Leaderboard Updates (Week 2) +giscus_comments: true +date: 2023-05-10 +featured: false +thumbnail: assets/img/blog/leaderboard_week2/leaderboard_cover.png +authors: + - name: Chatbot Arena Team + affiliations: + name: LMSYS Org +--- + +We release an updated leaderboard with more models and new data we collected last week, after the announcement of the anonymous [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/). We are actively iterating on the design of the arena and leaderboard scores. + +In this update, we have added 4 new yet strong players into the Arena, including three **proprietary models** and one open-source model. They are: + +- OpenAI GPT-4 +- OpenAI GPT-3.5-turbo +- Anthropic Claude-v1 +- RWKV-4-Raven-14B + +Table 1 displays the Elo ratings of all 13 models, which are based on the 13K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing). You can also try the voting [demo](https://lmarena.ai). + + + +

Table 1. LLM Leaderboard (April 24 - May 8, 2023). The latest and detailed version here.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rank Model Elo Rating Description License
1 🥇 GPT-4 1274 ChatGPT-4 by OpenAI Proprietary
2 🥈 Claude-v1 1224 Claude by Anthropic Proprietary
3 🥉 GPT-3.5-turbo 1155 ChatGPT-3.5 by OpenAI Proprietary
4 Vicuna-13B 1083 a chat assistant fine-tuned from LLaMA on user-shared conversations by LMSYS Weights available; Non-commercial
5 Koala-13B 1022 a dialogue model for academic research by BAIR Weights available; Non-commercial
6 RWKV-4-Raven-14B 989 an RNN with transformer-level LLM performance Apache 2.0
7 Oasst-Pythia-12B 928 an Open Assistant for everyone by LAION Apache 2.0
8 ChatGLM-6B 918 an open bilingual dialogue language model by Tsinghua University Weights available; Non-commercial
9 StableLM-Tuned-Alpha-7B 906 Stability AI language models CC-BY-NC-SA-4.0
10 Alpaca-13B 904 a model fine-tuned from LLaMA on instruction-following demonstrations by Stanford Weights available; Non-commercial
11 FastChat-T5-3B 902 a chat assistant fine-tuned from FLAN-T5 by LMSYS Apache 2.0
12 Dolly-V2-12B 863 an instruction-tuned open large language model by Databricks MIT
13 LLaMA-13B 826 open and efficient foundation language models by Meta Weights available; Non-commercial
+ +If you want to see more models, please help us [add them](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model) or [contact us](mailto:lmsysorg@gmail.com) by giving us API access. + +## Overview + +Thanks to the community's help, we have gathered 13k anonymous votes. Looking at the rankings and data collected from this leaderboard update, we have a few interesting findings. + +**Gaps between proprietary and open-source models** +We do observe a substantial gap between the three proprietary models and all other open-source models. +In particular, GPT-4 is leading the board, achieving an Elo score of 1274. It is almost 200 scores higher than the best open-source alternative on this board -- our Vicuna-13B. +After dropping ties, GPT-4 wins 82% of the matches when it is against Vicuna-13B, and it even wins 79% of the matches when it is against its previous generation GPT-3.5-turbo. + +However, it is important to note that these open-source models on the leaderboard generally have fewer parameters, in the range of 3B - 14B, than proprietary models. +In fact, recent advancements in LLMs and data curation have allowed for significant improvements in performance with smaller models. +[Google's latest PaLM 2](https://ai.google/discover/palm2) is a great example of this: knowing that PaLM 2 achieves even better performance than its previous generation using smaller model sizes, +we remain very optimistic about the potential for open-source language models to catch up. Through our [FastChat-based Chatbot Arena](https://github.com/lm-sys/FastChat) and this leaderboard effort, +we hope to contribute a trusted evaluation platform for evaluating LLMs, and help advance this field and create better language models for everyone. + +**Comparing proprietary models** +However, among the three proprietary models, we do observe, based on our collected voting results, +that Anthropic's Claude model is preferred by our users over GPT-3.5-turbo, which is often discussed as its opponent. +In fact, Claude is highly competitive even when competing against the most powerful model -- OpenAI's GPT-4. +Looking at the win rate plots (Figure 3 below), among the 66 non-tied matches between GPT-4 and Claude, Claude indeed wins over GPT-4 in 32 (48%) matches. Great job Anthropic team! + +**Comparing open-source chatbots** +In this update, we have added RWKV-4-Raven-14B model into the Arena thanks to the community [contribution](https://github.com/lm-sys/FastChat/issues/633). Unlike all other models, RWKV model is an RNN instead of a transformer-based model; but it performs surprisingly well! +It soon uptrends on the leaderboard and is positioned #6 on the overall leaderboard. It wins more than 50% of non-tied matches against all other open-source models except Vicuna. You are welcome to check out its [repo](https://github.com/BlinkDL/RWKV-LM) to learn more about other features like memory saving and fast inference. +Kudos to the RWKV developers. + +**Fluctuations of Elo scores** +The Elo scores of existing models can go up and down depending on the results of the new games played. This is similar to the way the Elo scores of chess players vary over time (see [here](https://en.chessbase.com/post/historical-chess-ratings-dynamically-presented)). +Since the participation of the three strong proprietary models, the Chatbot Arena has never been more competitive than ever before! +As a consequence, we observe the Elo scores of all open source models have decreased a bit. This is because open source models lose lots of pairwise matches when they are against the proprietary models. + +## Detailed Results + +**When does GPT-4 fail?** +We present a few examples in which GPT-4 is not preferred by users. + + +

Figure 1: One example where Claude is preferred over GPT-4.

+ +In Figure 1, the user posed a tricky question that demanded careful reasoning and planning. Although both Claude and GPT-4 provided similar answers, Claude's response was marginally better as the needle was positioned on top. +However, we observed that the outcome of this example cannot always be replicated due to the randomness of sampling. +Sometimes GPT-4 can also give the same order as Claude, but it fails at this generation trial. +Additionally, we noted that the behavior of GPT-4 differed slightly when using the OpenAI API versus the ChatGPT interface, which could be attributed to different prompts, sampling parameters, or other unknown factors. + + +

Figure 2: One example where a user thinks both Claude and GPT-4 are wrong.

+ +In Figure 2, both Claude and GPT-4 are still struggling with this kind of tricky reasoning questions despite their amazing capabilities. + +Besides these tricky cases, there are also a lot of easy questions that do not require complex reasoning or knowledge. In this case, open source models like Vicuna can perform on par with GPT-4, so we might be able to use a slightly weaker (but smaller or cheaper) LLM in place of the more powerful one like GPT-4. + +**Win Fraction Matrix** +We present the win fraction of all model pairs in Figure 3. + + +

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles.

+ +**Language-specific leaderboards** +Lastly, we present two language-specific leaderboards, by isolating the conversation data into two subsets based on the language: (1) English-only and (2) non-English. From Figure 4, we can tell that Koala is worse at non-English languages and ChatGLM-6B is better at non-English languages. This is because of the different compositions of their training data. + + +

Figure 4: The English-only and non-English leaderboards.

+ +More figures, analyses, and calculations can be found in this [notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing). + +## Next Steps + +**Help us add more models** +Since the launch of Chatbot Arena, we have seen growing interest from the community. Many model developers are eager to put their chatbots into the Arena and see how they perform against others. +Please help us add more models by following [this guide](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model). + +**Bring your own self-hosted chatbot (BYOC)** +We also plan to open some APIs to allow competitors to register their self-hosted chatbots and participate in the Arena. + +**Area-specific Arena** +Similar to the language-specific Arena, we will extend a single, monolithic leaderboard to more areas, and publish more functionality-specific leaderboards, +such as writing, coding, and reasoning. In which specific area or ability do you want to see the LLMs evaluated? +Please give us feedback on [Discord](https://discord.gg/HSWAKCrnFx) or [Twitter](https://twitter.com/lmsysorg). + +## Acknowledgement + +This blog post is primarily contributed by Lianmin Zheng, Ying Sheng, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. +We thank other members of LMSYS team (Wei-Lin Chiang, Siyuan Zhuang, and more) for valuable feedback and MBZUAI for donating compute resources. +Additionally, we extend our thanks to community contributors for their votes and model support. diff --git a/_posts/2023-05-25-leaderboard.md b/_posts/2023-05-25-leaderboard.md new file mode 100644 index 0000000..00458a5 --- /dev/null +++ b/_posts/2023-05-25-leaderboard.md @@ -0,0 +1,171 @@ +--- +layout: distill +title: Chatbot Arena Leaderboard Updates (Week 4) +giscus_comments: true +date: 2023-05-25 +featured: false +thumbnail: assets/img/blog/leaderboard_week4/leaderboard_cover.png +authors: + - name: Chatbot Arena Team + affiliations: + name: LMSYS Org +--- + +In this update, we are excited to welcome the following models joining the [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/): + +1. Google PaLM 2, chat-tuned with the code name [chat-bison@001](https://cloud.google.com/vertex-ai/docs/release-notes#May_10_2023) on Google Cloud Vertex AI +2. Anthropic Claude-instant-v1 +3. MosaicML MPT-7B-chat +4. Vicuna-7B + +A new Elo rating leaderboard based on the 27K anonymous voting data collected **in the wild** between April 24 and May 22, 2023 is released in Table 1 below. + +We provide a [Google Colab notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing) to analyze the voting data, including the computation of the Elo ratings. +You can also try the voting [demo](https://lmarena.ai). + + + +
+

Table 1. LLM Leaderboard (April 24 - May 22, 2023). The latest and detailed version here.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rank Model Elo Rating Description License
1 🥇 GPT-4 1225 ChatGPT-4 by OpenAI Proprietary
2 🥈 Claude-v1 1195 Claude by Anthropic Proprietary
3 🥉 Claude-instant-v1 1153 Lighter, less expensive, and much faster version of Claude Proprietary
4 GPT-3.5-turbo 1143 ChatGPT-3.5 by OpenAI Proprietary
5 Vicuna-13B 1054 a chat assistant fine-tuned from LLaMA on user-shared conversations by LMSYS Weights available; Non-commercial
6 PaLM 2 1042 PaLM 2 tuned for chat (chat-bison@001 on Google Vertex AI). The PaLM 2 model family is powering Bard. Proprietary
7 Vicuna-7B 1007 a chat assistant fine-tuned from LLaMA on user-shared conversations by LMSYS Weights available; Non-commercial
8 Koala-13B 980 a dialogue model for academic research by BAIR Weights available; Non-commercial
9 mpt-7b-chat 952 a chatbot fine-tuned from MPT-7B by MosaicML CC-By-NC-SA-4.0
10 FastChat-T5-3B 941 a chat assistant fine-tuned from FLAN-T5 by LMSYS Apache 2.0
11 Alpaca-13B 937 a model fine-tuned from LLaMA on instruction-following demonstrations by Stanford Weights available; Non-commercial
12 RWKV-4-Raven-14B 928 an RNN with transformer-level LLM performance Apache 2.0
13 Oasst-Pythia-12B 921 an Open Assistant for everyone by LAION Apache 2.0
14 ChatGLM-6B 921 an open bilingual dialogue language model by Tsinghua University Weights available; Non-commercial
15 StableLM-Tuned-Alpha-7B 882 Stability AI language models CC-BY-NC-SA-4.0
16 Dolly-V2-12B 866 an instruction-tuned open large language model by Databricks MIT
17 LLaMA-13B 854 open and efficient foundation language models by Meta Weights available; Non-commercial
+ +­ + +**Win Fraction Matrix** +The win fraction matrix of all model pairs is shown in Figure 1. + + +

Figure 1: Fraction of Model A Wins for All Non-tied A vs. B Battles.

+ +If you want to see more models, please help us [add them](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model) or contact us by giving us API access. + +## Overview + +### Google PaLM 2 + +Google's PaLM 2 is one of the most significant models announced since our last leaderboard update. We added the PaLM 2 Chat to the Chatbot Arena via the [Google Cloud Vertex AI API](https://cloud.google.com/vertex-ai/docs/release-notes#May_10_2023). The model is chat-tuned under the code name _chat-bison@001_. + +In the past two weeks, PaLM 2 has competed for around 1.8k anonymous battles with the other 16 chatbots, currently ranked 6th on the leaderboard. It ranks above all other open-source chatbots, except for Vicuna-13B, whose Elo is 12 scores higher than PaLM 2 (Vicuna 1054 vs. PaLM 2 1042) which in terms of ELO rating is nearly a virtual tie. We noted the following interesting results from PaLM 2's Arena data. + +PaLM 2 is better when playing against the top 4 players, i.e., GPT-4, Claude-v1, ChatGPT, Claude-instant-v1, and it also wins 53% of the plays with Vicuna, but worse when playing against weaker players. This can be seen in Figure 1 which shows the win fraction matrix. Among all battles PaLM 2 has participated in, 21.6% were lost to a chatbot that is not one of GPT-4, Claude-v1, GPT-3.5-turbo, Claude-instant-v1. For reference, another proprietary model GPT-3.5-turbo only loses 12.8% of battles to those chatbots. + +In short, we find that the current PaLM 2 version available at Google Cloud Vertex API has the following deficiencies when compared to other models we have evaluated: + +1. PaLM 2 seems more strongly regulated than other models which impacts its ability to answer some questions. +2. The currently offered PaLM 2 has limited multilingual abilities. +3. The currently offered PaLM 2 has unsatisfied reasoning capabilities. + +**PaLM 2 is more strongly regulated** + +PaLM 2 seems to be more strongly regulated than other models. In many user conversations, when the users ask questions that PaLM 2 is uncertain or uncomfortable giving an answer to, PaLM 2 is more likely to abstain from responding than other models. + +Based on a rough estimate, among all pairwise battles, PaLM 2 has lost 20.9% of the battles due to refusing to answer, and it has lost 30.8% of the battles to chatbots not belonging to one of the top four (GPT-4, Claude-v1, ChatGPT, Claude-instant-v1) due to refusing to answer. + +This partially explains why PaLM 2 frequently loses plays to weaker chatbots on the leaderboard. This also highlights a flaw in the chatbot arena methodology, as casual users are more likely to penalize abstention over subtly inaccurate responses. Below we provide several failure cases illustrating how PaLM loses plays to weaker chatbots because it refuses to answer the question. + +We also noticed that, sometimes, it is hard to clearly specify the boundary for LLM regulation. In the offered PaLM 2 versions, we see several undesired tendencies: + +- PaLM 2 refuses many roleplay questions, even if the users asked it to emulate a Linux terminal or a programming language interpreter. +- Sometimes PaLM 2 refuses to answer easy and non-controversial factual questions. + +Several examples are shown below: + + + +

Figure 2: Example questions that PaLM 2 refuses to answer.

+ +**Limited multilingual abilities** + +We do not see strong multilingual abilities from PaLM 2 with the currently offered public API chat-bison@001 at Google Vertex API. PaLM 2 tends to not answer non-English questions, including questions written in popular languages such as Chinese, Spanish, and Hebrew. We were unable to reproduce several multilingual examples demonstrated in the PaLM 2 technical report using the current PaLM 2 versions. We are waiting for Google to gradually release the latest version of PaLM 2. + +We also calculate the Elo ratings of all models when only considering English and only considering non-English conversations, respectively, illustrated in Figure 3. The results confirm the observations – on the non-English leaderboard, PaLM 2 ranks 16th. + + +

Figure 3: The English-only and non-English leaderboards.

+ +**PaLM 2's reasoning ability is unsatisfied** + +We also observe the offered PaLM 2 version do not demonstrate strong reasoning capabilities. On one hand, it seems to detect if the question is in plain text, and tends to refuse many questions not in plain text, such as those in programming languages, debugging, and code interpretation. On the other hand, we see PaLM 2 didn’t perform well on some entry-level reasoning tasks when compared against other chatbots. See several examples in Figure 4. + + + +

Figure 4: Examples where PaLM 2 fails on simple reasoning tasks.

+ +**Elo ratings after removing non-English and refusal conversations** + +We remove all non-English conversations and all conversations for which PaLM 2 didn’t provide an answer and calculate the Elo ratings of each model with the filtered data. This rating represents a hypothetical upper bound of PaLM 2's Elo in the Arena. See Figure 5 below. + + +

Figure 5: The leaderboard after removing PaLM 2's non-English and refusal conversations.

+ +### Smaller Models Are Competitive + +We observe several smaller models, including vicuna-7B and mpt-7b-chat, have achieved high ratings on the leaderboard. These smaller models perform favorably when compared against larger models with doubled parameters. + +We speculate that high-quality pre-training and fine-tuning datasets are more critical than model size. However, it is possible that larger models would still perform better with more complex reasoning tasks or answering more subtle questions (e.g., Trivia). +Hence, curating high-quality datasets in both pretraining and finetuning stages seems to be a key approach to reducing model sizes while keeping model quality high. + +### Claude-v1 and Claude-instant-v1 + +Claude-instant-v1 is a low-cost, faster alternative to Claude-v1 offered by Anthropic. If benchmarked in the wild in the arena, we observe that Claude-instant is close to GPT-3.5-turbo (1153 vs. 1143). The rating gap between Claude and Claude-instant seems smaller than that between GPT-4 and GPT-3.5-turbo. Claude-instant has a context length of 9K, is charged at a price of 0.00163/1K prompt token and 0.00551/1K completion token, compared to its OpenAI opponent product – GPT-3.5-turbo – with a context length of 4K and a uniform price of 0.002/1K token (regardless of prompt or completion). + +### Limitations of the “In-the-wild” Evaluation + +However, we want to point out a few facts about the current chatbot Arena and leaderboard. The current Arena is designed to benchmark LLM-based chatbots **"in the wild"**. That means, the voting data provided by our Arena users and the prompts-answers generated during the voting process reflect how the chatbots perform in normal human-chatbot interactions. This might not align with many benchmarking results in the LLM research literature, which tends to characterize long-tail abilities like zero-shot, complex reasoning, etc. Hence, the current chatbot arena has limitations in clearly reflecting the long-tail capability difference between chatbots. See the later section for more details and our plan. + +## Next Steps + +**Evaluating long-tail capability of LLMs** + +As pointed out by the community in [thread 1](https://twitter.com/tinkerteller/status/1656914923316998144?s=20) and [thread 2](https://twitter.com/LechMazur/status/1659915936919347202?s=20), the current Arena and leaderboard design has one major limitation: Performing user studies on a small scale often cannot generate many hard or medium prompts that are necessary to tell the long-tail capability difference between LLMs. Moreover, for difficult questions, it is also very hard for regular Arena users to judge which LLM has generated a better answer -- some domain-specific questions are considered very difficult, even for 99% of non-expert humans. + +However, long-tail capability, such as complex reasoning, can be crucial for LLMs to complete real-world tasks. Building long-tail capability into LLMs is the holy-grail problem and is the most actively studied and invested area in LLM development. + +We listen carefully to the community feedback and are thinking about how to improve the leaderboard to overcome these limitations and capture the long-tail capability different in LLMs. On top of the Chatbot Arena, we are actively designing a new tournament mechanism to examine the chatbots using presets of expert-designed questions and expert judges. We will have more updates soon. + +**More models** + +Since the launch of Arena, we have received many requests from the community to add more models. Due to the limited compute resources and bandwidth we have, we may not be able to serve all of them. We are working on improving the scalability of our serving systems. +In the meanwhile, you can still contribute support for [new models](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model) or contact us if you can help us scale the system. diff --git a/_posts/2023-06-22-leaderboard.md b/_posts/2023-06-22-leaderboard.md new file mode 100644 index 0000000..5540a75 --- /dev/null +++ b/_posts/2023-06-22-leaderboard.md @@ -0,0 +1,353 @@ +--- +layout: distill +title: Chatbot Arena Leaderboard Updates (Week 8) +description: Introducing MT-Bench and Vicuna-33B +giscus_comments: true +date: 2023-06-22 +featured: false +thumbnail: assets/img/blog/leaderboard_week8/ability_breakdown.png +authors: + - name: Lianmin Zheng + url: https://lmzheng.net/ + affiliations: + name: UC Berkeley, LMSys + - name: Wei-Lin Chiang + url: https://infwinston.github.io/ + - name: Ying Sheng + url: https://sites.google.com/view/yingsheng/home + - name: Hao Zhang + url: https://cseweb.ucsd.edu/~haozhang/ +--- + +In this blog post, we share the latest update on Chatbot Arena leaderboard, which now includes more open models and three metrics: + +1. **Chatbot Arena Elo**, based on 42K anonymous votes from [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/) using the Elo rating system. +2. **MT-Bench score**, based on a challenging multi-turn benchmark and GPT-4 grading, proposed and validated in our [Judging LLM-as-a-judge paper](https://arxiv.org/abs/2306.05685). +3. **MMLU**, a widely adopted [benchmark](https://arxiv.org/abs/2009.03300). + +Furthermore, we’re excited to introduce our **new series of Vicuna-v1.3 models**, ranging from 7B to 33B parameters, trained on an extended set of user-shared conversations. +Their weights are now [available](https://github.com/lm-sys/FastChat/tree/main#vicuna-weights). + +## Updated Leaderboard and New Models + + + + + +
+

Table 1. LLM Leaderboard (April 24 - June 19, 2023). The latest and detailed version here.

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model MT-bench (score) Arena Elo Rating MMLU License
GPT-4 8.99 1227 86.4 Proprietary
GPT-3.5-turbo 7.94 1130 70.0 Proprietary
Claude-v1 7.90 1178 75.6 Proprietary
Claude-instant-v1 7.85 1156 61.3 Proprietary
Vicuna-33B 7.12 - 59.2 Non-commercial
WizardLM-30B 7.01 - 58.7 Non-commercial
Guanaco-33B 6.53 1065 57.6 Non-commercial
Tulu-30B 6.43 - 58.1 Non-commercial
Guanaco-65B 6.41 - 62.1 Non-commercial
OpenAssistant-LLaMA-30B 6.41 - 56.0 Non-commercial
PaLM-Chat-Bison-001 6.40 1038 - Proprietary
Vicuna-13B 6.39 1061 52.1 Non-commercial
MPT-30B-chat 6.39 - 50.4 CC-BY-NC-SA-4.0
WizardLM-13B 6.35 1048 52.3 Non-commercial
Vicuna-7B 6.00 1008 47.1 Non-commercial
Baize-v2-13B 5.75 - 48.9 Non-commercial
Nous-Hermes-13B 5.51 - 49.3 Non-commercial
MPT-7B-Chat 5.42 956 32.0 CC-BY-NC-SA-4.0
GPT4All-13B-Snoozy 5.41 986 43.0 Non-commercial
Koala-13B 5.35 992 44.7 Non-commercial
MPT-30B-Instruct 5.22 - 47.8 CC-BY-SA 3.0
Falcon-40B-Instruct 5.17 - 54.7 Apache 2.0
H2O-Oasst-OpenLLaMA-13B 4.63 - 42.8 Apache 2.0
Alpaca-13B 4.53 930 48.1 Non-commercial
ChatGLM-6B 4.50 905 36.1 Non-commercial
OpenAssistant-Pythia-12B 4.32 924 27.0 Apache 2.0
RWKV-4-Raven-14B 3.98 950 25.6 Apache 2.0
Dolly-V2-12B 3.28 850 25.7 MIT
FastChat-T5-3B 3.04 897 47.7 Apache 2.0
StableLM-Tuned-Alpha-7B 2.75 871 24.4 CC-BY-NC-SA-4.0
LLaMA-13B 2.61 826 47.0 Non-commercial
+
+ +Welcome to try the Chatbot Arena voting [demo](https://lmarena.ai). +Keep in mind that each benchmark has its limitations. Please consider the results as guiding references. See our discussion below for more technical details. + +## Evaluating Chatbots with MT-bench and Arena + +### Motivation + +While several benchmarks exist for evaluating Large Language Model's (LLM) performance, such as [MMLU](https://arxiv.org/abs/2009.03300), [HellaSwag](https://arxiv.org/abs/1905.07830), and [HumanEval](https://github.com/openai/human-eval), +we noticed that these benchmarks might fall short when assessing LLMs' human preferences. +Traditional benchmarks often test LLMs on close-ended questions with concise outputs (e.g., multiple choices), which do not reflect the typical use cases of LLM-based chat assistants. + +To fill this gap, in this leaderboard update, in addition to the Chatbot Arena Elo system, we add a new benchmark: MT-Bench. + +- [MT-bench](https://arxiv.org/abs/2306.05685) is a challenging multi-turn question set designed to evaluate the conversational and instruction-following ability of models. You can view sample questions and answers of MT-bench [here](https://huggingface.co/spaces/lmsys/mt-bench). +- [Chatbot Arena](https://lmarena.ai) is a crowd-sourced battle platform, where users ask chatbots any question and vote for their preferred answer. + +Both benchmarks are designed to use human preferences as the primary metric. + +### Why MT-Bench? + +MT-Bench is a carefully curated benchmark that includes 80 high-quality, multi-turn questions. +These questions are tailored to assess the conversation flow and instruction-following capabilities of models in multi-turn dialogues. +They include both common use cases and challenging instructions meant to distinguish between chatbots. +MT-Bench serves as a **quality-controlled complement** to our crowd-sourced based evaluation -- Chatbot Arena. + +Through running the Chatbot Arena for 2 months and analyzing our users' prompts, we've identified 8 primary categories of user prompts: Writing, Roleplay, Extraction, Reasoning, Math, Coding, Knowledge I (STEM), and Knowledge II (humanities/social science). +We crafted 10 multi-turn questions per category, yielding a set of 160 questions in total. We display some sample questions below in Figure 1. You can find more [here](https://huggingface.co/spaces/lmsys/mt-bench). + + +

Figure 1: Sample questions from the MT-Bench.

+ +### But Still, How to Grade Chatbots' Answers? + +Though we believe human preference is the gold standard, it is notoriously slow and expensive to collect. +In our first [Vicuna blogpost](https://lmsys.org/blog/2023-03-30-vicuna/), we explored an automated evaluation pipeline based on GPT-4. +This approach has since got popular and adopted in several [concurrent and follow-up works](#related-work). + +In our latest paper, ["Judging LLM-as-a-judge"](https://arxiv.org/abs/2306.05685), we conducted a systematic study to answer how reliable those LLM judges are. +We provide a brief overview of conclusions here but recommend reading the paper for more details. + +We begin by acknowledging potential limitations of LLM-as-a-judge: + +- **Position bias** where LLM judges may favor the first answer in a pairwise comparison. +- **Verbosity bias** where LLM judges may favor lengthier answers, regardless of their quality. +- **Self-enhancement bias** where LLM judges may favor their own responses. +- **Limited reasoning ability** referring to LLM judges' possible shortcomings in grading math and reasoning questions. + +Our study then explores how few-shot judge, chain-of-thought judge, reference-based judge, and fine-tuned judge can help to mitigate these limitations. + +Upon implementing some of these solutions, we discovered that despite limitations, strong LLM judges like GPT-4 can align impressively well with both controlled and crowdsourced human preferences, achieving over 80% agreement. +This level of agreement is comparable to the agreement between two different human judges. +Therefore, if used carefully, LLM-as-a-judge can act as a _scalable_ and _explainable_ approximation of human preferences. + +We also found that single-answer grading based on GPT-4, without pairwise comparison, can also rank models effectively and match human preferences well. +In Table 1, we present the MT-Bench as a column on the leaderboard based on single-answer grading with GPT-4. + +## Results and Analysis + +### MT-Bench Effectively Distinguishes Among Chatbots + +Table 1 provides a detailed rundown of the MT-bench-enhanced leaderboard, where we conduct an exhaustive evaluation of 28 popular instruction-tuned models. +We observe a clear distinction among chatbots of varying abilities, with scores showing a high correlation with the Chatbot Arena Elo rating. +In particular, MT-Bench reveals noticeable performance gaps between GPT-4 and GPT-3.5/Claude, and between open and proprietary models. + +To delve deeper into the distinguishing factors among chatbots, we select a few representative chatbots and break down their performance per category in Figure 2. +GPT-4 shows superior performance in Coding and Reasoning compared to GPT-3.5/Claude, while Vicuna-13B lags significantly behind in several specific categories: Extraction, Coding, and Math. +This suggests there is still ample room for improvement for open-source models. + + +

Figure 2: The comparison of 6 representative LLMs regarding their abilities in 8 categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities.

+ +### Multi-turn Conversation Capabilities + +We next analyze the multi-turn scores of selected models, presented in Table 2. + +

Table 2. The breakdown of LLMs' MT-bench scores in the 1st and 2nd turn of a dialogue. Full score is 10.

+
+ + + + + + + + + + + + + + + + + + + + +
Model Average 1st Turn Score Average 2nd Turn Score Score Difference
GPT-4 8.96 9.03 0.07
Claude-v1 8.15 7.65 -0.50
GPT-3.5-turbo 8.08 7.81 -0.26
Vicuna-33B 7.46 6.79 -0.67
WizardLM-30B 7.13 6.89 -0.24
WizardLM-13B 7.12 5.59 -1.53
Guanaco-33B 6.88 6.18 -0.71
Vicuna-13B 6.81 5.96 -0.85
PaLM2-Chat-Bison 6.71 6.09 -0.63
Vicuna-7B 6.69 5.30 -1.39
Koala-13B 6.08 4.63 -1.45
MPT-7B-Chat 5.85 4.99 -0.86
Falcon-40B-instruct 5.81 4.53 -1.29
H2OGPT-Oasst-Open-LLaMA-13B 5.51 3.74 -1.78
+
+ +­ + +The MT-bench incorporates challenging follow-up questions as part of its design. +For open models, The performance drops significantly from the first to the second turn (e.g., Vicuna-7B, WizardLM-13B), while strong proprietary models maintain consistency. +We also notice a considerable performance gap between LLaMA-based models and those with permissive licenses (MPT-7B, Falcon-40B, and instruction-tuned Open-LLaMA). + +### Explainability in LLM judges + +Another advantage of LLM judges is their ability to provide explainable evaluations. +Figure 3 presents an instance of GPT-4's judgment on an MT-bench question, with answers from alpaca-13b and gpt-3.5-turbo. +GPT-4 provides thorough and logical feedback to support its judgment. +Our [study](https://arxiv.org/abs/2306.05685) found that such reviews are beneficial in guiding humans to make better-informed decisions (refer to Section 4.2 for more details). +All the GPT-4 judgments can be found on our [demo site](https://huggingface.co/spaces/lmsys/mt-bench). + + + +

Figure 3: MT-bench provides more explainability in evaluating LLMs' human preferences.

+ +In conclusion, we have shown that MT-Bench effectively differentiates between chatbots of varying capabilities. +It's scalable, offers valuable insights with category breakdowns, and provides explainability for human judges to verify. +However, LLM judges should be used carefully. It can still make errors, especially when grading math/reasoning questions. + +## How to Evaluate New Models on MT-Bench? + +Evaluating models on MT-bench is simple and fast. Our script supports all huggingface models, and we’ve provided [detailed instructions](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge#mt-bench), +in which you can generate model’s answers to the MT-bench questions and their GPT-4 judgments. You can also examine the answers and reviews on our gradio browsing demo. + +## Next steps + +**Release of Conversations Data** + +We're in the process of releasing Chatbot Arena conversations data to the broader research community. Stay tuned for updates! + +**MT-bench-1K** + +MT-Bench currently consists of a concise set of 80 carefully curated questions, ensuring the highest quality. +We're actively expanding the question set to MT-Bench-1K by integrating high-quality prompts from the Chatbot Arena and generating new ones automatically using LLMs. +If you have any good ideas, we'd be delighted to hear from you. + +**Invitation for collaborations** + +We're engaging with various organizations to explore possibilities for standardizing the evaluation of human preferences for LLMs at scale. +If this interests you, please feel free to reach out to us. + +## Related work + +There has been a great amount of interesting work studying how to evaluate human preferences and how to use strong LLM as judges for evaluation. +You are welcome to check them out and see more opinions on this topic: + +- [Judging LLM-as-a-judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685) +- [Can foundation models label data like humans?](https://huggingface.co/blog/llm-leaderboard) +- [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources](https://arxiv.org/abs/2306.04751) +- [The False Promise of Imitating Proprietary LLMs](https://arxiv.org/abs/2305.15717) +- [AlpacaEval and AlpacaFarm](https://github.com/tatsu-lab/alpaca_eval) +- [Large Language Models are not Fair Evaluators](https://arxiv.org/abs/2305.17926) + +## Links + +Below are readily available tools and code to run MT-bench and other metrics used in this blogpost: + +- The MT-bench uses [fastchat.llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge), +- The [Arena Elo calculator](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing). +- The MMLU is based on [InstructEval](https://github.com/declare-lab/instruct-eval/blob/main/mmlu.py) and [Chain-of-Thought Hub](https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU). + +If you wish to see more models on leaderboard, we invite you to [contribute to FastChat](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model) or [contact us](mailto:lmsysorg@gmail.com) to provide us with API access. diff --git a/_posts/2023-07-20-dataset.md b/_posts/2023-07-20-dataset.md new file mode 100644 index 0000000..b404505 --- /dev/null +++ b/_posts/2023-07-20-dataset.md @@ -0,0 +1,117 @@ +--- +layout: distill +title: Chatbot Arena Conversation Dataset Release +giscus_comments: true +date: 2023-07-20 +featured: false +thumbnail: assets/img/blog/arena/cover.png +authors: + - name: Chatbot Arena Team + affiliations: + name: LMSYS Org +--- + +Since its launch three months ago, [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/) has become a widely cited LLM evaluation platform that emphasizes large-scale, community-based, and interactive human evaluation. In that short time span, we collected around 53K votes from 19K unique IP addresses for 22 models. + +In this blog post, we are releasing an updated leaderboard with more models and two datasets for human preference related study: + +- **33K crowd-sourced conversations** with human preference annotations from Chatbot Arena. ([link](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)) +- **3K expert-level human annotations** from MT-bench. ([link](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)) + +As estimated by this Llama2 analysis blog [post](https://www.interconnects.ai/p/llama-2-from-meta?sd=pf), Meta spent about 8 million on human preference data for LLama 2 and that dataset is not avaialble now. +Therefore, we think our datasets are highly valuable due to the expensive nature of obtaining human preferences and the limited availability of open, high-quality datasets. + +## Updated Leaderboard + +We are hosting the latest leaderboard at [lmsys/chatbot-arena-leaderboard](https://lmarena.ai/?leaderboard). Below is a screenshot. Since the last update, we added two 30B models: Vicuna-33B-v1.3 and MPT-30B-chat, both of which perform very well in the arena. +Two days ago, we also introduced Llama 2 and Claude 2 to the arena. The leaderboard will soon include them after we get enough votes. +Please help us by casting your votes at our voting [website](https://lmarena.ai). + +Besides the slowly updated Arena Elo ratings, we also use MT-bench, a fast GPT-4 based automatic evaluation pipeline to evaluate all new models, including LLama 2 (chat), Claude 2, WizardLM-13B-v1.1, XGen-7B-8K-Inst, and ChatGLM2-6B. +You are welcome to check out the interactive [lmsys/chatbot-arena-leaderboard](https://lmarena.ai/?leaderboard) to sort the models according to different metrics. +Some early evaluation results of LLama 2 can be found in our [tweets](https://twitter.com/lmsysorg/status/1681744327192752128). + + +

Figure 1. Chatbot Arena Leaderboard (see more)

+ +## Dataset 1: 33K Chatbot Arena Conversation Data + +Link: [lmsys/chatbot_arena_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) + +This dataset contains 33K cleaned conversations with pairwise human preferences collected on Chatbot Arena from April to June 2023. +Each sample includes two model names, their full conversation text, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp. + +To ensure the safe release of data, we have attempted to remove all conversations that contain personally identifiable information (PII). In addition, we have included the OpenAI moderation API output to flag inappropriate conversations. However, we have chosen not to remove all of these conversations so that researchers can study safety-related questions associated with LLM usage in the wild as well as the OpenAI moderation process. As an example, we included additional toxic tags that are generated by our own toxic tagger, which are trained by fine-tuning T5 and RoBERTa on manually labeled data. + +### Uniqueness and Potential Usage + +Compared to existing human preference datasets like [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1). This dataset + +- Contains the outputs of 20 LLMs including stronger LLMs such as GPT-4 and Claude-v1. It also contains many failure cases of these state-of-the-art models. +- Contains unrestricted conversations from over 13K users in the wild. + +We believe this data will help the AI research community answer important questions around topics like: + +- Characteristics of real-world user prompts +- Train better models with RLHF +- Improve and evaluate LLM evaluation methods +- Build model selection and request dispatching algorithms +- Study the design and application of inappropriate content filtering mechanisms + +### Disclaimers and Terms + +- This dataset includes offensive conversations. It is not intended for training dialogue agents without applying appropriate filtering measures. We are not responsible for any outputs of the models trained on this dataset. +- Statements or opinions made in this dataset do not reflect the views of researchers or institutions involved in the data collection effort. +- Users of this data are responsible for ensuring its appropriate use, which includes abiding by any applicable laws and regulations. +- Users of this data should adhere to the terms of use for a specific model when using its direct outputs. +- Please contact us if you find any issues with the dataset. + +### Visualization and Elo Rating Calculation + +This Colab [notebook](https://colab.research.google.com/drive/1J2Wf7sxc9SVmGnSX_lImhT246pxNVZip?usp=sharing) provides some visualizations and shows how to compute Elo ratings with the dataset. We pasted some figures here. + + +

Figure 2. Fraction of Model A Wins for All Non-tied A vs. B Battles.

+ +
+ + +

Figure 3. Battle Counts of Each Models Pair.

+ +## Dataset 2: 3K MT-bench Human Annotations + +Link: [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments) + +In addition to the crowd-sourced evaluation with Chatbot Arena, we also conducted a controlled human evaluation with MT-bench. + +This dataset contains 3.3K expert-level pairwise human preferences for model responses generated by 6 models in response to 80 MT-bench questions. +The 6 models are GPT-4, GPT-3.5, Claud-v1, Vicuna-13B, Alpaca-13B, and LLaMA-13B. The annotators are mostly graduate students with expertise in the topic areas of each of the questions. The details of data collection can be found in our [paper](https://arxiv.org/abs/2306.05685). + +### Agreement Calculation + +This Colab [notebook](https://colab.research.google.com/drive/1ctgygDRJhVGUJTQy8-bRZCl1WNcT8De6?usp=sharing) shows how to compute the agreement between humans and GPT-4 judge with the dataset. Our results show that humans and GPT-4 judge achieve over 80\% agreement, the same level of agreement between humans. + +## Acknowlement + +We thank the whole community for contributing to the arena dataset. +We also plan to gradually release more conversations in the future after doing thorough review. + +## Citation + +``` +@misc{chiang2024chatbot, + title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference}, + author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica}, + year={2024}, + eprint={2403.04132}, + archivePrefix={arXiv}, + primaryClass={cs.AI} +} +@inproceedings{zheng2023judging, + title={Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena}, + author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica}, + booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, + year={2023}, + url={https://openreview.net/forum?id=uccHPGDlao} +} +``` diff --git a/_posts/2024-03-01-policy.md b/_posts/2024-03-01-policy.md new file mode 100644 index 0000000..d7c9574 --- /dev/null +++ b/_posts/2024-03-01-policy.md @@ -0,0 +1,92 @@ +--- +layout: distill +title: Chatbot Arena Policy Update +description: Live and Community-Driven LLM Evaluation +giscus_comments: true +date: 2024-03-01 +featured: false +thumbnail: assets/img/blog/arena_policy/arena_logo_v0_4x3.png +authors: + - name: Chatbot Arena Team + affiliations: + name: LMSYS Org +--- + +## Our Mission + +Chatbot Arena ([lmarena.ai](https://lmarena.ai)) is an open-source project developed by members from [LMSYS](https://lmarena.ai/?about) and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We maintain the open evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish [leaderboard](https://lmarena.ai/?leaderboard) periodically. + + + +## Our Progress + +Chatbot Arena was first launched in [May 2023](https://lmsys.org/blog/2023-05-03-arena/) and has emerged as a critical platform for live, community-driven LLM evaluation, attracting millions of participants and collecting over 800,000 votes. This extensive engagement has enabled the evaluation of more than 90 LLMs, including both commercial GPT-4, Gemini/Bard and open-weight Llama and Mistral models, significantly enhancing our understanding of their capabilities and limitations. + +Our periodic [leaderboard](https://lmarena.ai/?leaderboard) and blog post updates have become a valuable resource for the community, offering critical insights into model performance that guide the ongoing development of LLMs. Our commitment to open science is further demonstrated through the sharing of [user preference data](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) and [one million user prompts](https://huggingface.co/datasets/lmsys/lmsys-chat-1m), supporting research and model improvement. + +We also collaborate with open-source and commercial model providers to bring their latest models to community for preview testing. We believe this initiative helps advancing the field and encourages user engagement to collect crucial votes for evaluating all the models in the Arena. Moreover, it provides an opportunity for the community to test and provide anonymized feedback before the models are officially released. + +The platform's infrastructure ([FastChat](https://github.com/lm-sys/FastChat)) and evaluation tools, available on GitHub, emphasize our dedication to transparency and community engagement in the evaluation process. This approach not only enhances the reliability of our findings but also fosters a collaborative environment for advancing LLMs. + +In our ongoing efforts, we feel obligated to establish policies that guarantee evaluation transparency and trustworthiness. Moreover, we actively involve the community in shaping any modifications to the evaluation process, reinforcing our commitment to openness and collaborative progress. + +## Our Policy + +
Last Updated: May 31, 2024
+ +**Open source**: The platform ([FastChat](https://github.com/lm-sys/FastChat)) including UI frontend, model serving backend, model evaluation and ranking pipelines are all open source and available on GitHub. This means that anyone can clone, audit or run another instance of Chatbot Arena to produce a similar leaderboard. + +**Transparent**: The evaluation process, including rating computation, identifying anomalous users, and LLM selection are all made publicly available so others can reproduce our analysis and fully understand the process of collecting data. Furthermore, we will involve the community in deciding any changes in the evaluation process. + +**Listing models on the leaderboard**: The public leaderboard will only include models that are accessible to other third parties. Specifically, it will only include models that are either (1) open weights or/and (2) publicly available through APIs (e.g., gpt-4-0613, gemini-pro-api), or (3) available as a service (e.g., Bard, GPT-4+browsing). In the remainder of this document we refer to these models as **publicly released models**. + +Once a publicly released model is listed on the leaderboard, the model will remain accessible at [lmarena.ai](https://lmarena.ai) for at least **two weeks** for the community to evaluate it. + +**Evaluating publicly released models**. Evaluating such a model consists of the following steps: + +1. Add the model to Arena for blind testing and let the community know it was added. +2. Accumulate enough votes until the model's rating stabilizes. +3. Once the model's rating stabilizes, we list the model on the public leaderboard. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will privately share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard. + +**Evaluating unreleased models**: We collaborate with open-source and commercial model providers to bring their unreleased models to community for preview testing. + +Model providers can test their unreleased models anonymously, meaning the models' names will be anonymized. A model is considered unreleased if its weights are neither open, nor available via a public API or service. Evaluating an unreleased model consists of the following steps: + +1. Add the model to Arena with an anonymous label. i.e., its identity will not be shown to users. +2. Keep it until we accumulate enough votes for its rating to stabilize or until the model provider withdraws it. +3. Once we accumulate enough votes, we will share the result privately with the model provider. These include the rating, as well as release samples of up to 20% of the votes. (See Sharing data with the model providers for further details). +4. Remove the model from Arena. + +If while we test an unreleased model, that model is publicly released, we immediately switch to the publicly released model evaluation process. + +To ensure the leaderboard accurately reflects model rankings, we rely on live comparisons between models. Hence, we may deprecate models from the leaderboard one month after they are no longer available online or publicly accessible. + +**Sharing data with the community**: We will periodically share data with the community. In particular, we will periodically share 20% of the arena vote data we have collected including the prompts, the answers, the identity of the model providing each answer (if the model is or has been on the leaderboard), and the votes. For the models we collected votes for but have never been on the leaderboard, we will still release data but we will label the model as "anonymous". + +**Sharing data with the model providers**: Upon request, we will offer early data access with model providers who wish to improve their models. However, this data will be a subset of data that we periodically share with the community. In particular, with a model provider, we will share the data that includes their model's answers. For battles, we may not reveal the opponent model and may use "anonymous" label. This data will be later shared with the community during the periodic releases. If the model is not on the leaderboard at the time of sharing, the model’s answers will also be labeled as "anonymous". Before sharing the data, we will remove user PII (e.g., Azure PII detection for texts). + +## FAQ + +### Why another eval? + +Most LLM benchmarks are static, which makes them prone to contamination, as these LLMs are trained on most available data on the Internet. Chatbot Arena aims to alleviate this problem by providing live evaluation with a continuous stream of new prompts from real people. We also believe that the open nature of the platform will attract users that accurately reflect the broader set of LLM users and real use cases. + +### What model to evaluate? Why not all? + +We will continuously add new models and retire old ones. It is not feasible to add every possible model due to the cost and the scalability of our evaluation process, i.e., it might take too much to accumulate enough votes to accurately rate each model. Today, the decision to add new models is rather ad-hoc: we add models based on the community’s perceived interest. We intend to formalize his process in the near future. + +### Why should the community trust our eval? + +We seek to provide transparency and all tools as well as the platform we are using in open-source. We invite the community to use our platform and tools to statistically reproduce our results. + +### Why do you only share 20% of data, not all? + +Arena data is used for LLM benchmark purpose. We periodically share data to mitigate the potential risk of overfitting or benchmark leakage. We will actively review this policy based on the community's feedback. + +### Who will fund this effort? Any conflict of interests? + +Chatbot Arena is only funded by gifts, in money, cloud credits, or API credits. The gifts have no strings attached. + +## Any feedback? + +Feel free to send us email or leave feedback on [Github](https://github.com/lm-sys/FastChat/issues)! diff --git a/assets/img/blog/arena/battle_counts.png b/assets/img/blog/arena/battle_counts.png new file mode 100644 index 0000000..ffdc28c Binary files /dev/null and b/assets/img/blog/arena/battle_counts.png differ diff --git a/assets/img/blog/arena/chat_demo.png b/assets/img/blog/arena/chat_demo.png new file mode 100644 index 0000000..1f38f36 Binary files /dev/null and b/assets/img/blog/arena/chat_demo.png differ diff --git a/assets/img/blog/arena/cover.png b/assets/img/blog/arena/cover.png new file mode 100644 index 0000000..7bfc45e Binary files /dev/null and b/assets/img/blog/arena/cover.png differ diff --git a/assets/img/blog/arena/lang_counts.png b/assets/img/blog/arena/lang_counts.png new file mode 100644 index 0000000..0b7955c Binary files /dev/null and b/assets/img/blog/arena/lang_counts.png differ diff --git a/assets/img/blog/arena/predicted_win_fraction.png b/assets/img/blog/arena/predicted_win_fraction.png new file mode 100644 index 0000000..f835155 Binary files /dev/null and b/assets/img/blog/arena/predicted_win_fraction.png differ diff --git a/assets/img/blog/arena/win_fraction.png b/assets/img/blog/arena/win_fraction.png new file mode 100644 index 0000000..9a57cc0 Binary files /dev/null and b/assets/img/blog/arena/win_fraction.png differ diff --git a/assets/img/blog/arena_policy/arena_logo_v0_4x3.png b/assets/img/blog/arena_policy/arena_logo_v0_4x3.png new file mode 100644 index 0000000..e177a39 Binary files /dev/null and b/assets/img/blog/arena_policy/arena_logo_v0_4x3.png differ diff --git a/assets/img/blog/arena_policy/winrate_heatmap.png b/assets/img/blog/arena_policy/winrate_heatmap.png new file mode 100644 index 0000000..2e8d172 Binary files /dev/null and b/assets/img/blog/arena_policy/winrate_heatmap.png differ diff --git a/assets/img/blog/leaderboard_week12/battle_count.png b/assets/img/blog/leaderboard_week12/battle_count.png new file mode 100644 index 0000000..8c6cd1b Binary files /dev/null and b/assets/img/blog/leaderboard_week12/battle_count.png differ diff --git a/assets/img/blog/leaderboard_week12/leaderboard.png b/assets/img/blog/leaderboard_week12/leaderboard.png new file mode 100644 index 0000000..684b0d3 Binary files /dev/null and b/assets/img/blog/leaderboard_week12/leaderboard.png differ diff --git a/assets/img/blog/leaderboard_week12/winrate.png b/assets/img/blog/leaderboard_week12/winrate.png new file mode 100644 index 0000000..80b2670 Binary files /dev/null and b/assets/img/blog/leaderboard_week12/winrate.png differ diff --git a/assets/img/blog/leaderboard_week2/claude_vs_gpt4.png b/assets/img/blog/leaderboard_week2/claude_vs_gpt4.png new file mode 100644 index 0000000..b2dcb1e Binary files /dev/null and b/assets/img/blog/leaderboard_week2/claude_vs_gpt4.png differ diff --git a/assets/img/blog/leaderboard_week2/claude_vs_gpt4_fail.png b/assets/img/blog/leaderboard_week2/claude_vs_gpt4_fail.png new file mode 100644 index 0000000..dd15571 Binary files /dev/null and b/assets/img/blog/leaderboard_week2/claude_vs_gpt4_fail.png differ diff --git a/assets/img/blog/leaderboard_week2/english_vs_non_english.png b/assets/img/blog/leaderboard_week2/english_vs_non_english.png new file mode 100644 index 0000000..7925970 Binary files /dev/null and b/assets/img/blog/leaderboard_week2/english_vs_non_english.png differ diff --git a/assets/img/blog/leaderboard_week2/leaderboard_cover.png b/assets/img/blog/leaderboard_week2/leaderboard_cover.png new file mode 100644 index 0000000..6ee2703 Binary files /dev/null and b/assets/img/blog/leaderboard_week2/leaderboard_cover.png differ diff --git a/assets/img/blog/leaderboard_week2/win_fraction_matrix.png b/assets/img/blog/leaderboard_week2/win_fraction_matrix.png new file mode 100644 index 0000000..ca1840a Binary files /dev/null and b/assets/img/blog/leaderboard_week2/win_fraction_matrix.png differ diff --git a/assets/img/blog/leaderboard_week4/PaLM2_reasoning_1.png b/assets/img/blog/leaderboard_week4/PaLM2_reasoning_1.png new file mode 100644 index 0000000..bd5ac90 Binary files /dev/null and b/assets/img/blog/leaderboard_week4/PaLM2_reasoning_1.png differ diff --git a/assets/img/blog/leaderboard_week4/PaLM2_reasoning_2.png b/assets/img/blog/leaderboard_week4/PaLM2_reasoning_2.png new file mode 100644 index 0000000..0dfe9ba Binary files /dev/null and b/assets/img/blog/leaderboard_week4/PaLM2_reasoning_2.png differ diff --git a/assets/img/blog/leaderboard_week4/PaLM2_refusal_1.png b/assets/img/blog/leaderboard_week4/PaLM2_refusal_1.png new file mode 100644 index 0000000..81eacf6 Binary files /dev/null and b/assets/img/blog/leaderboard_week4/PaLM2_refusal_1.png differ diff --git a/assets/img/blog/leaderboard_week4/PaLM2_refusal_2.png b/assets/img/blog/leaderboard_week4/PaLM2_refusal_2.png new file mode 100644 index 0000000..99912f8 Binary files /dev/null and b/assets/img/blog/leaderboard_week4/PaLM2_refusal_2.png differ diff --git a/assets/img/blog/leaderboard_week4/english_non_refusal_leaderboard.png b/assets/img/blog/leaderboard_week4/english_non_refusal_leaderboard.png new file mode 100644 index 0000000..3818413 Binary files /dev/null and b/assets/img/blog/leaderboard_week4/english_non_refusal_leaderboard.png differ diff --git a/assets/img/blog/leaderboard_week4/language_leaderboard.png b/assets/img/blog/leaderboard_week4/language_leaderboard.png new file mode 100644 index 0000000..8df70e0 Binary files /dev/null and b/assets/img/blog/leaderboard_week4/language_leaderboard.png differ diff --git a/assets/img/blog/leaderboard_week4/leaderboard_cover.png b/assets/img/blog/leaderboard_week4/leaderboard_cover.png new file mode 100644 index 0000000..46a4a13 Binary files /dev/null and b/assets/img/blog/leaderboard_week4/leaderboard_cover.png differ diff --git a/assets/img/blog/leaderboard_week4/win_fraction_matrix.png b/assets/img/blog/leaderboard_week4/win_fraction_matrix.png new file mode 100644 index 0000000..d63e3ba Binary files /dev/null and b/assets/img/blog/leaderboard_week4/win_fraction_matrix.png differ diff --git a/assets/img/blog/leaderboard_week8/ability_breakdown.png b/assets/img/blog/leaderboard_week8/ability_breakdown.png new file mode 100644 index 0000000..87c6a74 Binary files /dev/null and b/assets/img/blog/leaderboard_week8/ability_breakdown.png differ diff --git a/assets/img/blog/leaderboard_week8/explainability_sample.png b/assets/img/blog/leaderboard_week8/explainability_sample.png new file mode 100644 index 0000000..c8da069 Binary files /dev/null and b/assets/img/blog/leaderboard_week8/explainability_sample.png differ diff --git a/assets/img/blog/leaderboard_week8/leaderboard.png b/assets/img/blog/leaderboard_week8/leaderboard.png new file mode 100644 index 0000000..8e91370 Binary files /dev/null and b/assets/img/blog/leaderboard_week8/leaderboard.png differ diff --git a/assets/img/blog/leaderboard_week8/sample_question.png b/assets/img/blog/leaderboard_week8/sample_question.png new file mode 100644 index 0000000..bf5ef83 Binary files /dev/null and b/assets/img/blog/leaderboard_week8/sample_question.png differ