Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[policy edits] #20

Merged
merged 3 commits into from
Dec 16, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions _posts/2024-03-01-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ In our ongoing efforts, we feel obligated to establish policies that guarantee e

## Our Policy

<div style="text-align: right">Last Updated: Nov 18, 2024</div>
<div style="text-align: right">Last Updated: Dec 15, 2024</div>

**Open source**: The platform ([FastChat](https://github.com/lm-sys/FastChat)) including UI frontend, model serving backend, model evaluation and ranking pipelines are all open source and available on GitHub. This means that anyone can clone, audit or run another instance of Chatbot Arena to produce a similar leaderboard.

Expand Down Expand Up @@ -62,7 +62,16 @@ Third-party endpoints will be explicitly labeled as "third-party" on the leaderb

1. Add the model to Arena for blind testing and let the community know it was added.
2. Accumulate enough votes until the model's rating stabilizes.
3. Once the model's rating stabilizes, we list the model on the public leaderboard. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will privately share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard.
3. Once the model's rating stabilizes, we list the model on the public leaderboard.
4. Testing timeline

- Public models will be tested until their leaderboard ranking converges: after 3,000 votes, or earlier if the confidence interval is small enough to distinguish it from surrounding models.
- We will release model results to the community on the public leaderboard immediately once the ranking converges. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard.
- The sampling weight of a model is set to 5 until 3,000 votes are collected. Then after release we assign the sampling weight to 1. Models may be retired after 3,000 votes if there are two more recent models in the same series (e.g. `gpt-4o-0513` and `gpt-4o-0806`) and/or if there are more than 3 providers that offer models cheaper or same price and strictly better (according to overall Arena Score) than this model.
- The top-10 models according to the overall Arena Score will be given a sampling weight of 3. This is to ensure the best community experience when visiting our site.
- The best model from the top-10 providers according to the overall Arena Score will be given a sampling weight of 1 to ensure diversity of battles.

This policy may be modified moving forward; visit this website for the most recent version.

**Evaluating unreleased models**: We collaborate with open-source and commercial model providers to bring their unreleased models to community for preview testing.

Expand Down
Loading