Skip to content

Commit

Permalink
[policy edits]
Browse files Browse the repository at this point in the history
  • Loading branch information
aangelopoulos committed Dec 16, 2024
1 parent 2476a10 commit 5955005
Showing 1 changed file with 10 additions and 1 deletion.
11 changes: 10 additions & 1 deletion _posts/2024-03-01-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,16 @@ Third-party endpoints will be explicitly labeled as "third-party" on the leaderb

1. Add the model to Arena for blind testing and let the community know it was added.
2. Accumulate enough votes until the model's rating stabilizes.
3. Once the model's rating stabilizes, we list the model on the public leaderboard. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will privately share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard.
3. Once the model's rating stabilizes, we list the model on the public leaderboard.
4. Testing timeline

- Public models will be tested until their leaderboard ranking converges: after 3,000 votes, or earlier if the confidence interval is small enough to distinguish it from surrounding models.
- We will release results from the model to the community on the public leaderboard immediately once the ranking converges. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will privately share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard.
- The sampling weight of a model is set to 5 until 3,000 votes are collected. Then after release we assign the sampling rate to 1. Models may be retired after 3,000 votes if there are two more recent models in the same family and/or if there are more than 3 providers that offer models cheaper or same price and strictly better (according to overall Arena Score) than this model.
- The top-10 models according to the overall Arena Score will be given a sampling weight of 3. This is to ensure the best community experience when visiting our site.
- The top-10 models according to the overall Arena Score will be given a sampling rate of 1 to ensure diversity of battles.

This policy may be modified moving forward; visit this website for the most recent version.

**Evaluating unreleased models**: We collaborate with open-source and commercial model providers to bring their unreleased models to community for preview testing.

Expand Down

0 comments on commit 5955005

Please sign in to comment.