Skip to content

Commit

Permalink
Update 2024-03-01-policy.md
Browse files Browse the repository at this point in the history
  • Loading branch information
aangelopoulos authored Dec 16, 2024
1 parent 5955005 commit 7f90129
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions _posts/2024-03-01-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,10 @@ Third-party endpoints will be explicitly labeled as "third-party" on the leaderb
4. Testing timeline

- Public models will be tested until their leaderboard ranking converges: after 3,000 votes, or earlier if the confidence interval is small enough to distinguish it from surrounding models.
- We will release results from the model to the community on the public leaderboard immediately once the ranking converges. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will privately share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard.
- The sampling weight of a model is set to 5 until 3,000 votes are collected. Then after release we assign the sampling rate to 1. Models may be retired after 3,000 votes if there are two more recent models in the same family and/or if there are more than 3 providers that offer models cheaper or same price and strictly better (according to overall Arena Score) than this model.
- We will release model results to the community on the public leaderboard immediately once the ranking converges. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard.
- The sampling weight of a model is set to 5 until 3,000 votes are collected. Then after release we assign the sampling weight to 1. Models may be retired after 3,000 votes if there are two more recent models in the same series (e.g. `gpt-4o-0513` and `gpt-4o-0806`) and/or if there are more than 3 providers that offer models cheaper or same price and strictly better (according to overall Arena Score) than this model.
- The top-10 models according to the overall Arena Score will be given a sampling weight of 3. This is to ensure the best community experience when visiting our site.
- The top-10 models according to the overall Arena Score will be given a sampling rate of 1 to ensure diversity of battles.
- The best model from the top-10 providers according to the overall Arena Score will be given a sampling weight of 1 to ensure diversity of battles.

This policy may be modified moving forward; visit this website for the most recent version.

Expand Down

0 comments on commit 7f90129

Please sign in to comment.