lmarena · aangelopoulos · Dec 16, 2024 · Dec 16, 2024 · Dec 16, 2024 · Dec 16, 2024
diff --git a/_posts/2024-03-01-policy.md b/_posts/2024-03-01-policy.md
@@ -32,7 +32,7 @@ In our ongoing efforts, we feel obligated to establish policies that guarantee e
 
 ## Our Policy
 
-<div style="text-align: right">Last Updated: Nov 18, 2024</div>
+<div style="text-align: right">Last Updated: Dec 15, 2024</div>
 
 **Open source**: The platform ([FastChat](https://github.com/lm-sys/FastChat)) including UI frontend, model serving backend, model evaluation and ranking pipelines are all open source and available on GitHub. This means that anyone can clone, audit or run another instance of Chatbot Arena to produce a similar leaderboard.
 
@@ -62,7 +62,16 @@ Third-party endpoints will be explicitly labeled as "third-party" on the leaderb
 
 1. Add the model to Arena for blind testing and let the community know it was added.
 2. Accumulate enough votes until the model's rating stabilizes.
-3. Once the model's rating stabilizes, we list the model on the public leaderboard. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will privately share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard.
+3. Once the model's rating stabilizes, we list the model on the public leaderboard.
+4. Testing timeline
+
+- Public models will be tested until their leaderboard ranking converges: after 3,000 votes, or earlier if the confidence interval is small enough to distinguish it from surrounding models.
+- We will release model results to the community on the public leaderboard immediately once the ranking converges. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard.
+- The sampling weight of a model is set to 5 until 3,000 votes are collected. Then after release we assign the sampling weight to 1. Models may be retired after 3,000 votes if there are two more recent models in the same series (e.g. `gpt-4o-0513` and `gpt-4o-0806`) and/or if there are more than 3 providers that offer models cheaper or same price and strictly better (according to overall Arena Score) than this model.
+- The top-10 models according to the overall Arena Score will be given a sampling weight of 3. This is to ensure the best community experience when visiting our site.
+- The best model from the top-10 providers according to the overall Arena Score will be given a sampling weight of 1 to ensure diversity of battles.
+
+This policy may be modified moving forward; visit this website for the most recent version.
 
 **Evaluating unreleased models**: We collaborate with open-source and commercial model providers to bring their unreleased models to community for preview testing.