Skip to content

Commit

Permalink
Small fix for PPE blog (#8)
Browse files Browse the repository at this point in the history
* small fix

---------

Co-authored-by: Evan Frick <[email protected]>
  • Loading branch information
efrick2002 and Evan Frick authored Oct 22, 2024
1 parent 2de4f35 commit e98f489
Showing 1 changed file with 5 additions and 3 deletions.
8 changes: 5 additions & 3 deletions _posts/2024-10-20-preference-proxy-evaluations.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,12 +123,14 @@ Turning to the right-hand side of [Figure 2](#figure2), the accuracy of the rewa

Overall, accuracy on the human preference dataset is more correlated than the correctness metrics. This is because correctness and human preference do not necessarily align. Moreover, the information contained in Loss, Max score, and End score may not prove relevant in DPO, which is off-policy. Those employing RLHF algorithms that have a higher risk of over-optimization may find these alternative measures helpful. However, when calculating correlation against style controlled ELOs<sup>\*</sup> we notice a slight decrease in correlations on the human preference dataset. Notably, the correctness preference measurements show no change, suggesting correctness preference may be more robust towards reward model preference quality, response style aside. Style-controlled correlation heatmaps are shown in Appendix [Style Control](#style-control).

<sup>*</sup> Style controlled ELOs are calculated as detailed in our previous blog, [*Does Style Matter?\*](/blog/2024/style-control/)
<sup>\*</sup> Style controlled ELOs are calculated as detailed in our previous blog, [_Does Style Matter?_](/blog/2024/style-control/)

&nbsp;
<a id="figure4"></a>
<img src="/assets/img/blog/preference_proxy_evaluations/AggregationPlots.png" style="width:100%; height:auto" />

<p style="color:gray; text-align: center;"><sub>Figure 4: Spearman Correlation, Confidence Agreement, and Accuracy metrics: For each metric, we take the quantiles of category scores (Hard, Easy, Instruction Following, Coding, Math, and Similar). The Pearson Correlation is calculated relative to Post-RLHF Human Preference Elo ratings for each quantile. Notably, accuracy peaks at 0.80 correlation at low quantile aggregation.</sub></p>

&nbsp;

Additionally, we observe that measuring the lower bound score may correlate more to downstream RLHF performance than the average score or upper bound score. In [Figure 4](#figure4), we first re-scale each category's scores to be mean 0 and SD 1, then we vary the quantile of the aggregation strategy across human preference dataset categories (Hard Prompts, Easy Prompts, etc). In this case, the 0 quantile is the minimum, and the 1 quantile is the maximum. We find that in nearly every metric, decreasing the quantile increases correlation with downstream ELO. We posit this represents the requirement that reward models be robust under all input distributions to mitigate reward-hacking. Any domain weakness in a reward model can be exploited by the LLM during training.
Expand All @@ -145,13 +147,13 @@ PPE is another step towards rigorous evaluations of reward models and LLM-Judges

```
@misc{frick2024evaluaterewardmodelsrlhf,
title={How to Evaluate Reward Models for RLHF},
title={How to Evaluate Reward Models for RLHF},
author={Evan Frick and Tianle Li and Connor Chen and Wei-Lin Chiang and Anastasios N. Angelopoulos and Jiantao Jiao and Banghua Zhu and Joseph E. Gonzalez and Ion Stoica},
year={2024},
eprint={2410.14872},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.14872},
url={https://arxiv.org/abs/2410.14872},
}
```

Expand Down

0 comments on commit e98f489

Please sign in to comment.