
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
Under review as a conference paper at ICLR 2026
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu,
Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi.
Rewardbench: Evaluating reward models for language modeling, 2024. URL
https://arxiv.
org/abs/2403.13787.
Michel Ledoux and Michel Talagrand. Probability in banach spaces: Isoperimetry and processes.
1991. URL https://api.semanticscholar.org/CorpusID:118526268.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow
instructions with human feedback. Advances in neural information processing systems, 35:27730–
27744, 2022.
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression:
Simple and scalable off-policy reinforcement learning, 2019. URL
https://arxiv.org/
abs/1910.00177.
Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational
space control. In Proceedings of the 24th international conference on Machine learning, pp.
745–750, 2007.
Bowen Qin, Duanyu Feng, and Xi Yang. Towards understanding the influence of reward margin on
preference model performance, 2024. URL https://arxiv.org/abs/2404.04932.
Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, Bradley Knox, Chelsea
Finn, and Scott Niekum. Scaling laws for reward model overoptimization in direct alignment
algorithms, 2024a. URL https://arxiv.org/abs/2406.02900.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea
Finn. Direct preference optimization: Your language model is secretly a reward model, 2024b.
URL https://arxiv.org/abs/2305.18290.
Robert E Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new
explanation for the effectiveness of voting methods. The annals of statistics, pp. 1651–1686, 1998.
Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to
algorithms. Cambridge university press, 2014.
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford,
Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022. URL
https://arxiv.org/abs/2009.01325.
Louis L Thurstone. A law of comparative judgment. In Scaling, pp. 81–92. Routledge, 2017.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation
and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Manya Wadhwa, Jifan Chen, Junyi Jessy Li, and Greg Durrett. Using natural language explanations
to rescale human judgments, 2024. URL https://arxiv.org/abs/2305.14770.
Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin,
Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun
Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu,
Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. Secrets of rlhf in large language models part ii:
Reward modeling, 2024a. URL https://arxiv.org/abs/2401.06080.
Shiqi Wang, Zhengze Zhang, Rui Zhao, Fei Tan, and Cam Tu Nguyen. Reward difference optimiza-
tion for sample reweighting in offline rlhf, 2024b. URL
https://arxiv.org/abs/2408.
09385.
Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar, Daniel Egert,
Olivier Delalleau, Jane Polak Scowcroft, Neel Kant, Aidan Swope, and Oleksii Kuchaiev. Helpsteer:
Multi-attribute helpfulness dataset for steerlm, 2023. URL
https://arxiv.org/abs/2311.
09528.
12