来自openai,2023.05月的论文;用Process-supervised Reward Models(PRMs,对每个reasoning step进行正确性打分,如图2)在更具挑战的数学数据集MATH上进行的实验;结论是:在用于best-of-N sampling时,PRMs比ORMs(只给整体answer打分)更好(见图3),且候选solution越多时效果差距越大,说明PRMs比ORMs更鲁棒,更不容易被“表面...