helpness 和 harmless之间有一定冲突,但大模型上这种问题会缓解,甚至对于helpful和harmless training data的比例更加鲁棒。 在不需要任何有伤害的样本,我们展示了OOD detection技术去拒绝更多奇怪和伤害的request。 Scaling, RLHF Robustness, and Iterated ‘Online’ Training Reward model 准确率是按照model和dataset的lo...
Claude is designed to be different, with a focus on being "helpful, honest, and harmless" while still managing to carry out all the tasks you'd expect of an AI assistant (for example, summarization, search, creative and collaborative writing, Q&A, and coding). ...