与人类保持一致具有至关重要的意义,人类反馈强化学习(RLHF)成为支撑这一追求的关键技术范式。目前的技术路线通常包括衡量人类偏好的奖励模型、优化策略模型输出的近端策略优化(PPO)以及提高逐步推理能力的过程监督。然而,由于奖励设计、环境交互和智能体训练的挑战,再加上大语言模型的巨大试错成本,人工智能研究人员在激励...
而基于人类反馈的强化学习(RLHF)则被视为支撑这一目标的关键技术。 RLHF的技术路线通常包括衡量人类偏好的奖励模型、优化策略模型输出的近端策略优化(Proximal Policy Optimization,PPO)以及提高逐步推理能力的过程监督。在这些技术中,PPO算法扮演着至关重要的角色。本文将对MOSS-RLHF框架中的PPO算法进行深入剖析,探讨其...
policy_model_path models/sft_model \ --critic_model_path models/moss-rlhf-reward-model-7B-zh/recover \ --model_save_path outputs/models/ppo/ppo_model_zh \ --data_path data/ppo_data \ --seed 42 \ --maxlen_prompt 2048 \ --maxlen_res 512 \ --lr 5e-7 \ --critic_lr 1.5e-6...
Saved searches Use saved searches to filter your results more quickly Cancel Create saved search Sign in Sign up Reseting focus {{ message }} trillionmonster / MOSS-RLHF Public forked from OpenLMLab/MOSS-RLHF Notifications You must be signed in to change notification settings Fork 0 ...
train_ppo.py train_ppo_en.sh train_ppo_zh.sh train_rm.py train_rm.sh utils.py Breadcrumbs MOSS-RLHF / Latest commit Cannot retrieve latest commit at this time. History History File metadata and controls 40 lines (39 loc) · 1.14 KB ...
# SPDX-License-Identifier: Apache-2.0 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 \ accelerate launch \ --config_file accelerate_config.yaml \ train_ppo.py \ --tokenizer_name_or_path models/moss-rlhf-reward-model-7B-zh \ --policy_model_path models/sft_model \ --critic_model_path models/...
model-7B-en/recover \ --critic_model_path models/moss-rlhf-reward-model-7B-en/recover \ --model_save_path outputs/models/ppo/ppo_model_en \ --data_path data/ppo_data \ --seed 42 \ --maxlen_prompt 2048 \ --maxlen_res 512 \ --lr 5e-7 \ --critic_lr 1.5e-6 \ --gamma 1...
csxrzhang / MOSS-RLHF Public forked from OpenLMLab/MOSS-RLHF Notifications Fork 0 Star 0 Code Pull requests Actions Projects Security Insights Footer © 2024 GitHub, Inc. Footer navigation Terms Privacy Security Status Docs Contact Manage cookies Do not share my personal information...