However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. ...
Breadcrumbs Token-level-Direct-Preference-Optimization / utils.py Latest commit HistoryHistory File metadata and controls Code Blame 175 lines (136 loc) · 6.44 KB Raw 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ...
三、TDPO:token-level直接偏好优化 为了应对模型生成多样性显著下降的问题,研究人员提出了TDPO(Token-level Direct Preference Optimization)算法。TDPO算法从token-level的角度重新定义了整个对齐流程的目标函数,并通过将Bradley-Terry模型转换为优势函数的形式,使得整个对齐流程能最终从Token-level层面进行分析和优化。 TDPO...
Breadcrumbs Token-level-Direct-Preference-Optimization / train.pyTop File metadata and controls Code Blame 118 lines (95 loc) · 4.95 KB Raw import torch torch.backends.cuda.matmul.allow_tf32 = True import torch.nn as nn import transformers from utils import get_local_dir, get_local_run_di...
几篇论文实现代码:《Token-level Direct Preference Optimization》(ICML 2024) GitHub: github.com/Vance0124/Token-level-Direct-Preference-Optimization 《Layer-Condensed KV Cache for Efficient Inferen...
代码地址:https://github.com/Vance0124/Token-level-Direct-Preference-Optimization 为了应对模型生成多样性显著下降的问题,TDPO 从 token-level 的角度重新定义了整个对齐流程的目标函数,并通过将 Bradley-Terry 模型转换为优势函数的形式,使得整个对齐流程能最终从 Token-level 层面进行分析和优化。相比于 DPO 而言,...
从早期的RLHF(Reinforcement Learning from Human Feedback,基于人类反馈的强化学习)算法,到近年的DPO(Direct Preference Optimization,直接偏好优化),再到最新的TDPO(Token-level Direct Preference Optimization,基于Token的直接偏好优化),大模型对齐算法已经取得了显著的进步。 RLHF算法通过人类反馈和PPO(Proximal Policy ...
论文标题:Token-level Direct Preference Optimization 论文地址:https://arxiv.org/abs/2404.11999 代码地址:https://github.com/Vance0124/Token-level-Direct-Preference-Optimization 为了应对模型生成多样性显著下降的问题,TDPO 从 token-level 的角度重新定义了整个对齐流程的目标函数,并通过将 Bradley-Terry 模型转换...
代码地址:https://github.com/Vance0124/Token-level-Direct-Preference-Optimization 为了应对模型生成多样性显著下降的问题,TDPO 从 token-level 的角度重新定义了整个对齐流程的目标函数,并通过将 Bradley-Terry 模型转换为优势函数的形式,使得整个对齐流程能最终从 Token-level 层面进行分析和优化。相比于 DPO 而言,...
Token-level Direct Preference Optimization 论文链接: https://arxiv.org/abs/2404.11999 代码链接: https://github.com/Vance0124/Token-level-Direct-Preference-Optimization 为了应对模型生成多样性显著下降的问题,TDPO 从 token-level 的角度...