offline model-based RL算法会受到model OOD的影响(model在有限的数据集上过拟合,在测试时会产生外推误差)。本文没有将策略探索约束在in-support的区域,而是直接探究在out-of-support区域的行为决策能力,提出的方法为MAPLE 由于策略的约束,之前的方法往往考虑如何尽可能利用offline dataset并将价值函数限制在行为策略的...
offlineRL的主要思想是将onlineRL结合conservatism或regularization。一般的model-free算法直接在策略或值函数上结合conservatism,它们学习的状态被限制在offline dataset中,从而将导致一个保守的算法。相反,model-based算法基于不确定性量化能够使得状态和动作空间均一定程度脱离offline dataset,潜在具有更强的泛化能力 但是不确定...
(4)是否model-based模型能在理论上帮助提升model-free DP还是一个公开问题,因为DP虽然没有直接学一个动态模型,但约等于学了一个无参数模型 (5)本质上DP和model-based RL都是在做预测问题。前者预测future return,后者预测future states,因此offline RL 对于non-linear函数估计的model-based mocel与DP 方法的theoret...
作者提出Model-based Offline Policy Optimization (MOPO)算法,用model based的方法来做offline RL,同时通过给reward添加惩罚项(soft reward penalty)来描述环境转移的不确定性(applying them with rewards artificially penalized by the uncertainty of the dynamics.)这种方式相当于在泛化性和风险之间做tradeoff。作者的意...
文章要点:这篇文章用model based方法去做offline RL。主要分为两步,第一步是用offline data学一个pessimistic MDP (P-MDP),第二步就是用这个P-MDP去学一个near-optimal policy。P-MDP的性质保证了这个near-optimal policy是真实环境里的performance的lower bound。具体来说,因为dataset不可能覆盖整个状态动作空间,...
offline reinforcement learningmodel-based reinforcement learningcausal discoverymethods have recently been shown promising for offline reinforcement learning (RL), which aims at learning good policies from historical data without interacting with the enviro......
the performance in the real environment is approximately lower-bounded by the performance in the P-MDP. This enables it to serve as a good surrogate for purposes of policy evaluation and learning, and overcome common pitfalls of model-based RL like model exploitation. Theoretically, we show that...
Offline RL Without Off-Policy Evaluation, Brandfonbrener et al, 2021.NIPS.Algorithm: One-step algorithm. Offline Reinforcement Learning with Soft Behavior Regularization, Xu et al, 2021.arxiv.Algorithm: SBAC. Model-Based MOReL: Model-Based Offline Reinforcement Learning, Kidambi et al, 2020.TWIML...
无法良好估计,具体可参考 论文理解【Offline RL】——【BCQ】Off-Policy Deep Reinforcement Learning without Exploration 。本文中 TT 直接对 Offline 轨迹建模,它同时生成数据集分布内的 state 和 action,因此不用担心生成 OOD 的 某种程度上,Trajectory Transformer 可以看做 model-based RL 和(隐式)policy constra...
过去的 Model-free Offline RL 方法基本可以分成 RL-based 和 IL-based 两类 RL-based 方法大都涉及 TD learning,它们在各种约束下估计价值函数以避免外推错误问题(详见BCQ论文)。通过价值函数作为媒介,这类方法的优势在于通常能学到超过最优行为策略的性能(可以做次优轨迹拼接) ...