Model-free offline RL 方法主要通过对于策略或者价值函数进行约束来实现 pessimism(即约束策略在 in-support 区域上);而在 model-based 的方法中,由于 dynamics model 自身可能有一些原因未知但是可能比较正确的泛化,使得我们有可能在 out-of-support 区域上也有较正确预期,从而学习到比较好的策略。不过这些 dynamics ...
因此核心问题是:是否能够设计一个offline RL算法泛化能力超越所提供的数据集 为了解决该问题,本文首先假设model-based RL是实现泛化的一个自然选择,因为:① model-based RL能够有效的接收更多的监督,因为模型是基于每个transition进行训练的,即使是在稀疏奖励的场景下亦是如此 ② 模型是基于监督学习训练的,相较于model-...
文章要点:这篇文章用model based方法去做offline RL。主要分为两步,第一步是用offline data学一个pessimistic MDP (P-MDP),第二步就是用这个P-MDP去学一个near-optimal policy。P-MDP的性质保证了这个near-optimal policy是真实环境里的performance的lower bound。具体来说,因为dataset不可能覆盖整个状态动作空间,...
offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; and (b) learning a ...
作者提出Model-based Offline Policy Optimization (MOPO)算法,用model based的方法来做offline RL,同时通过给reward添加惩罚项(soft reward penalty)来描述环境转移的不确定性(applying them with rewards artificially penalized by the uncertainty of the dynamics.)这种方式相当于在泛化性和风险之间做tradeoff。作者的...
Existing model-based offline RL methods set pessimistic constraints of the learned model within the support region of the offline data to avoid extrapolation errors, but these approaches limit the generalization potential of the policy in out-of-distribution (OOD) region. The artificial fixed ...
作者提出Model-based Offline Policy Optimization (MOPO)算法,⽤model based的⽅法来做offline RL,同时通过给reward添加惩罚项(soft reward penalty)来描述环境转移的不确定性(applying them with rewards artificially penalized by the uncertainty of the dynamics.)这种⽅式相当于在泛化性和风险之间做tradeoff...
而model-based offline RL则致力于用数据集更好地拟合真实环境的动态模型,然后在收敛的模型上应用或拓展在线RL算法。然而,现有方法往往忽视了数据集中的轨迹连续性,仅将轨迹分割为状态转移片段进行独立训练。一种新的思考方式是将每条轨迹视为独立样本,利用扩散模型的强大能力来建模决策轨迹的分布。这种...
Model-based offline reinforcement learning (RL) is a compelling approach that addresses the challenge of learning from limited, static data by generating imaginary trajectories using learned models. However, it falls short in solving long-horizon tasks due to high bias in value estimation from model...
However, most model-based planning algorithms are not designed for offline settings. Simply combining the ingredients of offline RL with existing methods either provides over-restrictive planning or leads to inferior performance. We propose a new light-weighted model-based offline planning framework, ...