本文据此提出了一个算法替换掉显式的不确定性衡量,并认为不确定性估计对于offlineRL是不必要的 Contribution ① 提出了COMBO ② 从理论上证明了COMBO学到的Q函数是真实Q函数的下界,并且相较于model-free算法CQL,其拥有更紧的下界。且COMBO不需要不确定性的量化 ③ 实验结果很好,即使是用于基于图片的任务 ...
offlineRL的一个关键问题是分布偏移 model-free解决该问题的方法包括:将目标策略限制接近于行为策略;对OOD状态动作对的Q函数计算施加惩罚。这些方法会导致policy严格限制在行为策略的数据流形中,很难获得特别高的性能。后续的一些方法引入了ensemble模型来评估Q函数的不确定性(offline dataset状态动作对以及OOD的状态动作对...
作者提出Model-based Offline Policy Optimization (MOPO)算法,用model based的方法来做offline RL,同时通过给reward添加惩罚项(soft reward penalty)来描述环境转移的不确定性(applying them with rewards artificially penalized by the uncertainty of the dynamics.)这种方式相当于在泛化性和风险之间做tradeoff。作者的意...
offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; and (b) learning a ...
文章要点:这篇文章用model based方法去做offline RL。主要分为两步,第一步是用offline data学一个pessimistic MDP (P-MDP),第二步就是用这个P-MDP去学一个near-optimal policy。P-MDP的性质保证了这个near-optimal policy是真实环境里的performance的lower bound。具体来说,因为dataset不可能覆盖整个状态动作空间,...
Model-based RL 1. 一旦模型被学习到了,之后就可以用On-Policy的方式进行学习,可以一直无限次进行交互,能够避免Off-Policy带来的一些问题。 2. 当然最大的好处还是极大地减少与真实环境的交互,像Batch RL和Offline RL中就经常会采用到MBRL中的一些方法。
Existing model-based offline RL methods set pessimistic constraints of the learned model within the support region of the offline data to avoid extrapolation errors, but these approaches limit the generalization potential of the policy in out-of-distribution (OOD) region. The artificial fixed ...
Name Last commit message Last commit date Latest commit History 23 Commits algo common config models results static_fns LICENSE README.md plotter.py train.py trainer.py README MIT license Overview This is a re-implementation of the offline model-based RL algorithm MOPO all by pytorch(including ...
For offline RL, the agent is not allowed to interact with the real environment, however, can access an offline dataset of transitions . The goal of OPE is to estimate the value by only using the given dataset of transitions . A more exhaustive definition can be found in refs. [6, 22]...
Key: context-based meta-RL, based on dreamer OpenReview: 6, 6, 6, 6 ExpEnv: Point Robot Navigation, Escape Room, Reacher Sparse Reward-Consistent Dynamics Models are Strongly Generalizable for Offline Reinforcement Learning Fan-Ming Luo, Tian Xu, Xingchen Cao, Yang Yu Key: reward learning, ...