offlineRL的一个关键问题是分布偏移 model-free解决该问题的方法包括:将目标策略限制接近于行为策略;对OOD状态动作对的Q函数计算施加惩罚。这些方法会导致policy严格限制在行为策略的数据流形中,很难获得特别高的性能。后续的一些方法引入了ensemble模型来评估Q函数的不确定性(offline dataset状态动作对以及OOD的状态动作对...
MOORe: Model-based Offline-to-Online Reinforcement Learning offline to online主要面临的问题是distribution shift,这种分布偏移主要是因为offline数据是由一个或多个策略产生的,而online训练时获得的数据是通过当前策略产生的,这种策略的差异导致了数据分布的差异,也就是distribution shift。 为了缓解distribution shift问题...
作者提出Model-based Offline Policy Optimization (MOPO)算法,用model based的方法来做offline RL,同时通过给reward添加惩罚项(soft reward penalty)来描述环境转移的不确定性(applying them with rewards artificially penalized by the uncertainty of the dynamics.)这种方式相当于在泛化性和风险之间做tradeoff。作者的意...
MOReL : Model-Based Offline Reinforcement Learning Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, Thorsten Joachims NeurIPS 2020|May 2020 Organized by ACM Download BibTex In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of hi...
Hence, a model-based offline adaptive policy optimization with episodic memory is proposed in this work to improve generalization of the policy. Inspired by active learning, constraint strength is proposed to trade off the return and risk adaptively to balance the robustness and generalization ability...
Offline reinforcement learning (RL) enables learning policies using pre-collected datasets without environment interaction, which provides a promising direction to make RL useable in real-world systems. Although recent offline RL studies have achieved much progress, existing methods still face many practica...
Name Last commit message Last commit date Latest commit History 23 Commits algo common config models results static_fns LICENSE README.md plotter.py train.py trainer.py README MIT license Overview This is a re-implementation of the offline model-based RL algorithm MOPO all by pytorch(including ...
《MOReL : Model-Based Offline Reinforcement Learning》R Kidambi, A Rajeswaran, P Netrapalli, T Joachims [Cornell University & University of Washington & Microsoft Research] (2020) http://t.cn/A6ABh...
python recursive.py --env <env_name> --exp_name <experiment_name> --sub_exp_name <exp_save_dir> --param_path configs/params_<env_name>_offline.json --bc_init --random_seeds 0 --target_kl 0.01 --max_path_length 1000 env_name:ant,half_cheetah,hopper,walker2d,cheetah_run ...
如下图所示,offline RL只需要直接部署在现实环境中,故deployment efficiency非常高,而on-policy算法策略一般采集整条轨迹的数据才会更新,故deployment efficiency次之,而off-policy算法一般采集一条数据就会发生更新,故deployment efficiency非常低(有点和采样效率的概念相对)...