Dealing with sparse rewards is one of the biggest challenges in Reinforcement Learning (RL). We present a novel technique called Hindsight Experience Replay which allows sample-efficient learning from rewards which are sparse nd binary and therefore avoid the need for complicated reward engineering. It can be combined with an arbitrary off-policy RL algorithm and may be seen as a form of implicit curriculum which we show our approach on the task of manipulating objects with a robotic arm.
The pivotal idea behind our approach is to re-examine the trajectory with a different goal, while this trajectory may not help us learn how to achieve the state g, it definitely tells us something about how to achieve the other state s. This information can be harvested by using an off-policy RL algorithm and experience replay where we replace g in the replay buffer by s. In addition we can still replay with the original goal g left intact in the replay buffer.
URL: papers.nips.cc/paper/2017/file/453fadbd8a1a3af50a9df4df899537b5-Paper.pdf
Author: OpenAI
Conference: NIPS 2017
Topic: Experience Replay
1. Background
A common challenge, especially for robotics, is the need to engineer a reward function that not only reflects the task at hand but is also carefully shaped (Ng et al., 1999) to guide the policy optimization. It is therefore of great practical relevance to develop algorithms which can learn from unshaped reward signals, e.g. a binary signal indicating successful task completion.
One ability humans have, unlike the current generation of model-free RL algorithms, is to learn almost as much from achieving an undesired outcome as from the desired one. In this paper we introduce a technique called Hindsight Experience Replay (HER) which allows the algorithm to perform exactly this kind of reasoning and can be combined with any off-policy RL algorithm. Not only does HER improve the sample efficiency in this setting, but more importantly, it makes learning possible even if the reward signal is sparse and binary. Our approach is based on training universal policies (Schaul et al., 2015a) which take as input not only the current state, but also a goal state. The pivotal idea behind HER is to replay each episode with a different goal than the one the agent was trying to achieve, e.g. one of the goals which was achieved in the episode.
2. Hindsight Experience Replay
2.1. Algorithm
3. Related Works
3.1. Universal Value Function Approximators (UVFA)