In extremely sparse reward settings, some kind of reward shaping is required, providing the manually designed with relatively denser reward signal. This reward shaping requires some domain-specific knowledge and it sometimes against the long-term goal of the agent, so it is inpractical to use in some cases. This paper proposes that curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful later in its life, hence overcoming the sparse reward settings. Curiosity have been used to explain the need to explore the environment and discover novel states, and by using the curiosity-driven exploration, the agent gains a knowledge helpful in later stage without a manually designing a reward shaping.
The curiosity-driven method is not novel, but how they generate an intrinsic reward signal using the inverse prediction is what makes this paper noteworthy. In predicting the action in a inverse manner (predicting what action was chosen, using the two consecutive state), because the state can be stochastic and there are bunch of redundant information, we only predict those changes in the environment that could possibly be due to the actions of our agent or affect the agent, and ignore the rest. In doing this, the paper uses transformed feature space, where only the information relevant to the action performed by the agent is represented, rather than raw pixel input. To be specific, the proposed method learn this feature space using self-supervision – training a neural network on a proxy inverse dynamics task of predicting the agent’s action given its current and next states. Then, this feature space is used to train a forward dynamics model that predicts the feature representation of the next state, given the feature representation of the current state and the action. Finally, it provides the prediction error of the forward dynamics model to the agent as an intrinsic reward to encourage its curiosity.
URL: proceedings.mlr.press/v70/pathak17a/pathak17a.pdf
Conference: ICML 2017
Topic: Exploration
Post: bluediary8.tistory.com/30
Video: www.youtube.com/watch?v=_Z9ZP1eiKsI
1. Curiosity-Driven Exploration
Our agent is composed of two subsystems: a reward generator that outputs a curiosity-driven intrinsic reward signal $r_t^i$ and a policy $\pi$ that outputs a sequence of actions to maximize that reward signal. Let the sparse extrinsic reware be $r_t^e$, the the policy sub-system is trained to maximize $r_t^i + r_t^e$, with the latter mostly zero due to its sparsity.
Refer to the video link above, for detail explanation and the intuition.
'Deep Learning > 강화학습' 카테고리의 다른 글
[2017.05] Constrained Policy Optimization (0) | 2021.04.25 |
---|---|
[2018.03] Policy Optimization with Demonstration (0) | 2021.04.23 |
[2018.02] IMPALA: scalable distributed Deep-RL with importance weighted actor-Learner architectures (0) | 2021.04.18 |
[2018.11] Recurrent experience replay in distributed reinforcement learning (0) | 2021.04.16 |
[2017.07] Hindsight Experience Replay (0) | 2021.04.16 |