본문 바로가기

Deep Learning/강화학습

[2017.12] Learning Multi-level Hierarchies with Hindsight

728x90

Hierarchical agents have the potential to solve sequential decision-making tasks with greater sample efficiency than their non-hierarchical counterparts because hierarchical agents can break down tasks into sets of subtasks that only require short sequences of decisions. In order to realize this potential of faster learning, hierarchical agents need to be able to learn their multiple levels of policies in parallel so these simpler subproblems can be solved simultaneously. Yet, learning multiple levels of policies in parallel is hard because it is inherently unstable: changes in a policy at one level of the hierarchy may cause changes in the transition and reward functions at higher levels in the hierarchy, making it difficult to jointly learn multiple levels of policies. In this paper, we introduce a new Hierarchical Reinforcement Learning (HRL) framework, Hierarchical Actor-Critic (HAC), that can overcome the instability issues that arise when agents try to jointly learn multiple levels of policies. The main idea behind HAC is to train each level of the hierarchy independently of the lower levels by training each level as if the lower level policies are already optimal. We demonstrate experimentally that our approach can significantly accelerate learning relative to other non-hierarchical and hierarchical methods. 

Figure 1: An ant agent uses a 3-level hierarchy to traverse through rooms to reach its goal, represented by the yellow cube. π2 uses as input the current state (joint positions θ and velocities ˙θ) and goal state (yellow box) and outputs a subgoal state (green box) for π1 to achieve. π1 takes in the current state and its goal state (green box) and outputs a subgoal state (purple box) for π0 to achieve. π0 takes in the current state and goal state (purple box) and outputs a vector of joint torques. π0 has a fixed number of attempts to move to the purple box before π1 outputs another subgoal state. Similarly, π1 has a fixed number of subgoal states that it can output to try to move the agent to the green box before π2 outputs another subgoal.

URL: arxiv.org/pdf/1712.00948.pdf
Topic: Hierarchical RL

1. Introduction

Hierarchy has the potential to accelerate learning in sequential decision-making tasks because hierarchical agents can decompose problems into smaller subproblems. In order to take advantage of these shorter horizon subproblems, an HRL algorithm must be able to learn the multiple levels within the hierarchy in parallel. That is, at the same time one level in the hierarchy is learning the sequence of subtasks needed to solve a task, the level below should be learning the sequence of shorter time scale actions needed to solve each subtask. Yet the existing HRL algorithms that are capable of automatically learning hierarchies in continuous domains do not efficiently learn the multiple levels within the hierarchy in parallel. Instead, these algorithms often resort to learning the hierarchy one level at a time in a bottom-up fashion.

Learning multiple levels of policies in parallel is challenging due to non-stationary state transition functions. In nested, multi-level hierarchies, the transition function for any level above the ground level depends on the current policies below that level. However, learning multiple policies in parallel is still possible because the transition function for each level above ground level will stabilize once all lower-level policies have converged to optimal or near-optimal policies. Our framework is able to simulate a transition function that uses an optimal lower-level policy hierarchy and thus can learn multiple levels of policies in parallel. We introduce a new HRL framework, Hierarchical Actor-Critic (HAC), that can significantly accelerate learning by enabling hierarchical agents to jointly learn a hierarchy of policies. Our framework is primarily comprised of two components: (i) a particular hierarchical architecture and (ii) a method for learning the multiple levels of policies in parallel given sparse rewards.

In addition, HAC enables agents to learn multiple policies in parallel using only sparse reward functions as a result of two types of hindsight transitions. Hindsight action transitions help agents learn multiple levels of policies simultaneously by training each subgoal policy with respect to a transition function that simulates the optimal lower-level policy hierarchy. Hindsight action transitions are implemented by using the subgoal state achieved in hindsight instead of the original subgoal state as the active component in the transition. For instance, when a subgoal level proposes subgoal state A, but the next level policy is unsuccessful and the agent ends in state B after a certain number of attempts, the subgoal level receives a transition in which the state B is the action component, not state A. With hindsight action transitions, a subgoal level can focus on learning the sequences of subgoal states that can reach a goal state, while the lower level policies focus on learning the sequences of actions to achieve those subgoal states. The second type of hindsight transition, hindsight goal transitions, helps each level learn a goal-conditioned policy in sparse reward tasks by extending the idea of Hindsight Experience Replay (Andrychowicz et al. (2017)) to the hierarchical setting. In these transitions, one of the states achieved in hindsight is used as the goal state in the transition instead of the original goal state.

1.1. Preliminaries

We are interested in solving a Markov Decision Process (MDP) augmented with a set of goals $\mathcal{G}$ (each a state or set of states) that we would like an agent to learn. We define an MDP augmented with set of goals as a Universal MDP (UMDP). A UMDP is a tuple $\mathcal{U}=(\mathcal{S}, \mathcal{G}, \mathcal{A}, T, R, \gamma)$, in which $\mathcal{S}$ is the set of states; $\mathcal{G}$ is the set of goals; $\mathcal{A}$ is the set of actions; $T$ is the transition probability function. At the beginning of each episode in a UMDP, goal $g \in \mathcal{G}$ is selected for the entirety of the episode. The solution to a UMDP is a control policy $\pi: \mathcal{S}, \mathcal{G} \rightarrow \mathcal{A}$ that maximizes the value function $v_{\pi}(s, g)=\mathbb{E}_{\pi}\left[\sum_{n=0}^{\infty} \gamma^{n} R_{t+n+1} \mid s_{t}=s, g_{t}=g\right]$ for an initial state $s$ and goal $g$. 

In order to implement hierarchical agents in tasks with continuous state and actions spaces, we will use two techniques from the RL literature: (i) the Universal Value Function Approximator (UVFA) (Schaul et al., 2015) and (ii) Hindsight Experience Replay (Andrychowicz et al., 2017). The UVFA will be used to estimate the action-value function of a goal-conditioned policy $\pi$,

$q_{\pi}(s, g, a)=$ $\mathbb{E}_{\pi}\left[\sum_{n=0}^{\infty} \gamma^{n} R_{t+n+1} \mid s_{t}=s, g_{t}=g, a_{t}=a\right] .$

In our experiments, the UVFAs used will be in the form of feedforward neural networks. UVFAs are important for learning goal-conditioned policies because they can potentially generalize Q-values from certain regions of the (state, goal, action) tuple space to other regions of the tuple space, which can accelerate learning. However, UVFAs are less helpful in difficult tasks that use sparse reward functions. In these tasks when the sparse reward is rarely achieved, the UVFA will not have large regions of the (state, goal, action) tuple space with relatively high Q-values that it can generalize to other regions. For this reason, we also use Hindsight Experience Replay (Andrychowicz et al., 2017). HER is a data augmentation technique that can accelerate learning in sparse reward tasks. HER first creates copies of the [state, action, reward, next state, goal] transitions that are created in traditional off-policy RL. In the copied transitions, the original goal element is replaced with a state that was actually achieved during the episode, which guarantees that at least one of the HER transitions will contain the sparse reward. 

2. Hierarchical Actor-Critic (HAC)

HAC contains two components: (i) a particular hierarchical architecture and (ii) a method for learning the levels of the hierarchy simultaneously and independently. In this section, we will more formally present our proposed system as a Universal MDP (UMDP) transformation operation.

The purpose of our framework is to efficiently learn a $k$ -level hierarchy $\Pi_{k-1}$ consisting of $k$ individual policies $\pi_{0}, \ldots, \pi_{k-1}$, in which $k$ is a hyperparameter chosen by the user. In order to learn $\pi_{0}, \ldots, \pi_{k-1}$ in parallel our framework transforms the original UMDP, $\mathcal{U}_{\text {original }}=$ $(\mathcal{S}, \mathcal{G}, \mathcal{A}, T, R, \gamma)$, into a set of $k$ UMDPs $\mathcal{U}_{0}, \ldots, \mathcal{U}_{k-1}$, in which $\mathcal{U}_{i}=\left(\mathcal{S}_{i}, \mathcal{G}_{i}, \mathcal{A}_{i}, T_{i}, R_{i}, \gamma_{i}\right) .$ In the remainder of the section, we will describe these tuples at a high-level. 

In our approach, each level of the UMDP hierarchy learns its own deterministic policy: $\pi_{i}: \mathcal{S}_{i}, \mathcal{G}_{i} \rightarrow$ $\mathcal{A}_{i}, 0 \leq i \leq k-1$. The state-space for every level $i$ is identical to the state space in the original problem: $\mathcal{S}_{i}=\mathcal{S}$. Since each level will learn to solve the shortest path problem with respect to a goal state, we set the goal space at each level $i$ to be identical to the state space: $\mathcal{G}_{i}=\mathcal{S}$. Finally, the action space at all levels except the bottom-most level is identical to the goal space of the next level down (i.e. the state space): $\mathcal{A}_{i}=\mathcal{S}, i>0$. These levels output subgoal states for the next lower level to achieve. The action space of the bottom-most level is identical to the set of primitive actions that are available to the agent: $\mathcal{A}_{0}=\mathcal{A}$.